-
Notifications
You must be signed in to change notification settings - Fork 0
Data Types and Features
This page documents the type of data CAOS_SEISMIC ingests and the features it considers — so a reader understands what the model actually sees. It is the companion to Data-Sources (which documents where the data comes from); read that first if you want the providers and licenses. Here the question is narrower and more technical: given the raw feeds, what is one record, how is it cleaned, and which numbers are derived from it and fed to the model?
The answer has a strict shape:
- One data type dominates — the catalog event record (an earthquake's time, place, depth, and magnitude). Everything load-bearing is derived from it.
-
A homogenization step turns the heterogeneous magnitude field into a single, comparable scale
(Mw) and estimates the magnitude of completeness
Mcbelow which the catalog cannot be trusted. -
Catalog-derived features capture recent activity and its clustering structure — recent-window
rates, ETAS/Omori intensities, and Zaliapin–Ben-Zion nearest-neighbor proximity (
$\eta$ ,$T$ ,$R$ ). - Context covariates from the global enrichers (slab geometry, distance-to-fault, plate-boundary type, GNSS strain rate, tidal Coulomb stress) tell the model what kind of place a cell is.
Honest framing. Categories 1–3 (the catalog and its derived features) carry essentially all of the published short-term forecasting skill. Category 4 (context covariates) is upside, not foundation: a covariate enters the model only after it shows a positive, significant pseudo-prospective information gain over a catalog-only baseline. This page labels every feature with that distinction.
The atomic unit of input is a single catalog event record. Whether it comes from USGS ComCat, ISC, or a regional network (see Data-Sources), every record carries the same core fields:
| Field | Type | Meaning | Notes for modeling |
|---|---|---|---|
event_id |
string | Provider's unique origin id | Used to dedupe across providers (ComCat vs regional vs EMSC). |
time |
UTC timestamp | Origin time | The clock the whole point-process model runs on; sub-second precision matters for Omori decay at short lags. |
latitude, longitude
|
float (degrees) | Epicenter | Joined to the forecast grid; distances are computed great-circle. |
depth |
float (km) | Hypocentral depth | Separates crustal / interface / intraslab when combined with Slab2 (§5). |
magnitude |
float | Reported size |
Heterogeneous — see magType. |
magType |
string | Magnitude scale (ml, mb, ms, mw, md, …) |
First-class field — never dropped. Different scales saturate differently; mixing them silently distorts the magnitude–frequency tail. |
The single most important hygiene rule on the raw record: keep
magType. A catalog that mixesmb,Ms, andMwwithout recording which is which makes the Gutenberg–Richter b-value (and hence every rate forecast) wrong. Homogenization (§3) depends on this field being present and correct.
A record may carry more — moment tensor, focal mechanism, nodal planes, P/T axes (from Global CMT, Data-Sources §5) — which feed the mechanism / stress features (§5). But the six fields above are the irreducible minimum every event must have.
The model does not forecast individual events; it forecasts a rate
So the event record is the input; a calibrated probability per region × magnitude × horizon is the output. Everything in between is the homogenization and feature engineering documented below.
flowchart TD
REC["Catalog event record<br/>time · lat · lon · depth · magnitude · magType"]
DEDUP["Dedupe across providers<br/>(preferred-origin id)"]
MC["Estimate Mc(x,y,t)<br/>MAXC+0.2 / GFT / EMR"]
CUT["Cut events below local Mc"]
HOM["Homogenize magnitude -> Mw<br/>total-least-squares ML/mb -> Mw<br/>anchored on ISC-GEM / Global CMT"]
DECL["Decluster (DUAL view)<br/>GK background | full catalog for triggering<br/>Zaliapin-Ben-Zion eta/T/R labels"]
FEAT["Catalog-derived features<br/>recent-window rates · ETAS/Omori intensities · eta/T/R"]
REC --> DEDUP --> MC --> CUT --> HOM --> DECL --> FEAT
The order is load-bearing and identical to the Pipeline DAG: you cannot homogenize before you
know magType, you cannot estimate a trustworthy b-value before you cut below Mc, and you must keep
both a declustered catalog (for the stationary background) and the full catalog (for the
triggering signal). Declustering the input to the conditional model is the single most common pipeline
mistake — the aftershock/foreshock triggering is the predictable signal.
Two transforms turn the raw, heterogeneous record into a statistically usable one.
Mc is the minimum magnitude above which all events in a given space–time volume are detected
(Wiemer & Wyss 2000). Below Mc the catalog is missing events, and any rate computed there is biased.
It must be estimated per spatial cell and per time epoch — a single global Mc injects fake
non-stationarity (e.g. a network densification that lowered Mc would look like a real seismicity
change). Estimators (run several; their spread is the uncertainty):
-
MAXC — maximum curvature:
Mc= the magnitude bin where the non-cumulative frequency–magnitude distribution peaks. Fast, but biases low in gradual roll-offs; the community fix is MAXC + 0.2, which is California-tuned and must be re-validated per region. -
GFT — goodness-of-fit test: accept the lowest
Mcwhere a fitted Gutenberg–Richter law matches the observed distribution at the chosen confidence level. -
EMR — entire magnitude range: models the full distribution including the incomplete part; gives
Mcand a bootstrap uncertainty.
The law Mc estimation rides on is Gutenberg–Richter, with the Aki–Utsu maximum-likelihood
b-value:
where
Highest-stakes case — short-term aftershock incompleteness. Right after a large mainshock — exactly when the forecast matters most —
Mcspikes for hours-to-days and a naive model under-forecasts productivity. The daily update uses an incompleteness-aware likelihood with a time-dependent$M_c(t)$ , not the static fit. This is a decided method, flagged honestly in product copy as a known limitation of the most important real-time updates.
Catalogs mix ML, mb, Ms, Md, Mw — not interchangeable (different saturation, different
physics). For a single statistical model, magnitudes are homogenized to Mw (proportional to
seismic moment, non-saturating). Where Mw is missing for small events, regional total-least-squares
regressions ML→Mw and mb→Mw are fit (TLS, not OLS — both axes carry error), anchored on the
ISC-GEM / Global CMT overlap. Both native and Mw-homogenized magnitudes are stored; the conversion
is versioned, because a wrong conversion shifts the whole Gutenberg–Richter relation and every rate
forecast.
These features are computed only from the cleaned catalog and carry the dominant skill. They are the model's view of how active, and how clustered, recent seismicity is.
Counts and rates of events above Mc in trailing windows (e.g. last 1, 7, 30, 365 days) per grid
cell, plus their ratios (a short-window rate divided by a long-window rate is a simple, robust
"activation" signal). These capture the raw level of recent activity that any clustering model
amplifies.
The Epidemic-Type Aftershock Sequence model expresses the conditional rate as a background plus the sum of triggering contributions from all past events:
The Omori–Utsu temporal decay
A non-parametric, data-driven way to quantify how strongly each event is clustered to its most likely
parent. Define the proximity from event
with
The histogram of
The dual-catalog rule (stated once, enforced everywhere). A declustered catalog (Gardner–Knopoff as a transparent cross-check; Zaliapin–Ben-Zion as primary) feeds only the stationary background rate
$\mu(x,y)$ and Poisson-baseline calibration. The full, un-declustered catalog feeds the conditional/ETAS triggering term, because the triggering is the signal. The Gardner–Knopoff windows (OpenQuake hmtk coefficients) are$L(M)=10^{,0.1238M+0.983}$ km and$T(M)=10^{,0.032M+2.7389}$ d for$M\ge6.5$ , else$10^{,0.5409M-0.547}$ d.
These features tell the model what kind of place a cell is. Each is joined as a static or slowly-varying spatial field from a global enricher (Data-Sources §7–§13), and each must demonstrate marginal information gain over a catalog-only baseline before it ships.
flowchart LR
subgraph SRC["Global enrichers (sources)"]
S2["Slab2<br/>subduction geometry"]
GF["GEM Active Faults"]
PB["Bird PB2002 plates"]
GN["NGL GNSS / MIDAS"]
WS["World Stress Map<br/>+ Global CMT mechanisms"]
TD["Tidal stress<br/>(pygtide + SPOTL/TPXO)"]
end
subgraph COV["Per-cell context covariates"]
C1["slab depth / dip / interface distance"]
C2["distance-to-nearest-fault · fault style"]
C3["distance-to-boundary · boundary type"]
C4["strain rate (dilatation, max shear)"]
C5["faulting regime · rake · dCFS"]
C6["tidal dCFS · phase · Mf envelope"]
end
S2 --> C1
GF --> C2
PB --> C3
GN --> C4
WS --> C5
TD --> C6
COV --> GRID["Forecast grid<br/>(joined per cell)"]
From Slab2: depth-to-slab, local dip and strike, and distance-to-interface. Combined with event
depth, these separate interface, intraslab, and crustal seismicity — very different
physics and rates. This is the highest-value context covariate in subduction regions.
From GEM Active Faults: distance-to-nearest-active-fault and fault style. From Bird PB2002: distance-to-plate-boundary, boundary type (subduction / transform / ridge), and plate-pair relative velocity. Together they place a cell in its tectonic setting.
From NGL GNSS / MIDAS: a strain-rate field (dilatation, maximum shear, second invariant) interpolated from station velocities. Strain rate correlates with long-term seismicity rate, so it primarily informs the background term, not the short-horizon term.
From Global CMT mechanisms and the World Stress Map: faulting regime, rake / P-T axis
orientation, and Coulomb stress-transfer (
From the computed tidal-stress series (Data-Sources §13): the tidal Coulomb failure stress at the forecast instant and its derived shape. These enter the conditional intensity as a regularized multiplier with the physically correct exponential (rate-and-state) form, with a learned coupling the data is allowed to drive to ~0:
Engineered tidal features:
The complete feature set the model considers. Tier marks the honest hierarchy: spine features (from the catalog) carry the skill; context covariates are upside, gated on measured information gain.
| Feature | Definition | Source | Tier | Why it matters |
|---|---|---|---|---|
| Recent-window rate | Count/rate of events ≥ Mc in trailing windows (1/7/30/365 d) per cell |
Catalog (homogenized) | Spine | Raw level of recent activity that clustering amplifies. |
| Rate ratio (activation) | Short-window rate ÷ long-window rate | Catalog | Spine | Simple, robust signal that a sequence is starting. |
| ETAS/Omori intensity |
|
Catalog (ETAS fit) | Spine | The physics-grounded conditional rate; the baseline to beat. |
| Productivity / decay state | Current |
Catalog (ETAS fit) | Spine | Encodes how strongly recent large events are still triggering. |
| ZBZ proximity |
Nearest-neighbor space–time–magnitude proximity | Catalog | Spine | Data-driven clustering strength per event. |
| ZBZ rescaled time |
Catalog | Spine | Separates temporal clustering from spatial. | |
| ZBZ rescaled space |
Catalog | Spine | Separates spatial clustering; with |
|
| b-value |
Aki–Utsu MLE per cell/epoch | Catalog | Spine | Sets the magnitude distribution and the large-event tail. |
Mc(x,y,t) |
Magnitude of completeness per cell/epoch | Catalog | Spine | Trust boundary; below it rates are biased — a monitored quantity. |
| Slab depth / dip | Depth-to-slab and local dip at the cell | Slab2 | Context | Separates interface / intraslab / crustal seismicity. |
| Interface distance | Distance to the subduction interface | Slab2 | Context | Conditions triggering geometry in subduction zones. |
| Distance-to-fault | Distance to nearest mapped active fault | GEM Active Faults | Context | Proximity to a known fault raises prior rate. |
| Fault style | Mapped faulting style at the cell | GEM Active Faults | Context | Constrains expected mechanism / orientation. |
| Distance-to-boundary | Distance to nearest plate boundary | Bird PB2002 | Context | Places the cell in the plate-tectonic frame. |
| Boundary type | Subduction / transform / ridge | Bird PB2002 | Context | Sets the dominant physics (e.g. thrust vs strike-slip). |
| Strain rate | Dilatation / max shear / second invariant | NGL GNSS / MIDAS | Context | Correlates with the long-term background rate. |
| Faulting regime / rake | Stress regime and P-T axis orientation | World Stress Map / Global CMT | Context | Orients the fault for tidal/Coulomb resolution. |
|
|
Coulomb stress transfer from recent large events | Global CMT mechanisms | Context | Physically motivated triggering covariate. |
| Tidal |
Tidal Coulomb failure stress at the instant | Computed (pygtide + SPOTL/TPXO) | Context | Regularized multiplier; coupling allowed to → 0. |
| Tidal stressing rate |
Time derivative of tidal |
Computed | Context | Rate governs whether a phase correlation is even detectable. |
| Tidal phase |
Circular encoding of tidal phase | Computed | Context | Keeps the periodic phase variable continuous. |
| Semidiurnal / Mf envelope | Recent-window semidiurnal and fortnightly amplitude | Computed | Context | The Mf (~14.7 d) band is where a clean tidal signal is most plausible. |
Stated for honesty and to forestall over-reading:
- No "imminence" or alarm feature. Nothing in the feature set is a countdown or a deterministic precursor. Every feature feeds a bounded, calibrated probability scored against reality.
-
No raw waveforms in v1. Waveform-derived detection (ML phase pickers) would improve the input
catalog (lower, more stable
Mc), which helps every model — but detection is not forecasting, and it is out of the v1 feature set. - No un-gated covariate. A context covariate that fails to show positive prospective information gain over the catalog-only baseline stays out of the public model, reported as a near-null.
- No declustered catalog as the conditional input. The triggering signal is kept; only the background term uses the declustered view.
- Aki, K. (1965). Maximum likelihood estimate of b in the formula log N = a − bM. Bull. Earthq. Res. Inst. 43, 237–239.
- Baiesi, M., & Paczuski, M. (2004). Scale-free networks of earthquakes and aftershocks. Phys. Rev. E 69, 066106. DOI
10.1103/PhysRevE.69.066106. - Beeler, N. M., & Lockner, D. A. (2003). Why earthquakes correlate weakly with the solid Earth tides. JGR Solid Earth 108(B8), 2391. DOI
10.1029/2001JB001518. - Gardner, J. K., & Knopoff, L. (1974). Is the sequence of earthquakes in Southern California, with aftershocks removed, Poissonian? BSSA 64(5), 1363–1367.
- Hayes, G. P., et al. (2018). Slab2, a comprehensive subduction zone geometry model. Science 362, 58–61. DOI
10.1126/science.aat4723. - Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. DOI
10.1080/01621459.1988.10478560. - Ogata, Y. (1998). Space–time point-process models for earthquake occurrences. Ann. Inst. Statist. Math. 50(2), 379–402. DOI
10.1023/A:1003403601725. - Wiemer, S., & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA 90(4), 859–869. DOI
10.1785/0119990114. - Woessner, J., & Wiemer, S. (2005). Assessing the quality of earthquake catalogues: estimating the magnitude of completeness and its uncertainty (EMR). BSSA 95(2), 684–698. DOI
10.1785/0120040007. - Zaliapin, I., Gabrielov, A., Keilis-Borok, V., & Wong, H. (2008). Clustering analysis of seismicity and aftershock identification. Phys. Rev. Lett. 101, 018501. DOI
10.1103/PhysRevLett.101.018501. - Zaliapin, I., & Ben-Zion, Y. (2020). Earthquake declustering using the nearest-neighbor approach in space-time-magnitude domain. JGR Solid Earth 125, e2018JB017120. DOI
10.1029/2018JB017120.
See also: Data-Sources — where each of these data types and covariates comes from · Pipeline — the versioned DAG that produces these features · Models-Classical — how the ETAS / Omori / Gutenberg–Richter features are used · Models-ML — how the feature set feeds the neural challenger.
⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.
CAOS_SEISMIC · seismic.fasl-work.com · source · MIT
Conditional probabilistic seismic forecasting — forecasts, never predictions.
Overview
Methodology & History
Classical models
- Models-Classical · index
- Gutenberg-Richter-Law
- Omori-Utsu-Law
- ETAS-Model
- Reasenberg-Jones-Model
- STEP-Model
- EEPAS-Model
- Smoothed-Seismicity
- Brownian-Passage-Time
- Rate-and-State-and-Coulomb
ML & analytical methods
- Models-ML · index
- Temporal-Point-Processes
- RMTPP
- Neural-Hawkes-Process
- Transformer-Hawkes-Process
- RECAST-and-FERN
- CNN-Spatial-Models
- Graph-and-Recurrent-Networks
- Detection-vs-Forecasting
Models employed
Data
Architecture
Evaluation
Progress
Reference