Skip to content

Data Types and Features

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Data Types and Features

This page documents the type of data CAOS_SEISMIC ingests and the features it considers — so a reader understands what the model actually sees. It is the companion to Data-Sources (which documents where the data comes from); read that first if you want the providers and licenses. Here the question is narrower and more technical: given the raw feeds, what is one record, how is it cleaned, and which numbers are derived from it and fed to the model?

The answer has a strict shape:

  1. One data type dominates — the catalog event record (an earthquake's time, place, depth, and magnitude). Everything load-bearing is derived from it.
  2. A homogenization step turns the heterogeneous magnitude field into a single, comparable scale (Mw) and estimates the magnitude of completeness Mc below which the catalog cannot be trusted.
  3. Catalog-derived features capture recent activity and its clustering structure — recent-window rates, ETAS/Omori intensities, and Zaliapin–Ben-Zion nearest-neighbor proximity ($\eta$, $T$, $R$).
  4. Context covariates from the global enrichers (slab geometry, distance-to-fault, plate-boundary type, GNSS strain rate, tidal Coulomb stress) tell the model what kind of place a cell is.

Honest framing. Categories 1–3 (the catalog and its derived features) carry essentially all of the published short-term forecasting skill. Category 4 (context covariates) is upside, not foundation: a covariate enters the model only after it shows a positive, significant pseudo-prospective information gain over a catalog-only baseline. This page labels every feature with that distinction.


1. The catalog event record (the primary data type)

The atomic unit of input is a single catalog event record. Whether it comes from USGS ComCat, ISC, or a regional network (see Data-Sources), every record carries the same core fields:

Field Type Meaning Notes for modeling
event_id string Provider's unique origin id Used to dedupe across providers (ComCat vs regional vs EMSC).
time UTC timestamp Origin time The clock the whole point-process model runs on; sub-second precision matters for Omori decay at short lags.
latitude, longitude float (degrees) Epicenter Joined to the forecast grid; distances are computed great-circle.
depth float (km) Hypocentral depth Separates crustal / interface / intraslab when combined with Slab2 (§5).
magnitude float Reported size Heterogeneous — see magType.
magType string Magnitude scale (ml, mb, ms, mw, md, …) First-class field — never dropped. Different scales saturate differently; mixing them silently distorts the magnitude–frequency tail.

The single most important hygiene rule on the raw record: keep magType. A catalog that mixes mb, Ms, and Mw without recording which is which makes the Gutenberg–Richter b-value (and hence every rate forecast) wrong. Homogenization (§3) depends on this field being present and correct.

A record may carry more — moment tensor, focal mechanism, nodal planes, P/T axes (from Global CMT, Data-Sources §5) — which feed the mechanism / stress features (§5). But the six fields above are the irreducible minimum every event must have.

1.1 From record to forecast target

The model does not forecast individual events; it forecasts a rate $\lambda$ per space–time– magnitude cell (a 0.1° × 0.1° × 0.1-magnitude bin, see Technical-Architecture), from which the public exceedance probability is derived:

$$P(\ge 1 \text{ event} \ge M^_) = 1 - e^{-N_{\ge M^_}}, \qquad N_{\ge M^_} = \iint \lambda,\Phi(M^_),dx,dy,dt, \quad \Phi(M^_) = 10^{-b,(M^_-M_c)}$$

So the event record is the input; a calibrated probability per region × magnitude × horizon is the output. Everything in between is the homogenization and feature engineering documented below.


2. The catalog-to-features flow

flowchart TD
    REC["Catalog event record<br/>time · lat · lon · depth · magnitude · magType"]
    DEDUP["Dedupe across providers<br/>(preferred-origin id)"]
    MC["Estimate Mc(x,y,t)<br/>MAXC+0.2 / GFT / EMR"]
    CUT["Cut events below local Mc"]
    HOM["Homogenize magnitude -> Mw<br/>total-least-squares ML/mb -> Mw<br/>anchored on ISC-GEM / Global CMT"]
    DECL["Decluster (DUAL view)<br/>GK background  |  full catalog for triggering<br/>Zaliapin-Ben-Zion eta/T/R labels"]
    FEAT["Catalog-derived features<br/>recent-window rates · ETAS/Omori intensities · eta/T/R"]
    REC --> DEDUP --> MC --> CUT --> HOM --> DECL --> FEAT
Loading

The order is load-bearing and identical to the Pipeline DAG: you cannot homogenize before you know magType, you cannot estimate a trustworthy b-value before you cut below Mc, and you must keep both a declustered catalog (for the stationary background) and the full catalog (for the triggering signal). Declustering the input to the conditional model is the single most common pipeline mistake — the aftershock/foreshock triggering is the predictable signal.


3. Homogenization: completeness Mc and the move to Mw

Two transforms turn the raw, heterogeneous record into a statistically usable one.

3.1 Magnitude of completeness Mc

Mc is the minimum magnitude above which all events in a given space–time volume are detected (Wiemer & Wyss 2000). Below Mc the catalog is missing events, and any rate computed there is biased. It must be estimated per spatial cell and per time epoch — a single global Mc injects fake non-stationarity (e.g. a network densification that lowered Mc would look like a real seismicity change). Estimators (run several; their spread is the uncertainty):

  • MAXC — maximum curvature: Mc = the magnitude bin where the non-cumulative frequency–magnitude distribution peaks. Fast, but biases low in gradual roll-offs; the community fix is MAXC + 0.2, which is California-tuned and must be re-validated per region.
  • GFT — goodness-of-fit test: accept the lowest Mc where a fitted Gutenberg–Richter law matches the observed distribution at the chosen confidence level.
  • EMR — entire magnitude range: models the full distribution including the incomplete part; gives Mc and a bootstrap uncertainty.

The law Mc estimation rides on is Gutenberg–Richter, with the Aki–Utsu maximum-likelihood b-value:

$$\log_{10} N(\ge M) = a - bM, \qquad \hat{b} = \frac{\log_{10} e}{\bar{M} - (M_c - \Delta M/2)}$$

where $\bar{M}$ is the mean magnitude of events $\ge M_c$ and $\Delta M$ the bin width. Never hard-code $b = 1$. Both $M_c(t)$ and $b(t)$ are exposed as monitored quantities — drift in either flags catalog or network breakage.

Highest-stakes case — short-term aftershock incompleteness. Right after a large mainshock — exactly when the forecast matters most — Mc spikes for hours-to-days and a naive model under-forecasts productivity. The daily update uses an incompleteness-aware likelihood with a time-dependent $M_c(t)$, not the static fit. This is a decided method, flagged honestly in product copy as a known limitation of the most important real-time updates.

3.2 Magnitude homogenization to Mw

Catalogs mix ML, mb, Ms, Md, Mwnot interchangeable (different saturation, different physics). For a single statistical model, magnitudes are homogenized to Mw (proportional to seismic moment, non-saturating). Where Mw is missing for small events, regional total-least-squares regressions ML→Mw and mb→Mw are fit (TLS, not OLS — both axes carry error), anchored on the ISC-GEM / Global CMT overlap. Both native and Mw-homogenized magnitudes are stored; the conversion is versioned, because a wrong conversion shifts the whole Gutenberg–Richter relation and every rate forecast.


4. Catalog-derived features (the load-bearing feature set)

These features are computed only from the cleaned catalog and carry the dominant skill. They are the model's view of how active, and how clustered, recent seismicity is.

4.1 Recent-window rates

Counts and rates of events above Mc in trailing windows (e.g. last 1, 7, 30, 365 days) per grid cell, plus their ratios (a short-window rate divided by a long-window rate is a simple, robust "activation" signal). These capture the raw level of recent activity that any clustering model amplifies.

4.2 ETAS / Omori conditional intensities

The Epidemic-Type Aftershock Sequence model expresses the conditional rate as a background plus the sum of triggering contributions from all past events:

$$\lambda(t,x,y \mid \mathcal{H}_t) = \mu(x,y) + \sum_{i:,t_i&lt;t} K,e^{\alpha(m_i - M_0)}; \frac{p-1}{c}\Big(1+\frac{t-t_i}{c}\Big)^{-p}; f(x-x_i, y-y_i \mid m_i)$$

The Omori–Utsu temporal decay $\big(1+\frac{t-t_i}{c}\big)^{-p}$ and the Utsu productivity $K,e^{\alpha(m_i-M_0)}$ are the heart of aftershock physics. Evaluated at the forecast instant, this intensity (and its components — current background level, current triggered excess) is itself a feature the model consumes, and is the reference baseline the whole product must beat (see Models-Classical).

4.3 Zaliapin–Ben-Zion nearest-neighbor proximity ($\eta$, $T$, $R$)

A non-parametric, data-driven way to quantify how strongly each event is clustered to its most likely parent. Define the proximity from event $j$ to a candidate parent $i$:

$$\eta_{ij} = t_{ij},(r_{ij})^{d_f},10^{-b,m_i}, \qquad \eta_j = \min_i \eta_{ij}$$

with $t_{ij}$ the inter-event time (years, $&gt;0$ only if $i$ precedes $j$), $r_{ij}$ the epicentral distance (km), $d_f$ the fractal dimension of hypocenters, $b$ the b-value, and $m_i$ the parent magnitude. It decomposes into rescaled time and rescaled space:

$$T_j = t_{ij},10^{-q b m_i}, \qquad R_j = (r_{ij})^{d_f},10^{-(1-q) b m_i}, \qquad \eta_{ij} = T_j R_j \quad (q \approx 0.5)$$

The histogram of $\log_{10}\eta$ is bimodal — one mode is background, one is clustered — so the threshold $\eta_0$ that separates them is a principled cluster labeler. Crucially, $\eta$, $T$, and $R$ are kept as per-event ML features (not just a keep/drop flag): they encode foreshock/aftershock structure and swarm-vs-burst topology directly.

The dual-catalog rule (stated once, enforced everywhere). A declustered catalog (Gardner–Knopoff as a transparent cross-check; Zaliapin–Ben-Zion as primary) feeds only the stationary background rate $\mu(x,y)$ and Poisson-baseline calibration. The full, un-declustered catalog feeds the conditional/ETAS triggering term, because the triggering is the signal. The Gardner–Knopoff windows (OpenQuake hmtk coefficients) are $L(M)=10^{,0.1238M+0.983}$ km and $T(M)=10^{,0.032M+2.7389}$ d for $M\ge6.5$, else $10^{,0.5409M-0.547}$ d.


5. Context covariates (from the global enrichers — upside, not foundation)

These features tell the model what kind of place a cell is. Each is joined as a static or slowly-varying spatial field from a global enricher (Data-Sources §7–§13), and each must demonstrate marginal information gain over a catalog-only baseline before it ships.

flowchart LR
    subgraph SRC["Global enrichers (sources)"]
        S2["Slab2<br/>subduction geometry"]
        GF["GEM Active Faults"]
        PB["Bird PB2002 plates"]
        GN["NGL GNSS / MIDAS"]
        WS["World Stress Map<br/>+ Global CMT mechanisms"]
        TD["Tidal stress<br/>(pygtide + SPOTL/TPXO)"]
    end
    subgraph COV["Per-cell context covariates"]
        C1["slab depth / dip / interface distance"]
        C2["distance-to-nearest-fault · fault style"]
        C3["distance-to-boundary · boundary type"]
        C4["strain rate (dilatation, max shear)"]
        C5["faulting regime · rake · dCFS"]
        C6["tidal dCFS · phase · Mf envelope"]
    end
    S2 --> C1
    GF --> C2
    PB --> C3
    GN --> C4
    WS --> C5
    TD --> C6
    COV --> GRID["Forecast grid<br/>(joined per cell)"]
Loading

5.1 Slab and geometry covariates

From Slab2: depth-to-slab, local dip and strike, and distance-to-interface. Combined with event depth, these separate interface, intraslab, and crustal seismicity — very different physics and rates. This is the highest-value context covariate in subduction regions.

5.2 Fault and plate-boundary covariates

From GEM Active Faults: distance-to-nearest-active-fault and fault style. From Bird PB2002: distance-to-plate-boundary, boundary type (subduction / transform / ridge), and plate-pair relative velocity. Together they place a cell in its tectonic setting.

5.3 Geodetic strain rate

From NGL GNSS / MIDAS: a strain-rate field (dilatation, maximum shear, second invariant) interpolated from station velocities. Strain rate correlates with long-term seismicity rate, so it primarily informs the background term, not the short-horizon term.

5.4 Stress and mechanism covariates

From Global CMT mechanisms and the World Stress Map: faulting regime, rake / P-T axis orientation, and Coulomb stress-transfer ($\Delta\mathrm{CFS}$) from recent large events — physically motivated triggering features, second-tier effort.

5.5 Tidal Coulomb stress covariates

From the computed tidal-stress series (Data-Sources §13): the tidal Coulomb failure stress at the forecast instant and its derived shape. These enter the conditional intensity as a regularized multiplier with the physically correct exponential (rate-and-state) form, with a learned coupling the data is allowed to drive to ~0:

$$\lambda(t \mid \mathcal{H}) = \lambda_0(t \mid \mathcal{H}) \cdot \exp!\Big(\beta,\tfrac{\Delta\mathrm{CFS}(t)}{A\sigma}\Big), \qquad \beta \in [0,1]$$

Engineered tidal features: $\Delta\mathrm{CFS}(t_0)$; its rate $\dot{S}(t_0)$; the circular phase $(\sin\theta, \cos\theta)$; the semidiurnal envelope; and the fortnightly Mf (~14.7 d) envelope. For most regions the expected gain is near-null; only shallow ocean-loaded thrusts, ridges, and the tremor/slow-slip channel show a measurable (still few-percent) effect — reported honestly.


6. Feature catalog (definition · source · why it matters)

The complete feature set the model considers. Tier marks the honest hierarchy: spine features (from the catalog) carry the skill; context covariates are upside, gated on measured information gain.

Feature Definition Source Tier Why it matters
Recent-window rate Count/rate of events ≥ Mc in trailing windows (1/7/30/365 d) per cell Catalog (homogenized) Spine Raw level of recent activity that clustering amplifies.
Rate ratio (activation) Short-window rate ÷ long-window rate Catalog Spine Simple, robust signal that a sequence is starting.
ETAS/Omori intensity $\lambda(t,x,y\mid\mathcal{H}_t)$ and its background/triggered components at the forecast instant Catalog (ETAS fit) Spine The physics-grounded conditional rate; the baseline to beat.
Productivity / decay state Current $K,e^{\alpha(m-M_0)}$ and Omori $(1+t/c)^{-p}$ contributions Catalog (ETAS fit) Spine Encodes how strongly recent large events are still triggering.
ZBZ proximity $\eta$ Nearest-neighbor space–time–magnitude proximity Catalog Spine Data-driven clustering strength per event.
ZBZ rescaled time $T$ $t_{ij},10^{-qbm_i}$ Catalog Spine Separates temporal clustering from spatial.
ZBZ rescaled space $R$ $(r_{ij})^{d_f},10^{-(1-q)bm_i}$ Catalog Spine Separates spatial clustering; with $T$ identifies swarms vs bursts.
b-value $b(x,y,t)$ Aki–Utsu MLE per cell/epoch Catalog Spine Sets the magnitude distribution and the large-event tail.
Mc(x,y,t) Magnitude of completeness per cell/epoch Catalog Spine Trust boundary; below it rates are biased — a monitored quantity.
Slab depth / dip Depth-to-slab and local dip at the cell Slab2 Context Separates interface / intraslab / crustal seismicity.
Interface distance Distance to the subduction interface Slab2 Context Conditions triggering geometry in subduction zones.
Distance-to-fault Distance to nearest mapped active fault GEM Active Faults Context Proximity to a known fault raises prior rate.
Fault style Mapped faulting style at the cell GEM Active Faults Context Constrains expected mechanism / orientation.
Distance-to-boundary Distance to nearest plate boundary Bird PB2002 Context Places the cell in the plate-tectonic frame.
Boundary type Subduction / transform / ridge Bird PB2002 Context Sets the dominant physics (e.g. thrust vs strike-slip).
Strain rate Dilatation / max shear / second invariant NGL GNSS / MIDAS Context Correlates with the long-term background rate.
Faulting regime / rake Stress regime and P-T axis orientation World Stress Map / Global CMT Context Orients the fault for tidal/Coulomb resolution.
$\Delta\mathrm{CFS}$ (static) Coulomb stress transfer from recent large events Global CMT mechanisms Context Physically motivated triggering covariate.
Tidal $\Delta\mathrm{CFS}(t_0)$ Tidal Coulomb failure stress at the instant Computed (pygtide + SPOTL/TPXO) Context Regularized multiplier; coupling allowed to → 0.
Tidal stressing rate $\dot{S}$ Time derivative of tidal $\Delta\mathrm{CFS}$ Computed Context Rate governs whether a phase correlation is even detectable.
Tidal phase $(\sin\theta,\cos\theta)$ Circular encoding of tidal phase Computed Context Keeps the periodic phase variable continuous.
Semidiurnal / Mf envelope Recent-window semidiurnal and fortnightly amplitude Computed Context The Mf (~14.7 d) band is where a clean tidal signal is most plausible.

7. What the model deliberately does not ingest as a feature

Stated for honesty and to forestall over-reading:

  • No "imminence" or alarm feature. Nothing in the feature set is a countdown or a deterministic precursor. Every feature feeds a bounded, calibrated probability scored against reality.
  • No raw waveforms in v1. Waveform-derived detection (ML phase pickers) would improve the input catalog (lower, more stable Mc), which helps every model — but detection is not forecasting, and it is out of the v1 feature set.
  • No un-gated covariate. A context covariate that fails to show positive prospective information gain over the catalog-only baseline stays out of the public model, reported as a near-null.
  • No declustered catalog as the conditional input. The triggering signal is kept; only the background term uses the declustered view.

References

  • Aki, K. (1965). Maximum likelihood estimate of b in the formula log N = a − bM. Bull. Earthq. Res. Inst. 43, 237–239.
  • Baiesi, M., & Paczuski, M. (2004). Scale-free networks of earthquakes and aftershocks. Phys. Rev. E 69, 066106. DOI 10.1103/PhysRevE.69.066106.
  • Beeler, N. M., & Lockner, D. A. (2003). Why earthquakes correlate weakly with the solid Earth tides. JGR Solid Earth 108(B8), 2391. DOI 10.1029/2001JB001518.
  • Gardner, J. K., & Knopoff, L. (1974). Is the sequence of earthquakes in Southern California, with aftershocks removed, Poissonian? BSSA 64(5), 1363–1367.
  • Hayes, G. P., et al. (2018). Slab2, a comprehensive subduction zone geometry model. Science 362, 58–61. DOI 10.1126/science.aat4723.
  • Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. DOI 10.1080/01621459.1988.10478560.
  • Ogata, Y. (1998). Space–time point-process models for earthquake occurrences. Ann. Inst. Statist. Math. 50(2), 379–402. DOI 10.1023/A:1003403601725.
  • Wiemer, S., & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA 90(4), 859–869. DOI 10.1785/0119990114.
  • Woessner, J., & Wiemer, S. (2005). Assessing the quality of earthquake catalogues: estimating the magnitude of completeness and its uncertainty (EMR). BSSA 95(2), 684–698. DOI 10.1785/0120040007.
  • Zaliapin, I., Gabrielov, A., Keilis-Borok, V., & Wong, H. (2008). Clustering analysis of seismicity and aftershock identification. Phys. Rev. Lett. 101, 018501. DOI 10.1103/PhysRevLett.101.018501.
  • Zaliapin, I., & Ben-Zion, Y. (2020). Earthquake declustering using the nearest-neighbor approach in space-time-magnitude domain. JGR Solid Earth 125, e2018JB017120. DOI 10.1029/2018JB017120.

See also: Data-Sources — where each of these data types and covariates comes from · Pipeline — the versioned DAG that produces these features · Models-Classical — how the ETAS / Omori / Gutenberg–Richter features are used · Models-ML — how the feature set feeds the neural challenger.

Clone this wiki locally