Data Types and Features

This page documents the type of data CAOS_SEISMIC ingests and the features it considers — so a reader understands what the model actually sees. It is the companion to Data-Sources (which documents where the data comes from); read that first if you want the providers and licenses. Here the question is narrower and more technical: given the raw feeds, what is one record, how is it cleaned, and which numbers are derived from it and fed to the model?

The answer has a strict shape:

One data type dominates — the catalog event record (an earthquake's time, place, depth, and magnitude). Everything load-bearing is derived from it.
A homogenization step turns the heterogeneous magnitude field into a single, comparable scale (Mw) and estimates the magnitude of completeness Mc below which the catalog cannot be trusted.
Catalog-derived features capture recent activity and its clustering structure — recent-window rates, ETAS/Omori intensities, and Zaliapin–Ben-Zion nearest-neighbor proximity ($\eta$, $T$, $R$).
Context covariates from the global enrichers (slab geometry, distance-to-fault, plate-boundary type, GNSS strain rate, tidal Coulomb stress) tell the model what kind of place a cell is.

Honest framing. Categories 1–3 (the catalog and its derived features) carry essentially all of the published short-term forecasting skill. Category 4 (context covariates) is upside, not foundation: a covariate enters the model only after it shows a positive, significant pseudo-prospective information gain over a catalog-only baseline. This page labels every feature with that distinction.

1. The catalog event record (the primary data type)

The atomic unit of input is a single catalog event record. Whether it comes from USGS ComCat, ISC, or a regional network (see Data-Sources), every record carries the same core fields:

Field	Type	Meaning	Notes for modeling
`event_id`	string	Provider's unique origin id	Used to dedupe across providers (ComCat vs regional vs EMSC).
`time`	UTC timestamp	Origin time	The clock the whole point-process model runs on; sub-second precision matters for Omori decay at short lags.
`latitude`, `longitude`	float (degrees)	Epicenter	Joined to the forecast grid; distances are computed great-circle.
`depth`	float (km)	Hypocentral depth	Separates crustal / interface / intraslab when combined with Slab2 (§5).
`magnitude`	float	Reported size	Heterogeneous — see `magType`.
`magType`	string	Magnitude scale (`ml`, `mb`, `ms`, `mw`, `md`, …)	First-class field — never dropped. Different scales saturate differently; mixing them silently distorts the magnitude–frequency tail.

The single most important hygiene rule on the raw record: keep magType. A catalog that mixes mb, Ms, and Mw without recording which is which makes the Gutenberg–Richter b-value (and hence every rate forecast) wrong. Homogenization (§3) depends on this field being present and correct.

A record may carry more — moment tensor, focal mechanism, nodal planes, P/T axes (from Global CMT, Data-Sources §5) — which feed the mechanism / stress features (§5). But the six fields above are the irreducible minimum every event must have.

1.1 From record to forecast target

The model does not forecast individual events; it forecasts a rate $\lambda$ per space–time– magnitude cell (a 0.1° × 0.1° × 0.1-magnitude bin, see Technical-Architecture), from which the public exceedance probability is derived:

$$P(\ge 1 \text{ event} \ge M^_) = 1 - e^{-N_{\ge M^_}}, \qquad N_{\ge M^_} = \iint \lambda,\Phi(M^_),dx,dy,dt, \quad \Phi(M^_) = 10^{-b,(M^_-M_c)}$$

So the event record is the input; a calibrated probability per region × magnitude × horizon is the output. Everything in between is the homogenization and feature engineering documented below.

2. The catalog-to-features flow

flowchart TD
    REC["Catalog event record<br/>time · lat · lon · depth · magnitude · magType"]
    DEDUP["Dedupe across providers<br/>(preferred-origin id)"]
    MC["Estimate Mc(x,y,t)<br/>MAXC+0.2 / GFT / EMR"]
    CUT["Cut events below local Mc"]
    HOM["Homogenize magnitude -> Mw<br/>total-least-squares ML/mb -> Mw<br/>anchored on ISC-GEM / Global CMT"]
    DECL["Decluster (DUAL view)<br/>GK background  |  full catalog for triggering<br/>Zaliapin-Ben-Zion eta/T/R labels"]
    FEAT["Catalog-derived features<br/>recent-window rates · ETAS/Omori intensities · eta/T/R"]
    REC --> DEDUP --> MC --> CUT --> HOM --> DECL --> FEAT

The order is load-bearing and identical to the Pipeline DAG: you cannot homogenize before you know magType, you cannot estimate a trustworthy b-value before you cut below Mc, and you must keep both a declustered catalog (for the stationary background) and the full catalog (for the triggering signal). Declustering the input to the conditional model is the single most common pipeline mistake — the aftershock/foreshock triggering is the predictable signal.

3. Homogenization: completeness `Mc` and the move to Mw

Two transforms turn the raw, heterogeneous record into a statistically usable one.

3.1 Magnitude of completeness `Mc`

Mc is the minimum magnitude above which all events in a given space–time volume are detected (Wiemer & Wyss 2000). Below Mc the catalog is missing events, and any rate computed there is biased. It must be estimated per spatial cell and per time epoch — a single global Mc injects fake non-stationarity (e.g. a network densification that lowered Mc would look like a real seismicity change). Estimators (run several; their spread is the uncertainty):

MAXC — maximum curvature: Mc = the magnitude bin where the non-cumulative frequency–magnitude distribution peaks. Fast, but biases low in gradual roll-offs; the community fix is MAXC + 0.2, which is California-tuned and must be re-validated per region.
GFT — goodness-of-fit test: accept the lowest Mc where a fitted Gutenberg–Richter law matches the observed distribution at the chosen confidence level.
EMR — entire magnitude range: models the full distribution including the incomplete part; gives Mc and a bootstrap uncertainty.

The law Mc estimation rides on is Gutenberg–Richter, with the Aki–Utsu maximum-likelihood b-value:

$$\log_{10} N(\ge M) = a - bM, \qquad \hat{b} = \frac{\log_{10} e}{\bar{M} - (M_c - \Delta M/2)}$$

where $\bar{M}$ is the mean magnitude of events $\ge M_c$ and $\Delta M$ the bin width. Never hard-code $b = 1$. Both $M_c(t)$ and $b(t)$ are exposed as monitored quantities — drift in either flags catalog or network breakage.

Highest-stakes case — short-term aftershock incompleteness. Right after a large mainshock — exactly when the forecast matters most — Mc spikes for hours-to-days and a naive model under-forecasts productivity. The daily update uses an incompleteness-aware likelihood with a time-dependent $M_c(t)$, not the static fit. This is a decided method, flagged honestly in product copy as a known limitation of the most important real-time updates.

3.2 Magnitude homogenization to Mw

Catalogs mix ML, mb, Ms, Md, Mw — not interchangeable (different saturation, different physics). For a single statistical model, magnitudes are homogenized to Mw (proportional to seismic moment, non-saturating). Where Mw is missing for small events, regional total-least-squares regressions ML→Mw and mb→Mw are fit (TLS, not OLS — both axes carry error), anchored on the ISC-GEM / Global CMT overlap. Both native and Mw-homogenized magnitudes are stored; the conversion is versioned, because a wrong conversion shifts the whole Gutenberg–Richter relation and every rate forecast.

4. Catalog-derived features (the load-bearing feature set)

These features are computed only from the cleaned catalog and carry the dominant skill. They are the model's view of how active, and how clustered, recent seismicity is.

4.1 Recent-window rates

Counts and rates of events above Mc in trailing windows (e.g. last 1, 7, 30, 365 days) per grid cell, plus their ratios (a short-window rate divided by a long-window rate is a simple, robust "activation" signal). These capture the raw level of recent activity that any clustering model amplifies.

4.2 ETAS / Omori conditional intensities

The Epidemic-Type Aftershock Sequence model expresses the conditional rate as a background plus the sum of triggering contributions from all past events:

$$\lambda(t,x,y \mid \mathcal{H}_t) = \mu(x,y) + \sum_{i:,t_i<t} K,e^{\alpha(m_i - M_0)}; \frac{p-1}{c}\Big(1+\frac{t-t_i}{c}\Big)^{-p}; f(x-x_i, y-y_i \mid m_i)$$

The Omori–Utsu temporal decay $\big(1+\frac{t-t_i}{c}\big)^{-p}$ and the Utsu productivity $K,e^{\alpha(m_i-M_0)}$ are the heart of aftershock physics. Evaluated at the forecast instant, this intensity (and its components — current background level, current triggered excess) is itself a feature the model consumes, and is the reference baseline the whole product must beat (see Models-Classical).

4.3 Zaliapin–Ben-Zion nearest-neighbor proximity ($\eta$, $T$, $R$)

A non-parametric, data-driven way to quantify how strongly each event is clustered to its most likely parent. Define the proximity from event $j$ to a candidate parent $i$:

$$\eta_{ij} = t_{ij},(r_{ij})^{d_f},10^{-b,m_i}, \qquad \eta_j = \min_i \eta_{ij}$$

with $t_{ij}$ the inter-event time (years, $>0$ only if $i$ precedes $j$), $r_{ij}$ the epicentral distance (km), $d_f$ the fractal dimension of hypocenters, $b$ the b-value, and $m_i$ the parent magnitude. It decomposes into rescaled time and rescaled space:

$$T_j = t_{ij},10^{-q b m_i}, \qquad R_j = (r_{ij})^{d_f},10^{-(1-q) b m_i}, \qquad \eta_{ij} = T_j R_j \quad (q \approx 0.5)$$

The histogram of $\log_{10}\eta$ is bimodal — one mode is background, one is clustered — so the threshold $\eta_0$ that separates them is a principled cluster labeler. Crucially, $\eta$, $T$, and $R$ are kept as per-event ML features (not just a keep/drop flag): they encode foreshock/aftershock structure and swarm-vs-burst topology directly.

The dual-catalog rule (stated once, enforced everywhere). A declustered catalog (Gardner–Knopoff as a transparent cross-check; Zaliapin–Ben-Zion as primary) feeds only the stationary background rate $\mu(x,y)$ and Poisson-baseline calibration. The full, un-declustered catalog feeds the conditional/ETAS triggering term, because the triggering is the signal. The Gardner–Knopoff windows (OpenQuake hmtk coefficients) are $L(M)=10^{,0.1238M+0.983}$ km and $T(M)=10^{,0.032M+2.7389}$ d for $M\ge6.5$, else $10^{,0.5409M-0.547}$ d.

5. Context covariates (from the global enrichers — upside, not foundation)

These features tell the model what kind of place a cell is. Each is joined as a static or slowly-varying spatial field from a global enricher (Data-Sources §7–§13), and each must demonstrate marginal information gain over a catalog-only baseline before it ships.

flowchart LR
    subgraph SRC["Global enrichers (sources)"]
        S2["Slab2<br/>subduction geometry"]
        GF["GEM Active Faults"]
        PB["Bird PB2002 plates"]
        GN["NGL GNSS / MIDAS"]
        WS["World Stress Map<br/>+ Global CMT mechanisms"]
        TD["Tidal stress<br/>(pygtide + SPOTL/TPXO)"]
    end
    subgraph COV["Per-cell context covariates"]
        C1["slab depth / dip / interface distance"]
        C2["distance-to-nearest-fault · fault style"]
        C3["distance-to-boundary · boundary type"]
        C4["strain rate (dilatation, max shear)"]
        C5["faulting regime · rake · dCFS"]
        C6["tidal dCFS · phase · Mf envelope"]
    end
    S2 --> C1
    GF --> C2
    PB --> C3
    GN --> C4
    WS --> C5
    TD --> C6
    COV --> GRID["Forecast grid<br/>(joined per cell)"]

5.1 Slab and geometry covariates

From Slab2: depth-to-slab, local dip and strike, and distance-to-interface. Combined with event depth, these separate interface, intraslab, and crustal seismicity — very different physics and rates. This is the highest-value context covariate in subduction regions.

5.2 Fault and plate-boundary covariates

From GEM Active Faults: distance-to-nearest-active-fault and fault style. From Bird PB2002: distance-to-plate-boundary, boundary type (subduction / transform / ridge), and plate-pair relative velocity. Together they place a cell in its tectonic setting.

5.3 Geodetic strain rate

From NGL GNSS / MIDAS: a strain-rate field (dilatation, maximum shear, second invariant) interpolated from station velocities. Strain rate correlates with long-term seismicity rate, so it primarily informs the background term, not the short-horizon term.

5.4 Stress and mechanism covariates

From Global CMT mechanisms and the World Stress Map: faulting regime, rake / P-T axis orientation, and Coulomb stress-transfer ($\Delta\mathrm{CFS}$) from recent large events — physically motivated triggering features, second-tier effort.

5.5 Tidal Coulomb stress covariates

From the computed tidal-stress series (Data-Sources §13): the tidal Coulomb failure stress at the forecast instant and its derived shape. These enter the conditional intensity as a regularized multiplier with the physically correct exponential (rate-and-state) form, with a learned coupling the data is allowed to drive to ~0:

$$\lambda(t \mid \mathcal{H}) = \lambda_0(t \mid \mathcal{H}) \cdot \exp!\Big(\beta,\tfrac{\Delta\mathrm{CFS}(t)}{A\sigma}\Big), \qquad \beta \in [0,1]$$

Engineered tidal features: $\Delta\mathrm{CFS}(t_0)$; its rate $\dot{S}(t_0)$; the circular phase $(\sin\theta, \cos\theta)$; the semidiurnal envelope; and the fortnightly Mf (~14.7 d) envelope. For most regions the expected gain is near-null; only shallow ocean-loaded thrusts, ridges, and the tremor/slow-slip channel show a measurable (still few-percent) effect — reported honestly.

6. Feature catalog (definition · source · why it matters)

The complete feature set the model considers. Tier marks the honest hierarchy: spine features (from the catalog) carry the skill; context covariates are upside, gated on measured information gain.

Feature	Definition	Source	Tier	Why it matters
Recent-window rate	Count/rate of events ≥ `Mc` in trailing windows (1/7/30/365 d) per cell	Catalog (homogenized)	Spine	Raw level of recent activity that clustering amplifies.
Rate ratio (activation)	Short-window rate ÷ long-window rate	Catalog	Spine	Simple, robust signal that a sequence is starting.
ETAS/Omori intensity	$\lambda(t,x,y\mid\mathcal{H}_t)$ and its background/triggered components at the forecast instant	Catalog (ETAS fit)	Spine	The physics-grounded conditional rate; the baseline to beat.
Productivity / decay state	Current $K,e^{\alpha(m-M_0)}$ and Omori $(1+t/c)^{-p}$ contributions	Catalog (ETAS fit)	Spine	Encodes how strongly recent large events are still triggering.
ZBZ proximity $\eta$	Nearest-neighbor space–time–magnitude proximity	Catalog	Spine	Data-driven clustering strength per event.
ZBZ rescaled time $T$	$t_{ij},10^{-qbm_i}$	Catalog	Spine	Separates temporal clustering from spatial.
ZBZ rescaled space $R$	$(r_{ij})^{d_f},10^{-(1-q)bm_i}$	Catalog	Spine	Separates spatial clustering; with $T$ identifies swarms vs bursts.
b-value $b(x,y,t)$	Aki–Utsu MLE per cell/epoch	Catalog	Spine	Sets the magnitude distribution and the large-event tail.
`Mc`(x,y,t)	Magnitude of completeness per cell/epoch	Catalog	Spine	Trust boundary; below it rates are biased — a monitored quantity.
Slab depth / dip	Depth-to-slab and local dip at the cell	Slab2	Context	Separates interface / intraslab / crustal seismicity.
Interface distance	Distance to the subduction interface	Slab2	Context	Conditions triggering geometry in subduction zones.
Distance-to-fault	Distance to nearest mapped active fault	GEM Active Faults	Context	Proximity to a known fault raises prior rate.
Fault style	Mapped faulting style at the cell	GEM Active Faults	Context	Constrains expected mechanism / orientation.
Distance-to-boundary	Distance to nearest plate boundary	Bird PB2002	Context	Places the cell in the plate-tectonic frame.
Boundary type	Subduction / transform / ridge	Bird PB2002	Context	Sets the dominant physics (e.g. thrust vs strike-slip).
Strain rate	Dilatation / max shear / second invariant	NGL GNSS / MIDAS	Context	Correlates with the long-term background rate.
Faulting regime / rake	Stress regime and P-T axis orientation	World Stress Map / Global CMT	Context	Orients the fault for tidal/Coulomb resolution.
$\Delta\mathrm{CFS}$ (static)	Coulomb stress transfer from recent large events	Global CMT mechanisms	Context	Physically motivated triggering covariate.
Tidal $\Delta\mathrm{CFS}(t_0)$	Tidal Coulomb failure stress at the instant	Computed (pygtide + SPOTL/TPXO)	Context	Regularized multiplier; coupling allowed to → 0.
Tidal stressing rate $\dot{S}$	Time derivative of tidal $\Delta\mathrm{CFS}$	Computed	Context	Rate governs whether a phase correlation is even detectable.
Tidal phase $(\sin\theta,\cos\theta)$	Circular encoding of tidal phase	Computed	Context	Keeps the periodic phase variable continuous.
Semidiurnal / Mf envelope	Recent-window semidiurnal and fortnightly amplitude	Computed	Context	The Mf (~14.7 d) band is where a clean tidal signal is most plausible.

7. What the model deliberately does not ingest as a feature

Stated for honesty and to forestall over-reading:

No "imminence" or alarm feature. Nothing in the feature set is a countdown or a deterministic precursor. Every feature feeds a bounded, calibrated probability scored against reality.
No raw waveforms in v1. Waveform-derived detection (ML phase pickers) would improve the input catalog (lower, more stable Mc), which helps every model — but detection is not forecasting, and it is out of the v1 feature set.
No un-gated covariate. A context covariate that fails to show positive prospective information gain over the catalog-only baseline stays out of the public model, reported as a near-null.
No declustered catalog as the conditional input. The triggering signal is kept; only the background term uses the declustered view.

References

Aki, K. (1965). Maximum likelihood estimate of b in the formula log N = a − bM. Bull. Earthq. Res. Inst. 43, 237–239.
Baiesi, M., & Paczuski, M. (2004). Scale-free networks of earthquakes and aftershocks. Phys. Rev. E 69, 066106. DOI 10.1103/PhysRevE.69.066106.
Beeler, N. M., & Lockner, D. A. (2003). Why earthquakes correlate weakly with the solid Earth tides. JGR Solid Earth 108(B8), 2391. DOI 10.1029/2001JB001518.
Gardner, J. K., & Knopoff, L. (1974). Is the sequence of earthquakes in Southern California, with aftershocks removed, Poissonian? BSSA 64(5), 1363–1367.
Hayes, G. P., et al. (2018). Slab2, a comprehensive subduction zone geometry model. Science 362, 58–61. DOI 10.1126/science.aat4723.
Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. DOI 10.1080/01621459.1988.10478560.
Ogata, Y. (1998). Space–time point-process models for earthquake occurrences. Ann. Inst. Statist. Math. 50(2), 379–402. DOI 10.1023/A:1003403601725.
Wiemer, S., & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA 90(4), 859–869. DOI 10.1785/0119990114.
Woessner, J., & Wiemer, S. (2005). Assessing the quality of earthquake catalogues: estimating the magnitude of completeness and its uncertainty (EMR). BSSA 95(2), 684–698. DOI 10.1785/0120040007.
Zaliapin, I., Gabrielov, A., Keilis-Borok, V., & Wong, H. (2008). Clustering analysis of seismicity and aftershock identification. Phys. Rev. Lett. 101, 018501. DOI 10.1103/PhysRevLett.101.018501.
Zaliapin, I., & Ben-Zion, Y. (2020). Earthquake declustering using the nearest-neighbor approach in space-time-magnitude domain. JGR Solid Earth 125, e2018JB017120. DOI 10.1029/2018JB017120.

See also: Data-Sources — where each of these data types and covariates comes from · Pipeline — the versioned DAG that produces these features · Models-Classical — how the ETAS / Omori / Gutenberg–Richter features are used · Models-ML — how the feature set feeds the neural challenger.

⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.

It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.

CAOS_SEISMIC · seismic.fasl-work.com · source · MIT

CAOS_SEISMIC

Conditional probabilistic seismic forecasting — forecasts, never predictions.

Live site · Repo

Overview

Methodology & History

Methodology-History

Classical models

ML & analytical methods

Models employed

Models-Employed

Data

Architecture

Evaluation

Evaluation-and-Tests

Progress

Changelog-and-Progress

Reference

Data Types and Features

Data Types and Features

1. The catalog event record (the primary data type)

1.1 From record to forecast target

2. The catalog-to-features flow

3. Homogenization: completeness Mc and the move to Mw

3.1 Magnitude of completeness Mc

3.2 Magnitude homogenization to Mw

4. Catalog-derived features (the load-bearing feature set)

4.1 Recent-window rates

4.2 ETAS / Omori conditional intensities

4.3 Zaliapin–Ben-Zion nearest-neighbor proximity ($\eta$, $T$, $R$)

5. Context covariates (from the global enrichers — upside, not foundation)

5.1 Slab and geometry covariates

5.2 Fault and plate-boundary covariates

5.3 Geodetic strain rate

5.4 Stress and mechanism covariates

5.5 Tidal Coulomb stress covariates

6. Feature catalog (definition · source · why it matters)

7. What the model deliberately does not ingest as a feature

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CAOS_SEISMIC

Clone this wiki locally

3. Homogenization: completeness `Mc` and the move to Mw

3.1 Magnitude of completeness `Mc`