Skip to content

Methodology History

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Methodology History

Earthquakes cannot be predicted, but their probability can be forecast — reported honestly, with uncertainty, evaluated against reality, never as an alarm and never as a promise of safety.

This page is the authoritative methodological record of CAOS_SEISMIC. It does two things:

  1. It traces how the methodology of short-term earthquake forecasting actually evolved — the century-long arc from the first empirical scaling laws to self-exciting point processes, the CSEP testing revolution, and the neural / machine-learning era with its honest verdict — naming the key papers, people, and why each step mattered.
  2. It documents how this product's own methodology was reasoned out on top of that history: why ETAS is the calibrated reference, why a stationary Poisson model is the mandatory null, why the "stronger model" is a gated, context-conditioned neural temporal point process, why global-context-conditioning is the central thesis, why every number is scored under CSEP, and why the framing is relentlessly honest.

Every equation is real and every non-trivial claim is anchored to peer-reviewed literature with a DOI. A magnitude convention note: magnitudes are homogenized to moment magnitude $M_w$ where possible; $M_c$ is the magnitude of completeness; $b$ is the Gutenberg–Richter slope; $\beta = b\ln 10$; information gains are reported in nats (the CSEP convention), never bits.


Table of contents


Part I — The historical arc of the field

The methodology of earthquake forecasting did not arrive as a single theory. It accreted, over more than a century, out of three robust empirical regularities, a point-process formalism that fused them, a testing culture that made claims falsifiable, and — most recently — a machine-learning wave whose honest contribution turned out to be narrower than its hype. The sequence below is roughly chronological, but the deeper logic is epistemic: each step is an answer to "how do we make an honest, testable statement about the future of a system we cannot deterministically predict?"

flowchart TD
    GR["Gutenberg–Richter 1944<br/>magnitude–frequency law"] --> ETAS
    OM["Omori 1894 / Utsu 1961<br/>aftershock decay law"] --> ETAS
    PROD["Utsu productivity<br/>aftershock abundance ∝ magnitude"] --> ETAS
    RJ["Reasenberg–Jones 1989<br/>operational aftershock probabilities"] --> STEP["STEP 2005<br/>gridded short-term maps"]
    RJ --> OAF["USGS OAF<br/>operational service"]
    ETAS["ETAS<br/>Ogata 1988 / 1998<br/>self-exciting point process"] --> SMOOTH["Smoothed seismicity<br/>Helmstetter et al. 2007<br/>spatial background μ(x,y)"]
    ETAS --> CSEP["CSEP / RELM 2006–<br/>prospective testing"]
    SMOOTH --> CSEP
    STEP --> CSEP
    CSEP --> ICEF["ICEF / OEF 2011<br/>operational doctrine"]
    ICEF --> NEURAL["Neural TPP era<br/>2016–2026<br/>honest verdict: gated challenger"]
    CSEP --> NEURAL
Loading

1. The empirical scaling laws (1894–1965): statistics, not timing

The foundation of every honest forecast is a small set of empirical scaling laws that describe the statistics of populations of earthquakes — not the timing of any single event. This distinction is the whole game: the laws let us say "the rate of $M \ge 5$ events in this region this week is $X$ times the background," which is genuine, testable skill, and they categorically do not let us say "an $M7$ will strike here on Thursday."

1.1 Gutenberg–Richter — the magnitude–frequency law (1944)

The number of earthquakes is, to first order, a power law in magnitude:

$$\log_{10} N(\ge M) = a - b,M, \qquad M \ge M_c.$$

Here $N(\ge M)$ is the number (or rate) of events with magnitude $\ge M$, $a$ sets the overall productivity, and $b$ — the b-value — is the slope, globally near 1. Equivalently, above completeness the magnitude density is exponential:

$$f(M) = \beta, e^{-\beta (M - M_c)}, \qquad \beta = b\ln 10.$$

The law supplies the magnitude term of every forecast: given a rate of small events, it sets the expected rate of large ones. Crucially, because the distribution is self-similar (no characteristic scale), there is no "gap" that signals a specific large event is "due." Gutenberg & Richter (1944) established the relation for California; Aki (1965) gave the maximum-likelihood estimator for $b$,

$$\hat b = \frac{\log_{10} e}{\bar M - M_c},$$

later refined for finite magnitude bins (the Aki–Utsu / Tinti–Mulargia binning correction, Tinti & Mulargia 1987):

$$\hat b = \frac{\log_{10} e}{\bar M - \left(M_c - \tfrac{\Delta M}{2}\right)}.$$

Why it mattered: it turned the raw fact "small quakes are common, big ones rare" into a quantitative, estimable distribution — the first ingredient of a forecast. Its central weakness, the strong bias of $\hat b$ when the magnitude of completeness $M_c$ is mis-estimated (Wiemer & Wyss 2000), is also the origin of an entire sub-discipline on completeness estimation that any honest pipeline must carry.

1.2 Omori–Utsu — the aftershock decay law (1894 / 1961)

Fusakichi Omori observed in 1894 that aftershock rate decays as roughly $1/t$ after a mainshock. Utsu (1961) generalized it to an adjustable exponent — the modified Omori (Omori–Utsu) law:

$$n(t) = \frac{K}{(t + c)^{p}}, \qquad p \approx 1,$$

with cumulative form (for $p \ne 1$)

$$N(t_1, t_2) = \frac{K}{1 - p}\left[(t_2 + c)^{1-p} - (t_1 + c)^{1-p}\right].$$

Here $K$ is productivity, $c$ a small time offset (hours to a day, controversial because it is biased by the catalog incompleteness immediately after a mainshock), and $p$ the decay exponent (typically $0.9$–$1.5$). This is the single most skillful, operationally usable short-term signal: aftershock rates are elevated and predictable in aggregate right after a large event.

Why it mattered: it is the temporal kernel that every modern short-term model — Reasenberg–Jones, STEP, ETAS — embeds. The honesty caveat it carries forward: right after a large mainshock, exactly the highest-stakes moment, $M_c$ spikes for hours to days and a naive fit underestimates productivity precisely when a large aftershock is most likely.

1.3 Utsu productivity — aftershock abundance scales with magnitude

The third regularity completes the trio: the number of aftershocks a mainshock produces grows exponentially with its magnitude, $k(m) \propto e^{\alpha(m - M_0)}$. Combined with Gutenberg–Richter (magnitudes) and Omori–Utsu (timing), it yields the modern clustering models. Båth's law (the largest aftershock is about 1.2 magnitude units below the mainshock) is a related empirical regularity. The three together are treated as the field's "three empirical scaling relations" (Shcherbakov, Turcotte & Rundle 2004).

The honest takeaway from Part 1. These laws are statistical and conditional. They are the reason short-term forecasting has real skill — and the reason that skill is always a statement about probability over a population, never about the timing of one rupture.

2. Operational aftershock forecasting (1989–2016): from law to service

The first step from "law" to "operational forecast" was made by Reasenberg & Jones (1989) in Science. They multiplied the Gutenberg–Richter magnitude term by the modified-Omori time decay to get the rate of aftershocks of magnitude $\ge M$ at time $t$ after a mainshock $M_m$:

$$\lambda(t, M) = \frac{10^{,a + b(M_m - M)}}{(t + c)^{p}}.$$

Treating events as a non-homogeneous Poisson process, the probability of one or more such aftershocks in a window $[t_1, t_2]$ is

$$N(M; t_1, t_2) = \int_{t_1}^{t_2} \lambda(t, M), dt, \qquad P(\ge 1) = 1 - e^{-N}.$$

Why it mattered: this is the first model that produced an actual public probability — "the chance of a damaging aftershock in the next week" — rather than a description of past statistics. It is the direct ancestor of the USGS Operational Aftershock Forecast (OAF) service, whose lineage runs Reasenberg & Jones (1989) → Page et al. (2016), who introduced global generic parameters by tectonic regime and, critically, modeled intersequence variability so that forecasts come with uncertainty bounds, not point estimates.

STEP (Short-Term Earthquake Probability; Gerstenberger, Wiemer, Jones & Reasenberg 2005, Nature) wrapped Reasenberg–Jones clustering plus a background term into hourly, gridded shaking-probability maps for California. STEP is the methodological template for the output shape of a daily gridded probabilistic product — "tomorrow's earthquakes" as a map — exactly the form this product takes (at daily cadence).

3. ETAS (1988–1998): the self-exciting point process

The decisive theoretical unification was Yosihiko Ogata's Epidemic-Type Aftershock Sequence (ETAS) model — temporal in Ogata (1988, JASA), spatio-temporal in Ogata (1998, AISM). ETAS is a self-exciting (Hawkes) marked point process: a stationary background rate plus a sum over all past events, each of which triggers its own offspring with Omori-time decay, Gutenberg–Richter magnitudes, and a spatial kernel. The conditional intensity — the instantaneous expected rate at $(x, y, t)$ given the history $\mathcal H_t$ — is

$$\lambda(t, x, y \mid \mathcal H_t) = \mu(x, y) + \sum_{i:, t_i < t} K, e^{\alpha (M_i - M_0)}, \left(1 + \tfrac{t - t_i}{c}\right)^{-p}, f!\left(x - x_i, y - y_i \mid M_i\right),$$

with the Ogata (1998) inverse-power spatial kernel

$$f(x, y \mid M_i) = \frac{q - 1}{\pi,\zeta^2}\left(1 + \frac{r^2}{\zeta^2}\right)^{-q}, \qquad \zeta = D, e^{\gamma (M_i - M_0)},\quad r^2 = (x-x_i)^2 + (y-y_i)^2.$$

So ETAS is literally (background) + (Utsu productivity) × (Omori–Utsu decay) × (spatial spread), with Gutenberg–Richter magnitudes, stitched into one equation. It is the canonical example of turning the three scaling laws of §1 into a forecast. The model is fit by maximizing the point-process log-likelihood

$$\ln L = \sum_i \ln \lambda(t_i, x_i, y_i \mid \mathcal H_{t_i})

  • \int_0^T!!\int_A \lambda(t, x, y \mid \mathcal H_t), dx, dy, dt,$$

and the background field $\mu(x,y)$ is recovered by stochastic declustering (Zhuang, Ogata & Vere-Jones 2002), which assigns each event a probability of being background versus triggered.

A subtlety with real operational consequences is stability, which requires two logically separate conditions:

  1. Finite branching: the productivity × magnitude integral converges only if $\alpha &lt; \beta$ (with $\beta = b\ln 10$). If $\alpha \ge \beta$ the largest events dominate and the model is improper.
  2. Subcriticality: given $\alpha &lt; \beta$, the branching ratio $n$ (expected direct offspring per event) must satisfy $n &lt; 1$. A fit with $n \ge 1$ is supercritical (explosive) and is rejected as a mis-fit.

Why it mattered: ETAS is fully generative (you can simulate synthetic next-day catalogs and compute any test on them), has only ~5–8 interpretable parameters (little overfitting headroom), and has decades of prospective track record. It is the de-facto operational baseline for short-term forecasting — the model any challenger must beat. This is exactly the role it plays in this product (Part II, §9).

4. Smoothed seismicity and the spatial background

ETAS needs a stationary background field $\mu(x,y)$where earthquakes occur independent of recent triggering. Helmstetter, Kagan & Jackson (2007) answered this with smoothed seismicity: large earthquakes nucleate where small ones cluster, so one smooths the locations of past small events with an adaptive kernel whose bandwidth follows local data density (distance to the $n$-th nearest neighbor):

$$\mu(x, y) = \sum_i K_{d_i}(r), \qquad K_d(r) = \frac{C(d)}{\left(r^2 + d^2\right)^{s}}.$$

The catalog is declustered first so the map reflects the long-term average, not transient bursts. This was the best-performing time-independent spatial forecast in the RELM/CSEP experiments.

Implementation honesty. The kernel exponent $s$ and normalization $C(d)$ vary across the Helmstetter–Kagan–Jackson family of papers (forms with $s = 1$ and $s = 3/2$ both appear), and the neighbor count is a region-tuned hyperparameter. These are pinned to a reference implementation, not hard-coded as universal constants.

Why it mattered for us: smoothed seismicity plays two roles in this product — it is the spatial background $\mu(x,y)$ that feeds ETAS, and it is the stationary Poisson reference, the mandatory null any time-dependent model must beat (§10).

5. CSEP (2006–): the testing revolution that made forecasts honest

Forecasting only became a science — rather than a sequence of post-hoc success stories — when it became prospectively falsifiable. The Regional Earthquake Likelihood Models (RELM) experiment and its successor, the Collaboratory for the Study of Earthquake Predictability (CSEP), established the culture: a testing center controls the input data and runs the models (not the modelers), evaluating each forecast against future seismicity it never saw. This removes hindsight tuning and makes results reproducible.

CSEP scores a gridded Poisson forecast by the joint log-likelihood over space–magnitude bins,

$$L(\Omega \mid \Lambda) = \sum_{\text{bins } i}\Big[-\lambda_i + \omega_i \ln \lambda_i

  • \ln(\omega_i!)\Big],$$

and applies a battery of consistency tests (Schorlemmer et al. 2007; Zechar, Gerstenberger & Rhoades 2010):

Test Question
N-test Is the total number of forecast events consistent with what occurred?
M-test Is the magnitude distribution consistent?
S-test Is the spatial distribution consistent?
L-test / CL-test Is the joint (pseudo-)likelihood consistent? (CL conditions on $N_{obs}$.)

Passing consistency tests is necessary but not sufficient. Skill is established only by winning a comparison test against a real baseline, via the information gain per earthquake (IGPE, in nats):

$$I_N(A, B) = \frac{1}{N}\sum_{i=1}^{N}\Big(\ln \lambda_{A}(k_i) - \ln \lambda_{B}(k_i)\Big)

  • \frac{\hat N_A - \hat N_B}{N},$$

tested with a paired T-test ($T = I_N/(s/\sqrt N)$) and the non-parametric W-test. All of this is implemented in the community toolkit pyCSEP (Savran et al. 2022).

Why it mattered: CSEP is the difference between "we forecast the earthquake" (a cherry-picked anecdote) and "across many days, our stated probabilities matched observed frequencies" (a calibrated, auditable claim). It is the testing backbone of this product (§13), and the reason a public reliability diagram is the single most credibility-building artifact the product can ship.

6. Operational Earthquake Forecasting and the ICEF doctrine (2009–2011)

The 2009 L'Aquila earthquake — and the criminal trials that followed a miscommunication of low probability — provoked the field's most important governance document. Italy appointed the International Commission on Earthquake Forecasting for Civil Protection (ICEF), chaired by Thomas H. Jordan; its report (Jordan et al. 2011) is the canonical statement of Operational Earthquake Forecasting (OEF): the authoritative dissemination of time-dependent, regularly updated probabilistic forecasts for civil-protection decision-making.

The ICEF report fixed the field's vocabulary, and ours. A prediction is "a deterministic statement that a future earthquake will or will not occur" (yes/no); a forecast is "a probability (greater than zero but less than one)" that such an event occurs. The report's single most important honesty constraint: short-term probabilities "may vary over orders of magnitude but typically remain low in an absolute sense (< 1 % per day)."

Real OEF systems run today and are honest, calibrated, and validated:

  • USGS OAF (Reasenberg–Jones → Page et al. 2016 global) publishes probabilities and expected counts with uncertainty.
  • OEF-Italy (INGV) runs an ensemble of three clustering models (two ETAS flavors plus a STEP model), updated every midnight and after each $M_L \ge 3.5$ event. Its 10-year prospective validation (Spassiani, Falcone, Murru & Marzocchi 2023) found it broadly reliable — with a documented underestimation during the 2016–2017 Central Italy sequence caused by post-mainshock catalog incompleteness, exactly the §1.2 limit.

Why it mattered: the field's answer to L'Aquila was not "stop forecasting." It was "do probabilistic forecasting properly and communicate it honestly." That doctrine is the spine of this product's framing (§14) and its governance posture.

7. The neural / machine-learning era (2016–2026) and its honest verdict

The deep-learning wave reached seismology with high expectations. The honest verdict, established by the field's most rigorous benchmarks, is sobering — and clarifying.

The cautionary tale. DeVries, Viegas, Wattenberg & Meade (2018, Nature) trained a deep network (6 hidden layers, ~13,451 free parameters, 12 stress features) to forecast aftershock spatial patterns and reported AUC 0.85 versus 0.58 for the classical Coulomb stress criterion. Mignan & Broccardo (2019, Nature) then matched that 0.85 with a 2-parameter logistic regression ("one neuron") using a single physically simple feature. The deep net's apparent advantage was an artifact of massive over-parameterization (versus only ~199 effective mainshocks), a per-cell "computer-vision" framing that inflated the apparent sample size, and an inappropriate metric (AUC is invariant to calibration and degenerates into a region classifier on rare per-cell tasks).

The decisive benchmark. EarthquakeNPP (Stockman, Lawson & Werner, TMLR 2026) benchmarked five modern neural point processes (NSTPP, DeepSTPP, AutoSTPP, DSTPP, SMASH) against ETAS on California 1971–2021 with strict chronological splits and CSEP consistency tests. None outperformed ETAS. ETAS won spatial log-likelihood consistently; on ComCat, ETAS passed the spatial test at ~92 % while the best NPP managed only ~69 % — and the spatial test is exactly where forecasting value lives. The crucial methodological fix: earlier neural-TPP-for-earthquakes work had used non-chronological splits (which inflate metrics because earthquakes trigger one another) and had excluded the giant 2011 Tohoku sequence; once temporal splits and the big sequences are restored, the apparent neural advantage evaporates.

Where ML genuinely helps. The honest, evidence-grounded reading is that neural value is concentrated in two real ETAS gaps, not in "deep-learning magic":

  1. Multivariate covariate ingestion — ETAS uses one catalog; a neural conditional intensity can fuse geodesy/InSAR, pore-pressure data, multiple catalogs, and sub-$M_c$ events. FERN (Zlydenko et al. 2023) — an ETAS-generalizing encoder that replaces fixed kernels with MLPs — reported a 4–12 % information-gain improvement, but its own authors note the gain came mostly from ingesting sub-$M_c$ events, that it was not CSEP-tested, provided no uncertainty quantification, and its test period ended before Tohoku.
  2. Learned spatial anisotropy — fault-aligned triggering structure recovered without explicit fault inputs.

RECAST (Dascher-Cousineau et al. 2023) — a GRU-based neural TPP — improves on temporal ETAS only when the training catalog is large ($\gtrsim 10^4$ events) and merely matches it otherwise.

Detection is not forecasting. ML waveform models (PhaseNet, EQTransformer, PhaseNO, the SeisLM foundation model) are mature and production-grade for detection, phase-picking, and characterization — they build better, more complete catalogs, which is the biggest near-term realizable lever for both ETAS and any neural forecaster. But they tell you what already happened; they do not forecast. SeisLM's "foreshock–aftershock" task is retrospective classification of existing waveforms relative to a known mainshock. This product keeps the detection/forecasting line explicit so that detection branding never implies prediction.

The honest verdict (2026). No pure machine-learning model has robustly beaten a well-fit ETAS for short-term forecasting under fair, prospective CSEP-style testing. This justifies shipping ETAS-class as the calibrated reference and gating any neural model behind a CSEP win — but it is stated as "on the benchmarks to date," not as an unconditional law that ML can never add skill.


Part II — How this product's methodology was reasoned out

Part I is the field's history. Part II is this product's history of decisions — what was chosen, and the reasoning behind each choice. The through-line: build the strongest calibrated, testable conditional estimator, ship it honestly, and never let a more flexible model reach the public until it has earned its place against the established baseline.

8. The forecasting target: a conditional intensity

The first decision was to formulate the problem as a marked spatio-temporal point process and to estimate its conditional intensity $\lambda^*(t, x, y, m \mid \mathcal H_t)$ — the instantaneous expected rate of events of magnitude $m$ at location $(x, y)$ and time $t$ given the history $\mathcal H_t$. The reason is the survival term in the point-process log-likelihood:

$$\log \mathcal L = \sum_{i=1}^{n} \log \lambda^_(t_i) - \int_0^T \lambda^_(\tau), d\tau.$$

The integral ("compensator") term is what makes a model probabilistic and calibratable rather than a regressor. A system that drops it — by reframing forecasting as classification ("will there be an $M\ge5$ tomorrow? yes/no") or rate regression — throws away the very structure CSEP scores, and becomes un-calibratable. This is the structural root of most ML-forecasting failures, and the reason the product commits to a point-process formulation from the ground up.

The public number is then an exceedance probability. The expected number of events above a target magnitude $M^*$ in region $A$ over horizon $[0, T]$ is

$$N_{\ge M^_} = \int_0^T!!\int_A \lambda(t, x, y \mid \mathcal H_t),\Phi(M^_), dx, dy, dt, \qquad \Phi(M^_) = 10^{-b(M^_ - M_c)},$$

and the published probability is

$$P(\ge 1 \text{ event} \ge M^_) = 1 - e^{-N_{\ge M^_}}.$$

The product forecasts the full conditional magnitude distribution and thresholds it for display (per-region $M^*$, explicit per-region $M_{\max}$), because a single fixed threshold is brittle — a Chilean $M \ge 6$ is routine, a UK $M \ge 6$ is unheard-of — and discards the Gutenberg–Richter information the model already computes. Forecasting the distribution also enables the M-test and CRPS scoring. The public formula $P = 1 - e^{-N}$ never changes; only the quality of $\lambda$ improves as the model improves.

9. Why ETAS is the calibrated reference

ETAS (§3) is shipped as the calibrated, defensible core for three reasons that no alternative matches at v0:

  • It is fully generative. Simulating $\ge 10{,}000$ synthetic next-day catalogs produces bounded probabilities for the 1-day / 2-day / 7-day horizons and lets every CSEP test run on the ensemble.
  • It has few interpretable parameters (~5–8), so there is little overfitting headroom and the fit is defensible to a senior seismologist.
  • It is the field's prospective baseline. Decades of CSEP track record make it the honest yardstick.

The product's ETAS is not a toy. It runs the full hygiene pipeline a simplistic baseline omits: rolling per-region $M_c(x,y,t)$ with an incompleteness-aware post-mainshock likelihood, $M_w$ homogenization, the dual-catalog declustering rule (background fit on a declustered catalog via Zaliapin–Ben-Zion; triggering fit on the full catalog because clustering is the predictable signal), a fine $0.1° \times 0.1°$ fit grid with $0.1$-magnitude bins (so the S-test and M-test can resolve where and what size), both stability gates enforced per fit, and propagated parameter uncertainty. A transparent Reasenberg–Jones fallback runs alongside as a sanity check — the same two-model transparency USGS OAF uses.

10. Why a stationary Poisson model is the mandatory null

Every time-dependent forecast is required to beat a stationary, time-independent Poisson model built from smoothed seismicity (§4). This is non-negotiable for an honest reason: most of any map's area-and-time is quiet, and on quiet days a time-dependent model should read near climatology. If a "sophisticated" model cannot beat the smoothed-seismicity null on the IGPE comparison test with a confidence interval excluding zero, it has added no skill — it has only added complexity. The null is also the principled cold-start floor: where recent seismicity is sparse or zero, the conditional rate floors to $\mu(x,y)$, never to an arbitrary per-day constant.

This discipline is the direct lesson of the DeVries/Mignan episode (§7): always include a trivially simple baseline. Here the trivial baselines are time-independent Poisson, smoothed seismicity, and ETAS — and any model that does not beat all three is not shipped.

11. Why the stronger model is a gated, context-conditioned neural TPP

The "stronger model" is not a different default; it is a gated challenger. The product ships ETAS-class as the reference and treats any neural model as a candidate that reaches the public map only if it beats ETAS in the product's own prospective CSEP harness (positive IGPE, T-test confidence interval excluding zero) and is calibrated (reliability diagram, not just sharpness). Calibration is a release blocker.

The chosen challenger class is a conditional spatio-temporal neural temporal point process with a Hawkes inductive bias, in the spirit of FERN: keep the additive background + summed-triggering skeleton of ETAS, replace the fixed kernels with small MLPs / attention, and model magnitude explicitly (a real gap in most neural point processes, flagged by EarthquakeNPP). The design targets the two proven neural levers from §7 — covariate ingestion and learned anisotropy — rather than chasing "deep-learning magic" that the benchmarks show does not exist for this problem. A spatial CNN appears in the architecture only as a context encoder for static geophysical fields, never as a standalone aftershock-pattern CNN (the refuted DeVries approach is documented as a lesson, not a template).

The gate is the methodology. What makes the challenger honest is not the architecture but the gate: strict temporal splits, proper scoring, a CSEP win over ETAS and the Poisson null, and a calibration check — every one of them a guardrail distilled from the failures in §7.

12. Why global-context-conditioning is the thesis

The product's central scientific thesis is global-context-conditioning: rather than fitting an isolated regional model, it treats any country as a view into one global field, trains on global seismicity plus every feasible global covariate, and runs inference across many countries at once — high-seismicity (Chile, Japan, Indonesia, Mexico, Turkey, California, New Zealand) and low-seismicity (UK, Germany, Australia, Brazil).

The reasoning is twofold:

  1. A region-only model is the opposite of the purpose. The interesting, testable question is whether a conditional forecaster, given enough global context (tectonic regime, slab geometry, plate-boundary type, strain rate), generalizes across tectonic settings — not whether it can be over-fit to one well-instrumented region.
  2. Comparing across high- and low-seismicity regions is an explicit evaluation goal: it is the way to detect and quantify bias toward high-seismicity zones, which a single-region build cannot surface at all.

Mechanically, this is implemented as a tiled global field: the globe is partitioned into interior tiles (which own cells for aggregation) plus halos (which carry edge events so triggering is edge-correct), each tile fit with a tectonic-regime prior anchored on the USGS OAF tectonic-regime study (Page et al. 2016). Five regimes — subduction interface, intraslab, crustal/strike-slip, intraplate, ridge — assigned per event from Slab2 and plate-boundary geometry, with a self-contained heuristic fallback when the enricher grids are absent. The tiled ETAS stays the calibrated reference the neural model must beat; the neural challenger's job is to do better than per-tile ETAS by exploiting the global covariate context that per-tile ETAS cannot absorb. This is precisely the neural lever §7 identified as real.

13. Why CSEP is the only acceptable scoring framework

The product adopts CSEP / pyCSEP as the only scoring framework — standard tests, not bespoke metrics — because using the same yardstick the field uses is what makes the product defensible: reviewers can dispute the model, not the test code. The methodology around scoring carries several non-obvious but load-bearing decisions:

  • Catalog-based tests are the primary path. Regional seismicity is over-dispersed relative to Poisson (variance $\gg$ mean, because of clustering), so Poisson grid tests over-reject during aftershock sequences. The product emits both a gridded-rate forecast (Poisson tests) and a catalog-based forecast (Savran et al. 2020; over-dispersion-honest tests), and the pessimistic uncertainty bound is deliberately wider than a naive Poisson interval (negative-binomial behavior; Kagan 2017).
  • Skill = winning a comparison test, in nats, against smoothed seismicity and ETAS — never a consistency-test pass reported as if it were skill.
  • A strict forecast clock makes temporal leakage structurally impossible: at each daily issue time the model is handed only the catalog slice $(-\infty, t)$, the forecast is sealed with an immutable input snapshot (catalog version, $M_c$ grid, declustering choice, parameters), then the clock advances. The testing-mode hierarchy is explicit: true prospective > pseudo-prospective > retrospective out-of-sample > in-sample (Mizrahi et al. 2024).
  • AUC / accuracy are banned as primary skill metrics — the direct lesson of DeVries/Mignan (§7). ROC appears only as a communication aid (the Molchan diagram / area skill score), never as a skill claim.
  • The headline credibility artifact is the live reliability diagram — "when we said 5 %, it happened ~5 % of the time" — updated as the prospective record accrues.

The information gain over a Poisson reference is reported as state-dependent (large during active aftershock sequences, near zero in quiet periods, with a modest all-period average), in nats — never a fabricated round figure, never bits.

14. Why the framing stays honest

The final methodological commitment is the honest framing itself, treated as a first-class part of the product rather than a disclaimer bolted on at the end. It rests on a clear epistemic basis: deterministic prediction is effectively impossible because whether a small rupture cascades into a great earthquake depends on unmeasurably fine details of the crust (Geller, Jackson, Kagan & Mulargia 1997), with self-organized criticality (Bak & Tang 1989) as the leading explanatory framework, not settled physics. What is achievable is OEF: conditional probabilities that typically remain low in absolute terms even when the relative gain during a sequence is large.

The product therefore commits, in code and copy, to:

  • Never an alarm, never a "safe" state. No alarm dots, no countdown, no binary call. Red is reserved for model quality, never danger. Every number is shown next to its long-term baseline, with horizon and magnitude threshold always attached.
  • Honest, real uncertainty (P10 / median / P90), over-dispersion-aware, with a staleness banner and coverage mask — blank never reads as "safe."
  • Complement, never compete. It is an independent research/education tool that complements official OEF (USGS, INGV, CSN, GeoNet, JMA), never an authoritative civil-protection alarm.
  • A single outcome neither validates nor invalidates a probabilistic forecast — carried into copy with the 2019 Ridgecrest worked example (a ~3 % first-week forecast was not wrong when the ~3 % outcome occurred).

The reason this is methodology and not marketing is the field's own cautionary tale: the L'Aquila harm was a communication failure — false reassurance — not a failure to predict. The honest framing is the engineered antidote to both over-reassurance and over-alarm. See the companion page Honest Limits for the full epistemics.

The product's creed, carried verbatim into the app: Earthquakes cannot be predicted, but their probability can be forecast — reported honestly, with uncertainty, evaluated against reality, never as an alarm and never as a promise of safety.


References

Canonical, peer-reviewed sources (USGS / ISC / CSEP documentation and journal articles). DOIs given where registered.

  1. Aki, K. (1965). Maximum likelihood estimate of b in the formula log N = a − bM and its confidence limits. Bull. Earthq. Res. Inst. 43, 237–239.
  2. Bak, P. & Tang, C. (1989). Earthquakes as a self-organized critical phenomenon. JGR 94(B11), 15635–15637. doi:10.1029/JB094iB11p15635
  3. Dascher-Cousineau, K. et al. (2023). Using deep learning for flexible and scalable earthquake forecasting (RECAST). GRL 50, e2023GL103909. doi:10.1029/2023GL103909
  4. DeVries, P. M. R., Viegas, F., Wattenberg, M. & Meade, B. J. (2018). Deep learning of aftershock patterns following large earthquakes. Nature 560, 632–634. doi:10.1038/s41586-018-0438-y
  5. Dieterich, J. (1994). A constitutive law for rate of earthquake production and its application to earthquake clustering. JGR 99(B2), 2601–2618. doi:10.1029/93JB02581
  6. Geller, R. J., Jackson, D. D., Kagan, Y. Y. & Mulargia, F. (1997). Earthquakes cannot be predicted. Science 275(5306), 1616–1617. doi:10.1126/science.275.5306.1616
  7. Gerstenberger, M. C., Wiemer, S., Jones, L. M. & Reasenberg, P. A. (2005). Real-time forecasts of tomorrow's earthquakes in California (STEP). Nature 435, 328–331. doi:10.1038/nature03622
  8. Gutenberg, B. & Richter, C. F. (1944). Frequency of earthquakes in California. BSSA 34, 185–188.
  9. Helmstetter, A., Kagan, Y. Y. & Jackson, D. D. (2007). High-resolution time-independent grid-based forecast for M ≥ 5 earthquakes in California. SRL 78(1), 78–86. doi:10.1785/gssrl.78.1.78
  10. Jordan, T. H. et al. (2011). Operational Earthquake Forecasting: State of Knowledge and Guidelines for Utilization (ICEF Report). Annals of Geophysics 54(4), 315–391. doi:10.4401/ag-5350
  11. Kagan, Y. Y. (2017). Worldwide earthquake forecasts. GJI 211(1), 335–345. doi:10.1093/gji/ggx300
  12. Mignan, A. & Broccardo, M. (2019). One neuron versus deep learning in aftershock prediction. Nature 575, E1–E3. doi:10.1038/s41586-019-1582-8
  13. Mizrahi, L. et al. (2024). Developing, testing, and communicating earthquake forecasts: Current practices and future directions. Reviews of Geophysics 62. doi:10.1029/2023RG000823
  14. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. doi:10.1080/01621459.1988.10478560
  15. Ogata, Y. (1998). Space–time point-process models for earthquake occurrences. Ann. Inst. Statist. Math. 50(2), 379–402. doi:10.1023/A:1003403601725
  16. Page, M. T. et al. (2016). Three ingredients for improved global aftershock forecasts. BSSA 106(5), 2290–2301. doi:10.1785/0120160073
  17. Reasenberg, P. A. & Jones, L. M. (1989). Earthquake hazard after a mainshock in California. Science 243(4895), 1173–1176. doi:10.1126/science.243.4895.1173
  18. Savran, W. H. et al. (2020). Pseudoprospective evaluation of UCERF3-ETAS forecasts during the 2019 Ridgecrest sequence. BSSA 110(4), 1799–1817. doi:10.1785/0120200026
  19. Savran, W. H. et al. (2022). pyCSEP: A Python toolkit for earthquake forecast developers. SRL 93(5), 2858–2870. doi:10.1785/0220220033
  20. Schorlemmer, D., Gerstenberger, M. C., Wiemer, S., Jackson, D. D. & Rhoades, D. A. (2007). Earthquake likelihood model testing. SRL 78(1), 17–29. doi:10.1785/gssrl.78.1.17
  21. Shcherbakov, R., Turcotte, D. L. & Rundle, J. B. (2004). A generalized Omori's law for earthquake aftershock decay. GRL 31, L11613. doi:10.1029/2004GL019808
  22. Spassiani, I., Falcone, G., Murru, M. & Marzocchi, W. (2023). Operational Earthquake Forecasting in Italy: validation after 10 yr of operativity. GJI 234(3), 2501–2518.
  23. Stockman, S., Lawson, D. & Werner, M. J. (2026, accepted). EarthquakeNPP: A benchmark for earthquake forecasting with neural point processes. TMLR. arXiv:2410.08226
  24. Tinti, S. & Mulargia, F. (1987). Confidence intervals of b-values for grouped magnitudes. BSSA 77(6), 2125–2134.
  25. Wiemer, S. & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA 90(4), 859–869. doi:10.1785/0119990114
  26. Woessner, J. & Wiemer, S. (2005). Assessing the quality of earthquake catalogues. BSSA 95(2), 684–698. doi:10.1785/0120040007
  27. Zaliapin, I. & Ben-Zion, Y. (2020). Earthquake declustering using the nearest-neighbor approach in space–time–magnitude domain. JGR Solid Earth 125, e2018JB017120. doi:10.1029/2018JB017120
  28. Zechar, J. D., Gerstenberger, M. C. & Rhoades, D. A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. BSSA 100(3), 1184–1195. doi:10.1785/0120090192
  29. Zhuang, J., Ogata, Y. & Vere-Jones, D. (2002). Stochastic declustering of space–time earthquake occurrences. JASA 97(458), 369–380. doi:10.1198/016214502760046925
  30. Zlydenko, O. et al. (2023). A neural encoder for earthquake rate forecasting (FERN). Scientific Reports 13. doi:10.1038/s41598-023-38033-9

Collaboratory for the Study of Earthquake Predictability (CSEP): https://cseptesting.org. USGS Operational Aftershock Forecasting: https://earthquake.usgs.gov/data/oaf/.


See also: Honest Limits — the epistemics of why deterministic prediction is impossible, why absolute probabilities stay small, and the L'Aquila communication lesson.

Clone this wiki locally