Skip to content

Evaluation and Tests

Felipe Santibañez-Leal edited this page Jun 18, 2026 · 2 revisions

Evaluation and Tests

This page is the credibility backbone of CAOS_SEISMIC. A conditional probabilistic seismic forecast is only as trustworthy as the framework used to score it. We therefore adopt the community-endorsed standard of the field — CSEP — and report all of its tests, including the ones we fail. We do not invent bespoke metrics: if a forecast is scored with a non-standard score, a senior seismologist will (correctly) dismiss the result.

Framing that never leaves the product. Deterministic earthquake prediction is impossible. CAOS_SEISMIC is a conditional probability estimator — never an alarm, never a single-event call. The tests below verify calibration (are our stated probabilities honest?) and relative information gain (do we add skill over standard baselines?). They do not, and cannot, verify that "we called the quake." There is no such thing.

Contents

  1. What CSEP is and why it is the only standard we use
  2. Two forecast representations: gridded-rate and catalog-based
  3. Shared notation
  4. Consistency (calibration) tests — N / M / S / L / CL
  5. Comparison tests — information gain vs Poisson AND ETAS (T / W)
  6. Catalog-based tests — relaxing the Poisson assumption
  7. Reliability diagrams and proper scoring rules
  8. Molchan / ROC — and why AUC is banned as a skill metric
  9. The forecast-clock protocol (pseudo-prospective, leakage-free)
  10. Completeness magnitude consistency
  11. Multi-country back-analysis design (high- vs low-seismicity bias)
  12. The headline thesis measurement: context-over-ETAS information gain
  13. pyCSEP — the tooling
  14. The full report table — report every cell, including failures
  15. Software tests (the repository's automated suite)
  16. Honest limitations of evaluation
  17. References

1. What CSEP is and why it is the only standard we use

The Collaboratory for the Study of Earthquake Predictability (CSEP) is the international infrastructure and rule-set for testing probabilistic earthquake forecasts in a transparent, reproducible, and — crucially — prospective manner. It grew out of the Regional Earthquake Likelihood Models (RELM) working group for California.

The credibility of a forecast comes from the protocol being fixed before the data are seen. CSEP enforces that forecasts are deposited, then sealed, then scored against a pre-declared authoritative catalog. Our back-analysis emulates this exactly: we freeze the model and its parameters, then score. The whole point is that a reviewer can dispute our model but never our scoring, because the scores are the community's, computed by the community's tooling (pyCSEP).

Foundational papers:

  • Field, E. H. (2007). Overview of the Working Group for the Development of Regional Earthquake Likelihood Models (RELM). Seismological Research Letters, 78(1), 7–16. DOI: 10.1785/gssrl.78.1.7. — sets up RELM: alternate models of earthquake potential evaluated head-to-head for California.
  • Schorlemmer, D., Gerstenberger, M. C., Wiemer, S., Jackson, D. D., & Rhoades, D. A. (2007). Earthquake Likelihood Model Testing. Seismological Research Letters, 78(1), 17–29. DOI: 10.1785/gssrl.78.1.17. — defines the original N-test, L-test, and R-test.
  • Schorlemmer, D., & Gerstenberger, M. C. (2007). RELM Testing Center. Seismological Research Letters, 78(1), 30–36. DOI: 10.1785/gssrl.78.1.30. — the rules: testing area and grid, forecast periods, magnitude ranges, authoritative catalog, declustering, and the distinction between five-year, one-year, and one-day models. CAOS_SEISMIC is conceptually a one-day model, re-issued daily.

Authoritative anchors for current best practice and the operational template:

  • Mizrahi, L., Dallo, I., van der Elst, N. J., et al. (2024). Developing, Testing, and Communicating Earthquake Forecasts: Current Practices and Future Directions. Reviews of Geophysics, 62. DOI: 10.1029/2023RG000823. — lays out the prospective / pseudo-prospective / retrospective / in-sample / out-of-sample taxonomy we use in §9.
  • Serafini, F., Bayona, J. A., Silva, F., et al. (2025). A benchmark database of ten years of prospective next-day earthquake forecasts in California from CSEP. Scientific Data, 12, 1501. DOI: 10.1038/s41597-025-05766-3. — 25 fully automated M≥3.95 daily models, >50,000 daily forecasts (Aug 2007–Aug 2018), evaluated with pyCSEP. This is the single most useful operational template for a daily-forecast product like ours, and a real public benchmark our California region scores against.

2. Two forecast representations: gridded-rate and catalog-based

CSEP supports two distinct forecast representations, and the tests differ between them. CAOS_SEISMIC emits both, because the daily forecast is strongly clustered (aftershocks, swarms, foreshocks) and the two representations make different assumptions.

Gridded-rate (grid-based) Catalog-based (simulation)
What it is Expected number of events $\lambda(m_i, s_j)$ in each space–magnitude bin Many synthetic catalogs (stochastic event sets) simulated from the model
Core assumption Events in each bin are independent and Poisson distributed No Poisson assumption — empirical distribution from simulations
Pros Simple, fast to test, analytic test distributions; directly comparable to the published CSEP California daily benchmark Captures clustering / over-dispersion, correct uncertainty, magnitude–space correlations
Cons Poisson under-disperses variance → over-rejects during sequences Computationally heavier; must store/serve many catalogs
Tests N, M, S, L, CL (Poisson) + T, W comparison Number, Magnitude, Spatial, Pseudo-likelihood (empirical) + comparison + calibration

Why both. Because earthquakes cluster, the variance of real event counts is far larger than the mean (over-dispersion). The Poisson assumption baked into the classic gridded L/CL-tests therefore systematically over-rejects us during active sequences — the model can be perfectly reasonable and still "fail" a Poisson test purely because Poisson under-estimates the spread. The fix is the catalog-based representation, whose test distributions are empirical (built from the simulations) and therefore over-dispersion-honest.

Operationally, every issue, every horizon, CAOS_SEISMIC emits:

  1. a gridded-rate forecast (a space–magnitude grid of expected rates) for the primary, comparable CSEP tests, and
  2. a catalog-based forecast (≥10,000 synthetic catalogs simulated from the model) to get honest, non-Poisson uncertainty and run the Savran et al. (2020) catalog tests.

This dual emission is exactly what pyCSEP is built to consume. Every Poisson-grid-test failure during a sequence is reported paired with its catalog-based result — never in isolation.


3. Shared notation

Following Schorlemmer et al. (2007) and Zechar et al. (2010):

  • Region $R$ is partitioned into bins indexed by magnitude $m_i$ and space $s_j$.
  • $\lambda(m_i, s_j)$ = forecast expected number of events in bin $(m_i, s_j)$.
  • $\omega(m_i, s_j)$ = observed number of events in bin $(m_i, s_j)$.
  • $N_{\text{fore}} = \sum_{m_i, s_j \in R} \lambda(m_i, s_j)$ = total forecast number.
  • $N_{\text{obs}} = \sum_{m_i, s_j \in R} \omega(m_i, s_j)$ = total observed number.
  • A hat ($\hat{\cdot}$) over a quantity denotes a value computed on a simulated catalog drawn from the forecast.

All consistency-test quantiles share one logic: compute an observed test statistic, simulate its distribution under the forecast, then report the quantile score = the fraction of simulated statistics $\le$ observed. A forecast is consistent when the observed value is not in the extreme tail; the conventional effective two-sided rejection sits at the $0.025 / 0.975$ quantiles.


4. Consistency (calibration) tests — N / M / S / L / CL

Consistency tests answer "is this one model plausibly correct?" They calibrate a single model. They never establish skill — a smoothed climatology can pass them all. Skill comes only from the comparison tests in §5.

4.1 The joint log-likelihood (kernel of L, CL, M, S)

For a forecast $\Lambda$ and an observation $\Omega$, the Poisson joint log-likelihood is

$$L(\Omega \mid \Lambda) = \sum_{m_i, s_j \in R} \Big[ -\lambda(m_i, s_j) + \omega(m_i, s_j),\log \lambda(m_i, s_j) - \log\big(\omega(m_i, s_j)!\big) \Big].$$

This is the standard Poisson log-likelihood summed over independent bins (Schorlemmer et al. 2007). It is also the accumulated logarithmic score (§7), which ties CSEP directly to proper-scoring-rule theory.

4.2 L-test (Likelihood)

Test statistic $L_{\text{obs}} = L(\Omega \mid \Lambda)$. Simulate $\hat{L}_x$ from many catalogs drawn from $\Lambda$. Quantile score:

$$\gamma = \frac{\big|{, \hat{L}_x \mid \hat{L}_x \le L_{\text{obs}} ,}\big|}{\big|{, \hat{L} ,}\big|}.$$

A low $\gamma$ means the observed catalog is less likely than almost all simulations → forecast inconsistent. The L-test conflates the number, spatial, and magnitude components, which is why it was split into N/S/M and is now diagnosed via the conditional likelihood (CL) test. The L-test is not deprecated in current pyCSEP, but we prefer the CL-test for shape diagnosis.

4.3 CL-test (Conditional Likelihood) — preferred over raw L

Identical to the L-test except the simulated catalogs are conditioned to contain exactly $N_{\text{obs}}$ events. This removes the number component, isolating spatial + magnitude shape independent of whether the total rate was right:

$$\gamma_{CL} = \frac{\big|{, \hat{CL}_x \mid \hat{CL}_x \le CL_{\text{obs}} ,}\big|}{\big|{, \hat{CL} ,}\big|}.$$

Preferred over the raw L-test because the L-test is dominated by, and confounded with, the number forecast (Werner et al. 2011).

4.4 N-test (Number)

Observed $N_{\text{obs}} = \sum \omega(m_i, s_j)$ should be Poisson with mean $N_{\text{fore}}$. The modern (Zechar et al. 2010) two-sided version reports two quantile scores:

$$\delta_1 = 1 - F\big((N_{\text{obs}} - 1)\mid N_{\text{fore}}\big), \qquad \delta_2 = F\big(N_{\text{obs}} \mid N_{\text{fore}}\big),$$

where $F$ is the Poisson CDF

$$F(x \mid N_{\text{fore}}) = \exp(-N_{\text{fore}}) \sum_{i=0}^{x} \frac{N_{\text{fore}}^{,i}}{i!}.$$

Interpretation (matches the pyCSEP source): a small $\delta_1$ ⇒ the forecast predicts too few events; a small $\delta_2$too many.

Over-dispersion caveat. The Poisson assumption for number is demonstrably wrong for real catalogs — earthquakes cluster, producing variance ≫ mean (Kagan 2017). The negative-binomial distribution adds a clustering parameter and yields wider, more realistic intervals. The explicit negative-binomial N-test is not in the core pyCSEP grid test list, so our production route to over-dispersed number uncertainty is the catalog-based number test (§6.1), which is non-parametric and is the maintainers' intended path.

4.5 M-test (Magnitude — Gutenberg–Richter shape)

Collapse over space and standardize so totals agree:

$$\omega^m(m_i) = \sum_{s_j \in S} \omega(m_i, s_j), \qquad \lambda^m(m_i) = \frac{N_{\text{obs}}}{N_{\text{fore}}} \sum_{s_j \in S} \lambda(m_i, s_j).$$

Test statistic $M = L(\Omega^m \mid \Lambda^m)$; quantile $\kappa = |{\hat{M}x \le M}| / |{\hat{M}}|$. The $N{\text{obs}}/N_{\text{fore}}$ rescaling isolates the shape of the magnitude (Gutenberg–Richter) distribution from the total rate (Zechar et al. 2010).

4.6 S-test (Spatial)

Collapse over magnitude and standardize:

$$\omega^s(s_j) = \sum_{m_i \in M} \omega(m_i, s_j), \qquad \lambda^s(s_j) = \frac{N_{\text{obs}}}{N_{\text{fore}}} \sum_{m_i \in M} \lambda(m_i, s_j).$$

Test statistic $S = L(\Omega^s \mid \Lambda^s)$; quantile $\zeta = |{\hat{S}_x \le S}| / |{\hat{S}}|$. A low $\zeta$ means events occurred where the model assigned low spatial probability (Zechar et al. 2010).

The N/M/S/L/CL nomenclature is current. The primary citation for the M / S / modern-N tests is Zechar, J. D., Gerstenberger, M. C., & Rhoades, D. A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. BSSA, 100(3), 1184–1195. DOI: 10.1785/0120090192.


5. Comparison tests — information gain vs Poisson AND ETAS (T / W)

Consistency tests ask "is model X plausibly correct?" Comparison tests ask "is A better than B?" — and skill lives only here. This is how we show the model beats the standard baselines.

5.1 Information Gain Per Earthquake (IGPE) — the central quantity, in nats

For each observed earthquake $i$ (in bin $k_i$), the mean per-event information gain of model A over model B is

$$I_N(A,B) = \frac{1}{N}\sum_{i=1}^{N} (X_i - Y_i) - \frac{\hat{N}_A - \hat{N}_B}{N}, \qquad X_i = \log \lambda_A(k_i),; Y_i = \log \lambda_B(k_i),$$

where $\hat{N}_A, \hat{N}_B$ are the total forecast counts of A and B and $N$ is the observed count. $I_N(A,B) > 0$ ⇒ A is more informative than B per earthquake (Rhoades et al. 2011).

Units matter. IGPE here is in nats (natural-log units), not bits. We report it in nats consistently, because that is how the CSEP literature reports it.

5.2 T-test (paired)

Sample variance of the per-event differences:

$$s^2 = \frac{1}{N-1}\sum_{i=1}^{N}(X_i - Y_i)^2 - \frac{1}{N^2 - N}\left[\sum_{i=1}^{N}(X_i - Y_i)\right]^2,$$

with test statistic distributed as Student-$t$ with $N-1$ degrees of freedom:

$$T = \frac{I_N(A,B)}{s/\sqrt{N}} \sim t_{N-1}.$$

A confidence interval on $I_N(A,B)$ that excludes 0 means a statistically significant difference in informativeness.

5.3 W-test (Wilcoxon signed-rank)

The non-parametric companion to the T-test: it tests whether the median of $(X_i - Y_i)$ equals $(\hat{N}_A - \hat{N}_B)/N$, without assuming the differences are normal. We use it as a robustness check when $N$ is small or the differences are skewed.

Primary citation for T / W / IGPE: Rhoades, D. A., Schorlemmer, D., Gerstenberger, M. C., Christophersen, A., Zechar, J. D., & Imoto, M. (2011). Efficient testing of earthquake forecasting models. Acta Geophysica, 59(4), 728–747. DOI: 10.2478/s11600-011-0013-5. This paper introduces the paired t-test as a more direct, more efficient alternative to the older R-test.

5.4 The mandatory baselines — Poisson smoothed-seismicity AND ETAS

A comparison test needs a model B. CAOS_SEISMIC must beat both of these, and both are standard, reproducible, and themselves defensible:

  1. a time-independent smoothed-seismicity model (a Poisson / Helmstetter–Werner-style background null), and
  2. an ETAS model — the standard self-exciting clustering baseline.

We claim skill only when IGPE is positive with a T-test confidence interval excluding zero, corroborated by the W-test — against at least the smoothed-seismicity baseline, and, for the headline thesis (§12), against ETAS.

Honest expectation. IGPE over a Poisson / smoothed-seismicity null is state-dependent — large during active sequences, near zero in quiet periods. For scale, time-independent model contrasts in prospective California CSEP give IGPE of only about −0.7 to +0.5 nats. There is no stable "2–3" steady-state value; we never quote one.


6. Catalog-based tests — relaxing the Poisson assumption

Savran et al. (2020) introduced non-parametric analogs of the CSEP tests that operate on $J$ synthetic catalogs simulated by the model, building empirical test distributions instead of Poisson ones. These are the right tests for clustered, daily, conditional forecasts like ours.

Primary citation: Savran, W. H., Werner, M. J., Marzocchi, W., Rhoades, D. A., Jackson, D. D., Milner, K., Field, E., & Michael, A. (2020). Pseudoprospective Evaluation of UCERF3-ETAS Forecasts during the 2019 Ridgecrest Sequence. BSSA, 110(4), 1799–1817. DOI: 10.1785/0120200026.

Let $\hat{\lambda}_s$ be the model's spatial rate density and $\Lambda_U = \lambda_1 \cup \dots \cup \lambda_J$ the union of all simulated catalogs.

6.1 Number test (catalog-based) — the over-dispersion-honest path

Empirical, not Poisson:

$$\delta_1 = P(N_j \ge N_{\text{obs}}) = 1 - F_N(N_{\text{obs}} - 1), \qquad \delta_2 = P(N_j \le N_{\text{obs}}) = F_N(N_{\text{obs}}),$$

where $F_N$ is the empirical CDF of simulated catalog counts. This is the production path to over-dispersed number uncertainty referenced in §4.4.

6.2 Pseudo-likelihood test (catalog-based)

The continuous marked-point-process log-likelihood is

$$L = \sum_{i=1}^{N}\log\lambda(e_i \mid H_t) - \int_R \lambda(e \mid H_t),dR.$$

pyCSEP approximates it with a gridded pseudo-log-likelihood. Observed:

$$\hat{L}_{\text{obs}} = \sum_{i=1}^{N_{\text{obs}}} \log \hat{\lambda}_s(k_i) - \bar{N},$$

with simulated distribution $\hat{L}j = \sum{i=1}^{N_j}\log\hat{\lambda}s(k{ij}) - \bar{N}$ and quantile $\gamma_L = P(\hat{L}j \le \hat{L}{\text{obs}})$.

6.3 Spatial test (catalog-based)

Normalized rate density $\hat{\lambda}_s^* = \hat{\lambda}_s / \sum_R \hat{\lambda}_s$. Observed:

$$\hat{S}_{\text{obs}} = \frac{1}{N_{\text{obs}}}\sum_{i=1}^{N_{\text{obs}}}\log \hat{\lambda}_s^*(k_i),$$

with $\hat{S}j = \frac{1}{N_j}\sum{i=1}^{N_j}\log\hat{\lambda}s^*(k{ij})$ and quantile $\gamma_s = P(\hat{S}j \le \hat{S}{\text{obs}})$.

6.4 Magnitude test (catalog-based) and its refinements

The original (number-biased) statistic over magnitude bins $k$:

$$d_{\text{obs}} = \sum_k \left[\log!\Big(\tfrac{N_{\text{obs}}}{N_U}\Lambda_U^{(m)}(k) + 1\Big) - \log\big(\Omega^{(m)}(k) + 1\big)\right]^2,$$

with simulated $D_j$ analogously and $\gamma_m = P(D_j \le d_{\text{obs}})$. Because this is biased by $N_{\text{obs}}$, pyCSEP also implements the resampled magnitude test (each resampled catalog has exactly $N_{\text{obs}}$ events) and the MLL (modified multinomial log-likelihood) magnitude test.

6.5 Calibration test (over time)

A Kolmogorov–Smirnov test on the collection of quantile scores across many forecast intervals — it checks that the quantile scores are themselves uniformly distributed, i.e. the model is calibrated over time, not just on one window. This is the test most directly relevant to a daily product evaluated over years, and we report it alongside the reliability diagram.


7. Reliability diagrams and proper scoring rules

On top of the CSEP suite — never as a substitute — we report proper scoring rules and a reliability diagram, because they are widely understood outside seismology and are honest about probability quality.

7.1 The reliability diagram — the headline credibility artifact

A reliability diagram plots forecast probability (x-axis) against observed frequency (y-axis), per horizon, per region. It is the public "when we said X%, it happened ~X% of the time" picture, and it is the single most honest summary of whether a probabilistic product is calibrated. A perfectly calibrated forecast lies on the diagonal. We validate calibration specifically in the quiet / cold-start regime, which dominates the diagram (most space–time cells are quiet) — not only during active sequences. The diagram is refreshed as the live prospective record grows.

7.2 Proper scoring rules (Gneiting & Raftery 2007)

A scoring rule $S(P, y)$ assigns a numerical penalty/reward to a probabilistic forecast $P$ given outcome $y$. It is proper if the forecaster minimizes expected score by reporting their true belief, and strictly proper if that optimum is unique (Gneiting & Raftery 2007).

  • Logarithmic score — the rule CSEP's L-test is built on:

$$\text{LogS}(p, y) = -\log p(y).$$

Lower is better; strictly proper. The CSEP joint log-likelihood (§4.1) is the accumulated log score over bins.

  • Brier score — for a probability $p$ of a binary event with outcome $y \in {0,1}$:

$$\text{BS} = \frac{1}{T}\sum_{t=1}^{T} (p_t - y_t)^2.$$

Strictly proper, bounded in $[0,1]$; the natural metric for our bounded probability of ≥1 target event in the window output. It decomposes into reliability − resolution + uncertainty (Murphy 1973), which lets us separately report calibration (reliability) and discrimination (resolution). The score originates with Brier (1950); Gneiting & Raftery (2007) is the proper-scoring-rule theory that treats it as a proper rule.

  • Continuous Ranked Probability Score (CRPS) — generalizes the Brier score to a full predictive CDF $F$ with outcome $y$:

$$\text{CRPS}(F, y) = \int_{-\infty}^{\infty} \big(F(x) - \mathbf{1}{x \ge y}\big)^2, dx.$$

Useful if we report a predictive distribution over number of events rather than a single probability.

Honesty note. Brier / Log / CRPS in isolation are not sufficient for seismology peer review — they don't account for the multiplicity of bins or for clustering. They are a communication-friendly supplement to the CSEP suite, never a replacement. We lead with CSEP; we report Brier / Log / CRPS as accessible secondary numbers.


8. Molchan / ROC — and why AUC is banned as a skill metric

Even though CAOS_SEISMIC is explicitly not an alarm, regulators and the public think in alarm terms, so we report the Molchan diagram — the standard "probability gain vs cost of false alarms" view (Zechar & Jordan 2008).

The Molchan diagram plots miss rate $\nu$ against fraction of space–time occupied by alarm $\tau$, for a sweep of alarm thresholds:

$$\nu = 1 - H = \frac{\text{events missed}}{\text{total events}}, \qquad \tau = \frac{\text{space–time alarmed}}{\text{total space–time}}.$$

Boundary conditions: $\tau = 0 \Rightarrow \nu = 1$ (alarm nowhere, miss everything); $\tau = 1 \Rightarrow \nu = 0$ (alarm everywhere, miss nothing). A skillful forecast bows toward the origin (low $\tau$, low $\nu$).

The Area Skill Score (ASS) is the normalized area above the Molchan trajectory: 1 = perfect skill, 0 = perfect non-skill, 0.5 = expected for a random alarm function (Zechar & Jordan 2010).

ROC (hit rate $H = 1 - \nu$ vs false-alarm rate) is the closely related complement. We prefer the Molchan diagram in seismology because $\tau$ (fraction of space–time alarmed) is more meaningful than the classical false-alarm rate when the no-event class overwhelmingly dominates.

Why AUC is banned as a skill metric

We do not use the area under the ROC curve (AUC) as a skill metric, and we say so explicitly:

  • The no-event class dominates by orders of magnitude. In a daily space–time grid, the overwhelming majority of cells contain no target event. AUC is computed over all cell–window pairs, so it is dominated by trivially-correct "nothing happened" cases and inflates toward 1 for almost any non-degenerate forecast. A high AUC is therefore not evidence of useful forecasting skill.
  • AUC is not a proper scoring rule and ignores calibration entirely. It measures only ranking (discrimination), so a badly mis-calibrated forecast — one that states wildly wrong probabilities — can still post a high AUC. For a product whose entire value proposition is calibrated probabilities, ranking-only metrics measure the wrong thing.
  • The field's own metric is the Area Skill Score, not AUC. The Molchan / ASS framing reports skill relative to a random alarm baseline (0.5) on the axis that matters ($\tau$), which is the defensible alarm-style view.

So: Molchan + ASS for the alarm-style communication view; never AUC as a headline skill number.


9. The forecast-clock protocol (pseudo-prospective, leakage-free)

This is where products lose credibility. Definitions follow Mizrahi et al. (2024) and CSEP convention.

Mode Time causality preserved? Test data known to model at training? Credibility
Prospective (true) Yes — forecasts deposited before events occur No Gold standard. Slow (must wait years).
Pseudo-prospective Yes — split preserves time order; model fit only on data before each forecast No Strong, accepted for back-analysis. Our primary mode.
Retrospective, out-of-sample (non-causal split) No — split (e.g. random k-fold) ignores time No (future may leak via global params) Weak for forecasting; sanity checks only.
Retrospective, in-sample No Yes — evaluated on training data Not evidence of skill; a fit diagnostic only.

Pseudo-prospective is the primary back-analysis mode. True prospective is the gold standard and the live tail; retrospective out-of-sample is weak; in-sample is a fit diagnostic only.

The forecast clock — leakage made structurally impossible

The control is structural, not disciplinary. A back-test driver iterates issue times $t$; at each $t$ it hands the model a catalog sliced to $(-\infty, t)$, the model emits the forecast for $[t, t+h)$, the driver seals it, then advances. The model code has no access to the future slice — leakage becomes an impossibility, not a matter of care.

flowchart LR
    A["Issue time t"] --> B["Slice catalog to (-inf, t)<br/>ONLY past events"]
    B --> C["Model emits forecast<br/>gridded + catalog-based<br/>for [t, t+h)"]
    C --> D["SEAL forecast<br/>snapshot inputs + Mc + params + version"]
    D --> E["Advance t -> t + 1 day"]
    E --> A
    F["Final authoritative catalog"] -.->|score later| D
Loading

Five leakage failure modes engineered against:

  1. Temporal leakage. Any feature, normalization statistic, hyperparameter, or weight that "saw" data at or after the issue time. Rule: for a forecast issued at $t$, every input and every fitted parameter derives only from data strictly before $t$. (CSEP requires parameters fixed before the test period.)
  2. Catalog-revision leakage. Using a final, relocated, magnitude-revised catalog as the input the model "had." Real-time catalogs differ from final ones. We snapshot the exact input catalog state for every daily issue so a forecast is byte-reproducible. Where reconstructing historical real-time states is infeasible, we document the optimistic bias explicitly rather than hide it. (The target catalog used for scoring is the best final catalog, with the declustering policy fixed in advance.)
  3. $M_c$ / completeness leakage. Estimating $M_c$, b-value, or smoothing bandwidths on the whole catalog including the test window. $M_c$ is estimated on the training window only, or fixed by a pre-declared, physically justified threshold (§10).
  4. Region / parameter snooping. Choosing the test region, magnitude threshold, or horizons after peeking at where the model does well. Region, $M_{\min}$, and horizons are pre-registered before scoring.
  5. Multiple-testing inflation. Running many regions/horizons/thresholds and reporting only the wins. We pre-register the full grid and report all cells, including failures.

Provenance and reproducibility (existential for "evaluated against reality")

Each daily forecast persists, immutably and versioned: the forecast (gridded + catalog-based) with its issue timestamp; the exact input catalog snapshot; the $M_c$ grid, declustering choice, model version, and all parameter values. This makes any past forecast byte-reproducible and auditable months later — the property the entire value proposition depends on.


10. Completeness magnitude consistency

The single most common way a forecast looks artificially good or bad is an $M_c$ mismatch between training and testing.

  • $M_c$ definition: the magnitude above which (essentially) all events are detected; below it the Gutenberg–Richter frequency–magnitude distribution rolls off.
  • Standard estimator — Maximum Curvature (MAXC): $M_c$ = magnitude of the peak of the non-cumulative frequency–magnitude distribution (Wiemer & Wyss 2000). MAXC tends to underestimate $M_c$; a widely used correction is

$$M_c = M_c^{\text{MAXC}} + 0.2.$$

This +0.2 correction is California-tuned, not universal — we re-validate it per region (with EMR / GFT and a frequency–magnitude inspection) and take the conservative (higher) value before exporting it to Chile / Japan / NZ / Italy.

  • Cross-check estimators: Goodness-of-Fit Test (GFT), b-value stability (MBS), Entire-Magnitude-Range (EMR) — Woessner & Wiemer (2005).

Rules for the back-analysis:

  1. Estimate $M_c$ per region, on the training window only, with MAXC+0.2 cross-checked against EMR; take the more conservative value.
  2. Set a single uniform target $M_{\min}$ at or above the worst (highest) regional $M_c$ across the entire study period, applied identically to the model and every baseline — so completeness never silently changes the target population. CSEP California next-day work used M≥3.95 for exactly this robustness (Serafini et al. 2025).
  3. Use the same $M_{\min}$, same grid, same declustering policy for every model compared. Comparison tests (T/W) are only fair if both forecasts cover an identical space–magnitude–time domain.
  4. Score on the non-declustered catalog. CAOS_SEISMIC deliberately forecasts clustering (aftershocks), so the target population must include aftershocks — declustering the target would throw away the very signal we forecast. We state this explicitly. (The dual-catalog rule — declustered background, full catalog for triggering — applies to model inputs, not to the scored target.)

11. Multi-country back-analysis design (high- vs low-seismicity bias)

We run the back-analysis across tectonically diverse regions to (a) prevent cherry-picking a friendly area, and (b) — as an explicit evaluation goal — measure and detect bias toward high-seismicity zones. A global model that only "works" where earthquakes are abundant is the opposite of the product's purpose, so the evaluation is deliberately designed to surface that failure if it exists.

Region Authoritative catalog Why included
Chile / N. Chile CSN (via EarthScope/IRIS) Subduction megathrust; dense national network; very high seismicity
California ANSS/ComCat, SCEDC/NCEDC Direct comparison to the 25 published CSEP next-day models (Serafini et al. 2025)
Japan JMA Dense network, very low $M_c$, abundant sequences
New Zealand GeoNet Mixed subduction / crustal; established CSEP history
Italy (optional) INGV OEF-Italy operational analogue; moderate seismicity stress test

The high- vs low-seismicity bias comparison. Inference is run across both high-seismicity countries (Chile, Japan, Indonesia, Mexico, Turkey, California, NZ) and low-seismicity countries (e.g. UK, Germany, Australia, Brazil). The point is to compare calibration and information gain across the seismicity spectrum: if the model is well calibrated only where data are abundant and silently mis-calibrated in quiet regions, the reliability diagrams and per-region CSEP scores will show it. Reporting both ends of the spectrum — including the regions where the model under-performs — is itself the antidote to the selection-bias trap CSEP exists to prevent.

Time split. A learning split of roughly 60–70% and a pseudo-prospective testing split of 30–40%, with frozen hyperparameters across the test split. Region polygons, horizons, thresholds, grid, $M_{\min}$, and declustering are pre-registered before scoring. California specifically anchors against Serafini et al. (2025) — the gold-standard operational template for a daily product. An optional fully-prospective live tail (the last 6–12 months, running "for real") demonstrates true prospective behavior once the product is live.

Horizons & output. Horizons 1 day / 2 days / 7 days, re-issued daily. Per issue per horizon, two products: a GriddedForecast (expected rate per space–magnitude bin) and a CatalogForecast (≥10,000 synthetic catalogs). The public-facing scalar is the bounded exceedance probability $P(\ge 1\text{ event with } M \ge M_{\min}\text{ in the region within the horizon})$, derived from the catalog forecast as the fraction of simulations with ≥1 qualifying event.


12. The headline thesis measurement: context-over-ETAS information gain

The product's scientific thesis is not "we can forecast aftershocks" — ETAS already does that, and so does anything that captures Omori–Utsu decay. The thesis is that a context-conditioned model adds skill on top of a well-fit ETAS. The measurement that proves or refutes it is therefore precise:

Headline thesis metric: the IGPE of the context-conditioned model over a well-fit ETAS, in nats, with a T-test confidence interval that excludes zero and a corroborating W-test, measured pseudo-prospectively across the multi-country panel — and reported for every region × horizon cell, including the cells where it is not positive.

Why this framing is the honest one:

  • Consistency tests never establish skill. Because the scored target is mostly aftershocks, a trivial "aftershocks follow mainshocks" model passes the consistency (N/M/S/L/CL) tests. A naive skill number can be an artifact of clustering that both the model and ETAS already capture.
  • The gain over ETAS is the genuine signal. Because both the context model and ETAS already reproduce Omori clustering, a positive, significant IGPE-over-ETAS is not "I predicted aftershocks" — it is "I predicted them better than the standard self-exciting model." That difference is the only thing that justifies a context-conditioned model existing.
  • The publish bar is gated on this. A heavier / context-conditioned model reaches the public forecast field only if it beats ETAS in our own pseudo-prospective CSEP harness (positive IGPE, T-test CI excluding zero) and passes calibration. Otherwise ETAS — the mandatory calibrated reference and safe default — is what ships. As of writing, no pure neural point process has robustly beaten a well-fit ETAS in prospective CSEP, so this bar is real, not ceremonial.

13. pyCSEP — the tooling

We use pyCSEP, the community Python toolkit, so that a reviewer can dispute our model but never our test code.

Primary citation: Savran, W. H., Bayona, J. A., Iturrieta, P., Bao, H., Bayliss, K., Herrmann, M., Schorlemmer, D., Marzocchi, W., & Werner, M. J. (2022). pyCSEP: A Python Toolkit for Earthquake Forecast Developers. Seismological Research Letters, 93(5), 2858–2870. DOI: 10.1785/0220220033; JOSS companion DOI: 10.21105/joss.03658. Docs: docs.cseptesting.org.

pyCSEP provides four module groups:

  1. Catalog access & processing — readers for ComCat (USGS), ISC-GEM, CSEP catalog formats; filtering by region/magnitude/time; declustering helpers.
  2. Forecast representationsGriddedForecast (space–magnitude rate grid) and CatalogForecast (ensemble of synthetic catalogs).
  3. Statistical tests — community-endorsed implementations:
    • Grid-based: number_test, magnitude_test, spatial_test, likelihood_test, conditional_likelihood_test, paired_t_test, w_test.
    • Catalog-based: number_test, spatial_test, magnitude_test, pseudolikelihood_test, calibration_test, resampled_magnitude_test, MLL_magnitude_test.
  4. Visualization & utilities — consistency-test plots (with simulated confidence bars), comparison plots (IGPE with confidence intervals), spatial forecast maps, Molchan / ROC helpers.

Caveat. The documented core pyCSEP grid test set uses the Poisson number / likelihood tests; there is no explicit negative-binomial N-test in the core API. Our over-dispersion handling therefore comes from the catalog-based path (§6.1), which is the maintainers' intended route. Using pyCSEP also gives free file-format compatibility with the CSEP testing centers should we ever pursue truly prospective registration.


14. The full report table — report every cell, including failures

For each region × horizon (1d / 2d / 7d) cell, we report the following — and we report every cell, including failures. A region × horizon where the model fails to beat ETAS is published as such. Selective reporting is itself the selection-bias trap CSEP exists to prevent.

Aspect evaluated Grid-based test Catalog-based test Score symbol Pass criterion
Total number / rate N-test (Poisson, $\delta_1,\delta_2$) catalog Number test $\delta_1, \delta_2$ $0.025 &lt; \delta &lt; 0.975$
Magnitude (GR shape) M-test catalog Magnitude / resampled / MLL $\kappa$ $\kappa &gt; 0.025$
Spatial location S-test catalog Spatial test $\zeta, \gamma_s$ $&gt; 0.025$
Joint likelihood L-test, CL-test catalog Pseudo-likelihood $\gamma, \gamma_{CL}, \gamma_L$ $&gt; 0.025$
Calibration over time Calibration (KS) test KS $p$ not rejected
Skill vs smoothed-seismicity T-test (IGPE) (comparison via IGPE) $I_N(A,B), T$ CI excludes 0, $I_N&gt;0$
Skill vs ETAS (the thesis) T-test (IGPE) (comparison via IGPE) $I_N(A,B), T$ CI excludes 0, $I_N&gt;0$
Skill robustness W-test $W$ consistent with T-test
Alarm-style communication Molchan diagram + ASS, ROC same $\tau,\nu$, ASS ASS $&gt; 0.5$ (never AUC)
Probability quality (public) Brier (+reliability diagram), Log, CRPS same BS, LogS, CRPS reliability ≈ diagonal

Reporting rules: (1) report every cell for every region × horizon, including failures; (2) show consistency-test plots with simulated confidence bars and comparison plots with IGPE confidence intervals (pyCSEP defaults); (3) include a reliability diagram per horizon — the most honest single picture of calibration. (4) Accompany any Poisson grid-test failure during an active sequence with the catalog-based (over-dispersed) result, and say so explicitly — it is an honest, defensible nuance, not an excuse.

How the results feed the web app's Back-analysis section

  • A CSEP calibration badge (green / amber / red) for model quality only — N-test / S-test / L/CL + reliability diagram. The traffic-light triad is reserved exclusively for model quality and never for the forecast field itself.
  • A reliability diagram per horizon — the public "when we said X%, it happened ~X%" artifact, refreshed as the live record grows.
  • Expected-count-vs-baseline bars and an expected-vs-observed time series in the no-map summary (accessible, no WebGL required).
  • Per-region × per-horizon results, including the cells where the model does not beat ETAS — published honestly.
  • A worked teaching example: the 2019 Ridgecrest case (a ~3% first-week forecast was not wrong when the ~3% outcome occurred) to teach that single outcomes neither validate nor invalidate a probabilistic forecast.
  • A prominent last-run / staleness banner and coverage mask so blank never reads as "safe."

15. Software tests (the repository's automated suite)

Scientific honesty also requires that the code doing the science is correct. The repository ships an automated test layer alongside the CSEP evaluation, and both are part of the credibility story.

15.1 The pytest unit suite

The pipeline is covered by a unit-test suite executed with pytest. It verifies the individual building blocks of the forecasting and evaluation stack independently of any network data — for example:

  • catalog hygiene (the load-bearing order: $M_c(x,y,t)$ → moment-magnitude homogenization → dual-catalog declustering);
  • the ETAS fit (MLE), its stability gates, and its simulation routine;
  • the tiled / regime-aware global forecaster (fitting ETAS per tile on halo events, routing cells to owning tiles, falling back to the smoothed null when a tile is thin or supercritical);
  • the calibration and QA-gate logic;
  • the compact-artifact writer.

The suite is meant to grow with the code; new functionality lands with its tests.

15.2 Import purity

A deliberate test asserts import purity: importing the package must not pull heavy scientific dependencies (e.g. ObsPy, xarray, GeoPandas, the deep-learning stack) at import time. This keeps the package fast to import, keeps the static / CLI surface lightweight, and prevents accidental coupling of the lightweight code paths to heavyweight optional deps.

15.3 The check smoke test

A first-class end-to-end smoke test is exposed through the CLI (e.g. caos-seismic check). It runs the whole short pipeline on a small slice of real catalog data — fetch → clean → estimate $M_c$ → condition ETAS → write the compact artifact → QA gate — and asserts the artifact is produced and passes its QA gate. It is the fastest honest answer to "does the real pipeline still run end-to-end?" and is the recommended first command after setting up the environment.

15.4 The operational QA gate (no silent corrupt publishes)

Distinct from the test suite but part of the same correctness posture: the daily job will not commit an anomalous or stale artifact. The QA gate checks catalog freshness, an event-count sanity bound, the absence of a duplicate/retracted spike near $M^*$, and N-test drift. On failure it commits nothing and leaves the last-good artifact in place; the staleness banner does the rest. This means a published forecast is always either fresh and sane, or visibly stale — never silently wrong.


16. Honest limitations of evaluation

These stay in the product framing — they are part of being trustworthy:

  • No deterministic prediction. None of these tests yields a time-place-magnitude prediction. We estimate conditional probabilities; the tests verify calibration and relative information gain, not "we called the quake."
  • Low base rates ⇒ slow statistics. For large $M_{\min}$ and short horizons, target events per window are rare; consistency tests have low power and comparison tests need years of data to reach significance. We are patient and do not over-claim from a short window.
  • Poisson tests over-reject during clustering. Mitigated by the catalog-based tests, but it means the classic CSEP L/N grid tests must be read with care for a daily clustered model — always paired with the catalog-based result.
  • Catalog dependence. Results depend on the authoritative catalog and on the $M_c$ choices; different agencies' catalogs can flip marginal results. We fix and document them.
  • Passing consistency tests is not skill. A model can pass N/M/S/L and still be useless (e.g. a smoothed climatology). Skill is established only by winning comparison tests against real baselines (smoothed-seismicity and ETAS).
  • Proper scores and ASS are communication aids, not peer-review-grade headline skill. We lead with CSEP; Brier / Log / CRPS / Molchan-ASS support it, never replace it. AUC is not used as a skill metric at all.

16b. Real results (live, first published 2026-06) — every model, honest failures included

These are the real, leakage-free numbers produced by the harness above and rendered live on the Back-analysis page (not a sample). Full provenance and the justification for every modelling change are in the repository experiment register.

Model + catalog. Regime-tiled ETAS, 195 ETAS tiles, b = 1.337, M_c = 5.35, fit on 97,230 real events (USGS ComCat, 1990–2026, M ≥ 4.5; Scordilis-2006 Mw homogenization). The N-test is not yet passed (the forecast count needs calibrating) — surfaced, not hidden.

Pseudo-prospective back-analysis (60-day window, 8 views, IGPE vs the Poisson null, in nats):

Result Value Reading
Global whole-Earth field vs null (1 d) +0.0835 the global context contribution — higher than any single country (aggregated worldwide triggering)
High − low seismicity gap (1 d) +0.0184 skill concentrates in active zones — the measured bias (Japan leads at +0.072; low-seismicity interiors ≈ 0)
Context channel (neural) ≈ 0 by construction until geodetic/stress covariates are wired — consistent with the field (no NPP beats ETAS prospectively as of 2026)

Multi-model benchmark (single-window leakage-free IGPE vs the null; every model, including ones we do not ship):

View obs ETAS Ensemble Reasenberg–Jones
Global 248 +3.712 +2.795 −0.025
Chile 11 +0.036 −0.061 −0.182
Japan 5 −0.081 −0.013 −0.438

Honest finding. ETAS is the best single model; the naive equal-weight ensemble underperforms ETAS because averaging in the null and the weak Reasenberg–Jones dilutes the triggering signal that is the skill. This matches the literature (ensembles help only when score-weighted with strong members; Marzocchi-Zechar-Jordan 2012, Herrmann & Marzocchi 2023), so we do not ship the naive ensemble — the next lever is score-weighted stacking of ETAS variants. The full cited evidence base is in improvement-evidence.

Reported honestly: negative and zero results are kept. A model that does not beat its baseline is shown as such; that is the entire point of the CSEP discipline above.


17. References

  1. Field, E. H. (2007). Overview of the Working Group for the Development of RELM. SRL, 78(1), 7–16. 10.1785/gssrl.78.1.7
  2. Schorlemmer, D., Gerstenberger, M. C., Wiemer, S., Jackson, D. D., & Rhoades, D. A. (2007). Earthquake Likelihood Model Testing. SRL, 78(1), 17–29. 10.1785/gssrl.78.1.17
  3. Schorlemmer, D., & Gerstenberger, M. C. (2007). RELM Testing Center. SRL, 78(1), 30–36. 10.1785/gssrl.78.1.30
  4. Zechar, J. D., & Jordan, T. H. (2008). Testing alarm-based earthquake predictions. GJI, 172(2), 715–724. 10.1111/j.1365-246X.2007.03676.x
  5. Zechar, J. D., Gerstenberger, M. C., & Rhoades, D. A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. BSSA, 100(3), 1184–1195. 10.1785/0120090192
  6. Zechar, J. D., & Jordan, T. H. (2010). The Area Skill Score Statistic for Evaluating Earthquake Predictability Experiments. PAGEOPH, 167, 893–906. 10.1007/s00024-010-0086-0
  7. Rhoades, D. A., Schorlemmer, D., Gerstenberger, M. C., Christophersen, A., Zechar, J. D., & Imoto, M. (2011). Efficient testing of earthquake forecasting models. Acta Geophysica, 59(4), 728–747. 10.2478/s11600-011-0013-5
  8. Werner, M. J., Helmstetter, A., Jackson, D. D., & Kagan, Y. Y. (2011). High-Resolution Long-Term and Short-Term Earthquake Forecasts for California. BSSA, 101(4), 1630–1648. 10.1785/0120090340
  9. Savran, W. H., Werner, M. J., Marzocchi, W., Rhoades, D. A., Jackson, D. D., Milner, K., Field, E., & Michael, A. (2020). Pseudoprospective Evaluation of UCERF3-ETAS Forecasts during the 2019 Ridgecrest Sequence. BSSA, 110(4), 1799–1817. 10.1785/0120200026
  10. Savran, W. H., Bayona, J. A., Iturrieta, P., et al. (2022). pyCSEP: A Python Toolkit for Earthquake Forecast Developers. SRL, 93(5), 2858–2870. 10.1785/0220220033; JOSS 10.21105/joss.03658
  11. Mizrahi, L., Dallo, I., van der Elst, N. J., et al. (2024). Developing, Testing, and Communicating Earthquake Forecasts. Reviews of Geophysics, 62. 10.1029/2023RG000823
  12. Serafini, F., Bayona, J. A., Silva, F., et al. (2025). A benchmark database of ten years of prospective next-day earthquake forecasts in California from CSEP. Scientific Data, 12, 1501. 10.1038/s41597-025-05766-3
  13. Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. JASA, 102(477), 359–378. 10.1198/016214506000001437
  14. Kagan, Y. Y. (2017). Earthquake number forecasts testing. GJI, 211(1), 335–345. 10.1093/gji/ggx300
  15. Wiemer, S., & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA, 90(4), 859–869. 10.1785/0119990114
  16. Woessner, J., & Wiemer, S. (2005). Assessing the quality of earthquake catalogues: Estimating the magnitude of completeness and its uncertainty. BSSA, 95(2), 684–698. 10.1785/0120040007
  17. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
  18. Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12(4), 595–600. 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2
  19. pyCSEP documentation — Theory of CSEP Tests: docs.cseptesting.org/getting_started/theory.html

Public technical documentation. Calibration and information gain are what these tests verify; deterministic prediction is impossible and is never claimed.

Clone this wiki locally