-
Notifications
You must be signed in to change notification settings - Fork 0
Evaluation and Tests
This page is the credibility backbone of CAOS_SEISMIC. A conditional probabilistic seismic forecast is only as trustworthy as the framework used to score it. We therefore adopt the community-endorsed standard of the field — CSEP — and report all of its tests, including the ones we fail. We do not invent bespoke metrics: if a forecast is scored with a non-standard score, a senior seismologist will (correctly) dismiss the result.
Framing that never leaves the product. Deterministic earthquake prediction is impossible. CAOS_SEISMIC is a conditional probability estimator — never an alarm, never a single-event call. The tests below verify calibration (are our stated probabilities honest?) and relative information gain (do we add skill over standard baselines?). They do not, and cannot, verify that "we called the quake." There is no such thing.
Contents
- What CSEP is and why it is the only standard we use
- Two forecast representations: gridded-rate and catalog-based
- Shared notation
- Consistency (calibration) tests — N / M / S / L / CL
- Comparison tests — information gain vs Poisson AND ETAS (T / W)
- Catalog-based tests — relaxing the Poisson assumption
- Reliability diagrams and proper scoring rules
- Molchan / ROC — and why AUC is banned as a skill metric
- The forecast-clock protocol (pseudo-prospective, leakage-free)
- Completeness magnitude consistency
- Multi-country back-analysis design (high- vs low-seismicity bias)
- The headline thesis measurement: context-over-ETAS information gain
- pyCSEP — the tooling
- The full report table — report every cell, including failures
- Software tests (the repository's automated suite)
- Honest limitations of evaluation
- References
The Collaboratory for the Study of Earthquake Predictability (CSEP) is the international infrastructure and rule-set for testing probabilistic earthquake forecasts in a transparent, reproducible, and — crucially — prospective manner. It grew out of the Regional Earthquake Likelihood Models (RELM) working group for California.
The credibility of a forecast comes from the protocol being fixed before the data are
seen. CSEP enforces that forecasts are deposited, then sealed, then scored against a
pre-declared authoritative catalog. Our back-analysis emulates this exactly: we freeze the
model and its parameters, then score. The whole point is that a reviewer can dispute our
model but never our scoring, because the scores are the community's, computed by the
community's tooling (pyCSEP).
Foundational papers:
- Field, E. H. (2007). Overview of the Working Group for the Development of Regional Earthquake Likelihood Models (RELM). Seismological Research Letters, 78(1), 7–16. DOI: 10.1785/gssrl.78.1.7. — sets up RELM: alternate models of earthquake potential evaluated head-to-head for California.
- Schorlemmer, D., Gerstenberger, M. C., Wiemer, S., Jackson, D. D., & Rhoades, D. A. (2007). Earthquake Likelihood Model Testing. Seismological Research Letters, 78(1), 17–29. DOI: 10.1785/gssrl.78.1.17. — defines the original N-test, L-test, and R-test.
- Schorlemmer, D., & Gerstenberger, M. C. (2007). RELM Testing Center. Seismological Research Letters, 78(1), 30–36. DOI: 10.1785/gssrl.78.1.30. — the rules: testing area and grid, forecast periods, magnitude ranges, authoritative catalog, declustering, and the distinction between five-year, one-year, and one-day models. CAOS_SEISMIC is conceptually a one-day model, re-issued daily.
Authoritative anchors for current best practice and the operational template:
- Mizrahi, L., Dallo, I., van der Elst, N. J., et al. (2024). Developing, Testing, and Communicating Earthquake Forecasts: Current Practices and Future Directions. Reviews of Geophysics, 62. DOI: 10.1029/2023RG000823. — lays out the prospective / pseudo-prospective / retrospective / in-sample / out-of-sample taxonomy we use in §9.
-
Serafini, F., Bayona, J. A., Silva, F., et al. (2025). A benchmark database of ten
years of prospective next-day earthquake forecasts in California from CSEP. Scientific
Data, 12, 1501. DOI: 10.1038/s41597-025-05766-3.
— 25 fully automated M≥3.95 daily models, >50,000 daily forecasts (Aug 2007–Aug 2018),
evaluated with
pyCSEP. This is the single most useful operational template for a daily-forecast product like ours, and a real public benchmark our California region scores against.
CSEP supports two distinct forecast representations, and the tests differ between them. CAOS_SEISMIC emits both, because the daily forecast is strongly clustered (aftershocks, swarms, foreshocks) and the two representations make different assumptions.
| Gridded-rate (grid-based) | Catalog-based (simulation) | |
|---|---|---|
| What it is | Expected number of events |
Many synthetic catalogs (stochastic event sets) simulated from the model |
| Core assumption | Events in each bin are independent and Poisson distributed | No Poisson assumption — empirical distribution from simulations |
| Pros | Simple, fast to test, analytic test distributions; directly comparable to the published CSEP California daily benchmark | Captures clustering / over-dispersion, correct uncertainty, magnitude–space correlations |
| Cons | Poisson under-disperses variance → over-rejects during sequences | Computationally heavier; must store/serve many catalogs |
| Tests | N, M, S, L, CL (Poisson) + T, W comparison | Number, Magnitude, Spatial, Pseudo-likelihood (empirical) + comparison + calibration |
Why both. Because earthquakes cluster, the variance of real event counts is far larger than the mean (over-dispersion). The Poisson assumption baked into the classic gridded L/CL-tests therefore systematically over-rejects us during active sequences — the model can be perfectly reasonable and still "fail" a Poisson test purely because Poisson under-estimates the spread. The fix is the catalog-based representation, whose test distributions are empirical (built from the simulations) and therefore over-dispersion-honest.
Operationally, every issue, every horizon, CAOS_SEISMIC emits:
- a gridded-rate forecast (a space–magnitude grid of expected rates) for the primary, comparable CSEP tests, and
- a catalog-based forecast (≥10,000 synthetic catalogs simulated from the model) to get honest, non-Poisson uncertainty and run the Savran et al. (2020) catalog tests.
This dual emission is exactly what pyCSEP is built to consume. Every Poisson-grid-test
failure during a sequence is reported paired with its catalog-based result — never in
isolation.
Following Schorlemmer et al. (2007) and Zechar et al. (2010):
- Region
$R$ is partitioned into bins indexed by magnitude$m_i$ and space$s_j$ . -
$\lambda(m_i, s_j)$ = forecast expected number of events in bin$(m_i, s_j)$ . -
$\omega(m_i, s_j)$ = observed number of events in bin$(m_i, s_j)$ . -
$N_{\text{fore}} = \sum_{m_i, s_j \in R} \lambda(m_i, s_j)$ = total forecast number. -
$N_{\text{obs}} = \sum_{m_i, s_j \in R} \omega(m_i, s_j)$ = total observed number. - A hat (
$\hat{\cdot}$ ) over a quantity denotes a value computed on a simulated catalog drawn from the forecast.
All consistency-test quantiles share one logic: compute an observed test statistic, simulate
its distribution under the forecast, then report the quantile score = the fraction of
simulated statistics
Consistency tests answer "is this one model plausibly correct?" They calibrate a single model. They never establish skill — a smoothed climatology can pass them all. Skill comes only from the comparison tests in §5.
For a forecast
This is the standard Poisson log-likelihood summed over independent bins (Schorlemmer et al. 2007). It is also the accumulated logarithmic score (§7), which ties CSEP directly to proper-scoring-rule theory.
Test statistic
A low pyCSEP, but we prefer the CL-test
for shape diagnosis.
Identical to the L-test except the simulated catalogs are conditioned to contain exactly
Preferred over the raw L-test because the L-test is dominated by, and confounded with, the number forecast (Werner et al. 2011).
Observed
where
Interpretation (matches the pyCSEP source): a small
Over-dispersion caveat. The Poisson assumption for number is demonstrably wrong for
real catalogs — earthquakes cluster, producing variance ≫ mean (Kagan 2017). The
negative-binomial distribution adds a clustering parameter and yields wider, more realistic
intervals. The explicit negative-binomial N-test is not in the core pyCSEP grid test
list, so our production route to over-dispersed number uncertainty is the catalog-based
number test (§6.1), which is non-parametric and is the maintainers' intended path.
Collapse over space and standardize so totals agree:
Test statistic
Collapse over magnitude and standardize:
Test statistic
The N/M/S/L/CL nomenclature is current. The primary citation for the M / S / modern-N tests is Zechar, J. D., Gerstenberger, M. C., & Rhoades, D. A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. BSSA, 100(3), 1184–1195. DOI: 10.1785/0120090192.
Consistency tests ask "is model X plausibly correct?" Comparison tests ask "is A better than B?" — and skill lives only here. This is how we show the model beats the standard baselines.
For each observed earthquake
where
Units matter. IGPE here is in nats (natural-log units), not bits. We report it in nats consistently, because that is how the CSEP literature reports it.
Sample variance of the per-event differences:
with test statistic distributed as Student-$t$ with
A confidence interval on
The non-parametric companion to the T-test: it tests whether the median of
Primary citation for T / W / IGPE: Rhoades, D. A., Schorlemmer, D., Gerstenberger, M. C., Christophersen, A., Zechar, J. D., & Imoto, M. (2011). Efficient testing of earthquake forecasting models. Acta Geophysica, 59(4), 728–747. DOI: 10.2478/s11600-011-0013-5. This paper introduces the paired t-test as a more direct, more efficient alternative to the older R-test.
A comparison test needs a model B. CAOS_SEISMIC must beat both of these, and both are standard, reproducible, and themselves defensible:
- a time-independent smoothed-seismicity model (a Poisson / Helmstetter–Werner-style background null), and
- an ETAS model — the standard self-exciting clustering baseline.
We claim skill only when IGPE is positive with a T-test confidence interval excluding zero, corroborated by the W-test — against at least the smoothed-seismicity baseline, and, for the headline thesis (§12), against ETAS.
Honest expectation. IGPE over a Poisson / smoothed-seismicity null is state-dependent — large during active sequences, near zero in quiet periods. For scale, time-independent model contrasts in prospective California CSEP give IGPE of only about −0.7 to +0.5 nats. There is no stable "2–3" steady-state value; we never quote one.
Savran et al. (2020) introduced non-parametric analogs of the CSEP tests that operate on
Primary citation: Savran, W. H., Werner, M. J., Marzocchi, W., Rhoades, D. A., Jackson, D. D., Milner, K., Field, E., & Michael, A. (2020). Pseudoprospective Evaluation of UCERF3-ETAS Forecasts during the 2019 Ridgecrest Sequence. BSSA, 110(4), 1799–1817. DOI: 10.1785/0120200026.
Let
Empirical, not Poisson:
where
The continuous marked-point-process log-likelihood is
pyCSEP approximates it with a gridded pseudo-log-likelihood. Observed:
with simulated distribution $\hat{L}j = \sum{i=1}^{N_j}\log\hat{\lambda}s(k{ij}) - \bar{N}$ and quantile $\gamma_L = P(\hat{L}j \le \hat{L}{\text{obs}})$.
Normalized rate density
with $\hat{S}j = \frac{1}{N_j}\sum{i=1}^{N_j}\log\hat{\lambda}s^*(k{ij})$ and quantile $\gamma_s = P(\hat{S}j \le \hat{S}{\text{obs}})$.
The original (number-biased) statistic over magnitude bins
with simulated pyCSEP also implements the resampled magnitude test (each
resampled catalog has exactly
A Kolmogorov–Smirnov test on the collection of quantile scores across many forecast intervals — it checks that the quantile scores are themselves uniformly distributed, i.e. the model is calibrated over time, not just on one window. This is the test most directly relevant to a daily product evaluated over years, and we report it alongside the reliability diagram.
On top of the CSEP suite — never as a substitute — we report proper scoring rules and a reliability diagram, because they are widely understood outside seismology and are honest about probability quality.
A reliability diagram plots forecast probability (x-axis) against observed frequency (y-axis), per horizon, per region. It is the public "when we said X%, it happened ~X% of the time" picture, and it is the single most honest summary of whether a probabilistic product is calibrated. A perfectly calibrated forecast lies on the diagonal. We validate calibration specifically in the quiet / cold-start regime, which dominates the diagram (most space–time cells are quiet) — not only during active sequences. The diagram is refreshed as the live prospective record grows.
A scoring rule
- Logarithmic score — the rule CSEP's L-test is built on:
Lower is better; strictly proper. The CSEP joint log-likelihood (§4.1) is the accumulated log score over bins.
-
Brier score — for a probability
$p$ of a binary event with outcome$y \in {0,1}$ :
Strictly proper, bounded in
-
Continuous Ranked Probability Score (CRPS) — generalizes the Brier score to a full
predictive CDF
$F$ with outcome$y$ :
Useful if we report a predictive distribution over number of events rather than a single probability.
Honesty note. Brier / Log / CRPS in isolation are not sufficient for seismology peer review — they don't account for the multiplicity of bins or for clustering. They are a communication-friendly supplement to the CSEP suite, never a replacement. We lead with CSEP; we report Brier / Log / CRPS as accessible secondary numbers.
Even though CAOS_SEISMIC is explicitly not an alarm, regulators and the public think in alarm terms, so we report the Molchan diagram — the standard "probability gain vs cost of false alarms" view (Zechar & Jordan 2008).
The Molchan diagram plots miss rate
Boundary conditions:
The Area Skill Score (ASS) is the normalized area above the Molchan trajectory: 1 = perfect skill, 0 = perfect non-skill, 0.5 = expected for a random alarm function (Zechar & Jordan 2010).
ROC (hit rate
We do not use the area under the ROC curve (AUC) as a skill metric, and we say so explicitly:
- The no-event class dominates by orders of magnitude. In a daily space–time grid, the overwhelming majority of cells contain no target event. AUC is computed over all cell–window pairs, so it is dominated by trivially-correct "nothing happened" cases and inflates toward 1 for almost any non-degenerate forecast. A high AUC is therefore not evidence of useful forecasting skill.
- AUC is not a proper scoring rule and ignores calibration entirely. It measures only ranking (discrimination), so a badly mis-calibrated forecast — one that states wildly wrong probabilities — can still post a high AUC. For a product whose entire value proposition is calibrated probabilities, ranking-only metrics measure the wrong thing.
-
The field's own metric is the Area Skill Score, not AUC. The Molchan / ASS framing
reports skill relative to a random alarm baseline (0.5) on the axis that matters
(
$\tau$ ), which is the defensible alarm-style view.
So: Molchan + ASS for the alarm-style communication view; never AUC as a headline skill number.
This is where products lose credibility. Definitions follow Mizrahi et al. (2024) and CSEP convention.
| Mode | Time causality preserved? | Test data known to model at training? | Credibility |
|---|---|---|---|
| Prospective (true) | Yes — forecasts deposited before events occur | No | Gold standard. Slow (must wait years). |
| Pseudo-prospective | Yes — split preserves time order; model fit only on data before each forecast | No | Strong, accepted for back-analysis. Our primary mode. |
| Retrospective, out-of-sample (non-causal split) | No — split (e.g. random k-fold) ignores time | No (future may leak via global params) | Weak for forecasting; sanity checks only. |
| Retrospective, in-sample | No | Yes — evaluated on training data | Not evidence of skill; a fit diagnostic only. |
Pseudo-prospective is the primary back-analysis mode. True prospective is the gold standard and the live tail; retrospective out-of-sample is weak; in-sample is a fit diagnostic only.
The control is structural, not disciplinary. A back-test driver iterates issue times
flowchart LR
A["Issue time t"] --> B["Slice catalog to (-inf, t)<br/>ONLY past events"]
B --> C["Model emits forecast<br/>gridded + catalog-based<br/>for [t, t+h)"]
C --> D["SEAL forecast<br/>snapshot inputs + Mc + params + version"]
D --> E["Advance t -> t + 1 day"]
E --> A
F["Final authoritative catalog"] -.->|score later| D
Five leakage failure modes engineered against:
-
Temporal leakage. Any feature, normalization statistic, hyperparameter, or weight that
"saw" data at or after the issue time. Rule: for a forecast issued at
$t$ , every input and every fitted parameter derives only from data strictly before$t$ . (CSEP requires parameters fixed before the test period.) - Catalog-revision leakage. Using a final, relocated, magnitude-revised catalog as the input the model "had." Real-time catalogs differ from final ones. We snapshot the exact input catalog state for every daily issue so a forecast is byte-reproducible. Where reconstructing historical real-time states is infeasible, we document the optimistic bias explicitly rather than hide it. (The target catalog used for scoring is the best final catalog, with the declustering policy fixed in advance.)
-
$M_c$ / completeness leakage. Estimating$M_c$ , b-value, or smoothing bandwidths on the whole catalog including the test window.$M_c$ is estimated on the training window only, or fixed by a pre-declared, physically justified threshold (§10). -
Region / parameter snooping. Choosing the test region, magnitude threshold, or
horizons after peeking at where the model does well. Region,
$M_{\min}$ , and horizons are pre-registered before scoring. - Multiple-testing inflation. Running many regions/horizons/thresholds and reporting only the wins. We pre-register the full grid and report all cells, including failures.
Each daily forecast persists, immutably and versioned: the forecast (gridded +
catalog-based) with its issue timestamp; the exact input catalog snapshot; the
The single most common way a forecast looks artificially good or bad is an
-
$M_c$ definition: the magnitude above which (essentially) all events are detected; below it the Gutenberg–Richter frequency–magnitude distribution rolls off. -
Standard estimator — Maximum Curvature (MAXC):
$M_c$ = magnitude of the peak of the non-cumulative frequency–magnitude distribution (Wiemer & Wyss 2000). MAXC tends to underestimate$M_c$ ; a widely used correction is
This +0.2 correction is California-tuned, not universal — we re-validate it per region (with EMR / GFT and a frequency–magnitude inspection) and take the conservative (higher) value before exporting it to Chile / Japan / NZ / Italy.
- Cross-check estimators: Goodness-of-Fit Test (GFT), b-value stability (MBS), Entire-Magnitude-Range (EMR) — Woessner & Wiemer (2005).
Rules for the back-analysis:
- Estimate
$M_c$ per region, on the training window only, with MAXC+0.2 cross-checked against EMR; take the more conservative value. - Set a single uniform target
$M_{\min}$ at or above the worst (highest) regional$M_c$ across the entire study period, applied identically to the model and every baseline — so completeness never silently changes the target population. CSEP California next-day work used M≥3.95 for exactly this robustness (Serafini et al. 2025). - Use the same
$M_{\min}$ , same grid, same declustering policy for every model compared. Comparison tests (T/W) are only fair if both forecasts cover an identical space–magnitude–time domain. - Score on the non-declustered catalog. CAOS_SEISMIC deliberately forecasts clustering (aftershocks), so the target population must include aftershocks — declustering the target would throw away the very signal we forecast. We state this explicitly. (The dual-catalog rule — declustered background, full catalog for triggering — applies to model inputs, not to the scored target.)
We run the back-analysis across tectonically diverse regions to (a) prevent cherry-picking a friendly area, and (b) — as an explicit evaluation goal — measure and detect bias toward high-seismicity zones. A global model that only "works" where earthquakes are abundant is the opposite of the product's purpose, so the evaluation is deliberately designed to surface that failure if it exists.
| Region | Authoritative catalog | Why included |
|---|---|---|
| Chile / N. Chile | CSN (via EarthScope/IRIS) | Subduction megathrust; dense national network; very high seismicity |
| California | ANSS/ComCat, SCEDC/NCEDC | Direct comparison to the 25 published CSEP next-day models (Serafini et al. 2025) |
| Japan | JMA | Dense network, very low |
| New Zealand | GeoNet | Mixed subduction / crustal; established CSEP history |
| Italy (optional) | INGV | OEF-Italy operational analogue; moderate seismicity stress test |
The high- vs low-seismicity bias comparison. Inference is run across both high-seismicity countries (Chile, Japan, Indonesia, Mexico, Turkey, California, NZ) and low-seismicity countries (e.g. UK, Germany, Australia, Brazil). The point is to compare calibration and information gain across the seismicity spectrum: if the model is well calibrated only where data are abundant and silently mis-calibrated in quiet regions, the reliability diagrams and per-region CSEP scores will show it. Reporting both ends of the spectrum — including the regions where the model under-performs — is itself the antidote to the selection-bias trap CSEP exists to prevent.
Time split. A learning split of roughly 60–70% and a pseudo-prospective testing split
of 30–40%, with frozen hyperparameters across the test split. Region polygons,
horizons, thresholds, grid,
Horizons & output. Horizons 1 day / 2 days / 7 days, re-issued daily. Per issue per
horizon, two products: a GriddedForecast (expected rate per space–magnitude bin) and a
CatalogForecast (≥10,000 synthetic catalogs). The public-facing scalar is the bounded
exceedance probability
The product's scientific thesis is not "we can forecast aftershocks" — ETAS already does that, and so does anything that captures Omori–Utsu decay. The thesis is that a context-conditioned model adds skill on top of a well-fit ETAS. The measurement that proves or refutes it is therefore precise:
Headline thesis metric: the IGPE of the context-conditioned model over a well-fit ETAS, in nats, with a T-test confidence interval that excludes zero and a corroborating W-test, measured pseudo-prospectively across the multi-country panel — and reported for every region × horizon cell, including the cells where it is not positive.
Why this framing is the honest one:
- Consistency tests never establish skill. Because the scored target is mostly aftershocks, a trivial "aftershocks follow mainshocks" model passes the consistency (N/M/S/L/CL) tests. A naive skill number can be an artifact of clustering that both the model and ETAS already capture.
- The gain over ETAS is the genuine signal. Because both the context model and ETAS already reproduce Omori clustering, a positive, significant IGPE-over-ETAS is not "I predicted aftershocks" — it is "I predicted them better than the standard self-exciting model." That difference is the only thing that justifies a context-conditioned model existing.
- The publish bar is gated on this. A heavier / context-conditioned model reaches the public forecast field only if it beats ETAS in our own pseudo-prospective CSEP harness (positive IGPE, T-test CI excluding zero) and passes calibration. Otherwise ETAS — the mandatory calibrated reference and safe default — is what ships. As of writing, no pure neural point process has robustly beaten a well-fit ETAS in prospective CSEP, so this bar is real, not ceremonial.
We use pyCSEP, the community Python toolkit, so that a reviewer can dispute our model
but never our test code.
Primary citation: Savran, W. H., Bayona, J. A., Iturrieta, P., Bao, H., Bayliss, K., Herrmann, M., Schorlemmer, D., Marzocchi, W., & Werner, M. J. (2022). pyCSEP: A Python Toolkit for Earthquake Forecast Developers. Seismological Research Letters, 93(5), 2858–2870. DOI: 10.1785/0220220033; JOSS companion DOI: 10.21105/joss.03658. Docs: docs.cseptesting.org.
pyCSEP provides four module groups:
- Catalog access & processing — readers for ComCat (USGS), ISC-GEM, CSEP catalog formats; filtering by region/magnitude/time; declustering helpers.
-
Forecast representations —
GriddedForecast(space–magnitude rate grid) andCatalogForecast(ensemble of synthetic catalogs). -
Statistical tests — community-endorsed implementations:
- Grid-based:
number_test,magnitude_test,spatial_test,likelihood_test,conditional_likelihood_test,paired_t_test,w_test. - Catalog-based:
number_test,spatial_test,magnitude_test,pseudolikelihood_test,calibration_test,resampled_magnitude_test,MLL_magnitude_test.
- Grid-based:
- Visualization & utilities — consistency-test plots (with simulated confidence bars), comparison plots (IGPE with confidence intervals), spatial forecast maps, Molchan / ROC helpers.
Caveat. The documented core
pyCSEPgrid test set uses the Poisson number / likelihood tests; there is no explicit negative-binomial N-test in the core API. Our over-dispersion handling therefore comes from the catalog-based path (§6.1), which is the maintainers' intended route. UsingpyCSEPalso gives free file-format compatibility with the CSEP testing centers should we ever pursue truly prospective registration.
For each region × horizon (1d / 2d / 7d) cell, we report the following — and we report every cell, including failures. A region × horizon where the model fails to beat ETAS is published as such. Selective reporting is itself the selection-bias trap CSEP exists to prevent.
| Aspect evaluated | Grid-based test | Catalog-based test | Score symbol | Pass criterion |
|---|---|---|---|---|
| Total number / rate | N-test (Poisson, |
catalog Number test | ||
| Magnitude (GR shape) | M-test | catalog Magnitude / resampled / MLL | ||
| Spatial location | S-test | catalog Spatial test | ||
| Joint likelihood | L-test, CL-test | catalog Pseudo-likelihood | ||
| Calibration over time | — | Calibration (KS) test | KS |
not rejected |
| Skill vs smoothed-seismicity | T-test (IGPE) | (comparison via IGPE) | CI excludes 0, |
|
| Skill vs ETAS (the thesis) | T-test (IGPE) | (comparison via IGPE) | CI excludes 0, |
|
| Skill robustness | W-test | — | consistent with T-test | |
| Alarm-style communication | Molchan diagram + ASS, ROC | same |
|
ASS |
| Probability quality (public) | Brier (+reliability diagram), Log, CRPS | same | BS, LogS, CRPS | reliability ≈ diagonal |
Reporting rules: (1) report every cell for every region × horizon, including failures;
(2) show consistency-test plots with simulated confidence bars and comparison plots with IGPE
confidence intervals (pyCSEP defaults); (3) include a reliability diagram per horizon —
the most honest single picture of calibration. (4) Accompany any Poisson grid-test failure
during an active sequence with the catalog-based (over-dispersed) result, and say so
explicitly — it is an honest, defensible nuance, not an excuse.
- A CSEP calibration badge (green / amber / red) for model quality only — N-test / S-test / L/CL + reliability diagram. The traffic-light triad is reserved exclusively for model quality and never for the forecast field itself.
- A reliability diagram per horizon — the public "when we said X%, it happened ~X%" artifact, refreshed as the live record grows.
- Expected-count-vs-baseline bars and an expected-vs-observed time series in the no-map summary (accessible, no WebGL required).
- Per-region × per-horizon results, including the cells where the model does not beat ETAS — published honestly.
- A worked teaching example: the 2019 Ridgecrest case (a ~3% first-week forecast was not wrong when the ~3% outcome occurred) to teach that single outcomes neither validate nor invalidate a probabilistic forecast.
- A prominent last-run / staleness banner and coverage mask so blank never reads as "safe."
Scientific honesty also requires that the code doing the science is correct. The repository ships an automated test layer alongside the CSEP evaluation, and both are part of the credibility story.
The pipeline is covered by a unit-test suite executed with pytest. It verifies the
individual building blocks of the forecasting and evaluation stack independently of any
network data — for example:
- catalog hygiene (the load-bearing order:
$M_c(x,y,t)$ → moment-magnitude homogenization → dual-catalog declustering); - the ETAS fit (MLE), its stability gates, and its simulation routine;
- the tiled / regime-aware global forecaster (fitting ETAS per tile on halo events, routing cells to owning tiles, falling back to the smoothed null when a tile is thin or supercritical);
- the calibration and QA-gate logic;
- the compact-artifact writer.
The suite is meant to grow with the code; new functionality lands with its tests.
A deliberate test asserts import purity: importing the package must not pull heavy scientific dependencies (e.g. ObsPy, xarray, GeoPandas, the deep-learning stack) at import time. This keeps the package fast to import, keeps the static / CLI surface lightweight, and prevents accidental coupling of the lightweight code paths to heavyweight optional deps.
A first-class end-to-end smoke test is exposed through the CLI (e.g.
caos-seismic check). It runs the whole short pipeline on a small slice of real catalog
data — fetch → clean → estimate
Distinct from the test suite but part of the same correctness posture: the daily job will
not commit an anomalous or stale artifact. The QA gate checks catalog freshness, an
event-count sanity bound, the absence of a duplicate/retracted spike near
These stay in the product framing — they are part of being trustworthy:
- No deterministic prediction. None of these tests yields a time-place-magnitude prediction. We estimate conditional probabilities; the tests verify calibration and relative information gain, not "we called the quake."
-
Low base rates ⇒ slow statistics. For large
$M_{\min}$ and short horizons, target events per window are rare; consistency tests have low power and comparison tests need years of data to reach significance. We are patient and do not over-claim from a short window. - Poisson tests over-reject during clustering. Mitigated by the catalog-based tests, but it means the classic CSEP L/N grid tests must be read with care for a daily clustered model — always paired with the catalog-based result.
-
Catalog dependence. Results depend on the authoritative catalog and on the
$M_c$ choices; different agencies' catalogs can flip marginal results. We fix and document them. - Passing consistency tests is not skill. A model can pass N/M/S/L and still be useless (e.g. a smoothed climatology). Skill is established only by winning comparison tests against real baselines (smoothed-seismicity and ETAS).
- Proper scores and ASS are communication aids, not peer-review-grade headline skill. We lead with CSEP; Brier / Log / CRPS / Molchan-ASS support it, never replace it. AUC is not used as a skill metric at all.
These are the real, leakage-free numbers produced by the harness above and rendered live on the Back-analysis page (not a sample). Full provenance and the justification for every modelling change are in the repository experiment register.
Model + catalog. Regime-tiled ETAS, 195 ETAS tiles, b = 1.337, M_c = 5.35, fit on 97,230 real events (USGS ComCat, 1990–2026, M ≥ 4.5; Scordilis-2006 Mw homogenization). The N-test is not yet passed (the forecast count needs calibrating) — surfaced, not hidden.
Pseudo-prospective back-analysis (60-day window, 8 views, IGPE vs the Poisson null, in nats):
| Result | Value | Reading |
|---|---|---|
| Global whole-Earth field vs null (1 d) | +0.0835 | the global context contribution — higher than any single country (aggregated worldwide triggering) |
| High − low seismicity gap (1 d) | +0.0184 | skill concentrates in active zones — the measured bias (Japan leads at +0.072; low-seismicity interiors ≈ 0) |
| Context channel (neural) | ≈ 0 | by construction until geodetic/stress covariates are wired — consistent with the field (no NPP beats ETAS prospectively as of 2026) |
Multi-model benchmark (single-window leakage-free IGPE vs the null; every model, including ones we do not ship):
| View | obs | ETAS | Ensemble | Reasenberg–Jones |
|---|---|---|---|---|
| Global | 248 | +3.712 | +2.795 | −0.025 |
| Chile | 11 | +0.036 | −0.061 | −0.182 |
| Japan | 5 | −0.081 | −0.013 | −0.438 |
Honest finding. ETAS is the best single model; the naive equal-weight ensemble underperforms ETAS because averaging in the null and the weak Reasenberg–Jones dilutes the triggering signal that is the skill. This matches the literature (ensembles help only when score-weighted with strong members; Marzocchi-Zechar-Jordan 2012, Herrmann & Marzocchi 2023), so we do not ship the naive ensemble — the next lever is score-weighted stacking of ETAS variants. The full cited evidence base is in improvement-evidence.
Reported honestly: negative and zero results are kept. A model that does not beat its baseline is shown as such; that is the entire point of the CSEP discipline above.
- Field, E. H. (2007). Overview of the Working Group for the Development of RELM. SRL, 78(1), 7–16. 10.1785/gssrl.78.1.7
- Schorlemmer, D., Gerstenberger, M. C., Wiemer, S., Jackson, D. D., & Rhoades, D. A. (2007). Earthquake Likelihood Model Testing. SRL, 78(1), 17–29. 10.1785/gssrl.78.1.17
- Schorlemmer, D., & Gerstenberger, M. C. (2007). RELM Testing Center. SRL, 78(1), 30–36. 10.1785/gssrl.78.1.30
- Zechar, J. D., & Jordan, T. H. (2008). Testing alarm-based earthquake predictions. GJI, 172(2), 715–724. 10.1111/j.1365-246X.2007.03676.x
- Zechar, J. D., Gerstenberger, M. C., & Rhoades, D. A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. BSSA, 100(3), 1184–1195. 10.1785/0120090192
- Zechar, J. D., & Jordan, T. H. (2010). The Area Skill Score Statistic for Evaluating Earthquake Predictability Experiments. PAGEOPH, 167, 893–906. 10.1007/s00024-010-0086-0
- Rhoades, D. A., Schorlemmer, D., Gerstenberger, M. C., Christophersen, A., Zechar, J. D., & Imoto, M. (2011). Efficient testing of earthquake forecasting models. Acta Geophysica, 59(4), 728–747. 10.2478/s11600-011-0013-5
- Werner, M. J., Helmstetter, A., Jackson, D. D., & Kagan, Y. Y. (2011). High-Resolution Long-Term and Short-Term Earthquake Forecasts for California. BSSA, 101(4), 1630–1648. 10.1785/0120090340
- Savran, W. H., Werner, M. J., Marzocchi, W., Rhoades, D. A., Jackson, D. D., Milner, K., Field, E., & Michael, A. (2020). Pseudoprospective Evaluation of UCERF3-ETAS Forecasts during the 2019 Ridgecrest Sequence. BSSA, 110(4), 1799–1817. 10.1785/0120200026
- Savran, W. H., Bayona, J. A., Iturrieta, P., et al. (2022). pyCSEP: A Python Toolkit for Earthquake Forecast Developers. SRL, 93(5), 2858–2870. 10.1785/0220220033; JOSS 10.21105/joss.03658
- Mizrahi, L., Dallo, I., van der Elst, N. J., et al. (2024). Developing, Testing, and Communicating Earthquake Forecasts. Reviews of Geophysics, 62. 10.1029/2023RG000823
- Serafini, F., Bayona, J. A., Silva, F., et al. (2025). A benchmark database of ten years of prospective next-day earthquake forecasts in California from CSEP. Scientific Data, 12, 1501. 10.1038/s41597-025-05766-3
- Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. JASA, 102(477), 359–378. 10.1198/016214506000001437
- Kagan, Y. Y. (2017). Earthquake number forecasts testing. GJI, 211(1), 335–345. 10.1093/gji/ggx300
- Wiemer, S., & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs. BSSA, 90(4), 859–869. 10.1785/0119990114
- Woessner, J., & Wiemer, S. (2005). Assessing the quality of earthquake catalogues: Estimating the magnitude of completeness and its uncertainty. BSSA, 95(2), 684–698. 10.1785/0120040007
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
- Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12(4), 595–600. 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2
- pyCSEP documentation — Theory of CSEP Tests: docs.cseptesting.org/getting_started/theory.html
Public technical documentation. Calibration and information gain are what these tests verify; deterministic prediction is impossible and is never claimed.
⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.
CAOS_SEISMIC · seismic.fasl-work.com · source · MIT
Conditional probabilistic seismic forecasting — forecasts, never predictions.
Overview
Methodology & History
Classical models
- Models-Classical · index
- Gutenberg-Richter-Law
- Omori-Utsu-Law
- ETAS-Model
- Reasenberg-Jones-Model
- STEP-Model
- EEPAS-Model
- Smoothed-Seismicity
- Brownian-Passage-Time
- Rate-and-State-and-Coulomb
ML & analytical methods
- Models-ML · index
- Temporal-Point-Processes
- RMTPP
- Neural-Hawkes-Process
- Transformer-Hawkes-Process
- RECAST-and-FERN
- CNN-Spatial-Models
- Graph-and-Recurrent-Networks
- Detection-vs-Forecasting
Models employed
Data
Architecture
Evaluation
Progress
Reference