CNN Spatial Models

CNN Spatial Models and the Cautionary Tale — DeVries 2018 vs. Mignan & Broccardo 2019

The single most important evaluation lesson in machine-learning-for-seismicity, told in full. A 2018 Nature paper used a deep neural network to forecast the spatial pattern of aftershocks and reported it crushed the classical baseline. A 2019 Nature rebuttal matched the deep net with a two-parameter logistic regression — "one neuron." The deep net's apparent advantage was an artifact of architecture and evaluation, not learned physics. This page works through the original claim, the refutation, the reply, and — most importantly — why AUC is the wrong metric for forecasting and what discipline this product bakes in as a result.

Why this page exists. Every design rule on the Models — Employed page — "AUC is banned as a primary metric," "always include a trivially simple baseline," "a CNN is a spatial-context encoder, never a standalone classifier" — descends directly from this episode. If you read only one page about how ML-for-earthquakes goes wrong, read this one.

The setup: spatial aftershock forecasting
The original claim — DeVries et al. 2018
The refutation — Mignan & Broccardo 2019 ("one neuron")
The reply — DeVries et al. 2019
Why AUC is the wrong metric for forecasting
Effective sample size: 131,000 rows ≠ 131,000 samples
The five evaluation lessons (baked into this product)
How this product uses CNNs safely
References

1. The setup: spatial aftershock forecasting

After a large mainshock, aftershocks cluster in space in a pattern related to the static stress change the mainshock imposed on the surrounding crust. The classical hypothesis is the Coulomb failure stress change,

$$\Delta \mathrm{CFS} = \Delta\tau + \mu',\Delta\sigma_n,$$

where $\Delta\tau$ is the shear-stress change on a chosen fault plane, $\Delta\sigma_n$ the normal- stress change, and $\mu'$ an effective friction coefficient. Where $\Delta\mathrm{CFS} > 0$, faults are pushed toward failure; where it is negative, they are relaxed. The forecasting question is: given the mainshock's computed stress field, which grid cells will light up with aftershocks?

This is a tempting target for deep learning — it is a spatial pattern-recognition problem with rich physical inputs (the full stress tensor) and a large apparent dataset (every grid cell around every mainshock is a labelled example). That temptation is exactly the trap.

2. The original claim — DeVries et al. 2018

DeVries, Viegas, Wattenberg & Meade (2018), Deep learning of aftershock patterns following large earthquakes, Nature 560:632–634 (doi:10.1038/s41586-018-0438-y).

Data: ~131,000 aftershocks drawn from ~199 mainshock–aftershock sequences, gridded around each mainshock.
Inputs: 12 features derived from the mainshock's coseismic stress-change tensor (the six independent components and related quantities) at each grid cell.
Model: a deep neural network — 6 hidden layers, 50 nodes per layer — predicting, per cell, the binary label "did an aftershock occur here, yes/no."
Headline result: the deep net achieved AUC = 0.85, versus AUC = 0.58 for the classical Coulomb-failure-stress criterion. The authors further suggested the network had discovered new physical triggering quantities — pointing to combinations like the second invariant of the deviatoric stress tensor (related to von Mises stress / maximum shear) as candidate controls the net seemed to favour over $\Delta\mathrm{CFS}$.

Framed this way, the result was striking: deep learning not only beat the textbook physics, it seemed to reveal better physics. It was a high-profile, Nature-cover-grade claim.

3. The refutation — Mignan & Broccardo 2019 ("one neuron")

Mignan & Broccardo (2019), One neuron versus deep learning in aftershock prediction, Nature 575:E1–E3 (doi:10.1038/s41586-019-1582-8; preprint arXiv:1904.01983; code at github.com/amignan/pred_EQ_aftershockXYZ). The original preprint title was even blunter: "One neuron is more informative than a deep neural network for aftershock pattern forecasting."

The core demonstration: a logistic regression with two free parameters — a single weight $w$ and a bias $b$, i.e. literally one artificial neuron —

$$p(\mathbf{x}) = \sigma(w,x + b) = \frac{1}{1 + e^{-(w x + b)}},$$

driven by essentially one physically simple feature (the sum of the absolute values of the independent stress-change tensor components), matches or exceeds the deep net's AUC. Two parameters tied 13,451.

The critique, point by point:

Massive over-parameterization. The DNN had roughly 13,451 free parameters for a problem with only 12 inputs and — critically — only ~199 effectively independent mainshocks. A model with thousands of parameters fit to a couple hundred independent units is over-parameterized by orders of magnitude; its apparent skill is overfitting that the evaluation setup failed to expose.
Inflated apparent sample size. Framing the task as a "computer-vision-like" per-cell classification turned ~199 sequences into ~131,000 labelled cells. But cells around a single mainshock are spatially correlated — they are not independent samples. Treating them as independent inflated the effective $N$ enormously and broke the assumptions underlying the AUC (see §5–§6).
No physical justification. DeVries et al. offered no mechanistic reason for the specific absolute-value stress quantities the network favoured; the "new physics" reading was not grounded — and a two-parameter model with a single transparent feature did just as well.

The deeper message: the deep net learned nothing a single neuron couldn't, so whatever generalizable signal exists in this problem is low-dimensional. The DNN's extra capacity bought overfitting, not physics.

flowchart LR
    subgraph DNN["DeVries 2018 — deep net"]
        D1["12 stress features<br/>per grid cell"] --> D2["6 layers x 50 nodes<br/>~13,451 parameters"]
        D2 --> D3["AUC = 0.85"]
    end
    subgraph ONE["Mignan & Broccardo 2019 — one neuron"]
        M1["1 feature:<br/>sum |stress components|"] --> M2["logistic regression<br/>2 parameters (w, b)"]
        M2 --> M3["AUC >= 0.85 (ties or beats)"]
    end
    D3 --> V["Verdict:<br/>the deep net learned nothing<br/>a single neuron couldn't"]
    M3 --> V

4. The reply — DeVries et al. 2019

DeVries, Viegas, Wattenberg & Meade (2019), Reply to: One neuron versus deep learning in aftershock prediction, Nature 575:E4–E5 (doi:10.1038/s41586-019-1583-7).

The authors acknowledged that the simpler model performs comparably, while maintaining that the deep net remained a useful exploratory tool for surfacing candidate physical quantities (the von Mises / second-invariant directions worth investigating). In other words: as a hypothesis-generator for physics, the net may have value; as a forecasting model that beats the baseline, the claim did not survive. That distinction — exploratory tool vs. shippable forecaster — is the one this product enforces operationally.

5. Why AUC is the wrong metric for forecasting

This is the load-bearing technical lesson, and it is worth stating precisely.

What AUC measures. The Area Under the ROC Curve is the probability that a randomly chosen positive case is scored higher than a randomly chosen negative case:

$$\mathrm{AUC} = \Pr\big(,\hat s(x_+) > \hat s(x_-),\big).$$

It is a ranking statistic. It asks only "are the cells with aftershocks ranked above the cells without?" It says nothing about whether the published numbers are right.

Why that is fatal for a forecast. A probabilistic forecast publishes rates and probabilities — "5% chance of an event in this cell tomorrow." AUC has two properties that make it blind to exactly what matters:

Invariance to any monotone rescaling of the scores. If you take a perfectly ranked forecast and multiply every probability by ten (or pass it through any increasing function), the AUC is unchanged — yet the forecast is now grossly miscalibrated. AUC therefore cannot detect miscalibration of the rates a forecast publishes. A model can have AUC 0.85 and still claim "50%" for cells that fire 5% of the time.
Degeneration into a region classifier on rare, gridded tasks. When positives are rare and spatially clustered (aftershocks near a fault), a high AUC is largely achieved by learning which broad region is seismically active — it measures between-region rate differences, not skill at the forecasting margin that decides a daily probability. You can score well by knowing "near the fault > far from the fault" while being useless at the actual decision.

The right metrics instead. Forecasting demands proper scoring rules (logarithmic score, Brier score, CRPS) and the CSEP consistency and comparison tests (N/M/S/CL tests, Information-Gain-Per- Earthquake against real baselines), all computed on the generative forecast that retains the point-process survival term. A proper scoring rule is one that is optimized in expectation only by reporting the true probabilities — which is precisely the property AUC lacks. See Evaluation for the full battery.

Product rule (verbatim). AUC and accuracy are banned as primary forecasting metrics in this product. Skill is established only by winning CSEP comparison tests against a real ETAS baseline, with a calibrated reliability diagram as a release blocker.

6. Effective sample size: 131,000 rows ≠ 131,000 samples

The over-parameterization critique only bites because the effective sample size was far smaller than the row count. This is a general, recurring trap, so it deserves its own statement.

Aftershock cells from a single mainshock are strongly correlated — physically (a contiguous stress lobe) and statistically (neighbouring cells share the same mainshock, the same catalog completeness, the same processing). A useful heuristic: the effective number of independent units is closer to the number of independent sequences ($\sim 199$) than to the number of cells ($\sim 131{,}000$).

Two consequences follow:

Variance of any skill estimate computed as if cells were independent is badly underestimated — confidence intervals are far too tight, so a difference that looks "significant" may be noise across just a couple hundred sequences.
The overfitting threshold is set by effective $N$, not row count. With $\sim 13{,}451$ parameters and $\sim 199$ effective units, parameters $\gg$ effective samples, and overfitting must be the null hypothesis until disproven on held-forward time.

The general rule the product adopts: count effective sample size, not rows. Spatially or temporally correlated observations from a few sequences are not independent draws, and every metric (AUC, log-likelihood, IGPE) computed under a false-independence assumption is misleading.

7. The five evaluation lessons (baked into this product)

Always include a trivially simple baseline — logistic regression, a single physical feature, or ETAS. If a two-parameter model ties your deep net, the deep net learned nothing generalizable. (This product runs a smoothed-seismicity Poisson null and ETAS as mandatory baselines.)
Count effective sample size, not row count. Correlated cells from a few sequences are not independent samples; metrics computed as if they were are misleading (§6).
Parameters $\gg$ effective samples $\Rightarrow$ assume overfitting until proven otherwise on held-out time.
AUC / classification is the wrong metric for forecasting. It is invariant to monotone rescaling (hence blind to miscalibration) and degenerates into a region classifier on rare gridded tasks. Use proper scoring + CSEP consistency tests on the generative forecast (§5).
Retrospective $\neq$ prospective. Nothing counts until it runs forward in time on data the model never touched at training. (This product's daily forecast clock makes temporal leakage structurally impossible — see Pipeline and Models — Employed §9.)

This is the deep-learning analogue of the selection-bias trap that discredited Accelerating Moment Release — a different method, the same failure of evaluation discipline (see Models — Classical).

8. How this product uses CNNs safely

CNNs are not banned here — they are demoted. The lesson of DeVries is not "never use a CNN," it is "never let a CNN be a standalone per-cell classifier scored by AUC." The safe pattern:

A CNN is used only as a spatial-context encoder, never as a standalone classifier.

Concretely, a CNN ingests gridded spatial fields — Slab2 subduction geometry, distance-to-fault, GNSS strain rate, smoothed background density — and encodes them into a context vector $\mathbf{c}_i$ that conditions a point-process intensity. The conditional-intensity output retains the survival (compensator) term end-to-end, so the published number stays a proper, calibratable probability — exactly what a per-cell AUC classifier throws away.

flowchart LR
    GRID["Gridded spatial context<br/>Slab2 · faults · GNSS strain · mu(x,y)"] --> CNN["CNN spatial-context encoder"]
    CNN --> CVEC["context vector c_i"]
    HIST["Event history H_t<br/>(t_i, x_i, y_i, m_i)"] --> ENC["Triggering encoder<br/>(Hawkes skeleton)"]
    CVEC --> ENC
    ENC --> LAM["Conditional intensity<br/>lambda(t,x,y | H_t)"]
    LAM --> LL["Point-process log-likelihood<br/>(survival term retained)"]
    LL --> PROB["Proper, calibratable probability<br/>(CSEP-testable, AUC banned)"]

The difference is total: a DeVries-style CNN outputs an un-calibratable ranking scored by a metric blind to miscalibration; a context-encoder CNN feeds a generative point process scored by CSEP. The gated neural challenger only ever uses the second pattern, and even then must beat ETAS prospectively before it reaches the public map.

References

DeVries, P.M.R., Viegas, F., Wattenberg, M. & Meade, B.J. (2018). Deep learning of aftershock patterns following large earthquakes. Nature 560, 632–634. doi:10.1038/s41586-018-0438-y
Mignan, A. & Broccardo, M. (2019). One neuron versus deep learning in aftershock prediction. Nature 575, E1–E3. doi:10.1038/s41586-019-1582-8 · preprint arXiv:1904.01983 · code: github.com/amignan/pred_EQ_aftershockXYZ
DeVries, P.M.R., Viegas, F., Wattenberg, M. & Meade, B.J. (2019). Reply to: One neuron versus deep learning in aftershock prediction. Nature 575, E4–E5. doi:10.1038/s41586-019-1583-7
King, G.C.P., Stein, R.S. & Lin, J. (1994). Static stress changes and the triggering of earthquakes. Bulletin of the Seismological Society of America 84(3), 935–953. doi:10.1785/BSSA0840030935
Stein, R.S. (1999). The role of stress transfer in earthquake occurrence. Nature 402, 605–609. doi:10.1038/45144
Bradley, A.P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159. doi:10.1016/S0031-3203(96)00142-2
Gneiting, T. & Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102(477), 359–378. doi:10.1198/016214506000001437
Zechar, J.D., Gerstenberger, M.C. & Rhoades, D.A. (2010). Likelihood-based tests for evaluating space–rate–magnitude earthquake forecasts. Bulletin of the Seismological Society of America 100(3), 1184–1195. doi:10.1785/0120090192
Mizrahi, L., Nandan, S., van der Elst, N. & Wiemer, S. (2024). Question-driven ensembles of flexible ETAS models / leakage taxonomy. Reviews of Geophysics 62. doi:10.1029/2023RG000823
CSEP / pyCSEP — Collaboratory for the Study of Earthquake Predictability. https://cseptesting.org · https://github.com/SCECcode/pycsep

See also: Models — Classical · Models — ML · Models — Employed · RECAST and FERN · Graph & Recurrent Networks · Detection vs. Forecasting · Evaluation · Honest Limits.

⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.

It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.

CAOS_SEISMIC · seismic.fasl-work.com · source · MIT

CAOS_SEISMIC

Conditional probabilistic seismic forecasting — forecasts, never predictions.

Live site · Repo

Overview

Methodology & History

Methodology-History

Classical models

ML & analytical methods

Models employed

Models-Employed

Data

Architecture

Evaluation

Evaluation-and-Tests

Progress

Changelog-and-Progress

Reference

CNN Spatial Models

CNN Spatial Models and the Cautionary Tale — DeVries 2018 vs. Mignan & Broccardo 2019

Table of contents

1. The setup: spatial aftershock forecasting

2. The original claim — DeVries et al. 2018

3. The refutation — Mignan & Broccardo 2019 ("one neuron")

4. The reply — DeVries et al. 2019

5. Why AUC is the wrong metric for forecasting

6. Effective sample size: 131,000 rows ≠ 131,000 samples

7. The five evaluation lessons (baked into this product)

8. How this product uses CNNs safely

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CAOS_SEISMIC

Clone this wiki locally