Skip to content

Graph and Recurrent Networks

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Graph and Recurrent Networks for Seismicity — GNNs, RNNs, and LSTMs

Two large families of neural networks are routinely applied to seismicity: graph neural networks (GNNs), which exploit the natural graph of a seismic-station network, and recurrent networks (RNNs / LSTMs / GRUs), which process event or rate sequences in time. This page treats both in depth: the intuition and governing equations, what they are genuinely good at, the recurring traps when they are pointed at forecasting rather than characterization, and the honest verdict on why — for a calibrated, testable forecasting product — they sit upstream or behind a gate, not at the core.

The framing. GNNs and RNNs are real, useful tools. But their established wins are overwhelmingly on the detection / characterization side (where the answer is about events that already happened), not the forecasting side (where the answer is a calibrated probability of events that have not). Keeping that line explicit is the whole point — see Detection vs. Forecasting.


Table of contents

  1. Two graphs, two sequence problems — orienting the families
  2. Graph neural networks — intuition and equations
  3. GNNs on seismicity — where they win, where they don't
  4. Recurrent networks — RNN, LSTM, GRU
  5. RNN/LSTM seismicity-rate regression — the failure modes
  6. The right recurrent design: a recurrent point process
  7. Honest verdict and role in this product
  8. References

1. Two graphs, two sequence problems — orienting the families

It helps to fix what the network is a function of before judging it:

  • A GNN operates on a graph $G = (V, E)$. In seismology the most productive choice is $V$ = seismic stations, $E$ = station adjacency — the network is the station array. Here the GNN fuses multi-station waveforms to detect, associate, and locate events. A different, harder choice is $V$ = spatial cells or faults, used to forecast — and this is where results weaken.
  • An RNN / LSTM / GRU operates on a sequence. In seismology that sequence is either a waveform (samples in time, for detection) or an event/rate series (a catalog or binned counts, for forecasting). Again: the waveform/detection use is strong; the rate-regression forecasting use is where the traps live.

The recurring pattern across both families: the graph or sequence over the station network / waveform is a forecasting-irrelevant strength; the graph or sequence over future seismicity is a forecasting-relevant weakness.

flowchart TD
    subgraph Strong["Strong, mature — DETECTION / CHARACTERIZATION"]
        G1["GNN over station network<br/>(V = stations)"] --> S1["association · location · source params"]
        R1["RNN/LSTM over waveform<br/>(samples in time)"] --> S2["phase picking · detection"]
    end
    subgraph Weak["Weak / unproven — FORECASTING"]
        G2["GNN over spatial cells / faults<br/>(V = cells)"] --> W1["magnitude / occurrence forecast<br/>'remains weak for all models'"]
        R2["LSTM over binned rate series"] --> W2["next-bin rate regression<br/>un-calibratable"]
    end
Loading

2. Graph neural networks — intuition and equations

A GNN learns representations of nodes by passing messages along edges. Each node aggregates features from its neighbours, transforms the result, and repeats over several layers, so that after $k$ layers a node "sees" its $k$-hop neighbourhood. The canonical message-passing update for node $v$ at layer $\ell$ is

$$h_v^{(\ell+1)} = \phi!\Big( h_v^{(\ell)},; \bigoplus_{u \in \mathcal{N}(v)} \psi\big(h_v^{(\ell)}, h_u^{(\ell)}, e_{uv}\big) \Big),$$

where $\mathcal{N}(v)$ is the neighbour set, $\psi$ is a learnable message function, $\bigoplus$ a permutation-invariant aggregator (sum / mean / max), $\phi$ a learnable update, and $e_{uv}$ optional edge features. The graph convolutional special case (Kipf & Welling 2017) is

$$H^{(\ell+1)} = \sigma!\Big( \tilde{D}^{-1/2}, \tilde{A}, \tilde{D}^{-1/2}, H^{(\ell)}, W^{(\ell)} \Big),$$

with $\tilde{A} = A + I$ the adjacency-plus-self-loops, $\tilde{D}$ its degree matrix, $H^{(\ell)}$ the node-feature matrix, $W^{(\ell)}$ the learnable weights, and $\sigma$ a nonlinearity. The $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ factor is a symmetric normalization of the neighbour average.

Why this fits a station array. A seismic network is literally a graph: stations are nodes, geographic/operational proximity defines edges, and an event is observed at many stations at once. A GNN respects that the same wavefront hits the array — it can fuse the multi-station picture in a way a single-station model cannot. Spatio-temporal GNNs add a temporal module (a recurrence or temporal convolution) on top, giving message passing in space and propagation in time.


3. GNNs on seismicity — where they win, where they don't

Where GNNs are genuinely strong. When the graph is the station network, GNNs are a natural and effective fit for:

  • Phase association — deciding which picks across many stations belong to the same event.
  • Earthquake location — jointly using multi-station arrivals on the array graph.
  • Source characterization — fusing station observations to estimate source parameters.

This is active and credible 2024–2025 research (e.g. spatio-temporal graph convolutional networks for source characterization; graph-based association). It is detection-side work: it concerns events that have already occurred.

Where GNNs underwhelm — forecasting. When the graph is re-purposed as spatial cells or faults to forecast future magnitude or occurrence, the literature converges on a blunt finding:

Depth and magnitude prediction "remain weak for all tested models."

The reasons are structural, not incidental:

  • Forecasting future magnitude is fighting Gutenberg–Richter: given that an event occurs, its magnitude is approximately memoryless, so there is little learnable signal in "what size is next" (see Honest Limits).
  • A cell-graph GNN that outputs a per-cell occurrence label inherits the DeVries trap — correlated cells, classification metrics, no survival term (see CNN Spatial Models).
  • No GNN forecaster has passed prospective CSEP testing against ETAS.

The honest summary: GNNs help where the graph is the station network (detection-side), not where you need a calibrated future rate. A 2025 hybrid spatio-temporal GNN line of work (Frontiers in AI) is interesting research, not a shipping forecaster.


4. Recurrent networks — RNN, LSTM, GRU

A recurrent network maintains a hidden state $h_t$ that it updates as it consumes a sequence, giving it memory of the past. The vanilla RNN update is

$$h_t = \tanh!\big(W_{hh}, h_{t-1} + W_{xh}, x_t + b_h\big), \qquad y_t = W_{hy}, h_t + b_y.$$

Vanilla RNNs suffer vanishing/exploding gradients over long sequences. The LSTM (Hochreiter & Schmidhuber 1997) fixes this with a gated cell state $c_t$ and input/forget/output gates:

$$ \begin{aligned} f_t &amp;= \sigma(W_f [h_{t-1}, x_t] + b_f), &amp;\quad i_t &amp;= \sigma(W_i [h_{t-1}, x_t] + b_i), \\ \tilde{c}_t &amp;= \tanh(W_c [h_{t-1}, x_t] + b_c), &amp;\quad o_t &amp;= \sigma(W_o [h_{t-1}, x_t] + b_o), \\ c_t &amp;= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, &amp;\quad h_t &amp;= o_t \odot \tanh(c_t). \end{aligned} $$

The forget gate $f_t$ lets the cell retain information across long gaps, mitigating vanishing gradients. The GRU (Cho et al. 2014) is a lighter two-gate variant (update + reset) that RECAST uses as its encoder. All three are excellent sequence encoders — the question is always what sequence, predicting what target.


5. RNN/LSTM seismicity-rate regression — the failure modes

A very common — and very flawed — design feeds binned seismicity-rate or magnitude time series into an LSTM to "predict the next bin." It looks reasonable and fails for structural reasons:

  1. It drops the point-process survival term. Regressing a binned rate abandons the compensator integral $\int_0^T \lambda^*(\tau),d\tau$ that makes a forecast a proper probability. The output is un-calibratable — there is no honest probability to publish (see RECAST and FERN §2).
  2. Class imbalance defeats it. Large events are rare, so a model trained to minimize average error learns to predict "no large event" essentially always. It scores beautifully on accuracy and fails on exactly the events that matter — the same imbalance pathology that makes accuracy useless here.
  3. Shuffled splits leak the future. Random train/test shuffling of a clustered catalog lets the model see aftershocks of a sequence whose mainshock is in the test set — the data-leakage flaw EarthquakeNPP identified, which "artificially inflates performance measures due to the nature of earthquake triggering." Metrics that look strong under shuffling evaporate under chronological splits.
  4. No built-in seismological physics. An LSTM's memory decay biases it toward the most recent bin; it has no built-in Omori–Utsu decay or Gutenberg–Richter magnitude law, so it must re-learn from scarce data what ETAS encodes for free.

The net effect: binned-rate LSTM "forecasters" tend to report impressive retrospective numbers that do not survive a fair, prospective, calibration-aware evaluation.


6. The right recurrent design: a recurrent point process

Recurrence is not the problem — recurrence used to regress a rate is. The principled use of an RNN in forecasting is to encode history inside a temporal point process, so the survival term is retained. The foundational example is RMTPP (Du et al. 2016): an RNN encodes the history after the $j$-th event into a hidden state $h_j$, and the conditional intensity between events is

$$\lambda^*(t) = \exp!\big(\mathbf{v}^{\top} h_j + w,(t - t_j) + b\big),$$

where the $w(t - t_j)$ term gives a log-linear, Omori-like decay and $h_j$ carries marked history. The Neural Hawkes process (Mei & Eisner 2017) goes further with a continuous-time LSTM whose cell state decays between events,

$$\lambda^*(t) = \mathrm{softplus}!\big(\mathbf{v}^{\top} h(t)\big),$$

which can even let past events lower future intensity (inhibition) — something a classical Hawkes process cannot represent. RECAST (Dascher-Cousineau et al. 2023) is the earthquake-specific member of this lineage: a GRU encoder + neural-density decoder, which beats temporal ETAS only on large catalogs ($\gtrsim 10^4$ events) and otherwise matches it (see RECAST and FERN).

The distinction in one line. An LSTM that regresses a binned rate is un-calibratable and leaks; an LSTM (or GRU) that parameterizes a conditional intensity with the survival term is a legitimate, testable forecaster. This product only ever considers the second kind, and even then behind a CSEP gate.


7. Honest verdict and role in this product

  • GNNs: kept upstream, on the detection side, where the graph is the station network — association, location, source characterization. They build better, more complete catalogs (a lower, more stable $M_c$), which is the single biggest realizable near-term lever for both ETAS and any neural forecaster. They are not used as a cell-graph occurrence classifier.
  • RNN/LSTM/GRU: never as a binned-rate regressor. The only admissible recurrent forecaster is a recurrent neural point process (RECAST-style), and only as the gated challenger of Models — Employed §5 — it reaches the public map solely if it beats ETAS in our own prospective CSEP harness and is calibrated.

This places both families exactly where the evidence supports them: GNNs upstream building the catalog, recurrent point processes as a gated challenger, and the calibrated classical ETAS reference as the shipping core. The line between detection and forecasting is kept explicit throughout — see Detection vs. Forecasting.


References

  1. Kipf, T.N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017. arXiv:1609.02907
  2. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O. & Dahl, G.E. (2017). Neural Message Passing for Quantum Chemistry. ICML 2017. arXiv:1704.01212
  3. Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8), 1735–1780. doi:10.1162/neco.1997.9.8.1735
  4. Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014. arXiv:1406.1078
  5. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes. KDD 2016. doi:10.1145/2939672.2939875
  6. Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. NeurIPS 2017. arXiv:1612.09328
  7. Dascher-Cousineau, K., Shchur, O., Brodsky, E.E. & Günnemann, S. (2023). Using deep learning for flexible and scalable earthquake forecasting (RECAST). Geophysical Research Letters 50, e2023GL103909. doi:10.1029/2023GL103909
  8. Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. TMLR. arXiv:2410.08226
  9. McBrearty, I.W. & Beroza, G.C. (2023). Earthquake phase association with graph neural networks. Bulletin of the Seismological Society of America 113(2), 524–547. doi:10.1785/0120220182
  10. Mousavi, S.M. & Beroza, G.C. (2022). Deep-learning seismology. Science 377, eabm4470. doi:10.1126/science.abm4470

See also: Models — Classical · Models — ML · Models — Employed · RECAST and FERN · CNN Spatial Models · Detection vs. Forecasting · Evaluation · Honest Limits.

Clone this wiki locally