-
Notifications
You must be signed in to change notification settings - Fork 0
Transformer Hawkes Process
One topic, in depth. The Transformer Hawkes Process (THP) replaces the recurrence of RMTPP and the Neural Hawkes Process with self-attention: the history representation feeding the conditional intensity is built by a Transformer encoder, so any past event can directly influence the present without information having to survive a chain of recurrent updates. This page derives its temporal encoding, attention mechanism, and intensity, and gives an honest account of its relevance to seismic forecasting.
Honest framing up front. THP (and its sibling, the Self-Attentive Hawkes Process, SAHP) were introduced and validated on non-seismic event-sequence benchmarks (social-media, retail, electronic health records, financial and synthetic streams). They report better log-likelihood than RMTPP/NHP on those domains. Those wins do not automatically transfer to earthquakes: in rigorous, prospective, CSEP-style testing on earthquake catalogs, self-attentive neural point processes — like every neural TPP — have not robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). THP is documented here as the most principled neural-forecasting option to study (it is a point process, and its attention can in principle model long-range / cross-fault dependencies), not as a shipping capability.
Equation-provenance note. The exact published THP intensity expression should be confirmed against the primary source (Zuo et al., 2020, PMLR v119) before being treated as canonical; the per-mark
softplusform below matches the form used elsewhere in this wiki and the standard secondary surveys.
- Intuition: attention instead of recurrence
- Temporal encoding of event times
- Self-attention over the event history
- The continuous conditional intensity
- The likelihood and training
- Parameter estimation and practicalities
- Strengths
- Limitations — and why they matter for seismicity
- Role in operational earthquake forecasting
- Worked illustration
- References
Recurrent neural TPPs (RMTPP, Neural Hawkes) push the entire history through a single evolving hidden state. Information from a distant, important event (say a large mainshock weeks ago) must survive every intervening update to still influence the present — and recurrent memory decays. This is the well-known long-range bottleneck of RNNs.
Zuo et al. (2020) apply the Transformer's solution: self-attention. Each event attends
directly to every other event in the history with a learned, content-and-time-dependent weight, so a
distant mainshock can influence the current intensity through a direct attention edge rather than a long
recurrent chain. The history representation
The Self-Attentive Hawkes Process (SAHP; Zhang et al., 2020) is a close cousin with the same core idea; THP and SAHP generally report the best log-likelihood among neural TPPs on standard non-seismic benchmarks.
A Transformer is permutation-invariant; it has no inherent notion of when events happened. THP injects
time with a temporal positional encoding — a deterministic, sinusoidal map of each event time
This encoding is added to the event's mark embedding (magnitude/location embedding, in a seismic
adaptation) to form the input token for event
Stacking the per-event input tokens into a matrix
where
flowchart TB
subgraph Tokens
T1["mark embⱼ + z(tⱼ)"]
end
T1 --> ATT["Masked multi-head<br/>self-attention<br/>softmax(QKᵀ/√d_k + M) V"]
ATT --> FFN["Feed-forward + residual + LayerNorm"]
FFN --> H["h(t_i) — causal history summary"]
H --> INT["λ*(t) = softplus( α·(t−t_i)/t_i + wᵀh(t_i) + b )"]
INT --> LL["log-likelihood: Σ log λ* − ∫ λ* dτ"]
The hidden representation
$$\boxed{;\lambda^*(t) = \text{softplus}!\Big( \alpha, \frac{t - t_i}{t_i}
- \mathbf{w}^{\top} h(t_i) + b \Big), \qquad t \in (t_i, t_{i+1}];}$$
(for a multivariate / marked process, one such expression per mark
-
$\mathbf{w}^{\top} h(t_i)$ — the history contribution from the attention summary; it sets the intensity level and carries long-range dependencies via attention. -
$\alpha,\dfrac{t - t_i}{t_i}$ — the continuous interpolation term: a learned coefficient$\alpha$ on the elapsed time since the last event, normalized by$t_i$ . Its sign governs whether the intensity rises or relaxes between events (with$\alpha < 0$ giving Omori-like decay). -
$b$ — a base offset; softplus keeps$\lambda^*(t) > 0$ (and, as in NHP, grows only linearly for large arguments, avoiding$\exp$ blow-ups).
So the intensity level and shape are set by attention over the full history, while a simple analytic term carries it continuously to the next event. This combines long-range expressiveness (attention) with a tractable inter-event form.
THP is trained on the point-process log-likelihood:
As with Neural Hawkes, the compensator
THP also commonly adds auxiliary heads (next-time and next-mark prediction losses) to stabilize training, but the point-process likelihood with its compensator remains the core objective — the survival penalty that keeps the model calibratable rather than a regressor, the dividing line emphasized across this wiki.
-
Parameters. Mark embeddings, the attention projection matrices
$W^Q, W^K, W^V$ (per head), feed-forward and layer-norm weights, the intensity readout${\alpha, \mathbf{w}, b}$ (per mark) — trained end-to-end by stochastic gradient ascent on the (MC-approximated) log-likelihood. -
Causal masking is mandatory. The mask
$M$ must forbid attending to current/future events; omitting it leaks the label and invalidates every evaluation. -
Compute. Self-attention is
$O(n^2)$ in sequence length but fully parallel — fast to train, and cheap at inference for the daily-forecast use case. - Data hunger / overfitting. Transformers are parameter-rich and notoriously data-hungry; seismic catalogs offer few effectively-independent large sequences, the regime where over-parameterized models overfit (the DeVries–Mignan lesson, Models — ML §5). The product's guardrails are unchanged: strictly temporal (rolling-origin) splits, trivial + ETAS baselines alongside, proper scoring + CSEP consistency tests, and calibration as a release blocker.
- Long-range dependencies without recurrence decay. Direct attention edges let a distant mainshock influence the present intensity — in principle capturing cross-fault / long-memory structure that RNN-based TPPs lose.
- Parallel, scalable training. No sequential recurrence; attention is parallel across the sequence.
- Best neural-TPP likelihood on standard benchmarks. THP/SAHP generally top RMTPP/NHP on non-seismic event-sequence log-likelihood.
- Principled point process. Trained on the true likelihood with a compensator, so it yields a proper, calibratable intensity — not a black-box classifier. Of the neural options, it is the most methodologically aligned with what forecasting requires.
- Validated off-domain; gains do not auto-transfer. THP's wins are on retail/EHR/social/synthetic streams. On earthquake catalogs under fair, prospective, temporal splits, self-attentive (and all) neural TPPs have not robustly beaten ETAS (EarthquakeNPP; see Models — ML §4). This is the single most important caveat.
- Data hunger vs. scarce sequences. Transformers want large, diverse training corpora; the supply of effectively-independent large seismic sequences is small, inviting overfitting that vanishes — or reverses — under honest temporal testing.
-
No native spatial kernel. THP is a temporal (optionally marked) process. Seismic forecasting
needs a continuous spatial density over
$(x,y)$ ; bolting space on as discrete marks is crude. - No built-in seismic physics. No Omori law, no Gutenberg–Richter, no branching-ratio subcriticality constraint unless imposed. A free THP can drift from seismologically sensible behaviour and offers no interpretable parameters for expert review.
- Approximate, costlier likelihood. The compensator must be approximated (MC/quadrature), adding variance and compute over ETAS's analytic integrals.
- Attention ≠ causation. High attention weight on a past event is not evidence of a physical triggering link; over-reading attention maps as "discovered physics" repeats the DeVries over-interpretation error.
Net assessment for this product. THP is the most principled neural-forecasting architecture to study, and it is the natural sequence backbone if the gated neural challenger ever goes transformer. But it is never a default forecaster. A seismic THP would keep the ETAS inductive bias (additive background + summed triggering), model magnitude and space explicitly, and must clear the hard gate: a prospective CSEP win over a well-fit ETAS plus a passing reliability diagram.
THP has no direct operational role in CAOS_SEISMIC today. Its relevance is forward-looking:
- The principled neural backbone. If the research track pursues a neural challenger, a THP-style self-attention encoder is the most defensible history summarizer — it is itself a point process and trains on the same likelihood the product grades on, so it slots directly into the CSEP harness.
- Long-range / cross-fault structure is the specific capability worth testing: whether attention over a long event history adds prospective skill over ETAS's local triggering, for some region, is an open empirical question — to be answered in the harness, not asserted.
- A cautionary data-point. That the strongest-likelihood neural TPP still does not beat ETAS on earthquakes under fair testing is exactly why the product ships an ETAS-class core and gates all neural work behind demonstrated prospective skill + calibration.
In OEF terms, THP maximizes expressive history modeling; the seismic evidence keeps the burden of proof squarely on demonstrated prospective skill against the physics-informed baseline.
Consider a horizon just after the most recent event
Evaluate:
| elapsed |
argument |
|
|---|---|---|
| 0.0 | ||
| 1.0 | ||
| 2.0 | ||
| 7.0 |
The intensity starts at ~0.40/day and relaxes smoothly — an Omori-flavoured decay whose level was set
by attention over the entire history (so a large mainshock far in the past could have lifted that
starting level through a direct attention edge, where an RNN might have forgotten it). The probability
of at least one event within
As always, this is a bounded, calibratable probability conditioned on history — never an alarm, never a deterministic call. A passing reliability diagram, not a single outcome, is what would validate it.
- Zuo, S., Jiang, H., Li, Z., Zhao, T. & Zha, H. (2020). Transformer Hawkes Process. Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR v119, 11692–11702. https://proceedings.mlr.press/v119/zuo20a.html
- Zhang, Q., Lipani, A., Kirnap, O. & Yilmaz, E. (2020). Self-Attentive Hawkes Process. ICML 2020, PMLR v119. arXiv:1907.07561
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
- Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. NeurIPS 2017. arXiv:1612.09328
- Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes. KDD 2016. doi:10.1145/2939672.2939875
- Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. doi:10.1080/01621459.1988.10478560
- Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
- Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. TMLR. arXiv:2410.08226
See also: Temporal Point Processes · RMTPP · Neural Hawkes Process · Models — ML · Models — Classical · Evaluation · Honest-Limits.
⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.
CAOS_SEISMIC · seismic.fasl-work.com · source · MIT
Conditional probabilistic seismic forecasting — forecasts, never predictions.
Overview
Methodology & History
Classical models
- Models-Classical · index
- Gutenberg-Richter-Law
- Omori-Utsu-Law
- ETAS-Model
- Reasenberg-Jones-Model
- STEP-Model
- EEPAS-Model
- Smoothed-Seismicity
- Brownian-Passage-Time
- Rate-and-State-and-Coulomb
ML & analytical methods
- Models-ML · index
- Temporal-Point-Processes
- RMTPP
- Neural-Hawkes-Process
- Transformer-Hawkes-Process
- RECAST-and-FERN
- CNN-Spatial-Models
- Graph-and-Recurrent-Networks
- Detection-vs-Forecasting
Models employed
Data
Architecture
Evaluation
Progress
Reference