Skip to content

Transformer Hawkes Process

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Transformer Hawkes Process — self-attentive intensity (Zuo et al., 2020)

One topic, in depth. The Transformer Hawkes Process (THP) replaces the recurrence of RMTPP and the Neural Hawkes Process with self-attention: the history representation feeding the conditional intensity is built by a Transformer encoder, so any past event can directly influence the present without information having to survive a chain of recurrent updates. This page derives its temporal encoding, attention mechanism, and intensity, and gives an honest account of its relevance to seismic forecasting.

Honest framing up front. THP (and its sibling, the Self-Attentive Hawkes Process, SAHP) were introduced and validated on non-seismic event-sequence benchmarks (social-media, retail, electronic health records, financial and synthetic streams). They report better log-likelihood than RMTPP/NHP on those domains. Those wins do not automatically transfer to earthquakes: in rigorous, prospective, CSEP-style testing on earthquake catalogs, self-attentive neural point processes — like every neural TPP — have not robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). THP is documented here as the most principled neural-forecasting option to study (it is a point process, and its attention can in principle model long-range / cross-fault dependencies), not as a shipping capability.

Equation-provenance note. The exact published THP intensity expression should be confirmed against the primary source (Zuo et al., 2020, PMLR v119) before being treated as canonical; the per-mark softplus form below matches the form used elsewhere in this wiki and the standard secondary surveys.


Table of contents

  1. Intuition: attention instead of recurrence
  2. Temporal encoding of event times
  3. Self-attention over the event history
  4. The continuous conditional intensity
  5. The likelihood and training
  6. Parameter estimation and practicalities
  7. Strengths
  8. Limitations — and why they matter for seismicity
  9. Role in operational earthquake forecasting
  10. Worked illustration
  11. References

1. Intuition: attention instead of recurrence

Recurrent neural TPPs (RMTPP, Neural Hawkes) push the entire history through a single evolving hidden state. Information from a distant, important event (say a large mainshock weeks ago) must survive every intervening update to still influence the present — and recurrent memory decays. This is the well-known long-range bottleneck of RNNs.

Zuo et al. (2020) apply the Transformer's solution: self-attention. Each event attends directly to every other event in the history with a learned, content-and-time-dependent weight, so a distant mainshock can influence the current intensity through a direct attention edge rather than a long recurrent chain. The history representation $h(t_i)$ at each event is a weighted sum over all past events, and the conditional intensity is read out from it. Self-attention is also parallelizable across the sequence (no sequential recurrence), which speeds training.

The Self-Attentive Hawkes Process (SAHP; Zhang et al., 2020) is a close cousin with the same core idea; THP and SAHP generally report the best log-likelihood among neural TPPs on standard non-seismic benchmarks.


2. Temporal encoding of event times

A Transformer is permutation-invariant; it has no inherent notion of when events happened. THP injects time with a temporal positional encoding — a deterministic, sinusoidal map of each event time $t_i$ to a vector $\mathbf{z}(t_i) \in \mathbb{R}^{d}$, analogous to the Transformer's positional encoding but using the continuous timestamp rather than the integer position:

$$\big[\mathbf{z}(t_i)\big]_{j} = \begin{cases} \sin!\big(t_i ,/, 10000^{,(j-1)/d}\big), & j \text{ odd},\[4pt] \cos!\big(t_i ,/, 10000^{,j/d}\big), & j \text{ even}. \end{cases}$$

This encoding is added to the event's mark embedding (magnitude/location embedding, in a seismic adaptation) to form the input token for event $i$. Encoding the actual time (not just order) is what lets the model reason about inter-event durations — essential for an Omori-like decay, where how long ago a mainshock occurred is what matters.


3. Self-attention over the event history

Stacking the per-event input tokens into a matrix $X$ (one row per past event), a single attention head computes queries, keys, and values by learned linear maps $Q = X W^Q$, $K = X W^K$, $V = X W^V$, and forms the attention output

$$\text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{Q K^{\top}}{\sqrt{d_k}} + M\right) V,$$

where $d_k$ is the key dimension and $M$ is a causal mask ($M_{ij} = -\infty$ for $j \ge i$) that forbids an event from attending to itself or the future — enforcing the point-process requirement that the intensity at event $i$ depends only on the strict history $\mathcal{H}_{t_i}$. (Without this mask the model would leak future information, the cardinal sin of Models — ML.) Multiple heads are concatenated and passed through a position-wise feed-forward network with residual connections and layer normalization, as in a standard Transformer encoder, producing a hidden representation $h(t_i)$ for each event that summarizes the whole causal history.

flowchart TB
    subgraph Tokens
      T1["mark embⱼ + z(tⱼ)"]
    end
    T1 --> ATT["Masked multi-head<br/>self-attention<br/>softmax(QKᵀ/√d_k + M) V"]
    ATT --> FFN["Feed-forward + residual + LayerNorm"]
    FFN --> H["h(t_i) — causal history summary"]
    H --> INT["λ*(t) = softplus( α·(t−t_i)/t_i + wᵀh(t_i) + b )"]
    INT --> LL["log-likelihood: Σ log λ* − ∫ λ* dτ"]
Loading

4. The continuous conditional intensity

The hidden representation $h(t_i)$ is computed at each event. To obtain a continuous intensity for times $t$ between events $t \in (t_i, t_{i+1}]$, THP interpolates with an explicit elapsed-time term:

$$\boxed{;\lambda^*(t) = \text{softplus}!\Big( \alpha, \frac{t - t_i}{t_i}

  • \mathbf{w}^{\top} h(t_i) + b \Big), \qquad t \in (t_i, t_{i+1}];}$$

(for a multivariate / marked process, one such expression per mark $m$, with parameters $\alpha_m, \mathbf{w}_m, b_m$). Term by term:

  • $\mathbf{w}^{\top} h(t_i)$ — the history contribution from the attention summary; it sets the intensity level and carries long-range dependencies via attention.
  • $\alpha,\dfrac{t - t_i}{t_i}$ — the continuous interpolation term: a learned coefficient $\alpha$ on the elapsed time since the last event, normalized by $t_i$. Its sign governs whether the intensity rises or relaxes between events (with $\alpha &lt; 0$ giving Omori-like decay).
  • $b$ — a base offset; softplus keeps $\lambda^*(t) &gt; 0$ (and, as in NHP, grows only linearly for large arguments, avoiding $\exp$ blow-ups).

So the intensity level and shape are set by attention over the full history, while a simple analytic term carries it continuously to the next event. This combines long-range expressiveness (attention) with a tractable inter-event form.


5. The likelihood and training

THP is trained on the point-process log-likelihood:

$$\log\mathcal{L} = \sum_{i=1}^{n} \log \lambda^_(t_i) ;-; \int_0^T \lambda^_(\tau),d\tau.$$

As with Neural Hawkes, the compensator $\int_0^T \lambda^*(\tau),d\tau$ has no general closed form (softplus of an affine function of elapsed time), so it is approximated. THP uses either Monte-Carlo sampling or a numerical-integration approximation per inter-event interval,

$$\int_{t_i}^{t_{i+1}} \lambda^_(\tau),d\tau \approx \frac{t_{i+1}-t_i}{L}\sum_{\ell=1}^{L}\lambda^_(\tau_\ell), \qquad \tau_\ell \sim \mathcal{U}(t_i,t_{i+1}).$$

THP also commonly adds auxiliary heads (next-time and next-mark prediction losses) to stabilize training, but the point-process likelihood with its compensator remains the core objective — the survival penalty that keeps the model calibratable rather than a regressor, the dividing line emphasized across this wiki.


6. Parameter estimation and practicalities

  • Parameters. Mark embeddings, the attention projection matrices $W^Q, W^K, W^V$ (per head), feed-forward and layer-norm weights, the intensity readout ${\alpha, \mathbf{w}, b}$ (per mark) — trained end-to-end by stochastic gradient ascent on the (MC-approximated) log-likelihood.
  • Causal masking is mandatory. The mask $M$ must forbid attending to current/future events; omitting it leaks the label and invalidates every evaluation.
  • Compute. Self-attention is $O(n^2)$ in sequence length but fully parallel — fast to train, and cheap at inference for the daily-forecast use case.
  • Data hunger / overfitting. Transformers are parameter-rich and notoriously data-hungry; seismic catalogs offer few effectively-independent large sequences, the regime where over-parameterized models overfit (the DeVries–Mignan lesson, Models — ML §5). The product's guardrails are unchanged: strictly temporal (rolling-origin) splits, trivial + ETAS baselines alongside, proper scoring + CSEP consistency tests, and calibration as a release blocker.

7. Strengths

  • Long-range dependencies without recurrence decay. Direct attention edges let a distant mainshock influence the present intensity — in principle capturing cross-fault / long-memory structure that RNN-based TPPs lose.
  • Parallel, scalable training. No sequential recurrence; attention is parallel across the sequence.
  • Best neural-TPP likelihood on standard benchmarks. THP/SAHP generally top RMTPP/NHP on non-seismic event-sequence log-likelihood.
  • Principled point process. Trained on the true likelihood with a compensator, so it yields a proper, calibratable intensity — not a black-box classifier. Of the neural options, it is the most methodologically aligned with what forecasting requires.

8. Limitations — and why they matter for seismicity

  • Validated off-domain; gains do not auto-transfer. THP's wins are on retail/EHR/social/synthetic streams. On earthquake catalogs under fair, prospective, temporal splits, self-attentive (and all) neural TPPs have not robustly beaten ETAS (EarthquakeNPP; see Models — ML §4). This is the single most important caveat.
  • Data hunger vs. scarce sequences. Transformers want large, diverse training corpora; the supply of effectively-independent large seismic sequences is small, inviting overfitting that vanishes — or reverses — under honest temporal testing.
  • No native spatial kernel. THP is a temporal (optionally marked) process. Seismic forecasting needs a continuous spatial density over $(x,y)$; bolting space on as discrete marks is crude.
  • No built-in seismic physics. No Omori law, no Gutenberg–Richter, no branching-ratio subcriticality constraint unless imposed. A free THP can drift from seismologically sensible behaviour and offers no interpretable parameters for expert review.
  • Approximate, costlier likelihood. The compensator must be approximated (MC/quadrature), adding variance and compute over ETAS's analytic integrals.
  • Attention ≠ causation. High attention weight on a past event is not evidence of a physical triggering link; over-reading attention maps as "discovered physics" repeats the DeVries over-interpretation error.

Net assessment for this product. THP is the most principled neural-forecasting architecture to study, and it is the natural sequence backbone if the gated neural challenger ever goes transformer. But it is never a default forecaster. A seismic THP would keep the ETAS inductive bias (additive background + summed triggering), model magnitude and space explicitly, and must clear the hard gate: a prospective CSEP win over a well-fit ETAS plus a passing reliability diagram.


9. Role in operational earthquake forecasting

THP has no direct operational role in CAOS_SEISMIC today. Its relevance is forward-looking:

  • The principled neural backbone. If the research track pursues a neural challenger, a THP-style self-attention encoder is the most defensible history summarizer — it is itself a point process and trains on the same likelihood the product grades on, so it slots directly into the CSEP harness.
  • Long-range / cross-fault structure is the specific capability worth testing: whether attention over a long event history adds prospective skill over ETAS's local triggering, for some region, is an open empirical question — to be answered in the harness, not asserted.
  • A cautionary data-point. That the strongest-likelihood neural TPP still does not beat ETAS on earthquakes under fair testing is exactly why the product ships an ETAS-class core and gates all neural work behind demonstrated prospective skill + calibration.

In OEF terms, THP maximizes expressive history modeling; the seismic evidence keeps the burden of proof squarely on demonstrated prospective skill against the physics-informed baseline.


10. Worked illustration

Consider a horizon just after the most recent event $t_i$, and suppose attention over the history produces $\mathbf{w}^{\top} h(t_i) = -0.7$, with a learned decay coefficient $\alpha = -0.4$ and base $b = 0$. For elapsed time $s = t - t_i$ (and taking $t_i$ as the normalizing scale, so the term is $\alpha, s / t_i$ with $t_i$ folded into $\alpha' = \alpha/t_i = -0.4,\text{day}^{-1}$ for this illustration):

$$\lambda^*(t_i + s) = \text{softplus}(-0.4,s - 0.7) = \log!\big(1 + e^{-0.4 s - 0.7}\big)\quad[\text{events/day}].$$

Evaluate:

elapsed $s$ (days) argument $\lambda^*$ = softplus (events/day)
0.0 $-0.70$ $0.403$
1.0 $-1.10$ $0.288$
2.0 $-1.50$ $0.201$
7.0 $-3.50$ $0.030$

The intensity starts at ~0.40/day and relaxes smoothly — an Omori-flavoured decay whose level was set by attention over the entire history (so a large mainshock far in the past could have lifted that starting level through a direct attention edge, where an RNN might have forgotten it). The probability of at least one event within $H = 2$ days uses the compensator (numerically, $\Lambda = \int_0^2 \lambda^*,ds \approx 0.59$):

$$P(\ge 1 \text{ event in 2 days}) = 1 - e^{-0.59} \approx 45%.$$

As always, this is a bounded, calibratable probability conditioned on history — never an alarm, never a deterministic call. A passing reliability diagram, not a single outcome, is what would validate it.


References

  1. Zuo, S., Jiang, H., Li, Z., Zhao, T. & Zha, H. (2020). Transformer Hawkes Process. Proceedings of the 37th International Conference on Machine Learning (ICML), PMLR v119, 11692–11702. https://proceedings.mlr.press/v119/zuo20a.html
  2. Zhang, Q., Lipani, A., Kirnap, O. & Yilmaz, E. (2020). Self-Attentive Hawkes Process. ICML 2020, PMLR v119. arXiv:1907.07561
  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762
  4. Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. NeurIPS 2017. arXiv:1612.09328
  5. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes. KDD 2016. doi:10.1145/2939672.2939875
  6. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. doi:10.1080/01621459.1988.10478560
  7. Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
  8. Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. TMLR. arXiv:2410.08226

See also: Temporal Point Processes · RMTPP · Neural Hawkes Process · Models — ML · Models — Classical · Evaluation · Honest-Limits.

Clone this wiki locally