Skip to content

Neural Hawkes Process

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Neural Hawkes Process — continuous-time LSTM (Mei & Eisner, 2017)

One topic, in depth. The Neural Hawkes Process (NHP) is the neural temporal point process that fixed RMTPP's biggest rigidity: instead of freezing the history vector between events, it lets the hidden state evolve continuously in time via a continuous-time LSTM. This grants the model an ability a classical Hawkes process structurally lacks — inhibition, where a past event can lower the future rate. This page derives the continuous-time cell, the intensity, and the likelihood, and gives an honest account of its relevance to seismic forecasting.

Honest framing up front. NHP was introduced and validated on non-seismic event streams (retail, social media, electronic health records, synthetic self-modulating processes). It is a foundational neural-TPP architecture, not an earthquake model. Its log-likelihood wins on those domains do not automatically transfer to seismicity: under fair, prospective, CSEP-style testing on earthquake catalogs, no neural point process has robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). NHP belongs in this wiki as a key conceptual step between RMTPP and Transformer Hawkes, and because its self-modulation idea (events can excite or inhibit) is genuinely relevant to seismicity's mix of triggering and quiescence.


Table of contents

  1. Intuition: a self-modulating Hawkes process
  2. From discrete-state to continuous-time LSTM
  3. The conditional intensity
  4. Excitation and inhibition (what classical Hawkes cannot do)
  5. The likelihood and its Monte-Carlo compensator
  6. Parameter estimation and practicalities
  7. Strengths
  8. Limitations — and why they matter for seismicity
  9. Role in operational earthquake forecasting
  10. Worked illustration
  11. References

1. Intuition: a self-modulating Hawkes process

A classical Hawkes process (the point-process skeleton of ETAS) is purely self-exciting: every past event adds a non-negative kick to the intensity, $\lambda^*(t) = \mu + \sum_{i:t_i<t}\phi(t-t_i)$ with $\phi \ge 0$. It can model "events beget events" (aftershocks) but cannot model an event making the future less likely — there is no mechanism for inhibition or for the background itself to ebb and flow.

Mei & Eisner (2017) generalized this into a neurally self-modulating point process. The idea: let a recurrent hidden state $h(t)$ represent the system, let each event update that state, and let the state decay continuously toward a baseline between events — but allow the decay targets and gates to be learned, so an event can drive the intensity up or down. The intensity is then a smooth, positive function of $h(t)$. Because $h(t)$ keeps moving between events (unlike RMTPP's frozen $h_j$), the intensity can be non-monotone within an interval, and the influence of an event can change sign. Mei & Eisner call this a self-modulating multivariate point process.

For seismicity this is conceptually attractive: real catalogs show both excitation (Omori aftershock bursts) and effective quiescence (rate drops, stress shadows, post-seismic relaxation patterns) that a purely additive, non-negative kernel cannot represent.


2. From discrete-state to continuous-time LSTM

A standard LSTM updates a cell state $c$ only at discrete steps. NHP's central device is the continuous-time LSTM (cLSTM): between events the cell state does not stay constant — it decays exponentially from its post-event value $c_j$ toward a learned steady-state target $\bar{c}_j$, at a learned rate $\delta_j$:

$$c(t) = \bar{c}_{j} + (c_{j} - \bar{c}_{j}),\exp!\big(-\delta_{j},(t - t_j)\big), \qquad t \in (t_j, t_{j+1}].$$

  • $c_j$ — the cell state immediately after event $j$ (the usual LSTM input/forget/cell update, applied at the event).
  • $\bar{c}_j$ — the target the cell decays toward between events (a second, learned cell vector).
  • $\delta_j > 0$ — a learned per-dimension decay rate (output of a softplus gate). Large $\delta$ means fast relaxation to baseline; small $\delta$ means long memory.

The hidden state is then $h(t) = o_j \odot \tanh!\big(c(t)\big)$, where $o_j$ is the LSTM output gate at event $j$ and $\odot$ is elementwise product. So $h(t)$ is a smooth, continuously varying vector between events — the key difference from RMTPP's piecewise-constant $h_j$.

At each event the cLSTM performs a standard gated update (input, forget, output gates, plus the extra decay-rate gate producing $\delta_j$), reading the event's mark and timing. The result: a recurrent state that both jumps at events and drifts between them.

flowchart LR
    subgraph "Between events: continuous decay"
      C0["cⱼ (post-event)"] -->|"c(t)=c̄ⱼ+(cⱼ−c̄ⱼ)e^(−δⱼ(t−tⱼ))"| HT["h(t)=oⱼ⊙tanh(c(t))"]
    end
    HT --> INT["λ*ₖ(t)=softplus(wₖᵀ h(t))"]
    EV["event (tⱼ,kⱼ)"] -->|gated cLSTM update| C0
    EV -.->|learned δⱼ via softplus| C0
    INT --> LL["log-likelihood<br/>Σ log λ* − ∫ λ* dτ (Monte-Carlo)"]
Loading

3. The conditional intensity

For each event type (mark) $k$, the conditional intensity is a softplus of a linear projection of the continuously evolving hidden state:

$$\boxed{;\lambda^*_k(t) = f_k!\big(\mathbf{w}_k^{\top} h(t)\big), \qquad f_k(x) = s_k,\log!\big(1 + e^{x / s_k}\big);}$$

where $f_k = \text{softplus}$ (a smooth, strictly positive function) and $s_k &gt; 0$ is a learned scale parameter per type. The total intensity is $\lambda^(t) = \sum_k \lambda^_k(t)$.

Why softplus rather than RMTPP's $\exp$? Two reasons. (1) Positivity without explosion: softplus is positive everywhere yet grows only linearly for large arguments, avoiding the runaway intensities $\exp$ can produce. (2) Sign-agnostic input: because $h(t)$ can be positive or negative and moves continuously, $\mathbf{w}_k^{\top} h(t)$ can rise or fall over an interval, so $\lambda^*_k(t)$ can be non-monotone between events — the expressive gain over RMTPP (whose log-intensity is linear, hence monotone, between events; see RMTPP §8).

The mark distribution at an event is obtained from the per-type intensities, $P(k \mid t) = \lambda^k(t),/,\sum{k'}\lambda^_{k'}(t)$ — a natural multivariate competing-risks form.


4. Excitation and inhibition (what classical Hawkes cannot do)

This is the headline capability. In a classical Hawkes process the kernel $\phi \ge 0$, so the intensity can only be pushed up by past events; the contribution of an event monotonically relaxes back to (but never below) baseline. NHP removes that restriction:

  • An event updates the cell state $c_j$ and the decay target $\bar{c}_j$. If the update drives $\mathbf{w}_k^{\top} h(t)$ downward, the intensity for type $k$ decreases after that event — inhibition. So "this event makes the next one less likely / later" is representable.
  • The baseline itself is self-modulating: $h(t)$ drifts toward a learned target, so the effective background rate is not a fixed $\mu$ but a state-dependent, evolving quantity.

For seismicity, inhibition and a modulating baseline map (loosely) onto phenomena a purely excitatory ETAS handles only by hand: stress shadows, post-mainshock rate deficits in some regions, and non-stationary background driven by transients. NHP can represent these in principle. Whether it estimates them reliably from limited earthquake data — and whether that helps a calibrated forecast — is a different question, answered honestly in §8.


5. The likelihood and its Monte-Carlo compensator

NHP is trained on the standard point-process log-likelihood:

$$\log\mathcal{L} = \sum_{j=1}^{n} \log \lambda^__{k_j}(t_j) ;-; \int_0^T \lambda^_(\tau),d\tau, \qquad \lambda^_(\tau) = \sum_k \lambda^__k(\tau).$$

Unlike RMTPP, the compensator $\int_0^T \lambda^*(\tau),d\tau$ has no closed form: the hidden state $h(\tau)$ evolves nonlinearly (softplus of a projection of an exponentially decaying cell state), so the integral is approximated by Monte-Carlo. On each inter-event interval $(t_j, t_{j+1}]$ one samples times $\tau_\ell$ uniformly and estimates

$$\int_{t_j}^{t_{j+1}} \lambda^_(\tau),d\tau \approx \frac{t_{j+1}-t_j}{L}\sum_{\ell=1}^{L} \lambda^_(\tau_\ell).$$

This is the practical price of continuous-time expressiveness — the survival penalty is now an estimated quantity rather than an analytic one. It remains the same compensator that makes the model calibratable: it penalizes intensity placed where no event occurred. (This is the term that classification/regression framings discard, the structural failure noted on Models — ML.)


6. Parameter estimation and practicalities

  • Parameters. All cLSTM gate weights (input/forget/output, plus the decay-rate gate producing $\delta_j$), the two cell vectors per step ($c_j$, $\bar{c}_j$), the per-type intensity weights $\mathbf{w}_k$ and scales $s_k$, and the mark embeddings — trained end-to-end by stochastic gradient ascent (BPTT) on the Monte-Carlo log-likelihood.
  • Compensator samples ($L$). A bias/variance knob: too few samples gives a noisy survival penalty; too many is slow. Implementations tune $L$ per interval length.
  • Cost. Heavier than RMTPP (continuous-state decay + MC integration), but still cheap at inference relative to large ETAS simulations.
  • Data hunger / overfitting. Like every neural TPP, NHP has many parameters and needs many effectively-independent sequences. Seismic catalogs supply few independent large sequences — the exact condition under which over-parameterized models overfit (the DeVries–Mignan lesson, Models — ML §5). The product's response is unchanged: any neural model must beat ETAS in a strictly temporal, prospective CSEP harness and pass calibration before it ships.

7. Strengths

  • Inhibition / self-modulation. Can represent an event lowering the future rate and a time-varying baseline — strictly more expressive than a non-negative-kernel Hawkes/ETAS.
  • Non-monotone intra-interval intensity. Because $h(t)$ evolves continuously, the intensity can rise and fall between events — fixing RMTPP's monotone-interval rigidity.
  • Continuous-time memory with controllable horizon. The learned decay rates $\delta_j$ let the model keep some dimensions long-lived (slow context) and others short-lived (recent dynamics).
  • Principled probabilistic object. Trained on the true point-process likelihood with a (Monte-Carlo) compensator, so it yields a proper, calibratable intensity — not a classifier.

8. Limitations — and why they matter for seismicity

  • Validated off-domain. NHP's gains are on retail/EHR/social/synthetic streams. They do not auto-transfer to earthquakes. On earthquake catalogs under fair temporal splits, neural TPPs of this family have not robustly beaten ETAS (EarthquakeNPP; see Models — ML §4).
  • Recurrent memory bottleneck. A single evolving state must summarize all history; very long-range or cross-fault dependencies can still be lost to recurrence decay — the gap that motivates the attention-based Transformer Hawkes Process.
  • No native spatial kernel. NHP is a multivariate temporal process (event types), not a continuous spatial one. Seismic forecasting needs a spatial density over $(x,y)$; discretizing space into "types" is crude and scales poorly.
  • No built-in seismic physics. No Omori law, no Gutenberg–Richter magnitude distribution, no branching-ratio subcriticality constraint unless added by hand. A free NHP can drift from seismologically sensible behaviour.
  • Costlier, approximate likelihood. The Monte-Carlo compensator adds variance and compute relative to ETAS's analytic integrals and RMTPP's closed form.
  • Interpretability. Parameters are not seismologically meaningful (no $p$, $c$, $\alpha$, $b$), making expert review and uncertainty propagation harder than for ETAS.

Net assessment for this product. NHP's self-modulation is a genuinely interesting capability, and it is studied in the neural-challenger research track. But it is never a default forecaster. If a seismic adaptation of this idea is pursued, it keeps the ETAS skeleton (additive background + summed triggering), models magnitude and space explicitly, and must clear the hard gate: a prospective CSEP win over a well-fit ETAS plus a passing reliability diagram.


9. Role in operational earthquake forecasting

NHP has no direct operational role in CAOS_SEISMIC today. Its contributions are conceptual:

  • Self-modulation as a design idea. The notion that the background and the event-influence can be learned and time-varying — including inhibition — is a useful lens on non-stationary seismicity (transients, stress shadows) that a fixed-$\mu$, non-negative-kernel ETAS handles only by hand.
  • Continuous-time state as a template. The cLSTM's continuously evolving hidden state is the recurrent counterpart to attention-based history summaries, both of which a seismic neural challenger might use on top of an ETAS inductive bias rather than as a free-form replacement.
  • A cautionary data-point. That such an expressive, principled model still does not beat ETAS on earthquakes under fair testing is exactly why the product ships an ETAS-class core and gates all neural work behind prospective CSEP skill + calibration.

In OEF terms, NHP widens what an intensity can express; the seismic evidence keeps the burden of proof on demonstrated prospective skill, not expressiveness.


10. Worked illustration

Take a single intensity dimension (drop the type subscript) with learned scale $s = 1$. Suppose right after an event the projection is $\mathbf{w}^{\top} h(t_j) = +0.5$ but the decay target is negative, $\mathbf{w}^{\top}\bar h = -1.0$, with per-dimension relaxation such that the projected state decays as

$$z(s) \equiv \mathbf{w}^{\top} h(t_j + s) = -1.0 + (0.5 - (-1.0)),e^{-\delta s} = -1.0 + 1.5,e^{-\delta s}, \qquad \delta = 1.0,\text{day}^{-1}.$$

The intensity is $\lambda^*(t_j + s) = \text{softplus}(z(s)) = \log(1 + e^{z(s)})$. Evaluate:

elapsed $s$ (days) $z(s)$ $\lambda^*$ = softplus$(z)$ (events/day)
0.0 $+0.50$ $0.974$
0.5 $-0.090$ $0.649$
1.0 $-0.448$ $0.494$
2.0 $-0.797$ $0.371$
$\to\infty$ $-1.00$ $0.313$

The intensity starts elevated (just under 1/day), falls below where it would settle, and relaxes toward a baseline of $\text{softplus}(-1.0) \approx 0.313$/day. Had the decay target been more negative than the post-event level, the curve would instead dip and recover — an explicit inhibition-then-recovery shape that neither a non-negative Hawkes kernel nor RMTPP's monotone interval can produce. That is the qualitative capability NHP adds.

The probability of at least one event within $H = 1$ day uses the Monte-Carlo compensator; numerically integrating $\lambda^*$ over $[0,1]$ here gives $\Lambda \approx 0.60$, so $P(\ge 1 \text{ in 1 day}) = 1 - e^{-0.60} \approx 45%$ — still a bounded, calibratable probability, never an alarm.


References

  1. Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. Advances in Neural Information Processing Systems (NeurIPS) 30. arXiv:1612.09328
  2. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes. KDD 2016. doi:10.1145/2939672.2939875
  3. Hawkes, A.G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90. doi:10.1093/biomet/58.1.83
  4. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. doi:10.1080/01621459.1988.10478560
  5. Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
  6. Zuo, S., Jiang, H., Li, Z., Zhao, T. & Zha, H. (2020). Transformer Hawkes Process. ICML 2020, PMLR v119, 11692–11702.
  7. Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. TMLR. arXiv:2410.08226

See also: Temporal Point Processes · RMTPP · Transformer Hawkes Process · Models — ML · Models — Classical · Honest-Limits.

Clone this wiki locally