Neural Hawkes Process

Neural Hawkes Process — continuous-time LSTM (Mei & Eisner, 2017)

One topic, in depth. The Neural Hawkes Process (NHP) is the neural temporal point process that fixed RMTPP's biggest rigidity: instead of freezing the history vector between events, it lets the hidden state evolve continuously in time via a continuous-time LSTM. This grants the model an ability a classical Hawkes process structurally lacks — inhibition, where a past event can lower the future rate. This page derives the continuous-time cell, the intensity, and the likelihood, and gives an honest account of its relevance to seismic forecasting.

Honest framing up front. NHP was introduced and validated on non-seismic event streams (retail, social media, electronic health records, synthetic self-modulating processes). It is a foundational neural-TPP architecture, not an earthquake model. Its log-likelihood wins on those domains do not automatically transfer to seismicity: under fair, prospective, CSEP-style testing on earthquake catalogs, no neural point process has robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). NHP belongs in this wiki as a key conceptual step between RMTPP and Transformer Hawkes, and because its self-modulation idea (events can excite or inhibit) is genuinely relevant to seismicity's mix of triggering and quiescence.

Intuition: a self-modulating Hawkes process
From discrete-state to continuous-time LSTM
The conditional intensity
Excitation and inhibition (what classical Hawkes cannot do)
The likelihood and its Monte-Carlo compensator
Parameter estimation and practicalities
Strengths
Limitations — and why they matter for seismicity
Role in operational earthquake forecasting
Worked illustration
References

1. Intuition: a self-modulating Hawkes process

A classical Hawkes process (the point-process skeleton of ETAS) is purely self-exciting: every past event adds a non-negative kick to the intensity, $\lambda^*(t) = \mu + \sum_{i:t_i<t}\phi(t-t_i)$ with $\phi \ge 0$. It can model "events beget events" (aftershocks) but cannot model an event making the future less likely — there is no mechanism for inhibition or for the background itself to ebb and flow.

Mei & Eisner (2017) generalized this into a neurally self-modulating point process. The idea: let a recurrent hidden state $h(t)$ represent the system, let each event update that state, and let the state decay continuously toward a baseline between events — but allow the decay targets and gates to be learned, so an event can drive the intensity up or down. The intensity is then a smooth, positive function of $h(t)$. Because $h(t)$ keeps moving between events (unlike RMTPP's frozen $h_j$), the intensity can be non-monotone within an interval, and the influence of an event can change sign. Mei & Eisner call this a self-modulating multivariate point process.

For seismicity this is conceptually attractive: real catalogs show both excitation (Omori aftershock bursts) and effective quiescence (rate drops, stress shadows, post-seismic relaxation patterns) that a purely additive, non-negative kernel cannot represent.

2. From discrete-state to continuous-time LSTM

A standard LSTM updates a cell state $c$ only at discrete steps. NHP's central device is the continuous-time LSTM (cLSTM): between events the cell state does not stay constant — it decays exponentially from its post-event value $c_j$ toward a learned steady-state target $\bar{c}_j$, at a learned rate $\delta_j$:

$$c(t) = \bar{c}_{j} + (c_{j} - \bar{c}_{j}),\exp!\big(-\delta_{j},(t - t_j)\big), \qquad t \in (t_j, t_{j+1}].$$

$c_j$ — the cell state immediately after event $j$ (the usual LSTM input/forget/cell update, applied at the event).
$\bar{c}_j$ — the target the cell decays toward between events (a second, learned cell vector).
$\delta_j > 0$ — a learned per-dimension decay rate (output of a softplus gate). Large $\delta$ means fast relaxation to baseline; small $\delta$ means long memory.

The hidden state is then $h(t) = o_j \odot \tanh!\big(c(t)\big)$, where $o_j$ is the LSTM output gate at event $j$ and $\odot$ is elementwise product. So $h(t)$ is a smooth, continuously varying vector between events — the key difference from RMTPP's piecewise-constant $h_j$.

At each event the cLSTM performs a standard gated update (input, forget, output gates, plus the extra decay-rate gate producing $\delta_j$), reading the event's mark and timing. The result: a recurrent state that both jumps at events and drifts between them.

flowchart LR
    subgraph "Between events: continuous decay"
      C0["cⱼ (post-event)"] -->|"c(t)=c̄ⱼ+(cⱼ−c̄ⱼ)e^(−δⱼ(t−tⱼ))"| HT["h(t)=oⱼ⊙tanh(c(t))"]
    end
    HT --> INT["λ*ₖ(t)=softplus(wₖᵀ h(t))"]
    EV["event (tⱼ,kⱼ)"] -->|gated cLSTM update| C0
    EV -.->|learned δⱼ via softplus| C0
    INT --> LL["log-likelihood<br/>Σ log λ* − ∫ λ* dτ (Monte-Carlo)"]

3. The conditional intensity

For each event type (mark) $k$, the conditional intensity is a softplus of a linear projection of the continuously evolving hidden state:

$$\boxed{;\lambda^*_k(t) = f_k!\big(\mathbf{w}_k^{\top} h(t)\big), \qquad f_k(x) = s_k,\log!\big(1 + e^{x / s_k}\big);}$$

where $f_k = \text{softplus}$ (a smooth, strictly positive function) and $s_k > 0$ is a learned scale parameter per type. The total intensity is $\lambda^(t) = \sum_k \lambda^_k(t)$.

Why softplus rather than RMTPP's $\exp$? Two reasons. (1) Positivity without explosion: softplus is positive everywhere yet grows only linearly for large arguments, avoiding the runaway intensities $\exp$ can produce. (2) Sign-agnostic input: because $h(t)$ can be positive or negative and moves continuously, $\mathbf{w}_k^{\top} h(t)$ can rise or fall over an interval, so $\lambda^*_k(t)$ can be non-monotone between events — the expressive gain over RMTPP (whose log-intensity is linear, hence monotone, between events; see RMTPP §8).

The mark distribution at an event is obtained from the per-type intensities, $P(k \mid t) = \lambda^k(t),/,\sum{k'}\lambda^_{k'}(t)$ — a natural multivariate competing-risks form.

4. Excitation and inhibition (what classical Hawkes cannot do)

This is the headline capability. In a classical Hawkes process the kernel $\phi \ge 0$, so the intensity can only be pushed up by past events; the contribution of an event monotonically relaxes back to (but never below) baseline. NHP removes that restriction:

An event updates the cell state $c_j$ and the decay target $\bar{c}_j$. If the update drives $\mathbf{w}_k^{\top} h(t)$ downward, the intensity for type $k$ decreases after that event — inhibition. So "this event makes the next one less likely / later" is representable.
The baseline itself is self-modulating: $h(t)$ drifts toward a learned target, so the effective background rate is not a fixed $\mu$ but a state-dependent, evolving quantity.

For seismicity, inhibition and a modulating baseline map (loosely) onto phenomena a purely excitatory ETAS handles only by hand: stress shadows, post-mainshock rate deficits in some regions, and non-stationary background driven by transients. NHP can represent these in principle. Whether it estimates them reliably from limited earthquake data — and whether that helps a calibrated forecast — is a different question, answered honestly in §8.

5. The likelihood and its Monte-Carlo compensator

NHP is trained on the standard point-process log-likelihood:

$$\log\mathcal{L} = \sum_{j=1}^{n} \log \lambda^__{k_j}(t_j) ;-; \int_0^T \lambda^_(\tau),d\tau, \qquad \lambda^_(\tau) = \sum_k \lambda^__k(\tau).$$

Unlike RMTPP, the compensator $\int_0^T \lambda^*(\tau),d\tau$ has no closed form: the hidden state $h(\tau)$ evolves nonlinearly (softplus of a projection of an exponentially decaying cell state), so the integral is approximated by Monte-Carlo. On each inter-event interval $(t_j, t_{j+1}]$ one samples times $\tau_\ell$ uniformly and estimates

$$\int_{t_j}^{t_{j+1}} \lambda^_(\tau),d\tau \approx \frac{t_{j+1}-t_j}{L}\sum_{\ell=1}^{L} \lambda^_(\tau_\ell).$$

This is the practical price of continuous-time expressiveness — the survival penalty is now an estimated quantity rather than an analytic one. It remains the same compensator that makes the model calibratable: it penalizes intensity placed where no event occurred. (This is the term that classification/regression framings discard, the structural failure noted on Models — ML.)

6. Parameter estimation and practicalities

Parameters. All cLSTM gate weights (input/forget/output, plus the decay-rate gate producing $\delta_j$), the two cell vectors per step ($c_j$, $\bar{c}_j$), the per-type intensity weights $\mathbf{w}_k$ and scales $s_k$, and the mark embeddings — trained end-to-end by stochastic gradient ascent (BPTT) on the Monte-Carlo log-likelihood.
Compensator samples ($L$). A bias/variance knob: too few samples gives a noisy survival penalty; too many is slow. Implementations tune $L$ per interval length.
Cost. Heavier than RMTPP (continuous-state decay + MC integration), but still cheap at inference relative to large ETAS simulations.
Data hunger / overfitting. Like every neural TPP, NHP has many parameters and needs many effectively-independent sequences. Seismic catalogs supply few independent large sequences — the exact condition under which over-parameterized models overfit (the DeVries–Mignan lesson, Models — ML §5). The product's response is unchanged: any neural model must beat ETAS in a strictly temporal, prospective CSEP harness and pass calibration before it ships.

7. Strengths

Inhibition / self-modulation. Can represent an event lowering the future rate and a time-varying baseline — strictly more expressive than a non-negative-kernel Hawkes/ETAS.
Non-monotone intra-interval intensity. Because $h(t)$ evolves continuously, the intensity can rise and fall between events — fixing RMTPP's monotone-interval rigidity.
Continuous-time memory with controllable horizon. The learned decay rates $\delta_j$ let the model keep some dimensions long-lived (slow context) and others short-lived (recent dynamics).
Principled probabilistic object. Trained on the true point-process likelihood with a (Monte-Carlo) compensator, so it yields a proper, calibratable intensity — not a classifier.

8. Limitations — and why they matter for seismicity

Validated off-domain. NHP's gains are on retail/EHR/social/synthetic streams. They do not auto-transfer to earthquakes. On earthquake catalogs under fair temporal splits, neural TPPs of this family have not robustly beaten ETAS (EarthquakeNPP; see Models — ML §4).
Recurrent memory bottleneck. A single evolving state must summarize all history; very long-range or cross-fault dependencies can still be lost to recurrence decay — the gap that motivates the attention-based Transformer Hawkes Process.
No native spatial kernel. NHP is a multivariate temporal process (event types), not a continuous spatial one. Seismic forecasting needs a spatial density over $(x,y)$; discretizing space into "types" is crude and scales poorly.
No built-in seismic physics. No Omori law, no Gutenberg–Richter magnitude distribution, no branching-ratio subcriticality constraint unless added by hand. A free NHP can drift from seismologically sensible behaviour.
Costlier, approximate likelihood. The Monte-Carlo compensator adds variance and compute relative to ETAS's analytic integrals and RMTPP's closed form.
Interpretability. Parameters are not seismologically meaningful (no $p$, $c$, $\alpha$, $b$), making expert review and uncertainty propagation harder than for ETAS.

Net assessment for this product. NHP's self-modulation is a genuinely interesting capability, and it is studied in the neural-challenger research track. But it is never a default forecaster. If a seismic adaptation of this idea is pursued, it keeps the ETAS skeleton (additive background + summed triggering), models magnitude and space explicitly, and must clear the hard gate: a prospective CSEP win over a well-fit ETAS plus a passing reliability diagram.

9. Role in operational earthquake forecasting

NHP has no direct operational role in CAOS_SEISMIC today. Its contributions are conceptual:

Self-modulation as a design idea. The notion that the background and the event-influence can be learned and time-varying — including inhibition — is a useful lens on non-stationary seismicity (transients, stress shadows) that a fixed-$\mu$, non-negative-kernel ETAS handles only by hand.
Continuous-time state as a template. The cLSTM's continuously evolving hidden state is the recurrent counterpart to attention-based history summaries, both of which a seismic neural challenger might use on top of an ETAS inductive bias rather than as a free-form replacement.
A cautionary data-point. That such an expressive, principled model still does not beat ETAS on earthquakes under fair testing is exactly why the product ships an ETAS-class core and gates all neural work behind prospective CSEP skill + calibration.

In OEF terms, NHP widens what an intensity can express; the seismic evidence keeps the burden of proof on demonstrated prospective skill, not expressiveness.

10. Worked illustration

Take a single intensity dimension (drop the type subscript) with learned scale $s = 1$. Suppose right after an event the projection is $\mathbf{w}^{\top} h(t_j) = +0.5$ but the decay target is negative, $\mathbf{w}^{\top}\bar h = -1.0$, with per-dimension relaxation such that the projected state decays as

$$z(s) \equiv \mathbf{w}^{\top} h(t_j + s) = -1.0 + (0.5 - (-1.0)),e^{-\delta s} = -1.0 + 1.5,e^{-\delta s}, \qquad \delta = 1.0,\text{day}^{-1}.$$

The intensity is $\lambda^*(t_j + s) = \text{softplus}(z(s)) = \log(1 + e^{z(s)})$. Evaluate:

elapsed $s$ (days)	$z(s)$	$\lambda^*$ = softplus$(z)$ (events/day)
0.0	$+0.50$	$0.974$
0.5	$-0.090$	$0.649$
1.0	$-0.448$	$0.494$
2.0	$-0.797$	$0.371$
$\to\infty$	$-1.00$	$0.313$

The intensity starts elevated (just under 1/day), falls below where it would settle, and relaxes toward a baseline of $\text{softplus}(-1.0) \approx 0.313$/day. Had the decay target been more negative than the post-event level, the curve would instead dip and recover — an explicit inhibition-then-recovery shape that neither a non-negative Hawkes kernel nor RMTPP's monotone interval can produce. That is the qualitative capability NHP adds.

The probability of at least one event within $H = 1$ day uses the Monte-Carlo compensator; numerically integrating $\lambda^*$ over $[0,1]$ here gives $\Lambda \approx 0.60$, so $P(\ge 1 \text{ in 1 day}) = 1 - e^{-0.60} \approx 45%$ — still a bounded, calibratable probability, never an alarm.

References

Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. Advances in Neural Information Processing Systems (NeurIPS) 30. arXiv:1612.09328
Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes. KDD 2016. doi:10.1145/2939672.2939875
Hawkes, A.G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90. doi:10.1093/biomet/58.1.83
Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. JASA 83(401), 9–27. doi:10.1080/01621459.1988.10478560
Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
Zuo, S., Jiang, H., Li, Z., Zhao, T. & Zha, H. (2020). Transformer Hawkes Process. ICML 2020, PMLR v119, 11692–11702.
Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. TMLR. arXiv:2410.08226

See also: Temporal Point Processes · RMTPP · Transformer Hawkes Process · Models — ML · Models — Classical · Honest-Limits.

⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.

It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.

CAOS_SEISMIC · seismic.fasl-work.com · source · MIT

CAOS_SEISMIC

Conditional probabilistic seismic forecasting — forecasts, never predictions.

Live site · Repo

Overview

Methodology & History

Methodology-History

Classical models

ML & analytical methods

Models employed

Models-Employed

Data

Architecture

Evaluation

Evaluation-and-Tests

Progress

Changelog-and-Progress

Reference

Neural Hawkes Process

Neural Hawkes Process — continuous-time LSTM (Mei & Eisner, 2017)

Table of contents

1. Intuition: a self-modulating Hawkes process

2. From discrete-state to continuous-time LSTM

3. The conditional intensity

4. Excitation and inhibition (what classical Hawkes cannot do)

5. The likelihood and its Monte-Carlo compensator

6. Parameter estimation and practicalities

7. Strengths

8. Limitations — and why they matter for seismicity

9. Role in operational earthquake forecasting

10. Worked illustration

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CAOS_SEISMIC

Clone this wiki locally