Skip to content
Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

RMTPP — Recurrent Marked Temporal Point Process (Du et al., 2016)

One topic, in depth. RMTPP is the first of the neural temporal point processes — the model that replaced a hand-designed Hawkes triggering kernel with a recurrent neural network that learns the history representation, while keeping the point-process log-likelihood as its training objective. This page derives its conditional intensity, explains the design choices, works the likelihood, and gives an honest account of its (limited) relevance to seismic forecasting.

Honest framing up front (read before the math). RMTPP was introduced and validated on non-seismic event streams — financial transactions, electronic health records, taxi trips, social-media and music-listening logs. It is a foundational architecture, not a seismic forecaster. Its accuracy gains on those domains do not automatically transfer to earthquakes: in rigorous, prospective CSEP-style testing on earthquake catalogs, no neural point process has robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). RMTPP appears here because it is the conceptual root of Neural Hawkes and Transformer Hawkes, and because understanding why its intensity form is restrictive motivates the models that followed.


Table of contents

  1. Intuition and place in history
  2. The architecture: an RNN over event history
  3. The conditional intensity
  4. The marked likelihood and training objective
  5. Closed-form next-event density and prediction
  6. Parameter estimation and practicalities
  7. Strengths
  8. Limitations — and why they matter for seismicity
  9. Role in operational earthquake forecasting
  10. Worked illustration
  11. References

1. Intuition and place in history

A classical Hawkes / ETAS process specifies its triggering kernel by hand: each past event adds a fixed-shape decaying "kick" to the intensity (Omori–Utsu in time, Utsu productivity in magnitude). That is powerful but rigid — the functional form of how history influences the future is decided in advance by the modeller.

Du et al. (2016) asked: what if a neural network learns how history maps to the future intensity, instead of us prescribing kernels? Their Recurrent Marked Temporal Point Process (RMTPP) feeds the sequence of past events — each carrying a mark (a categorical/continuous label; for seismicity, magnitude and location) and an inter-event time — through a recurrent neural network. The RNN compresses the entire history into a fixed-length hidden vector $h_j$, and the intensity for the time until the next event is read out from $h_j$ plus an explicit elapsed-time term.

The payoff is a single model that jointly learns (a) the timing dynamics and (b) the mark dynamics from data, with no hand-tuned kernels. RMTPP is, in the lineage of temporal point processes, the first "the kernel is a neural net" model, and every neural TPP since is a variation on this idea.


2. The architecture: an RNN over event history

Events arrive as a sequence ${(t_j, k_j)}_{j=1}^n$, where $t_j$ is the time and $k_j$ is the mark of the $j$-th event. RMTPP processes them recurrently:

  1. Embed each event: the mark $k_j$ is mapped to an embedding $\mathbf{y}j$, and the timing is encoded by features of the inter-event gap $\Delta t_j = t_j - t{j-1}$ (e.g. $\Delta t_j$ and simple temporal features).

  2. Recur: a recurrent unit (the paper uses a vanilla RNN; LSTM/GRU are drop-in replacements) updates the hidden state

    $$h_j = \max!\big(\mathbf{0},; W^{y}\mathbf{y}j + W^{t}\Delta t_j + W^{h} h{j-1} + \mathbf{b}_h\big),$$

    so $h_j$ is a learned summary of the entire history up to and including event $j$.

  3. Read out two things from $h_j$: a conditional intensity governing the time to the next event (§3) and a softmax distribution over the mark of the next event (§4).

flowchart LR
    E1["(t₁,k₁)"] --> R1["RNN cell"]
    E2["(t₂,k₂)"] --> R2["RNN cell"]
    E3["(t₃,k₃)"] --> R3["RNN cell"]
    R1 -->|h₁| R2
    R2 -->|h₂| R3
    R3 -->|hⱼ| INT["Intensity readout<br/>λ*(t) = exp(vᵀhⱼ + w(t−tⱼ) + b)"]
    R3 -->|hⱼ| MK["Mark readout<br/>softmax over next mark"]
    INT --> LL["Point-process log-likelihood"]
    MK --> LL
Loading

The hidden state $h_j$ is held fixed between events $j$ and $j+1$; only the explicit elapsed-time term varies the intensity over that interval. This is the design decision that makes the next-event density closed-form (§5) — and also the source of RMTPP's main limitation (§8).


3. The conditional intensity

Between the $j$-th and $(j{+}1)$-th events — i.e. for $t \in (t_j, t_{j+1}]$ — RMTPP defines the conditional intensity as

$$\boxed{;\lambda^*(t) = \exp!\big(\mathbf{v}^{\top} h_j + w,(t - t_j) + b\big);}$$

Term by term:

  • $\mathbf{v}^{\top} h_j$ — the history contribution: a learned linear projection of the RNN summary $h_j$. This sets the intensity level right after event $j$ and carries all marked history (magnitudes, locations, gaps).
  • $w,(t - t_j)$ — the current-influence / elapsed-time term: a single scalar $w$ multiplying the time since the last event. With $w &lt; 0$ the intensity decays as time passes (an Omori-like relaxation toward background); with $w &gt; 0$ it grows; with $w = 0$ it is flat (a memoryless, homogeneous-Poisson segment). The sign and size of $w$ are learned, not prescribed.
  • $b$ — a base log-intensity offset.
  • The outer $\exp(\cdot)$ guarantees $\lambda^*(t) \ge 0$ as any intensity must (cf. the non-negativity requirement).

The crucial structural fact: $h_j$ is constant over the interval, so the only time-variation of $\log\lambda^(t)$ within an interval is the linear term $w(t - t_j)$. Hence $\lambda^(t)$ is a log-linear (i.e. exponential) function of elapsed time between events — monotone over each interval. This is exactly the property that yields a tractable likelihood but also restricts the shapes RMTPP can represent (§8).


4. The marked likelihood and training objective

RMTPP is trained by maximizing the point-process log-likelihood, augmented with the mark term. For a sequence of $n$ events on $[0,T]$:

$$\log\mathcal{L} = \sum_{j=1}^{n}\Big[\underbrace{\log\lambda^*(t_j)}_{\text{timing}}

  • \underbrace{\log P(k_j \mid h_{j-1})}{\text{mark}}\Big] ;-;\underbrace{\int_0^T \lambda^*(\tau),d\tau}{\text{compensator / survival penalty}}.$$

Because $\lambda^*$ is exponential in elapsed time with $h_j$ frozen between events, the compensator over a single inter-event interval has a closed form. On $(t_j, t_{j+1}]$:

$$\int_{t_j}^{t_{j+1}} \lambda^*(\tau),d\tau = \int_{t_j}^{t_{j+1}} e^{\mathbf{v}^{\top} h_j + w(\tau - t_j) + b},d\tau = \frac{1}{w}\Big(e^{\mathbf{v}^{\top}h_j + w(t_{j+1}-t_j) + b} - e^{\mathbf{v}^{\top}h_j + b}\Big).$$

Summing these interval integrals gives the full survival penalty with no Monte-Carlo approximation — a notable practical advantage over later neural TPPs whose compensators must be numerically integrated. The mark term is a standard cross-entropy from the softmax readout (categorical marks); for a continuous mark like magnitude one substitutes an appropriate density (in a seismic adaptation, a Gutenberg–Richter-style $f(m)$ — see Temporal Point Processes §5).

The presence of the compensator is what keeps RMTPP a genuine probabilistic model rather than a regressor: it penalizes high intensity where no event occurred, which is exactly what makes the output calibratable — the dividing line emphasized throughout Models — ML.


5. Closed-form next-event density and prediction

From the survival identity, the density of the next event time given the current state $h_j$ is

$$f^(t) = \lambda^(t),\exp!\left(-\int_{t_j}^{t}\lambda^*(\tau),d\tau\right) = \exp!\Big(\mathbf{v}^{\top}h_j + w(t-t_j) + b

  • \tfrac{1}{w}\big(e^{\mathbf{v}^{\top}h_j + w(t-t_j)+b} - e^{\mathbf{v}^{\top}h_j+b}\big)\Big).$$

The expected time to the next event is then $\hat t_{j+1} = \int_{t_j}^{\infty} t,f^(t),dt$, which RMTPP evaluates by numerical integration of this closed-form density. The next mark is predicted from the softmax readout. The probability of at least one event within a horizon $H$ is $1 - \exp(-\int_{t_j}^{t_j+H}\lambda^,d\tau)$, using the closed-form compensator above — the same exceedance shape the product publishes.


6. Parameter estimation and practicalities

  • Parameters. The RNN weights ${W^y, W^t, W^h, \mathbf{b}_h}$, the intensity readout ${\mathbf{v}, w, b}$, the mark-embedding and softmax weights. Trained end-to-end by stochastic gradient ascent on the log-likelihood (back-propagation through time).
  • Stability of $w$. The compensator divides by $w$, so $w \to 0$ needs a limiting form ($\int e^{c} d\tau = e^c,\Delta t$); implementations special-case small $|w|$.
  • Sequence handling. Long catalogs are processed in truncated-BPTT windows; the hidden state carries context across windows.
  • Regularization / data hunger. Like all neural TPPs, RMTPP needs many sequences to fit without overfitting — a binding constraint for seismicity, where the number of effectively independent large sequences is small (the central lesson of the DeVries–Mignan episode; see Models — ML §5). The product's guardrail is that any neural model must clear ETAS in a strictly temporal, prospective CSEP harness before it can ship.

7. Strengths

  • Learned history representation. No hand-designed kernel: the RNN discovers how marked history drives the future rate, in principle capturing dependencies ETAS's fixed kernels cannot.
  • Joint timing + mark model. Time-to-next-event and next-mark are learned together, sharing $h_j$.
  • Closed-form likelihood and density. The frozen-$h_j$ design makes the compensator and the next-event density analytic — fast, exact training with no Monte-Carlo compensator.
  • General-purpose. The same architecture handles any marked stream; seismicity is just one instantiation (with magnitude/location marks).

8. Limitations — and why they matter for seismicity

  • Monotone intra-interval intensity. With $h_j$ frozen between events, $\log\lambda^(t)$ is linear in elapsed time, so $\lambda^$ can only rise or fall monotonically within an interval. Real seismic relaxation after a mainshock — and especially non-monotone dynamics like a delayed secondary surge — is not representable inside a single interval. This is the specific rigidity that Neural Hawkes relaxes with a continuously evolving (LSTM) state.
  • Single-vector history bottleneck. The whole past is compressed into one fixed-length $h_j$; very long-range or cross-fault dependencies can be lost to RNN memory decay — the gap Transformer Hawkes addresses with self-attention.
  • No explicit spatial kernel. RMTPP is fundamentally a temporal, categorically-marked model. Seismic forecasting needs a genuine spatial density; adapting RMTPP to space requires bolting on a spatial mark, which is not its native strength.
  • No built-in seismic physics. There is no Omori law, no Gutenberg–Richter, no branching-ratio stability constraint unless added by hand. A purely learned RMTPP has no guarantee of the subcritical behaviour ETAS enforces.
  • Validated off-domain. Its reported wins are on retail/EHR/social streams. Gains do not auto-transfer to earthquakes — the headline caveat of this entire model family. On earthquake catalogs under fair temporal splits, neural TPPs have not robustly beaten ETAS.

Net assessment for this product. RMTPP is studied as the conceptual ancestor and as a baseline in the neural-challenger research track — never as a default forecaster. Any version of it reaches the public map only behind the same hard gate as every neural model: a prospective CSEP win over a well-fit ETAS, plus a passing calibration / reliability diagram.


9. Role in operational earthquake forecasting

RMTPP itself has no operational role in CAOS_SEISMIC. Its role is conceptual and methodological:

  • It establishes the template — a neural network parameterizing a conditional intensity, trained on the point-process log-likelihood — that the product's gated neural challenger follows, but with a Hawkes inductive bias (additive background + summed triggering) and explicit magnitude modeling that plain RMTPP lacks.
  • Its closed-form compensator is a useful property to inherit where possible (cheap, exact training).
  • Its limitations (monotone intervals, single-vector bottleneck, no spatial kernel, no seismic physics) are precisely the design requirements the product imposes on any neural model it would actually deploy: keep the ETAS skeleton, model magnitude and space explicitly, and prove skill prospectively.

In OEF terms: RMTPP teaches how to wire a learned intensity into the calibratable point-process framework, while the empirical record teaches that doing so does not, by itself, beat the physics- informed baseline.


10. Worked illustration

Suppose after some event $j$ the RNN produces a history projection $\mathbf{v}^{\top}h_j = -2.0$, with $b = 0$ and a learned decay $w = -0.5,\text{day}^{-1}$. The intensity at elapsed time $s = t - t_j$ (in days) is

$$\lambda^*(t_j + s) = e^{-2.0 - 0.5 s} \quad [\text{events/day}].$$

So immediately after the event ($s=0$) the rate is $e^{-2.0} \approx 0.135$/day and it relaxes toward zero — a learned, Omori-flavoured decay (here monotone, by construction). The probability of at least one further event within the next $H = 2$ days uses the closed-form compensator:

$$\Lambda = \int_0^{2} e^{-2.0 - 0.5 s},ds = e^{-2.0}\cdot\frac{1 - e^{-0.5\cdot 2}}{0.5} = 0.135 \times \frac{1 - e^{-1}}{0.5} \approx 0.135 \times 1.264 \approx 0.171,$$

$$P(\ge 1 \text{ event in 2 days}) = 1 - e^{-0.171} \approx 15.7%.$$

Two points. First, the entire forecast came from one scalar elapsed-time term and one frozen history vector — illustrating both the elegance and the rigidity of RMTPP: within this interval the rate can only decay monotonically. Second, this is still a rate-based, bounded probability with a survival penalty — an honest forecast, not a prediction that an event will occur. If no event happens in the two days, the ~16% forecast was not refuted.


References

  1. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes: Embedding Event History to Vector. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), 1555–1564. doi:10.1145/2939672.2939875
  2. Hawkes, A.G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90. doi:10.1093/biomet/58.1.83
  3. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association 83(401), 9–27. doi:10.1080/01621459.1988.10478560
  4. Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. NeurIPS 2017. arXiv:1612.09328
  5. Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
  6. Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. Transactions on Machine Learning Research (TMLR). arXiv:2410.08226
  7. Dascher-Cousineau, K., Shchur, O., Brodsky, E.E. & Günnemann, S. (2023). Using deep learning for flexible and scalable earthquake forecasting (RECAST). Geophysical Research Letters 50, e2023GL103909. doi:10.1029/2023GL103909

See also: Temporal Point Processes · Neural Hawkes Process · Transformer Hawkes Process · Models — ML · Models — Classical · Honest-Limits.

Clone this wiki locally