-
Notifications
You must be signed in to change notification settings - Fork 0
RMTPP
One topic, in depth. RMTPP is the first of the neural temporal point processes — the model that replaced a hand-designed Hawkes triggering kernel with a recurrent neural network that learns the history representation, while keeping the point-process log-likelihood as its training objective. This page derives its conditional intensity, explains the design choices, works the likelihood, and gives an honest account of its (limited) relevance to seismic forecasting.
Honest framing up front (read before the math). RMTPP was introduced and validated on non-seismic event streams — financial transactions, electronic health records, taxi trips, social-media and music-listening logs. It is a foundational architecture, not a seismic forecaster. Its accuracy gains on those domains do not automatically transfer to earthquakes: in rigorous, prospective CSEP-style testing on earthquake catalogs, no neural point process has robustly beaten a well-fit ETAS as of 2026 (see Models — ML §4). RMTPP appears here because it is the conceptual root of Neural Hawkes and Transformer Hawkes, and because understanding why its intensity form is restrictive motivates the models that followed.
- Intuition and place in history
- The architecture: an RNN over event history
- The conditional intensity
- The marked likelihood and training objective
- Closed-form next-event density and prediction
- Parameter estimation and practicalities
- Strengths
- Limitations — and why they matter for seismicity
- Role in operational earthquake forecasting
- Worked illustration
- References
A classical Hawkes / ETAS process specifies its triggering kernel by hand: each past event adds a fixed-shape decaying "kick" to the intensity (Omori–Utsu in time, Utsu productivity in magnitude). That is powerful but rigid — the functional form of how history influences the future is decided in advance by the modeller.
Du et al. (2016) asked: what if a neural network learns how history maps to the future intensity,
instead of us prescribing kernels? Their Recurrent Marked Temporal Point Process (RMTPP) feeds the
sequence of past events — each carrying a mark (a categorical/continuous label; for seismicity,
magnitude and location) and an inter-event time — through a recurrent neural network. The RNN compresses
the entire history into a fixed-length hidden vector
The payoff is a single model that jointly learns (a) the timing dynamics and (b) the mark dynamics from data, with no hand-tuned kernels. RMTPP is, in the lineage of temporal point processes, the first "the kernel is a neural net" model, and every neural TPP since is a variation on this idea.
Events arrive as a sequence
-
Embed each event: the mark
$k_j$ is mapped to an embedding $\mathbf{y}j$, and the timing is encoded by features of the inter-event gap $\Delta t_j = t_j - t{j-1}$ (e.g.$\Delta t_j$ and simple temporal features). -
Recur: a recurrent unit (the paper uses a vanilla RNN; LSTM/GRU are drop-in replacements) updates the hidden state
$$h_j = \max!\big(\mathbf{0},; W^{y}\mathbf{y}j + W^{t}\Delta t_j + W^{h} h{j-1} + \mathbf{b}_h\big),$$
so
$h_j$ is a learned summary of the entire history up to and including event$j$ . -
Read out two things from
$h_j$ : a conditional intensity governing the time to the next event (§3) and a softmax distribution over the mark of the next event (§4).
flowchart LR
E1["(t₁,k₁)"] --> R1["RNN cell"]
E2["(t₂,k₂)"] --> R2["RNN cell"]
E3["(t₃,k₃)"] --> R3["RNN cell"]
R1 -->|h₁| R2
R2 -->|h₂| R3
R3 -->|hⱼ| INT["Intensity readout<br/>λ*(t) = exp(vᵀhⱼ + w(t−tⱼ) + b)"]
R3 -->|hⱼ| MK["Mark readout<br/>softmax over next mark"]
INT --> LL["Point-process log-likelihood"]
MK --> LL
The hidden state
Between the
Term by term:
-
$\mathbf{v}^{\top} h_j$ — the history contribution: a learned linear projection of the RNN summary$h_j$ . This sets the intensity level right after event$j$ and carries all marked history (magnitudes, locations, gaps). -
$w,(t - t_j)$ — the current-influence / elapsed-time term: a single scalar$w$ multiplying the time since the last event. With$w < 0$ the intensity decays as time passes (an Omori-like relaxation toward background); with$w > 0$ it grows; with$w = 0$ it is flat (a memoryless, homogeneous-Poisson segment). The sign and size of$w$ are learned, not prescribed. -
$b$ — a base log-intensity offset. - The outer
$\exp(\cdot)$ guarantees$\lambda^*(t) \ge 0$ as any intensity must (cf. the non-negativity requirement).
The crucial structural fact:
RMTPP is trained by maximizing the point-process log-likelihood, augmented
with the mark term. For a sequence of
$$\log\mathcal{L} = \sum_{j=1}^{n}\Big[\underbrace{\log\lambda^*(t_j)}_{\text{timing}}
- \underbrace{\log P(k_j \mid h_{j-1})}{\text{mark}}\Big] ;-;\underbrace{\int_0^T \lambda^*(\tau),d\tau}{\text{compensator / survival penalty}}.$$
Because
Summing these interval integrals gives the full survival penalty with no Monte-Carlo approximation
— a notable practical advantage over later neural TPPs whose compensators must be numerically
integrated. The mark term is a standard cross-entropy from the softmax readout (categorical marks); for
a continuous mark like magnitude one substitutes an appropriate density (in a seismic adaptation, a
Gutenberg–Richter-style
The presence of the compensator is what keeps RMTPP a genuine probabilistic model rather than a regressor: it penalizes high intensity where no event occurred, which is exactly what makes the output calibratable — the dividing line emphasized throughout Models — ML.
From the survival identity, the density of the next event time given the
current state
$$f^(t) = \lambda^(t),\exp!\left(-\int_{t_j}^{t}\lambda^*(\tau),d\tau\right) = \exp!\Big(\mathbf{v}^{\top}h_j + w(t-t_j) + b
- \tfrac{1}{w}\big(e^{\mathbf{v}^{\top}h_j + w(t-t_j)+b} - e^{\mathbf{v}^{\top}h_j+b}\big)\Big).$$
The expected time to the next event is then $\hat t_{j+1} = \int_{t_j}^{\infty} t,f^(t),dt$, which RMTPP evaluates by numerical integration of this closed-form density. The next mark is predicted from the softmax readout. The probability of at least one event within a horizon $H$ is $1 - \exp(-\int_{t_j}^{t_j+H}\lambda^,d\tau)$, using the closed-form compensator above — the same exceedance shape the product publishes.
-
Parameters. The RNN weights
${W^y, W^t, W^h, \mathbf{b}_h}$ , the intensity readout${\mathbf{v}, w, b}$ , the mark-embedding and softmax weights. Trained end-to-end by stochastic gradient ascent on the log-likelihood (back-propagation through time). -
Stability of
$w$ . The compensator divides by$w$ , so$w \to 0$ needs a limiting form ($\int e^{c} d\tau = e^c,\Delta t$ ); implementations special-case small$|w|$ . - Sequence handling. Long catalogs are processed in truncated-BPTT windows; the hidden state carries context across windows.
- Regularization / data hunger. Like all neural TPPs, RMTPP needs many sequences to fit without overfitting — a binding constraint for seismicity, where the number of effectively independent large sequences is small (the central lesson of the DeVries–Mignan episode; see Models — ML §5). The product's guardrail is that any neural model must clear ETAS in a strictly temporal, prospective CSEP harness before it can ship.
- Learned history representation. No hand-designed kernel: the RNN discovers how marked history drives the future rate, in principle capturing dependencies ETAS's fixed kernels cannot.
-
Joint timing + mark model. Time-to-next-event and next-mark are learned together, sharing
$h_j$ . - Closed-form likelihood and density. The frozen-$h_j$ design makes the compensator and the next-event density analytic — fast, exact training with no Monte-Carlo compensator.
- General-purpose. The same architecture handles any marked stream; seismicity is just one instantiation (with magnitude/location marks).
-
Monotone intra-interval intensity. With
$h_j$ frozen between events, $\log\lambda^(t)$ is linear in elapsed time, so $\lambda^$ can only rise or fall monotonically within an interval. Real seismic relaxation after a mainshock — and especially non-monotone dynamics like a delayed secondary surge — is not representable inside a single interval. This is the specific rigidity that Neural Hawkes relaxes with a continuously evolving (LSTM) state. -
Single-vector history bottleneck. The whole past is compressed into one fixed-length
$h_j$ ; very long-range or cross-fault dependencies can be lost to RNN memory decay — the gap Transformer Hawkes addresses with self-attention. - No explicit spatial kernel. RMTPP is fundamentally a temporal, categorically-marked model. Seismic forecasting needs a genuine spatial density; adapting RMTPP to space requires bolting on a spatial mark, which is not its native strength.
- No built-in seismic physics. There is no Omori law, no Gutenberg–Richter, no branching-ratio stability constraint unless added by hand. A purely learned RMTPP has no guarantee of the subcritical behaviour ETAS enforces.
- Validated off-domain. Its reported wins are on retail/EHR/social streams. Gains do not auto-transfer to earthquakes — the headline caveat of this entire model family. On earthquake catalogs under fair temporal splits, neural TPPs have not robustly beaten ETAS.
Net assessment for this product. RMTPP is studied as the conceptual ancestor and as a baseline in the neural-challenger research track — never as a default forecaster. Any version of it reaches the public map only behind the same hard gate as every neural model: a prospective CSEP win over a well-fit ETAS, plus a passing calibration / reliability diagram.
RMTPP itself has no operational role in CAOS_SEISMIC. Its role is conceptual and methodological:
- It establishes the template — a neural network parameterizing a conditional intensity, trained on the point-process log-likelihood — that the product's gated neural challenger follows, but with a Hawkes inductive bias (additive background + summed triggering) and explicit magnitude modeling that plain RMTPP lacks.
- Its closed-form compensator is a useful property to inherit where possible (cheap, exact training).
- Its limitations (monotone intervals, single-vector bottleneck, no spatial kernel, no seismic physics) are precisely the design requirements the product imposes on any neural model it would actually deploy: keep the ETAS skeleton, model magnitude and space explicitly, and prove skill prospectively.
In OEF terms: RMTPP teaches how to wire a learned intensity into the calibratable point-process framework, while the empirical record teaches that doing so does not, by itself, beat the physics- informed baseline.
Suppose after some event
So immediately after the event (
Two points. First, the entire forecast came from one scalar elapsed-time term and one frozen history vector — illustrating both the elegance and the rigidity of RMTPP: within this interval the rate can only decay monotonically. Second, this is still a rate-based, bounded probability with a survival penalty — an honest forecast, not a prediction that an event will occur. If no event happens in the two days, the ~16% forecast was not refuted.
- Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M. & Song, L. (2016). Recurrent Marked Temporal Point Processes: Embedding Event History to Vector. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), 1555–1564. doi:10.1145/2939672.2939875
- Hawkes, A.G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90. doi:10.1093/biomet/58.1.83
- Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association 83(401), 9–27. doi:10.1080/01621459.1988.10478560
- Mei, H. & Eisner, J. (2017). The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. NeurIPS 2017. arXiv:1612.09328
- Shchur, O., Türkmen, A.C., Januschowski, T. & Günnemann, S. (2021). Neural Temporal Point Processes: A Review. IJCAI 2021 Survey Track. arXiv:2104.03528
- Stockman, S., Lawson, D. & Werner, M.J. (2026, accepted). EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes. Transactions on Machine Learning Research (TMLR). arXiv:2410.08226
- Dascher-Cousineau, K., Shchur, O., Brodsky, E.E. & Günnemann, S. (2023). Using deep learning for flexible and scalable earthquake forecasting (RECAST). Geophysical Research Letters 50, e2023GL103909. doi:10.1029/2023GL103909
See also: Temporal Point Processes · Neural Hawkes Process · Transformer Hawkes Process · Models — ML · Models — Classical · Honest-Limits.
⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.
CAOS_SEISMIC · seismic.fasl-work.com · source · MIT
Conditional probabilistic seismic forecasting — forecasts, never predictions.
Overview
Methodology & History
Classical models
- Models-Classical · index
- Gutenberg-Richter-Law
- Omori-Utsu-Law
- ETAS-Model
- Reasenberg-Jones-Model
- STEP-Model
- EEPAS-Model
- Smoothed-Seismicity
- Brownian-Passage-Time
- Rate-and-State-and-Coulomb
ML & analytical methods
- Models-ML · index
- Temporal-Point-Processes
- RMTPP
- Neural-Hawkes-Process
- Transformer-Hawkes-Process
- RECAST-and-FERN
- CNN-Spatial-Models
- Graph-and-Recurrent-Networks
- Detection-vs-Forecasting
Models employed
Data
Architecture
Evaluation
Progress
Reference