Skip to content

Temporal Point Processes

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Temporal Point Processes — the conditional-intensity framework

One topic, in depth. This page is the mathematical spine of every forecasting model in CAOS_SEISMIC. It defines the object all of them estimate — the conditional intensity function $\lambda^*(t \mid \mathcal{H}_t)$ — and the point-process log-likelihood that turns an intensity into a calibratable probability. Classical models (ETAS, Omori–Utsu, Reasenberg–Jones) and neural models (RMTPP, Neural Hawkes, Transformer Hawkes) are all special cases of what is defined here. If you read only one page on the why of the model family, read this one.

Honest framing up front. Everything below describes how to estimate rates and probabilities of seismicity, not how to predict individual earthquakes. A point-process model never says "an earthquake will happen." It says "given the history, the expected rate over the next window is $\lambda$, and therefore the probability of at least one event above $M^*$ is $1 - e^{-\Lambda}$." That number is bounded, calibrated, and testable. It is never an alarm. See Honest-Limits.


Table of contents

  1. Intuition: a catalog is a realization of a point process
  2. The counting process and the conditional intensity
  3. From intensity to a full probability model
  4. The point-process log-likelihood (the heart of it)
  5. Marks: magnitude and space
  6. Self-exciting (Hawkes) processes — the seismic special case
  7. Parameter estimation: MLE, the compensator, and residual analysis
  8. Simulation: from intensity to forecasts
  9. Why this framework, and its limits
  10. Role in operational earthquake forecasting
  11. Worked illustration
  12. References

1. Intuition: a catalog is a realization of a point process

An earthquake catalog is a list of events ${(t_i, x_i, y_i, m_i)}$ — a time, a location, and a magnitude for each. Mathematically, the times alone form a temporal point process: a random set of points on the time axis. Adding location and magnitude makes it a marked spatio-temporal point process, where the "marks" are $(x_i, y_i, m_i)$.

The key intuition is that we do not try to model each $t_i$ as a separate random variable. Instead we model the rate at which events arrive, allowing that rate to depend on the entire past. A quiet fault has a low arrival rate; immediately after a mainshock the rate spikes (aftershocks) and then decays. A single function — the conditional intensity $\lambda^*(t \mid \mathcal{H}_t)$ — captures all of this. Everything else (counts, probabilities, simulations, scores) is derived from it.

This is the same mathematical object that governs queue arrivals, neuron firing, financial trades, and social-media cascades. What makes seismic point processes special is the physics baked into the intensity (Omori decay, Gutenberg–Richter magnitudes, self-excitation) — but the scaffolding is universal, which is precisely why the neural architectures on the sibling pages, originally built for non-seismic streams, are even candidates here.


2. The counting process and the conditional intensity

Let $N(t)$ be the counting process — the number of events that have occurred up to time $t$. Let

$$\mathcal{H}_t = {(t_i, x_i, y_i, m_i) : t_i < t}$$

denote the history: everything observed strictly before $t$. The conditional intensity function is the instantaneous expected rate of events at time $t$, given the history:

$$\lambda^*(t \mid \mathcal{H}_t) = \lim_{\Delta t \to 0} \frac{\mathbb{E}!\left[N(t + \Delta t) - N(t) \mid \mathcal{H}_t\right]}{\Delta t}.$$

The superscript $$ is the standard Daley–Vere-Jones notation reminding us that the intensity is conditioned on history — without it, $\lambda$ would be a deterministic (Poisson) rate. Reading the definition: in a tiny window $[t, t+\Delta t)$, the probability of exactly one event is $\lambda^(t),\Delta t + o(\Delta t)$, and the probability of two or more is $o(\Delta t)$ (events do not pile up at a single instant).

A few immediate consequences:

  • Non-negativity. $\lambda^*(t) \ge 0$ always — it is a rate. Neural parameterizations enforce this with a softplus or exp output (see Neural Hawkes and RMTPP).
  • History dependence is the whole game. If $\lambda^*$ did not depend on $\mathcal{H}_t$, the process would be an inhomogeneous Poisson process and earthquakes would be "memoryless" in their occurrence — which contradicts the observed Omori-law clustering. The dependence on $\mathcal{H}_t$ is what lets a mainshock raise the future rate.
  • The compensator. The integral $\Lambda(t) = \int_0^t \lambda^*(\tau),d\tau$ is the compensator (cumulative intensity). It is the expected number of events by time $t$ and is the engine of both the likelihood (§4) and residual diagnostics (§7).

3. From intensity to a full probability model

The conditional intensity determines the entire distribution of the next event. Define the conditional survival between the last event at $t_j$ and a candidate next time $t$:

$$S^_(t) = \Pr(t_{j+1} &gt; t \mid \mathcal{H}_{t_j}) = \exp!\left(-\int_{t_j}^{t} \lambda^_(\tau),d\tau\right).$$

The probability density of the next event time is then

$$f^_(t) = \lambda^_(t), S^_(t) = \lambda^_(t), \exp!\left(-\int_{t_j}^{t} \lambda^*(\tau),d\tau\right).$$

This factorization — an instantaneous hazard $\lambda^(t)$ multiplied by the survival up to that instant — is the single most important identity in the theory. It says the model is fully specified by $\lambda^$ alone: give me the intensity and I can write down the density of every inter-event time, hence the joint density of the whole catalog.

Crucially, the probability of at least one event in a forecast window $[t, t+H)$ follows directly:

$$\Pr\big(N(t+H) - N(t) \ge 1 \mid \mathcal{H}_t\big) = 1 - \exp!\left(-\int_{t}^{t+H} \lambda^*(\tau),d\tau\right) = 1 - e^{-\Lambda_{[t,t+H)}}.$$

This is exactly the exceedance probability the product publishes (restricted to magnitudes $\ge M^*$ via the marked extension in §5). The "$1 - e^{-\Lambda}$" the product copy references is not a heuristic — it is the survival identity of the point process.

flowchart LR
    H["History H_t<br/>(t_i, x_i, y_i, m_i)"] --> L["Conditional intensity<br/>λ*(t | H_t)"]
    L --> C["Compensator<br/>Λ = ∫ λ* dτ"]
    C --> S["Survival<br/>S*(t) = exp(−Λ)"]
    S --> P["Exceedance probability<br/>P(≥1 event) = 1 − e^(−Λ)"]
    L --> LL["Log-likelihood<br/>Σ log λ*(t_i) − ∫ λ* dτ"]
    LL --> FIT["MLE / training"]
    L --> SIM["Ogata thinning<br/>→ synthetic catalogs"]
    SIM --> FC["Forecast bands<br/>(CSEP tests)"]
Loading

4. The point-process log-likelihood (the heart of it)

Given an observed catalog with event times $t_1 &lt; t_2 &lt; \dots &lt; t_n$ on the interval $[0, T]$, the joint density is the product of the next-event densities (§3). Taking logs and using the survival factorization, the inter-event survival terms telescope into a single integral over $[0,T]$, giving the point-process log-likelihood:

$$\boxed{;\log \mathcal{L} = \sum_{i=1}^{n} \log \lambda^_(t_i \mid \mathcal{H}_{t_i}) ;-; \int_0^T \lambda^_(\tau \mid \mathcal{H}_\tau), d\tau;}$$

Read the two terms separately, because the contrast is the whole point:

  • The sum $\sum_i \log \lambda^*(t_i)$ rewards placing high intensity exactly where events actually occurred. A model that puts a tall spike on every observed event maximizes this term.
  • The integral $\int_0^T \lambda^*(\tau),d\tau$ — the compensator — is the survival penalty. It punishes putting high intensity where events did not occur. A model that is "always alarmed" pays a large integral penalty.

These two terms are in tension, and that tension is what forces a model to output an honest rate rather than a binary "alarm / no alarm." A classifier ("will there be an $M\ge5$ tomorrow? yes/no") optimizes only an analogue of the first term and throws away the integral. That is the structural reason classification/regression framings become un-calibratable: without the compensator there is no survival penalty, so there is nothing forcing the published probability to match observed frequencies. This single equation is the dividing line between point-process forecasting and the ML-for-earthquakes failure modes catalogued on Models — ML.

Operational note. The same log-likelihood is reused at test time as the CSEP L-test (likelihood test) and pseudo-likelihood score — so the quantity we fit on is the quantity we are later graded on. See Evaluation.

For the inhomogeneous Poisson special case ($\lambda^*(t) = \lambda(t)$ independent of history), the formula is unchanged but the integral becomes a deterministic constant — which is exactly the climatological baseline the product always shows next to its conditional number.


5. Marks: magnitude and space

Seismic forecasting needs where and how large, not just when. The marked conditional intensity factorizes (under the usual separability assumption) into a ground intensity times mark densities:

$$\lambda^_(t, x, y, m \mid \mathcal{H}_t) = \lambda^_(t \mid \mathcal{H}_t), f(x, y \mid t, \mathcal{H}_t), f(m).$$

  • Magnitude mark $f(m)$. Standard seismology takes magnitude as (approximately) independent of history given that an event occurs — the Gutenberg–Richter law, $f(m) = \beta, e^{-\beta(m - M_c)}$ with $\beta = b \ln 10$, for $m \ge M_c$. This "magnitude is memoryless" property is why deterministically predicting the size of the next event is not supported, and why the product forecasts the distribution of magnitudes, not a number.
  • Spatial mark $f(x,y\mid\cdot)$. A density over location, typically a magnitude-scaled power-law kernel centered on triggering events (classical ETAS) or a learned spatial field (FERN-style neural encoders).

The marked log-likelihood simply adds the mark log-densities at observed events and integrates the ground intensity over space as well as time:

$$\log\mathcal{L} = \sum_{i=1}^{n}\Big[\log\lambda^*(t_i) + \log f(x_i,y_i\mid\cdot) + \log f(m_i)\Big]

  • \int_0^T!!\int_A \lambda^*(t,x,y),dx,dy,dt.$$

The marked exceedance probability at a display threshold $M^*$ — the public scalar — is

$$P(\ge 1 \text{ event} \ge M^_) = 1 - e^{-\Lambda_{\ge M^_}}, \qquad \Lambda_{\ge M^_} = \int_t^{t+H}!!\int_A \lambda^_(\tau,x,y), \Phi(M^*),dx,dy,d\tau,$$

with the GR exceedance factor $\Phi(M^) = 10^{-b,(M^ - M_c)}$. The neural pages keep $f(m)$ explicit rather than dropping it — a real gap in many neural point processes that the product treats as a hard requirement.


6. Self-exciting (Hawkes) processes — the seismic special case

A Hawkes process is the temporal point process whose intensity is a constant background plus a sum of decaying "kicks," one per past event:

$$\lambda^*(t) = \mu + \sum_{i:,t_i < t} \phi(t - t_i),$$

where $\mu \ge 0$ is the background rate and $\phi(\cdot) \ge 0$ is the triggering kernel (each event raises the future rate, then the kick decays). Hawkes is the natural language for seismicity because earthquakes trigger aftershocks — the process is self-exciting.

ETAS is precisely a marked Hawkes process with seismologically motivated kernels: an Omori–Utsu temporal decay $\phi(t-t_i) \propto (t - t_i + c)^{-p}$, a magnitude-dependent productivity $\propto e^{\alpha(m_i - m_0)}$, and a spatial kernel. (Full ETAS equations and estimation are on Models — Classical.)

A central quantity is the branching ratio

$$n = \int_0^{\infty}!!\int \phi(\tau),d\tau \cdots,$$

the expected number of direct aftershocks per event. The process is subcritical / stationary only if $n &lt; 1$; at $n \ge 1$ the cascade is supercritical (a single event triggers, on average, an ever-growing family) and the model is rejected as a mis-fit. The neural Hawkes models on the sibling pages generalize this skeleton in two directions:

Generalization What it relaxes Which page
Kernel becomes a learned RNN function of history $\phi$ no longer a fixed parametric decay RMTPP
Background + kicks become a continuous-time LSTM with decay allows inhibition ($\phi &lt; 0$ effects), self-modulation Neural Hawkes
History summarized by self-attention long-range dependencies without recurrence memory decay Transformer Hawkes

Each is a strictly more expressive intensity plugged into the same log-likelihood of §4. That shared likelihood is why they are directly comparable to ETAS — and why, despite the extra expressiveness, none has robustly beaten a well-fit ETAS in prospective CSEP testing as of 2026.


7. Parameter estimation: MLE, the compensator, and residual analysis

Maximum likelihood. Parameters $\theta$ (e.g. ETAS's $\mu, K, c, p, \alpha$, or a neural net's weights) are fit by maximizing the log-likelihood of §4. For ETAS this is a low-dimensional, smooth optimization (5–8 parameters); for neural intensities it is stochastic gradient ascent on the same objective. The compensator integral $\int_0^T \lambda^*,d\tau$ is usually the expensive part — it must be evaluated (analytically for ETAS kernels, by quadrature or Monte Carlo for neural intensities) every step.

Residual analysis (the random-time-change theorem). A deep, useful fact: if $\lambda^*$ is the true intensity, then transforming each event time by its own compensator,

$$\tau_i = \Lambda(t_i) = \int_0^{t_i} \lambda^*(\tau),d\tau,$$

turns the catalog into a unit-rate Poisson process in the transformed time $\tau$. So a fitted model can be diagnosed by checking whether the ${\tau_i}$ look Poisson (inter-event gaps $\sim \text{Exponential}(1)$, no remaining clustering). This is Ogata's residual analysis and is the principled way to detect that a model is missing structure (e.g. an unmodeled swarm) without appeal to any external metric.

Uncertainty. Parameter uncertainty comes from the inverse Hessian (Fisher information) of the log-likelihood, from bootstrap, or from a Bayesian posterior. The product propagates this into the published P10/median/P90 bands — a Poisson-only interval is not sufficient because regional seismicity is over-dispersed relative to Poisson (see Evaluation).


8. Simulation: from intensity to forecasts

Because the model is generative, we forecast by simulation rather than by a closed-form formula whenever the window contains self-triggering. The standard method is Ogata's thinning algorithm:

  1. Propose candidate event times from a homogeneous Poisson process with rate $\bar\lambda \ge \sup_{[t,t+H)} \lambda^*$ (an upper bound on the intensity over the window).
  2. Accept each candidate $t^\star$ with probability $\lambda^*(t^\star)/\bar\lambda$.
  3. Each accepted event is appended to the history, raising the intensity for later candidates (this is what reproduces aftershock cascades).
  4. Repeat to the horizon $H$; draw each accepted event's magnitude from $f(m)$ and location from the spatial kernel.

Running this $\ge 10{,}000$ times yields an ensemble of synthetic next-window catalogs. From the ensemble we read off the distribution of event counts above $M^*$ — hence the median forecast, the P10/P90 bands, and every CSEP catalog-based test statistic. This is exactly the "10,000 simulated catalogs per day" protocol used both by the product's daily job and by the EarthquakeNPP benchmark, so ETAS and any neural challenger are simulated and scored identically.


9. Why this framework, and its limits

Strengths.

  • One object, everything derived. Counts, probabilities, simulations, and scores all flow from $\lambda^*$. No separate "probability calibration model" is bolted on — the survival identity is the probability model.
  • Calibratable by construction. The compensator/survival penalty is what makes the output a proper probability, gradeable by CSEP consistency tests and reliability diagrams.
  • Unifies classical and neural. ETAS, Reasenberg–Jones, RMTPP, Neural Hawkes, and Transformer Hawkes are all the same framework with different intensities — directly comparable on the same likelihood and the same simulation protocol.

Limits and assumptions (stated honestly).

  • Separability (time ⟂ space ⟂ magnitude) is an approximation; real catalogs show magnitude–space–time coupling that simple separable kernels miss.
  • Completeness. The likelihood assumes events below $M_c$ are simply absent, not mis-recorded. A drifting or post-mainshock-spiking $M_c$ silently biases the fit — handled by a time-varying $M_c(x,y,t)$ (see Data Sources and Methodology).
  • Stationarity of the background $\mu$ over the fit window — often violated by transients (slow-slip, swarms, induced seismicity).
  • The framework forecasts rates, not individual events. No amount of intensity modeling makes the next earthquake's time/size deterministically knowable — the GR memorylessness of magnitude and the nonlinearity of rupture nucleation forbid it. See Honest-Limits.

10. Role in operational earthquake forecasting

Operational Earthquake Forecasting (OEF) is the practice of issuing authoritative, regularly updated probabilities of future seismicity (Jordan et al. 2011). The conditional-intensity framework is its mathematical foundation:

  • The daily public number is the marked exceedance probability $1 - e^{-\Lambda_{\ge M^*}}$ of §5, computed from a fitted intensity over a 1d / 2d / 7d window.
  • The information gain of one model over another (the CSEP comparison metric, in nats) is a difference of the per-event log-intensities — a direct consequence of the §4 likelihood. This is how the product certifies that any neural challenger genuinely beats ETAS before it reaches the map.
  • The honesty guarantees — bounded probabilities, no alarms, a climatological baseline always shown alongside — are properties of the math, not editorial choices: the survival identity bounds the output in $[0,1]$, and the compensator forces it to track observed frequency.

In short, the conditional intensity is what the product estimates; the log-likelihood is how it is fit and graded; and the survival identity is why the published number is an honest probability and never a prediction.


11. Worked illustration

Consider a single isolated $M_w,6.0$ mainshock and a temporal-only ETAS-style intensity, ignoring background for clarity:

$$\lambda^*(t) = \frac{K, e^{\alpha (m - m_0)}}{(t - t_0 + c)^{p}},$$

with illustrative values $K = 0.02,\text{day}^{p-1}$, $\alpha = 1.5$, $m - m_0 = 2.0$, $c = 0.01,\text{day}$, $p = 1.1$. The expected number of triggered events of any size in the first day after the mainshock is the compensator over $[t_0, t_0 + 1]$:

$$\Lambda = \int_{t_0}^{t_0+1} \lambda^*(\tau),d\tau = K, e^{\alpha(m-m_0)} \int_0^{1} (s + c)^{-p},ds = K, e^{\alpha(m-m_0)} \cdot \frac{(1+c)^{1-p} - c^{1-p}}{1-p}.$$

Plugging in: $e^{\alpha(m-m_0)} = e^{3.0} \approx 20.1$, and the bracket evaluates to $\frac{(1.01)^{-0.1} - (0.01)^{-0.1}}{-0.1} \approx \frac{0.999 - 1.585}{-0.1} \approx 5.86$, giving $\Lambda \approx 0.02 \times 20.1 \times 5.86 \approx 2.36$ events of any size.

To get the probability of at least one aftershock of $M \ge 5$, scale by the GR exceedance factor with $b = 1.0$, $M_c = 3.0$: $\Phi(5) = 10^{-1.0,(5 - 3)} = 10^{-2} = 0.01$, so $\Lambda_{\ge 5} \approx 2.36 \times 0.01 = 0.0236$, and

$$P(\ge 1 \text{ event } M\ge 5 \text{ in 1 day}) = 1 - e^{-0.0236} \approx 2.3%.$$

Two honest readings of this worked number: (1) the relative gain over a quiet-day climatology (which might be $\sim$0.01%) is two-to-three orders of magnitude — the genuine value of conditional forecasting; (2) the absolute probability is still small (~2%), so if no $M\ge5$ aftershock occurs, the forecast was not wrong — a single outcome neither confirms nor refutes a probability. This is the 2019 Ridgecrest lesson the product copy repeats.


References

  1. Daley, D.J. & Vere-Jones, D. (2003). An Introduction to the Theory of Point Processes, Vol. I: Elementary Theory and Methods (2nd ed.). Springer. doi:10.1007/b97277
  2. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association 83(401), 9–27. doi:10.1080/01621459.1988.10478560
  3. Ogata, Y. (1998). Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics 50(2), 379–402. doi:10.1023/A:1003403601725
  4. Ogata, Y. (1981). On Lewis' simulation method for point processes (the thinning algorithm). IEEE Transactions on Information Theory 27(1), 23–31. doi:10.1109/TIT.1981.1056305
  5. Hawkes, A.G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90. doi:10.1093/biomet/58.1.83
  6. Rasmussen, J.G. (2018). Lectures on the Poisson Process and temporal point processes. arXiv:1806.00221
  7. Reinhart, A. (2018). A review of self-exciting spatio-temporal point processes and their applications. Statistical Science 33(3), 299–318. doi:10.1214/17-STS629
  8. Schoenberg, F.P. (2003). Multidimensional residual analysis of point process models for earthquake occurrences. Journal of the American Statistical Association 98(464), 789–795. doi:10.1198/016214503000000710
  9. Jordan, T.H., Chen, Y.-T., Gasparini, P., Madariaga, R., Main, I., Marzocchi, W., Papadopoulos, G., Sobolev, G., Yamaoka, K. & Zschau, J. (2011). Operational earthquake forecasting: state of knowledge and guidelines for utilization. Annals of Geophysics 54(4), 315–391. doi:10.4401/ag-5350

See also: Models — Classical · Models — ML · RMTPP · Neural Hawkes Process · Transformer Hawkes Process · Evaluation · Glossary · Honest-Limits.

Clone this wiki locally