Skip to content

Smoothed Seismicity

Felipe Santibañez-Leal edited this page Jun 17, 2026 · 1 revision

Smoothed Seismicity — the adaptive-kernel spatial background and the stationary Poisson null

A single-topic deep page. Smoothed seismicity is the model that answers "where do earthquakes happen?" by turning a catalog of past small events into a smooth, stationary map of the long-term rate. It is the most consequential "boring" model in the whole system, because it plays two jobs at once: it is the spatial background $\mu(x,y)$ that the time-dependent ETAS model needs, and it is the stationary, time-independent Poisson reference — the mandatory null that every time-dependent or machine-learning forecaster must demonstrably beat in prospective CSEP testing before it can claim any forecasting skill at all.

Honest framing. This page describes a forecast, not a prediction. Smoothed seismicity produces a conditional rate density — a number of expected events per unit area per unit time per magnitude bin — never a deterministic statement that an event will or will not occur. It is time-independent by construction: it says nothing about when, only about the long-run where and how often. The published quantity is always a bounded probability in $(0,1)$, shown next to its own long-term baseline.


Table of contents

  1. Intuition and history
  2. The governing model — kernel density of a point process
  3. The adaptive power-law kernel
  4. Why adaptive bandwidth, not fixed bandwidth
  5. From a spatial density to a space–rate–magnitude forecast
  6. Declustering: the long-term map must not be a snapshot of a burst
  7. Parameter estimation and retrospective optimization
  8. The dual role: background for ETAS, and the mandatory null
  9. Assumptions, strengths and limitations
  10. A worked illustration
  11. Role in operational earthquake forecasting
  12. References

1. Intuition and history

The founding empirical observation is simple and robust: large earthquakes tend to nucleate where small earthquakes have clustered in the past. Microseismicity (down to $M \approx 2$) traces out, at high spatial resolution, the faults and fault networks that are currently active. A region that has produced thousands of small events over a few decades is far more likely to produce the next $M \ge 5$ than a region that has been seismically silent. The locations of small events are, in effect, a dense, cheap proxy for the spatial loading and fault geometry that we cannot observe directly.

The naïve way to use this observation — counting events in fixed grid cells — is badly behaved: cells are arbitrary, the map is discontinuous at cell boundaries, an empty cell next to a busy cell reads as "impossible" when it is merely "undersampled", and the resolution is fixed regardless of how much data there is locally. Kernel smoothing fixes all of these. Instead of binning, each past event is replaced by a smooth "bump" (a kernel) centred on its epicentre, and the bumps are summed to give a continuous density field. Where events are dense the bumps overlap and the density is high; where events are sparse the density tapers smoothly to a low, non-zero floor.

The decisive methodological step — and what makes this a genuinely competitive forecast rather than a textbook exercise — is the adaptive bandwidth introduced by Helmstetter, Kagan & Jackson (2007). The width of each bump is not fixed; it is set by the local data density, so that well-sampled fault traces are imaged sharply and sparsely-sampled regions are smoothed broadly. This single idea let an unassuming, parameter-light, time-independent model become the best-performing time-independent forecast in the RELM/CSEP California experiments — beating several far more elaborate models. Werner et al. (2011) later extended the same machinery to time-dependent adaptive smoothing, and the testing/evaluation discipline around it grew into the modern CSEP framework. The model is now a standard ingredient of national hazard models and the default background term of operational ETAS implementations.

flowchart LR
  A["Catalog of past<br/>small events (M ≥ 2)"] --> B["Decluster<br/>(remove aftershock bursts)"]
  B --> C["Place an adaptive kernel<br/>on each event"]
  C --> D["Sum kernels →<br/>smooth density μ(x,y)"]
  D --> E["× Gutenberg–Richter<br/>magnitude tail"]
  E --> F["× time scaling"]
  F --> G["Space–rate–magnitude<br/>forecast λ(x,y,m)"]
  G --> H1["Role 1:<br/>background μ for ETAS"]
  G --> H2["Role 2:<br/>stationary Poisson NULL<br/>(model to beat)"]
Loading

2. The governing model — kernel density of a point process

Treat the past epicentres ${(x_i, y_i)}_{i=1}^{N}$ as a realisation of an inhomogeneous spatial Poisson point process with unknown intensity $\mu(x,y)$ (events per unit area per unit time). The kernel density estimate of that intensity is a sum of normalized kernels, one per event:

$$\mu(x,y) = \sum_{i=1}^{N} K_{d_i}!\big(\lVert (x,y) - (x_i,y_i)\rVert\big),$$

where $\lVert \cdot \rVert$ is the (horizontal) epicentral distance, $K_{d}(\cdot)$ is a kernel function with bandwidth $d$, and $d_i$ is the bandwidth assigned to event $i$. Each kernel is a proper probability density over the plane,

$$\int_{\mathbb{R}^2} K_{d}(\lVert \mathbf{r} \rVert), d\mathbf{r} = 1,$$

so that if we sum $N$ kernels the total integral is exactly $N$ — the estimated density conserves the number of events. Multiplying by a temporal rate (events per unit time) turns this spatial density into a spatial-rate field. The estimator is non-parametric: it makes no assumption about the functional shape of $\mu(x,y)$, only that nearby past events indicate elevated future rate.

The same object can be written as the background term of the general conditional intensity of a marked space–time point process. In a pure smoothed-seismicity (time-independent) forecast, the triggering sum is absent and the intensity is just $\mu(x,y)$ multiplied by a magnitude density and a horizon length — i.e. a stationary Poisson process in time. That is exactly what makes it the natural null: it is the conditional-intensity model with the time-dependence switched off.


3. The adaptive power-law kernel

Helmstetter, Kagan & Jackson (2007) use an isotropic power-law kernel rather than a Gaussian, because earthquake epicentres have heavy spatial tails — neighbouring faults are correlated over long distances and a Gaussian decays far too fast. The kernel has the form

$$K_{d}(r) = \frac{C(d)}{\big(r^2 + d^2\big)^{s}},$$

  • $r$ — horizontal distance from the kernel's centre (the event) to the evaluation point.
  • $d$ — the bandwidth (smoothing length) of this kernel.
  • $s$ — the power-law exponent controlling how fast the kernel decays at large $r$.
  • $C(d)$ — the normalization constant, chosen so the kernel integrates to 1 over the plane.

For a power-law $1/(r^2 + d^2)^{s}$ on the plane, the normalization follows from the radial integral

$$\int_{\mathbb{R}^2} \frac{1}{(r^2 + d^2)^{s}}, d\mathbf{r} = 2\pi \int_0^{\infty} \frac{r}{(r^2 + d^2)^{s}}, dr = \frac{\pi}{(s-1), d^{,2(s-1)}} \quad (s > 1),$$

so that $C(d) = (s-1), d^{,2(s-1)} / \pi$ makes $\int K_d = 1$. The integral converges only for $s &gt; 1$; for $s \le 1$ the heavy tail is non-integrable and the kernel cannot be normalized on the infinite plane (in practice the domain is finite, but $s &gt; 1$ is the well-posed regime). The near-field behaviour is regularized by $d^2$: at the event itself ($r = 0$) the kernel takes the finite value $C(d)/d^{2s}$, avoiding the singularity a bare $1/r^{2s}$ would have.

Implementation note — do not hard-code the exponent. The exponent $s$ and the matching normalization $C(d)$ are not a single universal constant across the Helmstetter–Kagan–Jackson family of papers. Both a power-$1$ form ($s = 1$, $\propto 1/(r^2 + d^2)$) and a power-$3/2$ form ($s = 3/2$, $\propto 1/(r^2 + d^2)^{3/2}$) appear, depending on the specific kernel, the normalization convention, and whether the smoothing is done on a sphere or a plane. The correct engineering practice is to pin $s$, $C(d)$ and the neighbour count to one specific reference implementation (Helmstetter–Kagan–Jackson 2007 / Werner et al. 2011, or the pyCSEP / floatCSEP reference code) and validate it on a known catalog, rather than treating $s = 3/2$ as a verified universal value. Throughout this page $s$ is kept symbolic for exactly this reason.


4. Why adaptive bandwidth, not fixed bandwidth

The defining feature of the model is that each event gets its own bandwidth $d_i$, set by the local data density. The standard rule is a $k$-nearest-neighbour bandwidth:

$$d_i = \text{(horizontal) distance from event } i \text{ to its } n\text{-th nearest neighbour},$$

with $n \approx 6$ (a hyperparameter, optimized retrospectively, not a constant). The consequences:

  • Where data are dense (an active, well-imaged fault trace), the $n$-th neighbour is close, so $d_i$ is small and the kernel is narrow — the fault is imaged sharply.
  • Where data are sparse (a quiet region, the edge of a network), the $n$-th neighbour is far, so $d_i$ is large and the kernel is broad — the rate is spread out and no single isolated event produces a spurious hot-spot.

This automatic, data-driven trade-off is exactly what a fixed-bandwidth estimator cannot do. A fixed bandwidth narrow enough to resolve a busy fault would produce noisy, spiky garbage in sparse regions; a fixed bandwidth broad enough to smooth the sparse regions would blur every fault into mush. The adaptive bandwidth resolves the resolution-vs-stability tension locally, and it is the single reason the method is competitive.

flowchart TB
  subgraph Dense["Dense region (active fault)"]
    D1["n-th neighbour is CLOSE"] --> D2["small bandwidth d_i"] --> D3["NARROW kernel<br/>sharp fault image"]
  end
  subgraph Sparse["Sparse region (quiet area)"]
    S1["n-th neighbour is FAR"] --> S2["large bandwidth d_i"] --> S3["BROAD kernel<br/>smooth, stable floor"]
  end
Loading

A subtlety worth stating: a fixed bandwidth makes the smoothing time-independent and stationary by construction, but the adaptive bandwidth introduces a mild dependence on the catalog used to define the neighbours. For the canonical time-independent forecast, the bandwidths are computed once from a long learning catalog and then held fixed, preserving stationarity. Werner et al. (2011) deliberately relax this to build a time-dependent adaptive-smoothing variant that updates as new events arrive — a different, short-term product.


5. From a spatial density to a space–rate–magnitude forecast

A CSEP-testable forecast is a rate in every space–magnitude bin over a fixed horizon. The smoothed density $\mu(x,y)$ supplies the spatial factor; it is multiplied by a magnitude factor (Gutenberg–Richter) and a temporal factor (the horizon length) under the assumption of separability — that the magnitude distribution is the same everywhere and does not depend on location or time.

The expected number of events in spatial cell $s_j$, magnitude bin $m_i$, over horizon $T$ is

$$\lambda(s_j, m_i) = \underbrace{\Big[\textstyle\int_{s_j} \mu(x,y), dx, dy\Big]}_{\text{spatial weight of cell}} ;\times; \underbrace{N_{\text{tot}}(T)}_{\text{total rate} \ge M_c} ;\times; \underbrace{\big[,10^{-b(m_i - M_c)} - 10^{-b(m_{i}+\Delta M - M_c)},\big]}_{\text{GR fraction in bin } m_i},$$

where $N_{\text{tot}}(T)$ is the total forecast number of events $\ge M_c$ over the horizon (the temporal scaling), and the bracketed magnitude factor is the Gutenberg–Richter probability that an event falls in bin $[m_i, m_i + \Delta M)$ — see Gutenberg–Richter. The spatial weights are normalized so that $\sum_j \int_{s_j}\mu = 1$, so $\mu$ acts as a spatial probability and $N_{\text{tot}}$ carries the absolute rate.

The single number a user sees is then the exceedance probability at a region-appropriate target magnitude $M^*$, obtained by treating the events in the region over the horizon as a non-homogeneous Poisson process:

$$N_{\ge M^_} = \int_0^T!!\int_A \mu(x,y), N_{\text{tot}}, 10^{-b(M^_ - M_c)}, dx, dy, \tfrac{dt}{T}, \qquad P(\ge 1 \text{ event} \ge M^_) = 1 - e^{-N_{\ge M^_}}.$$

This is the same $P = 1 - e^{-N}$ formula used everywhere in the system; only the quality of the rate $\lambda$ differs between models. For the pure smoothed-seismicity null, $\lambda$ is constant in time, so on a quiet day it reads exactly at climatology — which is precisely the behaviour we want from a null.


6. Declustering: the long-term map must not be a snapshot of a burst

A smoothed-seismicity background must reflect the long-term average rate of independent events, not the transient inflation of an ongoing aftershock sequence. If the raw catalog is smoothed directly, a single recent large mainshock and its thousands of aftershocks would paint a huge, spurious hot-spot that has nothing to do with the long-run loading — the map would be a photograph of last month's burst, not the decade's tectonics. Therefore the catalog is declustered first: aftershock and foreshock bursts are stripped so that only (approximately) independent mainshocks remain, and the smoothing then images the stationary spatial structure.

This is one half of the system's dual-catalog rule, which is worth stating crisply because it is the most common pipeline mistake in the field:

Quantity Catalog used Why
Smoothed background $\mu(x,y)$ Declustered Must reflect independent long-term rate, not aftershock inflation.
ETAS / triggering term Full (un-declustered) Aftershock/foreshock triggering is the predictable short-term signal.

Declustering methods range from the classic Gardner–Knopoff space–time windowing (transparent, easy to audit) to the modern Zaliapin–Ben-Zion nearest-neighbour method (statistically principled, based on a rescaled space–time–magnitude proximity). The choice of declustering is itself an assumption that propagates into the background map, so it is treated as a documented, versioned decision and ideally cross-checked across methods. (Declustering is deceptively dangerous: it is the step where it is easiest to silently delete real signal.)


7. Parameter estimation and retrospective optimization

The smoothed-seismicity model is refreshingly parameter-light. The quantities to set are:

  1. The neighbour count $n$ (which controls all bandwidths $d_i$). Typically $n \approx 6$, but it is optimized retrospectively: the learning catalog is split into a fitting period and a target period, the forecast is generated for a grid of candidate $n$ values, and the $n$ that maximizes the out-of-sample log-likelihood of the target events is selected.
  2. The kernel exponent $s$ (and the matching normalization $C(d)$) — pinned to the chosen reference implementation (§3).
  3. The completeness magnitude $M_c$ — the lowest magnitude included in the smoothing catalog (small events dominate the count, so a stable, low $M_c$ is what gives the map its resolution).
  4. The Gutenberg–Richter $b$-value — for the magnitude factor; estimated by the binning-corrected Aki–Utsu MLE, never hard-coded to 1.

The optimization objective is the point-process log-likelihood of the held-out target events under the forecast intensity. For a Poisson forecast with binned rates $\lambda_k$ and observed counts $\omega_k$,

$$\ln L = \sum_{k} \big[, -\lambda_k + \omega_k \ln \lambda_k - \ln(\omega_k!) ,\big],$$

summed over space–magnitude bins $k$. Maximizing this over $n$ (and any other free hyperparameters) selects the smoothing that best predicts events it has never seen. This retrospective optimization under an out-of-sample likelihood is what protects the model from over-fitting — a fixed bandwidth tuned in-sample would look great in-sample and fail prospectively, which is exactly the kind of self-deception the evaluation framework is designed to expose.

Crucially, all hyperparameters must be frozen on data strictly before the forecast period — the forecast-clock causal cutoff — so that the reported skill is genuinely prospective and not a hindsight artifact.


8. The dual role: background for ETAS, and the mandatory null

Smoothed seismicity earns its central place because it serves two structurally different purposes.

Role 1 — the spatial background $\mu(x,y)$ of ETAS. The ETAS conditional intensity is $\lambda(t,x,y\mid\mathcal{H}_t) = \mu(x,y) + (\text{triggering sum})$. The background term $\mu$ is where independent events nucleate, and smoothed seismicity (on a declustered catalog) is the standard, best-available estimate of exactly that. A poorly-chosen background degrades ETAS everywhere the triggering sum is small — i.e. on the quiet majority of the map and the quiet majority of days. So the quality of the background is not a side detail; it dominates the all-period score.

Role 2 — the stationary Poisson null. A time-independent forecast is, by definition, the hypothesis "the future rate equals the smoothed past rate, with no time-dependence." This is the honest null hypothesis for the whole product. Any model that adds time-dependence — ETAS, a neural point process, a tidal covariate — must show, in prospective CSEP comparison tests, a positive and statistically significant information gain per earthquake over this null. If it cannot, the added machinery is adding noise, not skill, and must not touch a public number.

flowchart LR
  SS["Smoothed seismicity<br/>μ(x,y)"]
  SS -->|"Role 1: background term"| ETAS["ETAS conditional intensity<br/>λ = μ + triggering"]
  SS -->|"Role 2: stationary null"| NULL["Time-independent<br/>Poisson reference"]
  ETAS -->|"must beat (prospective IGPE > 0)"| NULL
  ML["Neural / hybrid model"] -->|"must beat"| NULL
  ML -->|"must beat"| ETAS
Loading

This is a sharp, honest bar. In prospective California CSEP, even the spread of time-independent models against a smoothed-seismicity reference is only of order a fraction of a nat per earthquake — the null is genuinely hard to beat, and saying so openly is part of the product's credibility.


9. Assumptions, strengths and limitations

Core assumptions.

  • The future resembles the smoothed past locations. Future large events occur (in a smoothed sense) where past small events have occurred.
  • Stationarity. The spatial rate does not change over the forecast horizon (true to good approximation over years, by construction in the time-independent variant).
  • Separability of magnitude. The Gutenberg–Richter distribution is the same everywhere (a simplification — $b$ genuinely varies in space, which a refined version maps locally).
  • Isotropy of the kernel. Each event smears out in a circularly symmetric bump, whereas real seismicity is elongated along fault strike (an anisotropic kernel does better but adds parameters).

Strengths.

  • Best-performing time-independent forecast in the RELM/CSEP California experiments — a strong, hard-to-beat baseline despite its simplicity.
  • Parameter-light and transparent. One main hyperparameter ($n$), no hidden tuning, fully auditable. This is exactly the property you want in a null.
  • Naturally regularized. The adaptive bandwidth prevents both over-fitting in dense regions and spurious hot-spots in sparse ones.
  • Reusable. The same map is both the ETAS background and the null, so it is computed once and serves two purposes.

Limitations (stated honestly).

  • Blind to silent faults. A fault that is fully locked and has produced no recent microseismicity gets a low rate — yet it may be precisely where the next great earthquake strikes. This is the model's deepest weakness, and it is why geodetic strain (which images loading on silent faults) is a valuable complementary input to the background term.
  • No time information. By design it cannot raise the rate during an active sequence — that is ETAS's job. On its own it under-warns during aftershock storms and over-warns on the day after they end.
  • Catalog-quality sensitive. A change in network detection capability changes $M_c$ and the apparent spatial density; an apparent "hot-spot" can be a network artifact, so $M_c(x,y,t)$ must be monitored and homogenized.
  • Declustering-dependent. The background inherits whatever assumptions the declustering method makes; an over-aggressive declusterer erases real clustered loading.

10. A worked illustration

Consider a simplified region with a declustered catalog of $N = 10{,}000$ independent events $\ge M_c = 2.0$ over a $T = 20$-year learning window, and suppose we want the annual probability of at least one $M \ge 6$ in a $0.1° \times 0.1°$ cell located on a busy fault trace.

  1. Bandwidths. For an event on the busy trace, its $n = 6$-th nearest neighbour sits at, say, $d_i \approx 2$ km, so its kernel is narrow. An isolated event off-fault might have $d_i \approx 30$ km, smearing its contribution broadly.

  2. Spatial weight. Summing the kernels and integrating over our target cell gives a spatial probability weight, say $\int_{\text{cell}}\mu \approx 3\times 10^{-3}$ (the cell holds 0.3 % of the region's long-term rate). On a quiet cell far from any trace this might be $10^{-6}$ — small, but never exactly zero, which is the point of smoothing.

  3. Temporal rate. The total rate $\ge M_c$ is $N/T = 10{,}000/20 = 500$ independent events per year region-wide.

  4. Magnitude factor. With $b = 1.0$, the Gutenberg–Richter fraction of events reaching $M \ge 6$ relative to $M \ge M_c = 2$ is $10^{-b(6-2)} = 10^{-4}$.

  5. Expected number in the cell, per year. $$N_{\ge 6} = \underbrace{3\times 10^{-3}}{\text{spatial}} \times \underbrace{500}{\text{rate/yr}} \times \underbrace{10^{-4}}_{\text{GR}} = 1.5\times 10^{-4}.$$

  6. Annual probability. $$P(\ge 1\ M!\ge!6 \text{ in the cell, 1 yr}) = 1 - e^{-1.5\times 10^{-4}} \approx 1.5\times 10^{-4} ;;(\approx 0.015%).$$

The number is deliberately tiny — a single small cell over a single year is a rare-event problem, and an honest time-independent forecast says so. The value of the model is not any single cell's number but the relative field: this busy-trace cell carries a rate three orders of magnitude higher than a quiet off-fault cell, and that spatial contrast — calibrated and testable — is the information content. A genuine time-dependent model is then judged by whether it can sharpen this field and track when, beating the flat-in-time null it started from.


11. Role in operational earthquake forecasting

In an operational system that issues one forecast per day, smoothed seismicity is the always-on floor:

  • It is the cold-start / quiescent-cell rate. Where recent seismicity is sparse or absent, the conditional rate does not collapse to zero and does not jump to an arbitrary hard-coded floor — it floors to the long-term smoothed background $\mu(x,y)$. This is what makes a "quiet" cell read as low-but-real climatology rather than impossible (and, critically, a blank cell must never be allowed to read as "safe").
  • It is the denominator of every skill claim. The daily pipeline scores the live forecaster's information gain against this null. Reporting "we beat climatology by X nats during the sequence" is only meaningful because the null is fixed, transparent and pre-registered.
  • It anchors the public number. Every probability is shown next to its smoothed-seismicity / climatological baseline, so a user reads "today: X % vs. baseline Y %," never an unanchored figure that invites alarm.

Because most of any map and most of any year are quiet, the calibration of the product is dominated by these background-rate cells. The reliability diagram ("when we said 2 %, it happened ~2 % of the time") must therefore be validated specifically in the cold-start regime, not only during the rare, dramatic active sequences. A model that is well-calibrated only during aftershock storms but mis-calibrated on quiet days is mis-calibrated most of the time. Getting the boring background right is, quietly, most of the job.


References

  1. Helmstetter, A., Kagan, Y.Y. & Jackson, D.D. (2007). High-resolution time-independent grid-based forecast for M ≥ 5 earthquakes in California. Seismological Research Letters 78(1), 78–86. doi:10.1785/gssrl.78.1.78
  2. Werner, M.J., Helmstetter, A., Jackson, D.D. & Kagan, Y.Y. (2011). High-resolution long-term and short-term earthquake forecasts for California. Bulletin of the Seismological Society of America 101(4), 1630–1648. doi:10.1785/0120090340
  3. Zechar, J.D., Schorlemmer, D., Werner, M.J., Gerstenberger, M.C., Rhoades, D.A. & Jordan, T.H. (2013). Regional Earthquake Likelihood Models I: First-order results. Bulletin of the Seismological Society of America 103(2A), 787–798. doi:10.1785/0120120186
  4. Schorlemmer, D., Gerstenberger, M.C., Wiemer, S., Jackson, D.D. & Rhoades, D.A. (2007). Earthquake likelihood model testing. Seismological Research Letters 78(1), 17–29. doi:10.1785/gssrl.78.1.17
  5. Wiemer, S. & Wyss, M. (2000). Minimum magnitude of completeness in earthquake catalogs: examples from Alaska, the western United States, and Japan. Bulletin of the Seismological Society of America 90(4), 859–869. doi:10.1785/0119990114
  6. Ogata, Y. (1998). Space–time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics 50(2), 379–402. doi:10.1023/A:1003403601725
  7. Zaliapin, I. & Ben-Zion, Y. (2020). Earthquake declustering using the nearest-neighbor approach in space–time–magnitude domain. Journal of Geophysical Research: Solid Earth 125, e2018JB017120. doi:10.1029/2018JB017120
  8. Jordan, T.H., Chen, Y.-T., Gasparini, P., Madariaga, R., Main, I., Marzocchi, W., Papadopoulos, G., Sobolev, G., Yamaoka, K. & Zschau, J. (2011). Operational Earthquake Forecasting: state of knowledge and guidelines for utilization. Annals of Geophysics 54(4), 315–391. doi:10.4401/ag-5350

See also: Models — Classical · Brownian Passage Time · Rate-and-State & Coulomb · Evaluation & Tests · Data Sources · Methodology & History. All equations are transcribed from the primary/authoritative sources cited above.

Clone this wiki locally