Pipeline

The versioned, deterministic pipeline that turns raw global data into one compact daily forecast artifact. Every stage reads a versioned input manifest, writes an output manifest, and is fully re-runnable from manifests + code + configs — raw data is rebuildable and never committed. Every stage stamps provenance (source catalog versions, Mc grid version, declustering choice, config hash, code git SHA, issue timestamp), so any past forecast is byte-reproducible months later and pseudo-prospective CSEP scoring is honest.

The pipeline is a DAG, not a script. It is structured so that temporal leakage is structurally impossible (the forecast clock, §4), the most common pipeline mistakes are guarded by design (the dual-catalog rule, §3), and every artifact is reproducible from a manifest (§6).

1. The DAG at a glance

flowchart TD
    CFG["configs/ (VERSIONED)<br/>region/global · mc · declustering · etas · grid"]
    A["(A) FETCH<br/>ComCat delta + regional FDSN + EMSC<br/>ISC-GEM / GCMT / enrichers (periodic/static)<br/>tidal stress computed"]
    B["(B) CLEAN / HOMOGENIZE<br/>dedupe across providers · keep magType<br/>homogenize -> Mw (TLS, ISC-GEM/GCMT anchor)"]
    C["(C) Mc + DECLUSTER<br/>Mc(x,y,t) grid · cut < Mc<br/>GK background + ZBZ eta/T/R (features+labels)"]
    D["(D) FEATURE BUILD<br/>recent-window counts · ETAS intensities · eta/T/R<br/>slab/fault/strain joins · tidal dCFS / Mf envelope"]
    E["(E) TRAIN / FIT<br/>smoothed-seismicity null · regime-tiled ETAS<br/>gated neural challenger + covariates"]
    F["(F) DAILY INFERENCE<br/>forecast clock: catalog slice (-inf, t)<br/>per-cell rate per horizon · bounds · calibration"]
    G["(G) COMPACT ARTIFACT<br/>sparsity floor + H3 binning + quantize + gzip<br/>ONE file: rates + baseline + bounds + CSEP + provenance"]

    CFG --> A --> B --> C --> D --> E --> F --> G
    A -. fetch_manifest .-> MAN[(manifests/<br/>VERSIONED)]
    B -. clean_manifest .-> MAN
    C -. mc_decluster_manifest .-> MAN
    D -. feature_manifest .-> MAN
    E -. model_manifest .-> MAN
    G --> RES[(results/<br/>forecast-YYYY-MM-DD.json.gz<br/>+ index.json · VERSIONED)]

Stages A–G are the pipeline; the dashed arrows are the provenance manifests every stage stamps; the cylinders are the only things committed to git (configs, manifests, compact results).

2. Stages

(A) Fetch

Pull the inputs and persist them locally.

ComCat updatedafter delta — the daily incremental catalog (only events updated since the last successful run).
Regional FDSN + EMSC cross-check — higher-resolution context where available; EMSC for independent dedup.
Periodic / static — ISC-GEM and GCMT on a slow refresh; Slab2, GEM faults, Bird PB2002 plate model, and NGL strain loaded once.
Tidal stress computed — pygtide body tide + SPOTL/TPXO ocean loading, resolved to tidal Coulomb stress.

Outputs: a raw event store (Parquet, partitioned by region/year) [never versioned] and a fetch_manifest.json recording per-source URL, query params, retrieved-at timestamp, row counts, and checksums [versioned].

See Data Sources for every source, its access call, license, and cadence.

(B) Clean / homogenize

Dedupe across providers by preferred-origin id; keep magType; homogenize magnitudes to Mw. Catalogs mix ML / mb / Ms / Md / Mw — different saturation, different physics — so where Mw is missing for small events, fit regional total-least-squares ML→Mw and mb→Mw (both axes carry error, so not OLS), anchored on the ISC-GEM / GCMT overlap. Store both native and Mw-homogenized magnitudes and version the conversion — a wrong conversion shifts the whole Gutenberg–Richter relation and every rate forecast.

Outputs: a clean catalog (Parquet, native + Mw) [never versioned] and a clean_manifest.json (dedupe stats, conversion coefficients + version) [versioned].

(C) Mc + dual-catalog declustering

This stage is order-load-bearing — a model trained on a dirty catalog learns the network's detection changes, not the Earth.

Mc(x,y,t) — estimate the magnitude of completeness per spatial cell and time epoch (MAXC with the +0.2 correction, cross-checked with the Goodness-of-Fit test and EMR for uncertainty). A single global Mc injects fake non-stationarity. The +0.2 MAXC correction is California-tuned — re-validate per region. Store the Mc grid as a first-class versioned artifact and monitor Mc(t) and b(t). Cut events below local Mc before declustering and feature-building.
Declustering — the DUAL view (central rule).
- Gardner–Knopoff declustered catalog → only for the stationary Poisson background rate $\mu(x,y)$ and Poisson-baseline calibration.
- Full, un-declustered catalog → fed to the conditional / ETAS model, because aftershock/foreshock triggering is the predictable signal. Declustering the conditional input is the single most common pipeline mistake — guarded by design and documented loudly.
- Zaliapin–Ben-Zion nearest-neighbor $\eta / T / R$ computed as ML features (not just keep/drop labels) and as the principled cluster labeler.

Canonical equations (preserved):

Gutenberg–Richter: $\log_{10} N(\ge M) = a - bM$
Aki–Utsu b-value MLE (binning-corrected): $\hat{b} = \dfrac{\log_{10} e}{\bar{M} - (M_c - \Delta M/2)}$
Gardner–Knopoff windows (OpenQuake hmtk coefficients): $L(M) = 10^{,0.1238M + 0.983}$ km; $T(M) = 10^{,0.032M + 2.7389}$ d for $M \ge 6.5$, else $10^{,0.5409M - 0.547}$ d.
Baiesi–Paczuski / Zaliapin nearest-neighbor proximity: $\eta_{ij} = t_{ij},(r_{ij})^{d_f},10^{-b,m_i}$, decomposed $T_j = t_{ij},10^{-qbm_i}$, $R_j = (r_{ij})^{d_f},10^{-(1-q)bm_i}$ ($q \approx 0.5$); $\log_{10}\eta$ is bimodal (background vs clustered).

Outputs: Mc grid + declustered + ZBZ-labeled catalogs [never versioned] and a mc_decluster_manifest.json (Mc method, grid hash, b(t), decluster params) [versioned].

Short-term aftershock incompleteness — the highest-stakes operational moment. Right after a large mainshock — exactly when the forecast matters most and is most consumed — Mc spikes for hours-to-days and a naive ETAS under-forecasts productivity. The daily update uses an incompleteness-aware likelihood with a time-dependent Mc(t) (detrended / incompleteness-aware ETAS), not the static-Mc fit. This is a decided method, flagged in product copy as a known limitation of the most important real-time updates.

(D) Feature build

Build the conditional features on the forecast grid: recent-window counts, ETAS intensities, $\eta / T / R$ clustering features, slab / fault / strain joins, and the tidal $\Delta\mathrm{CFS}$ / fortnightly Mf envelope. Outputs: a feature store (Parquet on the forecast grid) [never versioned] and a feature_manifest.json (feature list, grid spec, enricher versions) [versioned].

(E) Train / fit

Smoothed-seismicity Poisson — the mandatory null (declustered catalog).
Regime-tiled space-time ETAS — the primary estimator and the mandatory reference, fit per tectonic-regime tile (see Technical Architecture).
Gated neural challenger + covariates — a Hawkes-biased Neural Temporal Point Process and the tidal / GNSS covariates, behind a feature flag, shipping only on a positive, significant prospective-CSEP information gain over ETAS plus passing calibration.

Outputs: model weights / fitted params [never versioned] and a model_manifest.json (model id, fit window, params, CSEP scores, config hash) [versioned].

(F) Daily inference

Run under the forecast clock: hand the model only the catalog slice $(-\infty, t)$, cut below Mc, compute per-cell rate per horizon {1 d, 2 d, 7 d}, with expected value and bounds (P10 / median / P90), and the calibrated public probability vs the long-term baseline. Emit both representations — a gridded-rate forecast (Poisson CSEP tests) and a catalog-based forecast (≥10k Monte-Carlo synthetic catalogs, relaxing the Poisson assumption). Output: a raw daily forecast object [never versioned].

The public scalar is the exceedance probability:

$$P(\geq 1 \text{ event} \geq M^_) = 1 - e^{-N_{\geq M^_}}, \qquad N_{\geq M^_} = \iint \lambda,\Phi(M^_),dx,dy,dt, \qquad \Phi(M^_) = 10^{-b(M^_ - M_c)}$$

with an explicit per-region Mmax bounding the exceedance integral for the rare large events that dominate impact.

(G) Compact artifact

Compact the raw forecast into one served file: apply a sparsity floor, H3-bin, quantize (log-binned uint8/uint16 + legend lookup), and gzip — targeting a few hundred KB to a few MB. The artifact carries, per cell: $\lambda$ for {1 d, 2 d, 7 d} × {lo, exp, hi}; the baseline $\lambda$; an N/S/M/L test summary + reliability points; the coverage mask; and metadata (source catalog versions, Mc map version, declustering choice, model version, issue timestamp).

Outputs: results/forecast-YYYY-MM-DD.json.gz and results/index.json (latest pointer + rolling CSEP calibration) [versioned — compact only].

3. The dual-catalog rule (guarded by design)

flowchart TD
    CLEAN["Clean, homogenized catalog (full)"] --> CUT["Cut events < Mc(x,y,t)"]
    CUT --> GK["Gardner-Knopoff decluster"]
    CUT --> FULL["Keep FULL un-declustered catalog"]
    GK --> BG["Stationary background mu(x,y)<br/>Poisson-baseline calibration"]
    FULL --> COND["Conditional / ETAS model<br/>(triggering IS the signal)"]
    CUT --> ZBZ["Zaliapin-Ben-Zion eta/T/R"]
    ZBZ --> COND
    BG --> NULL["Smoothed-seismicity null<br/>(the mandatory baseline)"]

The declustered catalog drives only the background and the Poisson null; the full catalog drives the conditional model. Scoring is on the non-declustered catalog (the target events are mostly aftershocks), and skill is measured against a real ETAS baseline — a model that merely reproduces Omori is not skillful. See the evaluation pages for the CSEP protocol.

4. Cross-cutting properties

Forecast clock (no temporal leakage). At each daily issue time $t$, the model is handed only the catalog slice $(-\infty, t)$, the forecast is sealed, then the clock advances. Temporal leakage — the primary pseudo-prospective failure mode — is structurally impossible, not a matter of discipline.
Cold-start / quiescent cells. Floor the conditional rate to the long-term smoothed-seismicity Poisson background (not an arbitrary hard floor); borrow strength spatially via regionalized / hierarchical priors. The UI distinguishes "low but poorly-constrained," "genuinely quiescent," and "no data / out-of-coverage" (the coverage mask — blank ≠ safe). Calibration is dominated by these cells, so they are handled honestly.
Input-state snapshots for reproducibility. The fetch manifest snapshots the exact catalog state used per issue (ComCat continuously revises magnitudes/locations and retracts events). A past forecast must be byte-reproducible months later; otherwise pseudo-prospective CSEP scoring is scored against a retroactively-improved catalog (optimistic leakage). Immutable issue-time logging of both the forecast and its input manifest is mandatory.

5. What is versioned vs never versioned

VERSIONED (committed to git):

Pipeline code (scripts/, the Python package, the static-preview server).
Configs (configs/*.yaml: region/global, grid / H3 resolution, Mc method, declustering params, ETAS, horizons, magnitude thresholds, Mmax assumption).
Manifests (fetch_manifest, clean_manifest, mc_decluster_manifest, feature_manifest, model_manifest) — the provenance/reproducibility record: source URLs, query params, retrieved-at timestamps, row counts, checksums, conversion coefficients, model params, CSEP scores, code SHA.
Compact daily results (results/forecast-YYYY-MM-DD.json.gz) + results/index.json. These are small and are the public app's data.
requirements.txt / lock, .env.example, README, license/credits page.

NEVER versioned (.gitignore, rebuildable from manifests + code):

Raw downloaded catalogs / waveforms / enricher grids (ComCat JSON, ISC-GEM CSV, GCMT ndk, Slab2 .grd, NGL .tenv3, TPXO model files).
The clean / declustered / Mc-grid / feature Parquet stores.
Model weights / fitted-parameter binaries.
The .venv/, caches, the working .env, and all secrets.
Any registration-gated raw files or derived products under their providers' agreements.

.venv/
.env
data/raw/
data/clean/
data/features/
data/mc/
models/
*.grd
*.tenv3
*.parquet
*.ndk
__pycache__/
.cache/
# keep: configs/, manifests/, results/*.json.gz, results/index.json
!results/*.json.gz

The result: the git repo stays small — only configs, manifests, code, and compact gzipped results are committed, growing by a few-hundred-KB-to-few-MB artifact per day. The working set on the build host is rebuildable in full from the manifests.

6. Reproducibility and manifests

Every stage emits a manifest; the chain of manifests is the reproducibility contract. To reconstruct any past forecast:

flowchart LR
    M["manifests/ for issue date t<br/>(fetch -> clean -> mc/decluster -> feature -> model)"] --> RB["Re-run stages A-G<br/>from code @ recorded git SHA<br/>+ configs @ recorded hash"]
    RB --> OUT["Byte-identical<br/>forecast-t.json.gz"]

A manifest pins everything that affects the output: source catalog versions and query params, the Mc grid hash, the declustering choice, the magnitude-conversion coefficients, the model id + fitted params, the config hash, the code git SHA, and the issue timestamp.
Raw data is never the source of truth — the manifest is. Raw inputs are gitignored and rebuildable; if a re-fetch returns a revised catalog, the manifest's snapshot is what governs the reproduction, so scoring is never silently improved by retroactive catalog revisions.
The daily QA gate (catalog freshness, event-count sanity, no duplicate/retracted spike near the magnitude threshold, rolling N-test drift, sane artifact size) prevents a corrupted or stale artifact from being committed; on failure the job commits nothing and the last-good artifact stays up. See Technical Architecture §7.

7. Reproducible scripts

Each human-run convenience command ships as parallel .ps1 + .sh with identical subcommands; server-side / timer scripts are .sh only.

Script	Runs	Does
`scripts/setup`	—	create the `.venv`, install pinned deps, materialize the gitignored `.env` from the secrets vault
`scripts/fetch`	stage (A)	ComCat delta + regional FDSN + EMSC + periodic ISC-GEM/GCMT/enricher refresh; write raw store + `fetch_manifest.json`
`scripts/build-features`	(B)→(D)	clean/dedupe/homogenize, `Mc` + declustering, feature build; write feature store + manifests
`scripts/train`	(E)	fit smoothed null + regime-tiled ETAS (+ gated challengers); run pyCSEP consistency + comparison tests; write weights + `model_manifest.json`
`scripts/infer`	(F)→(G)	forecast-clock inference, bounds, calibration, gridded + catalog-based forecasts, compact gzipped artifact + `index.json`
`scripts/daily`	(A)→(G) + publish	the cron entry: fetch → build-features → infer, QA gate, then scoped `git add results/ manifests/` → commit → push
`scripts/dev`	—	local static-preview server for the latest artifact (never a deployed backend)
`scripts/check`	smoke	lint + a fast pipeline smoke run on a tiny fixture window (CI-friendly, minimal network)

All FDSN calls retry with backoff on 204/400/413/429/503 (treat 413 as "tile smaller").

See also: Data Sources — every input this pipeline fetches · Technical Architecture — the compute host, regime tiling, git-as-data publishing, and the daily 03:00 job that drives this DAG.

⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.

It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.

CAOS_SEISMIC · seismic.fasl-work.com · source · MIT

CAOS_SEISMIC

Conditional probabilistic seismic forecasting — forecasts, never predictions.

Live site · Repo

Overview

Methodology & History

Methodology-History

Classical models

ML & analytical methods

Models employed

Models-Employed

Data

Architecture

Evaluation

Evaluation-and-Tests

Progress

Changelog-and-Progress

Reference

Pipeline

Pipeline

1. The DAG at a glance

2. Stages

(A) Fetch

(B) Clean / homogenize

(C) Mc + dual-catalog declustering

(D) Feature build

(E) Train / fit

(F) Daily inference

(G) Compact artifact

3. The dual-catalog rule (guarded by design)

4. Cross-cutting properties

5. What is versioned vs never versioned

6. Reproducibility and manifests

7. Reproducible scripts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CAOS_SEISMIC

Clone this wiki locally