-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline
The versioned, deterministic pipeline that turns raw global data into one compact daily forecast
artifact. Every stage reads a versioned input manifest, writes an output manifest, and is fully
re-runnable from manifests + code + configs — raw data is rebuildable and never committed.
Every stage stamps provenance (source catalog versions, Mc grid version, declustering choice,
config hash, code git SHA, issue timestamp), so any past forecast is byte-reproducible months later
and pseudo-prospective CSEP scoring is honest.
The pipeline is a DAG, not a script. It is structured so that temporal leakage is structurally impossible (the forecast clock, §4), the most common pipeline mistakes are guarded by design (the dual-catalog rule, §3), and every artifact is reproducible from a manifest (§6).
flowchart TD
CFG["configs/ (VERSIONED)<br/>region/global · mc · declustering · etas · grid"]
A["(A) FETCH<br/>ComCat delta + regional FDSN + EMSC<br/>ISC-GEM / GCMT / enrichers (periodic/static)<br/>tidal stress computed"]
B["(B) CLEAN / HOMOGENIZE<br/>dedupe across providers · keep magType<br/>homogenize -> Mw (TLS, ISC-GEM/GCMT anchor)"]
C["(C) Mc + DECLUSTER<br/>Mc(x,y,t) grid · cut < Mc<br/>GK background + ZBZ eta/T/R (features+labels)"]
D["(D) FEATURE BUILD<br/>recent-window counts · ETAS intensities · eta/T/R<br/>slab/fault/strain joins · tidal dCFS / Mf envelope"]
E["(E) TRAIN / FIT<br/>smoothed-seismicity null · regime-tiled ETAS<br/>gated neural challenger + covariates"]
F["(F) DAILY INFERENCE<br/>forecast clock: catalog slice (-inf, t)<br/>per-cell rate per horizon · bounds · calibration"]
G["(G) COMPACT ARTIFACT<br/>sparsity floor + H3 binning + quantize + gzip<br/>ONE file: rates + baseline + bounds + CSEP + provenance"]
CFG --> A --> B --> C --> D --> E --> F --> G
A -. fetch_manifest .-> MAN[(manifests/<br/>VERSIONED)]
B -. clean_manifest .-> MAN
C -. mc_decluster_manifest .-> MAN
D -. feature_manifest .-> MAN
E -. model_manifest .-> MAN
G --> RES[(results/<br/>forecast-YYYY-MM-DD.json.gz<br/>+ index.json · VERSIONED)]
Stages A–G are the pipeline; the dashed arrows are the provenance manifests every stage stamps; the cylinders are the only things committed to git (configs, manifests, compact results).
Pull the inputs and persist them locally.
-
ComCat
updatedafterdelta — the daily incremental catalog (only events updated since the last successful run). - Regional FDSN + EMSC cross-check — higher-resolution context where available; EMSC for independent dedup.
- Periodic / static — ISC-GEM and GCMT on a slow refresh; Slab2, GEM faults, Bird PB2002 plate model, and NGL strain loaded once.
-
Tidal stress computed —
pygtidebody tide + SPOTL/TPXO ocean loading, resolved to tidal Coulomb stress.
Outputs: a raw event store (Parquet, partitioned by region/year) [never versioned] and a
fetch_manifest.json recording per-source URL, query params, retrieved-at timestamp, row counts,
and checksums [versioned].
See Data Sources for every source, its access call, license, and cadence.
Dedupe across providers by preferred-origin id; keep magType; homogenize magnitudes to Mw.
Catalogs mix ML / mb / Ms / Md / Mw — different saturation, different physics — so where Mw is
missing for small events, fit regional total-least-squares ML→Mw and mb→Mw (both axes carry
error, so not OLS), anchored on the ISC-GEM / GCMT overlap. Store both native and Mw-homogenized
magnitudes and version the conversion — a wrong conversion shifts the whole Gutenberg–Richter
relation and every rate forecast.
Outputs: a clean catalog (Parquet, native + Mw) [never versioned] and a clean_manifest.json
(dedupe stats, conversion coefficients + version) [versioned].
This stage is order-load-bearing — a model trained on a dirty catalog learns the network's detection changes, not the Earth.
-
Mc(x,y,t)— estimate the magnitude of completeness per spatial cell and time epoch (MAXC with the +0.2 correction, cross-checked with the Goodness-of-Fit test and EMR for uncertainty). A single globalMcinjects fake non-stationarity. The +0.2 MAXC correction is California-tuned — re-validate per region. Store theMcgrid as a first-class versioned artifact and monitorMc(t)andb(t). Cut events below localMcbefore declustering and feature-building. -
Declustering — the DUAL view (central rule).
-
Gardner–Knopoff declustered catalog → only for the stationary Poisson background rate
$\mu(x,y)$ and Poisson-baseline calibration. - Full, un-declustered catalog → fed to the conditional / ETAS model, because aftershock/foreshock triggering is the predictable signal. Declustering the conditional input is the single most common pipeline mistake — guarded by design and documented loudly.
-
Zaliapin–Ben-Zion nearest-neighbor
$\eta / T / R$ computed as ML features (not just keep/drop labels) and as the principled cluster labeler.
-
Gardner–Knopoff declustered catalog → only for the stationary Poisson background rate
Canonical equations (preserved):
- Gutenberg–Richter:
$\log_{10} N(\ge M) = a - bM$ - Aki–Utsu b-value MLE (binning-corrected):
$\hat{b} = \dfrac{\log_{10} e}{\bar{M} - (M_c - \Delta M/2)}$ - Gardner–Knopoff windows (OpenQuake hmtk coefficients):
$L(M) = 10^{,0.1238M + 0.983}$ km;$T(M) = 10^{,0.032M + 2.7389}$ d for$M \ge 6.5$ , else$10^{,0.5409M - 0.547}$ d. - Baiesi–Paczuski / Zaliapin nearest-neighbor proximity:
$\eta_{ij} = t_{ij},(r_{ij})^{d_f},10^{-b,m_i}$ , decomposed$T_j = t_{ij},10^{-qbm_i}$ ,$R_j = (r_{ij})^{d_f},10^{-(1-q)bm_i}$ ($q \approx 0.5$ );$\log_{10}\eta$ is bimodal (background vs clustered).
Outputs: Mc grid + declustered + ZBZ-labeled catalogs [never versioned] and a
mc_decluster_manifest.json (Mc method, grid hash, b(t), decluster params) [versioned].
Short-term aftershock incompleteness — the highest-stakes operational moment. Right after a large mainshock — exactly when the forecast matters most and is most consumed —
Mcspikes for hours-to-days and a naive ETAS under-forecasts productivity. The daily update uses an incompleteness-aware likelihood with a time-dependentMc(t)(detrended / incompleteness-aware ETAS), not the static-Mcfit. This is a decided method, flagged in product copy as a known limitation of the most important real-time updates.
Build the conditional features on the forecast grid: recent-window counts, ETAS intensities,
feature_manifest.json (feature list, grid spec, enricher versions)
[versioned].
- Smoothed-seismicity Poisson — the mandatory null (declustered catalog).
- Regime-tiled space-time ETAS — the primary estimator and the mandatory reference, fit per tectonic-regime tile (see Technical Architecture).
- Gated neural challenger + covariates — a Hawkes-biased Neural Temporal Point Process and the tidal / GNSS covariates, behind a feature flag, shipping only on a positive, significant prospective-CSEP information gain over ETAS plus passing calibration.
Outputs: model weights / fitted params [never versioned] and a model_manifest.json (model id,
fit window, params, CSEP scores, config hash) [versioned].
Run under the forecast clock: hand the model only the catalog slice Mc, compute per-cell rate per horizon {1 d, 2 d, 7 d}, with expected value and bounds
(P10 / median / P90), and the calibrated public probability vs the long-term baseline. Emit both
representations — a gridded-rate forecast (Poisson CSEP tests) and a catalog-based forecast
(≥10k Monte-Carlo synthetic catalogs, relaxing the Poisson assumption). Output: a raw daily forecast
object [never versioned].
The public scalar is the exceedance probability:
with an explicit per-region Mmax bounding the exceedance integral for the rare large events that
dominate impact.
Compact the raw forecast into one served file: apply a sparsity floor, H3-bin, quantize
(log-binned uint8/uint16 + legend lookup), and gzip — targeting a few hundred KB to a few MB.
The artifact carries, per cell: Mc map version, declustering choice, model version, issue timestamp).
Outputs: results/forecast-YYYY-MM-DD.json.gz and results/index.json (latest pointer + rolling
CSEP calibration) [versioned — compact only].
flowchart TD
CLEAN["Clean, homogenized catalog (full)"] --> CUT["Cut events < Mc(x,y,t)"]
CUT --> GK["Gardner-Knopoff decluster"]
CUT --> FULL["Keep FULL un-declustered catalog"]
GK --> BG["Stationary background mu(x,y)<br/>Poisson-baseline calibration"]
FULL --> COND["Conditional / ETAS model<br/>(triggering IS the signal)"]
CUT --> ZBZ["Zaliapin-Ben-Zion eta/T/R"]
ZBZ --> COND
BG --> NULL["Smoothed-seismicity null<br/>(the mandatory baseline)"]
The declustered catalog drives only the background and the Poisson null; the full catalog drives the conditional model. Scoring is on the non-declustered catalog (the target events are mostly aftershocks), and skill is measured against a real ETAS baseline — a model that merely reproduces Omori is not skillful. See the evaluation pages for the CSEP protocol.
-
Forecast clock (no temporal leakage). At each daily issue time
$t$ , the model is handed only the catalog slice$(-\infty, t)$ , the forecast is sealed, then the clock advances. Temporal leakage — the primary pseudo-prospective failure mode — is structurally impossible, not a matter of discipline. - Cold-start / quiescent cells. Floor the conditional rate to the long-term smoothed-seismicity Poisson background (not an arbitrary hard floor); borrow strength spatially via regionalized / hierarchical priors. The UI distinguishes "low but poorly-constrained," "genuinely quiescent," and "no data / out-of-coverage" (the coverage mask — blank ≠ safe). Calibration is dominated by these cells, so they are handled honestly.
- Input-state snapshots for reproducibility. The fetch manifest snapshots the exact catalog state used per issue (ComCat continuously revises magnitudes/locations and retracts events). A past forecast must be byte-reproducible months later; otherwise pseudo-prospective CSEP scoring is scored against a retroactively-improved catalog (optimistic leakage). Immutable issue-time logging of both the forecast and its input manifest is mandatory.
VERSIONED (committed to git):
- Pipeline code (
scripts/, the Python package, the static-preview server). - Configs (
configs/*.yaml: region/global, grid / H3 resolution,Mcmethod, declustering params, ETAS, horizons, magnitude thresholds,Mmaxassumption). -
Manifests (
fetch_manifest,clean_manifest,mc_decluster_manifest,feature_manifest,model_manifest) — the provenance/reproducibility record: source URLs, query params, retrieved-at timestamps, row counts, checksums, conversion coefficients, model params, CSEP scores, code SHA. -
Compact daily results (
results/forecast-YYYY-MM-DD.json.gz) +results/index.json. These are small and are the public app's data. -
requirements.txt/ lock,.env.example, README, license/credits page.
NEVER versioned (.gitignore, rebuildable from manifests + code):
- Raw downloaded catalogs / waveforms / enricher grids (ComCat JSON, ISC-GEM CSV, GCMT ndk, Slab2
.grd, NGL.tenv3, TPXO model files). - The clean / declustered /
Mc-grid / feature Parquet stores. - Model weights / fitted-parameter binaries.
- The
.venv/, caches, the working.env, and all secrets. - Any registration-gated raw files or derived products under their providers' agreements.
.venv/
.env
data/raw/
data/clean/
data/features/
data/mc/
models/
*.grd
*.tenv3
*.parquet
*.ndk
__pycache__/
.cache/
# keep: configs/, manifests/, results/*.json.gz, results/index.json
!results/*.json.gzThe result: the git repo stays small — only configs, manifests, code, and compact gzipped results are committed, growing by a few-hundred-KB-to-few-MB artifact per day. The working set on the build host is rebuildable in full from the manifests.
Every stage emits a manifest; the chain of manifests is the reproducibility contract. To reconstruct any past forecast:
flowchart LR
M["manifests/ for issue date t<br/>(fetch -> clean -> mc/decluster -> feature -> model)"] --> RB["Re-run stages A-G<br/>from code @ recorded git SHA<br/>+ configs @ recorded hash"]
RB --> OUT["Byte-identical<br/>forecast-t.json.gz"]
-
A manifest pins everything that affects the output: source catalog versions and query params,
the
Mcgrid hash, the declustering choice, the magnitude-conversion coefficients, the model id + fitted params, the config hash, the code git SHA, and the issue timestamp. - Raw data is never the source of truth — the manifest is. Raw inputs are gitignored and rebuildable; if a re-fetch returns a revised catalog, the manifest's snapshot is what governs the reproduction, so scoring is never silently improved by retroactive catalog revisions.
- The daily QA gate (catalog freshness, event-count sanity, no duplicate/retracted spike near the magnitude threshold, rolling N-test drift, sane artifact size) prevents a corrupted or stale artifact from being committed; on failure the job commits nothing and the last-good artifact stays up. See Technical Architecture §7.
Each human-run convenience command ships as parallel .ps1 + .sh with identical subcommands;
server-side / timer scripts are .sh only.
| Script | Runs | Does |
|---|---|---|
scripts/setup |
— | create the .venv, install pinned deps, materialize the gitignored .env from the secrets vault |
scripts/fetch |
stage (A) | ComCat delta + regional FDSN + EMSC + periodic ISC-GEM/GCMT/enricher refresh; write raw store + fetch_manifest.json
|
scripts/build-features |
(B)→(D) | clean/dedupe/homogenize, Mc + declustering, feature build; write feature store + manifests |
scripts/train |
(E) | fit smoothed null + regime-tiled ETAS (+ gated challengers); run pyCSEP consistency + comparison tests; write weights + model_manifest.json
|
scripts/infer |
(F)→(G) | forecast-clock inference, bounds, calibration, gridded + catalog-based forecasts, compact gzipped artifact + index.json
|
scripts/daily |
(A)→(G) + publish | the cron entry: fetch → build-features → infer, QA gate, then scoped git add results/ manifests/ → commit → push
|
scripts/dev |
— | local static-preview server for the latest artifact (never a deployed backend) |
scripts/check |
smoke | lint + a fast pipeline smoke run on a tiny fixture window (CI-friendly, minimal network) |
All FDSN calls retry with backoff on 204/400/413/429/503 (treat 413 as "tile smaller").
See also: Data Sources — every input this pipeline fetches · Technical Architecture — the compute host, regime tiling, git-as-data publishing, and the daily 03:00 job that drives this DAG.
⚠️ Disclaimer — read this. CAOS_SEISMIC produces probabilistic forecasts, not predictions. It is an independent research and education tool. It is NOT an official earthquake early-warning or civil-protection system, it does NOT predict when, where, or how large an earthquake will be, and it must NOT be used for life-safety, emergency, or evacuation decisions. Every number it publishes is a bounded, calibrated probability conditioned on the present state of seismicity — never an alarm, a countdown, or a "safe" state. A single outcome neither confirms nor refutes a probabilistic forecast.It complements, and does not replace or speak for, official agencies — always follow your national seismological and civil-protection authorities (e.g. USGS, INGV, CSN (Chile, SENAPRED for civil protection), GeoNet, JMA). The software is provided "as is", without warranty of any kind (MIT License); the authors accept no liability for its use. Data are courtesy of their providers (USGS/ANSS, ISC/ISC-GEM, Global CMT, EMSC, CSN, and others) under their respective licenses and attribution terms. See Honest-Limits for the full epistemic context.
CAOS_SEISMIC · seismic.fasl-work.com · source · MIT
Conditional probabilistic seismic forecasting — forecasts, never predictions.
Overview
Methodology & History
Classical models
- Models-Classical · index
- Gutenberg-Richter-Law
- Omori-Utsu-Law
- ETAS-Model
- Reasenberg-Jones-Model
- STEP-Model
- EEPAS-Model
- Smoothed-Seismicity
- Brownian-Passage-Time
- Rate-and-State-and-Coulomb
ML & analytical methods
- Models-ML · index
- Temporal-Point-Processes
- RMTPP
- Neural-Hawkes-Process
- Transformer-Hawkes-Process
- RECAST-and-FERN
- CNN-Spatial-Models
- Graph-and-Recurrent-Networks
- Detection-vs-Forecasting
Models employed
Data
Architecture
Evaluation
Progress
Reference