# Practical Project: USGS Earthquakes Pipeline

In this project we will build a small but real **scientific data pipeline** using the **USGS Earthquake Catalog API**.  

### What will be produced:
Artifacts (all under a project folder):
* `data/raw/`: raw API snapshots + metadata (query params, timestamps)
* `data/staged/`: parsed/normalized table (deduped, typed)
* `data/warehouse/`: curated table (Parquet; optionally partitioned by day)
* `data/reference/validation_report.json`: contracts + anomaly rates + canaries
* `data/reference/pipeline_runs/`: run logs for reproducibility

## 0 - Setup

Project will be created in the path relative to this notebook  

`/work/m2_project/`

In [1]:
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timedelta, timezone
import json
import hashlib
import math

import numpy as np
import pandas as pd

from IPython.display import display

pd.set_option("display.max_columns", 160)
pd.set_option("display.width", 180)

WORK_DIR = Path("work")
PROJECT_DIR = WORK_DIR / "m2_project"

DATA_DIR = PROJECT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
STAGED_DIR = DATA_DIR / "staged"
WH_DIR = DATA_DIR / "warehouse"
REF_DIR = DATA_DIR / "reference"
RUN_DIR = REF_DIR / "pipeline_runs"

for p in [RAW_DIR, STAGED_DIR, WH_DIR, REF_DIR, RUN_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("Project:", PROJECT_DIR)
print("Raw:", RAW_DIR)
print("Staged:", STAGED_DIR)
print("Warehouse:", WH_DIR)
print("Reference:", REF_DIR)
print("Runs:", RUN_DIR)

Project: work/m2_project
Raw: work/m2_project/data/raw
Staged: work/m2_project/data/staged
Warehouse: work/m2_project/data/warehouse
Reference: work/m2_project/data/reference
Runs: work/m2_project/data/reference/pipeline_runs


### Helper Utilities  

These helpers keep the notebook focused on pipeline thinking rather than boilerplate

In [2]:
class PipelineError(RuntimeError):
    pass

def utc_now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

def sha16(x: str) -> str:
    return hashlib.sha256(x.encode("utf-8")).hexdigest()[:16]

def write_json(path: Path, obj: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(obj, indent=2, default=str))

def read_json(path: Path) -> dict:
    return json.loads(path.read_text())

def require_columns(df: pd.DataFrame, cols: list[str], context: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise PipelineError(f"[{context}] Missing required columns: {missing}")

def require_unique(df: pd.DataFrame, key: str, context: str) -> None:
    if key not in df.columns:
        raise PipelineError(f"[{context}] Missing key column '{key}'")
    dupes = int(df[key].duplicated().sum())
    if dupes:
        raise PipelineError(f"[{context}] Key '{key}' has {dupes} duplicates")

print("Helpers ready.")

Helpers ready.


## 1 - Ingest: Pull Earthquakes From USGS API (Paginated)

We will use the USGS **event** endpoint (GeoJSON):
* base `https://earthquak.usgs.gov/fdsnws/event/1/query`
* params `format=geojson`, `starttime`, `endtime`, `minmagnitude`, plus pagination (`limit, offset`)

### Choose a query window
Set:
* `DAYS_BACK` = 30
* `MIN_MAG` = 2.5

Then build `starttime` and `endtime` in **UTC** as ISO dates

**Note:** The API likes `YYYY-MM-DD` strings

In [3]:
# Set query window
DAYS_BACK = 30
MIN_MAG = 2.5

# compute starttime/endtime as YYYY-MM-DD (UTC-based)
endtime_dt = datetime.now(timezone.utc)
starttime_dt = endtime_dt - timedelta(days=DAYS_BACK)

starttime = starttime_dt.strftime('%Y-%m-%d')
endtime = endtime_dt.strftime('%Y-%m-%d')

print(f"starttime: {starttime}")
print(f"endtime: {endtime}")
print(f"minmagnitude: {MIN_MAG}")

starttime: 2026-01-04
endtime: 2026-02-03
minmagnitude: 2.5


### Implement pagination

Write `fetch_usgs_pages(...)` that:
* requests pages using `limit` and `offset`
* stops when a page returns fewer than `limit` features
* returns a list of page dictionaries

**Constraints:**  
* Use a small `limit` while testing (e.g., 200) to see pagination work.
* Add a polite `sleep` if wanted, but keep it simple

**Note:**
* `obj["features"]` = list of events
* `obj["metadata"]["count"]` = count for the query (not always equal to returned features)

In [4]:
import time
import requests

USGS_URL = "https://earthquake.usgs.gov/fdsnws/event/1/query"

def fetch_usgs_pages(starttime: str, endtime: str, minmag: float, limit: int=200, max_pages: int=50) -> list[dict]:
    pages = []
    offset = 1  # USGS uses 1-based offsets
    for page_i in range(max_pages):
        params = {
            "format": "geojson",
            "starttime": starttime,
            "endtime": endtime,
            "minmagnitude": minmag,
            "limit": limit,
            "offset": offset,
            "orderby": "time",
        }

        # Request the page, check the status, parse JSON
        print(f"Fetching page {page_i + 1} (offset={offset})...", end=" ")
        response = requests.get(USGS_URL, params=params)

        # Check for HTTP erros
        response.raise_for_status()

        data = response.json()
        features = data.get("features", [])
        total_count = data.get("metadata", {}).get("count", 0)
        print(f"got {len(features)} events (total in query: {total_count})")

        # Quick stop if there are no results
        if not features:
            print("No more results, stopping")
            break

        pages.append(data)

        # Stop if we get fewer results than the limit (last page)
        if len(features) < limit:
            print("Partial page received, stopping")
            break

        # Increment offset for next page
        offset += limit

        # Short polite break for API
        time.sleep(0.5)

    return pages

print("Fetcher ready.")
        

Fetcher ready.


### Fetch data and write a raw snapshot

Run the fetch and write:  
* raw pages: `data/raw/usgs_pages_<runid>.jsonl`
* raw_metadata: `data/raw/usgs_meta_<runid>.json`

**Note:**  
`.jsonl` means "JSON lines": one JSON object per line

In [5]:
from pprint import pprint

# Fetch data and write raw snapshot
run_id = sha16(utc_now_iso())
print(f"Run ID: {run_id}\n")

# Fetch pages
pages = fetch_usgs_pages(starttime, endtime, MIN_MAG, limit=200)
print(f"\nFetched {len(pages)} pages")

# Count total events
total_events = sum(len(p['features']) for p in pages)
print(f"Total events: {total_events}")

# Write raw pages as JSONL (one JSON object per line)
pages_path = RAW_DIR / f"usgs_pages_{run_id}.jsonl"
pages_path.parent.mkdir(parents=True, exist_ok=True)

with pages_path.open('w') as f:
    for page in pages:
        f.write(json.dumps(page) + '\n')

print(f"Wrote pages: {pages_path}")

# Write metadata
metadata = {
    "run_id": run_id,
    "generated_at_utc": utc_now_iso(),
    "query": {
        "starttime": starttime,
        "endtime": endtime,
        "minmagnitude": MIN_MAG,
    },
    "n_pages": len(pages),
    "n_features_total": total_events,
    "source": "USGS Earthquake Catalog (GeoJSON)",
    "endpoint": USGS_URL,
}

meta_path = RAW_DIR / f"usgs_meta_{run_id}.json"
write_json(meta_path, metadata)
print(f"Wrote metadata: {meta_path}")


# Pretty print the metadata
print("\nMetadata:")
pprint(metadata, sort_dicts=False)
      

Run ID: 53565fc37978a749

Fetching page 1 (offset=1)... got 200 events (total in query: 200)
Fetching page 2 (offset=201)... got 200 events (total in query: 200)
Fetching page 3 (offset=401)... got 200 events (total in query: 200)
Fetching page 4 (offset=601)... got 200 events (total in query: 200)
Fetching page 5 (offset=801)... got 200 events (total in query: 200)
Fetching page 6 (offset=1001)... got 200 events (total in query: 200)
Fetching page 7 (offset=1201)... got 200 events (total in query: 200)
Fetching page 8 (offset=1401)... got 172 events (total in query: 172)
Partial page received, stopping

Fetched 8 pages
Total events: 1572
Wrote pages: work/m2_project/data/raw/usgs_pages_53565fc37978a749.jsonl
Wrote metadata: work/m2_project/data/raw/usgs_meta_53565fc37978a749.json

Metadata:
{'run_id': '53565fc37978a749',
 'generated_at_utc': '2026-02-03T12:20:39.529741+00:00',
 'query': {'starttime': '2026-01-04',
           'endtime': '2026-02-03',
           'minmagnitude': 2.5},
 '

## 2 - Stage: Normalize GeoJSON &rarr; Table

USGS GeoJSON structure:
* `feature["id"]` is a stable event id
* `feature["properties"]` contains magnitude, place, time, etc.
* `feature["geometry"]["coordinates"]`is`[longitude, latitude, depth_km]`

### Flatten features into a DataFrame  

Implement `features_to_df(pages)` that returns one DataFrame with one row per event.

**Note:**  
`pd.json_normalize(features)` will help

In [8]:
# Create a DataFrame with one row per event
def features_to_df(pages: list[dict]) -> pd.DataFrame:
    # 1. Extract all features from all pages
    all_features = []
    for page in pages:
        all_features.extend(page.get("features", []))

    # 2. Use pd.json_normalize to flatten nested structures, automatically
    df = pd.json_normalize(all_features)

    # 3. Extract coordinates into separate columns
    # geometry.coordintates is [longitude, latitude, depth km]
    df['longitude'] = df['geometry.coordinates'].apply(lambda x: x[0] if x else None)
    df['latitude'] = df['geometry.coordinates'].apply(lambda x: x[1] if x else None)
    df['depth_km'] = df['geometry.coordinates'].apply(lambda x: x[2] if x else None)

    # 4. Clean up column names (remove properties prefix)
    df.columns = df.columns.str.replace('properties.', '', regex=False)

    # 5. Drop the original nested column (geometry.coordinates)
    df = df.drop(columns=['geometry.coordinates'], errors='ignore')

    return df

print("flattening function ready")

flattening function ready


In [17]:
df_raw = features_to_df(pages)
print(f"Raw flattened shape: {df_raw.shape}")
display(df_raw.head(5))

Raw flattened shape: (1572, 32)


Unnamed: 0,type,id,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,alert,status,tsunami,sig,net,code,ids,sources,types,nst,dmin,rms,gap,magType,type.1,title,geometry.type,longitude,latitude,depth_km
0,Feature,us6000s620,4.1,"16 km E of Calingasta, Argentina",1770075292434,1770080742040,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,reviewed,0,259,us,6000s620,",us6000s620,",",us,",",origin,phase-data,",27,1.322,0.79,89.0,mb,earthquake,"M 4.1 - 16 km E of Calingasta, Argentina",Point,-69.2517,-31.3176,162.179
1,Feature,uw62216847,2.8,"5 km ESE of Benton City, Washington",1770075173110,1770096410010,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,reviewed,0,121,uw,62216847,",uw62216847,",",uw,",",origin,phase-data,",14,0.161,0.28,202.0,ml,explosion,"M 2.8 Explosion - 5 km ESE of Benton City, Was...",Point,-119.422,46.2345,-0.24
2,Feature,nc75306276,2.94,"5 km SE of San Ramon, CA",1770073598570,1770101775717,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,10.0,3.3,,,automatic,0,136,nc,75306276,",nc75306276,us6000s61u,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",64,0.1036,0.1,37.0,ml,earthquake,"M 2.9 - 5 km SE of San Ramon, CA",Point,-121.935837,37.754501,7.37
3,Feature,nc75306271,3.0,"4 km ESE of San Ramon, CA",1770073458430,1770101755307,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,24.0,3.1,,,reviewed,0,146,nc,75306271,",nc75306271,us6000s61t,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",74,0.1056,0.1,39.0,ml,earthquake,"M 3.0 - 4 km ESE of San Ramon, CA",Point,-121.935333,37.763332,8.38
4,Feature,nc75306256,3.14,"4 km ESE of San Ramon, CA",1770072973340,1770098210597,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,97.0,3.4,,,reviewed,0,185,nc,75306256,",nc75306256,us6000s61s,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",80,0.1033,0.12,26.0,ml,earthquake,"M 3.1 - 4 km ESE of San Ramon, CA",Point,-121.937164,37.765167,7.82


### Parse types and extract coordinates  

Create a staged table with columns:  
* `event_id`: string
* `time_utc`: datetime
* `updated_utc`: datetime
* `mag`: float
* `place`: string
* `longitude`: float
* `latitude`: float
* `depth_km`: float
* `tsunami`: int
* `status`: string

USGS times are often **milliseconds since epoch**  

**Note:** `pd.to_datetime(ms, unit="ms", utc=True)  

Also:
* Deduplicate by `event_id` (keep first)
* Normalize missing-like strings

In [19]:
staged = pd.DataFrame()

# Build staged dataframe
def staged_dataframe(data: pd.DataFrame) -> pd.DataFrame:
    return data

print("function to create staged DataFrame not ready")

function to create staged DataFrame not ready


In [20]:
df_staged = staged_dataframe(df_raw)
print(f"Raw flattened shape: {df_staged.shape}")
display(df_staged.head(5))

Raw flattened shape: (1572, 32)


Unnamed: 0,type,id,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,alert,status,tsunami,sig,net,code,ids,sources,types,nst,dmin,rms,gap,magType,type.1,title,geometry.type,longitude,latitude,depth_km
0,Feature,us6000s620,4.1,"16 km E of Calingasta, Argentina",1770075292434,1770080742040,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,reviewed,0,259,us,6000s620,",us6000s620,",",us,",",origin,phase-data,",27,1.322,0.79,89.0,mb,earthquake,"M 4.1 - 16 km E of Calingasta, Argentina",Point,-69.2517,-31.3176,162.179
1,Feature,uw62216847,2.8,"5 km ESE of Benton City, Washington",1770075173110,1770096410010,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,reviewed,0,121,uw,62216847,",uw62216847,",",uw,",",origin,phase-data,",14,0.161,0.28,202.0,ml,explosion,"M 2.8 Explosion - 5 km ESE of Benton City, Was...",Point,-119.422,46.2345,-0.24
2,Feature,nc75306276,2.94,"5 km SE of San Ramon, CA",1770073598570,1770101775717,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,10.0,3.3,,,automatic,0,136,nc,75306276,",nc75306276,us6000s61u,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",64,0.1036,0.1,37.0,ml,earthquake,"M 2.9 - 5 km SE of San Ramon, CA",Point,-121.935837,37.754501,7.37
3,Feature,nc75306271,3.0,"4 km ESE of San Ramon, CA",1770073458430,1770101755307,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,24.0,3.1,,,reviewed,0,146,nc,75306271,",nc75306271,us6000s61t,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",74,0.1056,0.1,39.0,ml,earthquake,"M 3.0 - 4 km ESE of San Ramon, CA",Point,-121.935333,37.763332,8.38
4,Feature,nc75306256,3.14,"4 km ESE of San Ramon, CA",1770072973340,1770098210597,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,97.0,3.4,,,reviewed,0,185,nc,75306256,",nc75306256,us6000s61s,",",nc,us,",",dyfi,focal-mechanism,nearby-cities,origin,pha...",80,0.1033,0.12,26.0,ml,earthquake,"M 3.1 - 4 km ESE of San Ramon, CA",Point,-121.937164,37.765167,7.82
