# Step 1 - HDX Metadata Crawler

## Purpose
This notebook crawls the **Humanitarian Data Exchange (HDX)** catalogue and downloads HDX metadata as **JSON**, supporting two export levels:
- **Resource-level metadata** (one JSON per resource file)
- **Dataset-level metadata** (one JSON per dataset, including its resources list)

## What this notebook does
1) Enumerates datasets using the **CKAN Action API** (`package_search`).
2) For each dataset, downloads:
   - **Resource-level exports** via:
     `/dataset/{dataset_id}/resource/{resource_id}/download_metadata?format=json`
   - **Dataset-level export** via:
     `/dataset/{dataset_id}/download_metadata?format=json`
3) Writes JSON outputs using **stable, unique filenames** (UUID-based) and optional human-readable slugs.
4) Produces audit logs:
   - `manifest*.jsonl` (successful downloads)
   - `errors*.jsonl` (failures, HTTP errors, bot checks, parse errors)

## Design notes
- **Resume-safe**: existing files are skipped (safe to re-run).
- **Polite crawling**: configurable throttling, retries, and exponential backoff.
- **Robustness**: if `download_metadata` returns non-JSON or is blocked, the workflow can fall back to CKAN:
  - `resource_show` for resource-level metadata
  - `package_show` for dataset-level metadata

## Outputs (typical)
- `dataset_metadata/` : dataset-level metadata JSON exports (one per dataset)
- `resource_metadata/`: resource-level metadata JSON exports (one per resource)
- `manifest_*.jsonl`  : one line per successful download (dataset/resource)
- `errors_*.jsonl`    : one line per failure with identifiers and error messages

## Configuration
Adjust before running:
- Pagination: `ROWS`
- Network behavior: `REQUESTS_PER_SECOND`, `MAX_RETRIES`, `TIMEOUT`
- Test limits: `MAX_DATASETS`, `MAX_RESOURCES`
- Filenames: `ADD_SLUG_TO_FILENAME`, `SLUG_MAXLEN`


## Resource-level metadata (per file)

Downloads one JSON per **resource** (i.e., per downloadable file attached to a dataset).  
Endpoint pattern:
`/dataset/{dataset_id}/resource/{resource_id}/download_metadata?format=json`  
Output naming uses `{dataset_uuid}__{resource_uuid}[__slug].json` to guarantee uniqueness and stable reruns. Use this mode when you need metadata tied to individual files (CSV, GeoPackage, Shapefile ZIPs, etc.).


### Cell 1 — Config + imports

In [None]:
from __future__ import annotations

import json
import random
import re
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, Iterable, Optional, Tuple

import requests

# ----------------------------
# User config
# ----------------------------
BASE = "https://data.humdata.org"
CKAN_API = f"{BASE}/api/3/action"

OUT_DIR = Path("hdx_metadata_dump")
DATASET_DIR = OUT_DIR / "datasets"
RESOURCE_DIR = OUT_DIR / "resources"
OUT_DIR.mkdir(parents=True, exist_ok=True)
DATASET_DIR.mkdir(parents=True, exist_ok=True)
RESOURCE_DIR.mkdir(parents=True, exist_ok=True)

MANIFEST_PATH = OUT_DIR / "manifest.jsonl"
ERRORS_PATH = OUT_DIR / "errors.jsonl"

# Pagination: CKAN package_search supports rows/start. Rows is often capped (commonly 1000).  :contentReference[oaicite:3]{index=3}
ROWS = 500

# Throttling (be nice)
REQUESTS_PER_SECOND = 2.0  # lower = gentler
MAX_RETRIES = 6
TIMEOUT = 60

# For testing, you can stop early:
MAX_DATASETS: Optional[int] = None     # e.g., 50
MAX_RESOURCES: Optional[int] = None    # e.g., 2000

# Optional: include a readable name in filename (still unique due to IDs)
ADD_SLUG_TO_FILENAME = True
SLUG_MAXLEN = 80


### Cell 2 — HTTP helpers + CKAN wrapper

In [None]:
session = requests.Session()
session.headers.update({
    "User-Agent": "hdx-metadata-crawler/1.0 (contact: you@example.com)",
    "Accept": "application/json,text/plain,*/*",
})

def _looks_like_bot_check(text: str) -> bool:
    t = text.lower()
    return ("verify that you're not a robot" in t) or ("javascript is disabled" in t)

def _sleep_rate_limited(last_ts: float) -> float:
    """Sleep to respect REQUESTS_PER_SECOND. Returns new timestamp."""
    if REQUESTS_PER_SECOND <= 0:
        return time.time()
    min_dt = 1.0 / REQUESTS_PER_SECOND
    now = time.time()
    dt = now - last_ts
    if dt < min_dt:
        time.sleep(min_dt - dt)
    return time.time()

def get_json(url: str, params: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    """
    GET JSON with retries, backoff, and basic bot-check detection.
    Returns parsed JSON dict.
    Raises RuntimeError on repeated failure.
    """
    last_ts = getattr(get_json, "_last_ts", 0.0)
    for attempt in range(MAX_RETRIES):
        last_ts = _sleep_rate_limited(last_ts)
        get_json._last_ts = last_ts

        try:
            r = session.get(url, params=params, timeout=TIMEOUT)
        except requests.RequestException as e:
            wait = min(60, (2 ** attempt) + random.random())
            time.sleep(wait)
            continue

        # Retry on transient errors / rate limiting
        if r.status_code in (429, 500, 502, 503, 504):
            retry_after = r.headers.get("Retry-After")
            if retry_after and retry_after.isdigit():
                time.sleep(int(retry_after))
            else:
                time.sleep(min(60, (2 ** attempt) + random.random()))
            continue

        # Hard failure
        if r.status_code >= 400:
            raise RuntimeError(f"HTTP {r.status_code} for {r.url}")

        # Bot-check HTML sometimes comes back 200
        ctype = (r.headers.get("Content-Type") or "").lower()
        if "json" not in ctype:
            text = r.text[:5000]
            if _looks_like_bot_check(text):
                raise RuntimeError(f"Bot-check page returned for {r.url}")
            # Sometimes servers omit JSON content-type; try parse anyway.
        try:
            return r.json()
        except Exception:
            text = r.text[:5000]
            raise RuntimeError(f"Non-JSON response for {r.url}: {text[:200]}")

    raise RuntimeError(f"Failed after {MAX_RETRIES} retries: {url}")

def ckan_action(action: str, **params: Any) -> Dict[str, Any]:
    """
    Call CKAN Action API:
      GET {CKAN_API}/{action}?...
    CKAN responses have {success: bool, result: ..., error: ...}. :contentReference[oaicite:4]{index=4}
    """
    url = f"{CKAN_API}/{action}"
    resp = get_json(url, params=params)
    if not resp.get("success", False):
        raise RuntimeError(f"CKAN action failed: {action} params={params} error={resp.get('error')}")
    return resp["result"]

_slug_re = re.compile(r"[^a-zA-Z0-9]+")
def slugify(s: str) -> str:
    s = (s or "").strip()
    s = _slug_re.sub("-", s).strip("-").lower()
    return s


### Cell 3 — Dataset iterator + download logic

In [None]:
def iter_datasets(q: str = "*:*", rows: int = ROWS) -> Iterable[Dict[str, Any]]:
    """
    Stream all datasets using package_search pagination. :contentReference[oaicite:5]{index=5}
    """
    start = 0
    yielded = 0
    while True:
        result = ckan_action(
            "package_search",
            q=q,
            rows=rows,
            start=start,
            sort="metadata_modified desc",
            facet="false",
        )
        count = result.get("count", 0)
        datasets = result.get("results", [])

        if not datasets:
            break

        for ds in datasets:
            yield ds
            yielded += 1
            if MAX_DATASETS is not None and yielded >= MAX_DATASETS:
                return

        start += rows
        if start >= count:
            break

def dataset_filename(ds: Dict[str, Any]) -> Path:
    ds_id = ds.get("id", "unknown-dataset-id")
    ds_name = slugify(ds.get("name", "")) or "dataset"
    return DATASET_DIR / f"{ds_id}__{ds_name}.json"

def resource_filename(dataset_id: str, resource_id: str, resource_name: str = "") -> Path:
    base = f"{dataset_id}__{resource_id}"
    if ADD_SLUG_TO_FILENAME:
        s = slugify(resource_name)[:SLUG_MAXLEN]
        if s:
            base += f"__{s}"
    return RESOURCE_DIR / f"{base}.json"

def write_json(path: Path, obj: Any) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)

def append_jsonl(path: Path, obj: Dict[str, Any]) -> None:
    with path.open("a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

def download_resource_metadata(dataset_id: str, resource: Dict[str, Any]) -> Tuple[Dict[str, Any], str]:
    """
    Prefer HDX's download_metadata export. If blocked, fall back to CKAN resource_show.
    Returns (metadata_json, source_label).
    """
    resource_id = resource["id"]
    export_url = f"{BASE}/dataset/{dataset_id}/resource/{resource_id}/download_metadata"
    try:
        meta = get_json(export_url, params={"format": "json"})
        return meta, "download_metadata"
    except Exception as e:
        # Fallback: CKAN resource_show is still structured metadata for the resource
        # (not necessarily identical to HDX export format).
        meta = ckan_action("resource_show", id=resource_id)
        return {
            "_fallback_reason": str(e),
            "_note": "Fallback used: CKAN resource_show (may differ from HDX download_metadata export).",
            "dataset_id": dataset_id,
            "resource": meta
        }, "ckan_resource_show_fallback"


### Cell 4 — Main runner (resume-friendly)

In [None]:
def run_full_crawl(q: str = "*:*") -> None:
    total_resources = 0

    for ds in iter_datasets(q=q):
        ds_id = ds.get("id")
        if not ds_id:
            continue

        # Save dataset snapshot (from package_search results)
        ds_path = dataset_filename(ds)
        if not ds_path.exists():
            write_json(ds_path, ds)

        resources = ds.get("resources") or []
        for res in resources:
            if "id" not in res:
                continue

            res_id = res["id"]
            name_for_slug = res.get("name") or res.get("description") or ""
            out_path = resource_filename(ds_id, res_id, name_for_slug)

            if out_path.exists():
                continue  # resume

            try:
                meta, source = download_resource_metadata(ds_id, res)
                write_json(out_path, meta)

                append_jsonl(MANIFEST_PATH, {
                    "dataset_id": ds_id,
                    "dataset_name": ds.get("name"),
                    "dataset_title": ds.get("title"),
                    "resource_id": res_id,
                    "resource_name": res.get("name"),
                    "resource_format": res.get("format"),
                    "metadata_source": source,
                    "metadata_file": str(out_path),
                    "metadata_url": f"{BASE}/dataset/{ds_id}/resource/{res_id}/download_metadata?format=json",
                })

            except Exception as e:
                append_jsonl(ERRORS_PATH, {
                    "dataset_id": ds_id,
                    "dataset_name": ds.get("name"),
                    "resource_id": res_id,
                    "error": str(e),
                })

            total_resources += 1
            if MAX_RESOURCES is not None and total_resources >= MAX_RESOURCES:
                return

print("Ready. Next: run_full_crawl()")


### Cell 5 — Run it

In [None]:
# Full crawl of all public datasets:
# run_full_crawl(q="*:*")

# Or target a subset first (recommended):
run_full_crawl(q="cod-ab-global")


## Dataset-level metadata (overall dataset)

Downloads one JSON per **dataset** representing the dataset’s overall metadata (title, tags, license, maintainer info, and a list of associated resources).  
Endpoint pattern:
`/dataset/{dataset_id}/download_metadata?format=json`  
Output naming uses `{dataset_uuid}[__slug].json`. Use this mode when you want a single “master” metadata record per dataset, without splitting into per-resource JSON files.


### Cell 1 — Config

In [1]:
from __future__ import annotations

import json
import random
import re
import time
from pathlib import Path
from typing import Any, Dict, Iterable, Optional, Tuple

import requests

BASE = "https://data.humdata.org"
CKAN_API = f"{BASE}/api/3/action"

OUT_DIR = Path("hdx_dataset_metadata_dump")
DATASET_META_DIR = OUT_DIR / "dataset_metadata"
OUT_DIR.mkdir(parents=True, exist_ok=True)
DATASET_META_DIR.mkdir(parents=True, exist_ok=True)

MANIFEST_PATH = OUT_DIR / "manifest_datasets.jsonl"
ERRORS_PATH = OUT_DIR / "errors_datasets.jsonl"

# CKAN pagination
ROWS = 500  # often safe; CKAN instances may cap max rows

# Politeness + robustness
REQUESTS_PER_SECOND = 2.0
MAX_RETRIES = 6
TIMEOUT = 60

# For testing
MAX_DATASETS: Optional[int] = None  # e.g. 50

# Filename readability
ADD_SLUG_TO_FILENAME = True
SLUG_MAXLEN = 80


### Cell 2 — HTTP + CKAN helpers

In [2]:
session = requests.Session()
session.headers.update({
    "User-Agent": "hdx-dataset-metadata-crawler/1.0",
    "Accept": "application/json,text/plain,*/*",
})

def _looks_like_bot_check(text: str) -> bool:
    t = text.lower()
    return ("verify that you're not a robot" in t) or ("javascript is disabled" in t)

def _sleep_rate_limited(last_ts: float) -> float:
    if REQUESTS_PER_SECOND <= 0:
        return time.time()
    min_dt = 1.0 / REQUESTS_PER_SECOND
    now = time.time()
    dt = now - last_ts
    if dt < min_dt:
        time.sleep(min_dt - dt)
    return time.time()

def get_json(url: str, params: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    last_ts = getattr(get_json, "_last_ts", 0.0)

    for attempt in range(MAX_RETRIES):
        last_ts = _sleep_rate_limited(last_ts)
        get_json._last_ts = last_ts

        try:
            r = session.get(url, params=params, timeout=TIMEOUT)
        except requests.RequestException:
            time.sleep(min(60, (2 ** attempt) + random.random()))
            continue

        if r.status_code in (429, 500, 502, 503, 504):
            retry_after = r.headers.get("Retry-After")
            if retry_after and retry_after.isdigit():
                time.sleep(int(retry_after))
            else:
                time.sleep(min(60, (2 ** attempt) + random.random()))
            continue

        if r.status_code >= 400:
            raise RuntimeError(f"HTTP {r.status_code} for {r.url}")

        ctype = (r.headers.get("Content-Type") or "").lower()
        if "json" not in ctype:
            text = r.text[:5000]
            if _looks_like_bot_check(text):
                raise RuntimeError(f"Bot-check page returned for {r.url}")
            # Sometimes JSON is returned with odd content-type; try parse anyway.

        try:
            return r.json()
        except Exception:
            text = r.text[:2000]
            raise RuntimeError(f"Non-JSON response for {r.url}: {text[:200]}")

    raise RuntimeError(f"Failed after {MAX_RETRIES} retries: {url}")

def ckan_action(action: str, **params: Any) -> Dict[str, Any]:
    url = f"{CKAN_API}/{action}"
    resp = get_json(url, params=params)
    if not resp.get("success", False):
        raise RuntimeError(f"CKAN action failed: {action} params={params} error={resp.get('error')}")
    return resp["result"]

_slug_re = re.compile(r"[^a-zA-Z0-9]+")
def slugify(s: str) -> str:
    s = (s or "").strip()
    s = _slug_re.sub("-", s).strip("-").lower()
    return s


### Cell 3 — Enumerate datasets + download dataset-level metadata

In [3]:
def iter_datasets(q: str = "*:*", rows: int = ROWS) -> Iterable[Dict[str, Any]]:
    start = 0
    yielded = 0

    while True:
        result = ckan_action(
            "package_search",
            q=q,
            rows=rows,
            start=start,
            sort="metadata_modified desc",
            facet="false",
        )
        count = result.get("count", 0)
        datasets = result.get("results", [])
        if not datasets:
            break

        for ds in datasets:
            yield ds
            yielded += 1
            if MAX_DATASETS is not None and yielded >= MAX_DATASETS:
                return

        start += rows
        if start >= count:
            break

def dataset_meta_filename(dataset_id: str, dataset_name: str = "") -> Path:
    base = dataset_id
    if ADD_SLUG_TO_FILENAME:
        s = slugify(dataset_name)[:SLUG_MAXLEN]
        if s:
            base += f"__{s}"
    return DATASET_META_DIR / f"{base}.json"

def write_json(path: Path, obj: Any) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        json.dump(obj, f, ensure_ascii=False, indent=2)

def append_jsonl(path: Path, obj: Dict[str, Any]) -> None:
    with path.open("a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

def download_dataset_metadata(dataset_id: str) -> Tuple[Dict[str, Any], str]:
    """
    Prefer HDX dataset export: /dataset/{dataset_id}/download_metadata?format=json
    Fallback: CKAN package_show (reliable automation API)
    """
    export_url = f"{BASE}/dataset/{dataset_id}/download_metadata"
    try:
        meta = get_json(export_url, params={"format": "json"})
        return meta, "download_metadata"
    except Exception as e:
        # Fallback to CKAN dataset metadata
        pkg = ckan_action("package_show", id=dataset_id)
        return {
            "_fallback_reason": str(e),
            "_note": "Fallback used: CKAN package_show (may differ from HDX download_metadata export).",
            "dataset": pkg,
        }, "ckan_package_show_fallback"


### Cell 4 — Main runner (dataset-level only, resume-safe)

In [4]:
def run_dataset_metadata_crawl(q: str = "*:*") -> None:
    for ds in iter_datasets(q=q):
        ds_id = ds.get("id")
        ds_name = ds.get("name") or ""
        ds_title = ds.get("title") or ""

        if not ds_id:
            continue

        out_path = dataset_meta_filename(ds_id, ds_name)
        if out_path.exists():
            continue  # resume-safe

        try:
            meta, source = download_dataset_metadata(ds_id)
            write_json(out_path, meta)

            append_jsonl(MANIFEST_PATH, {
                "dataset_id": ds_id,
                "dataset_name": ds_name,
                "dataset_title": ds_title,
                "metadata_source": source,
                "metadata_file": str(out_path),
                "metadata_url": f"{BASE}/dataset/{ds_id}/download_metadata?format=json",
            })
        except Exception as e:
            append_jsonl(ERRORS_PATH, {
                "dataset_id": ds_id,
                "dataset_name": ds_name,
                "dataset_title": ds_title,
                "error": str(e),
            })

print("Ready: run_dataset_metadata_crawl()")


Ready: run_dataset_metadata_crawl()


### Cell 5 — Run

In [5]:
# Full crawl:
run_dataset_metadata_crawl(q="*:*")

# Or test with your example dataset name:
# run_dataset_metadata_crawl(q="cod-ab-global")


## End of Code