# Step 2 - OSM Policy Exclusion Index (HDX → RDLS Pipeline)

## Purpose
This notebook scans **HDX dataset-level metadata JSON files** (downloaded in Step 1) and flags datasets that are
**derived from OpenStreetMap (OSM)** (including HOT exports and non-HOT OSM-derived datasets like Healthsites).

Your team policy is to **exclude OSM-derived datasets** from RDLS conversion for now (to avoid producing a large number
of near-duplicate metadata records whose upstream source is OSM). This notebook:

- Detects OSM-derived datasets using multiple strong signals (not just tags)
- Produces a reproducible **exclusion list** (dataset UUIDs)
- Produces an auditable **exclusion report** (CSV) with detection reasons
- Optionally prepares a **pilot candidate list** for controlled OSM experiments later

## Inputs
- Folder of dataset-level metadata JSON files, e.g.:
  - `data/raw/hdx_dataset_metadata/*.json`

The notebook is compatible with:
- HDX `download_metadata?format=json` output (fields like `dataset_source`, `license_title`, `resources[*].download_url`)
- CKAN `package_show` output (fallback shape), if your Step 1 stored that format (tags/resources as dicts)

## Outputs
- `data/policy/osm_excluded_dataset_ids.txt` — one dataset UUID per line
- `data/policy/osm_exclusion_report.csv` — dataset_id, title, org, key signals, and detection reasons
- `data/policy/osm_candidates_for_pilot.csv` (optional) — a deduplicated shortlist for future OSM pilot work

## Notes on detection approach
OSM/HOT datasets are detected using a *rule-based* approach with traceable reasons, prioritizing strong evidence such as:
- `dataset_source` contains “OpenStreetMap contributors”
- license indicates ODbL together with OSM textual cues
- organization or title indicates HOT/OSM exports
- resource URLs include `hotosm.org` / `export.hotosm.org` / `openstreetmap.org`

This yields high precision for policy exclusion while keeping the logic transparent and reviewable.


In [1]:
"""
OSM Policy Exclusion Index (HDX dataset-level metadata)

This module (used within the notebook) scans HDX dataset-level metadata JSON files and produces:
1) A text file of dataset IDs to exclude (OSM-derived datasets)
2) A CSV report explaining why each dataset was flagged

Design principles:
- Deterministic, transparent rules (no ML black box)
- Strong-signal detection (dataset_source/license/resource URLs)
- Auditability: each flagged dataset includes a list of reasons
- Minimal dependencies: standard library only

Author: <YOUR NAME/ORG>
License: <YOUR LICENSE>
"""

from __future__ import annotations

import csv
import json
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple


In [2]:
# =========================
# Configuration (EDIT HERE)
# =========================

# Folder containing HDX dataset-level metadata JSON files (from Step 1).
# Example:
#   INPUT_DIR = Path("data/raw/hdx_dataset_metadata")
INPUT_DIR = Path("../hdx_dataset_metadata_dump/dataset_metadata")

# Output folder for policy artifacts.
OUTPUT_DIR = Path("../hdx_dataset_metadata_dump/policy")

# Output files.
OUT_IDS_TXT = OUTPUT_DIR / "osm_excluded_dataset_ids.txt"
OUT_REPORT_CSV = OUTPUT_DIR / "osm_exclusion_report.csv"
OUT_PILOT_CSV = OUTPUT_DIR / "osm_candidates_for_pilot.csv"  # optional

# If True, prefilter files by scanning text for OSM markers before parsing JSON.
# This is faster when you have many files.
USE_FAST_PREFILTER = True

# When generating the pilot file, limit candidates per (org/theme) bucket to keep it small.
PILOT_MAX_PER_BUCKET = 10

# Create outputs folder
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("INPUT_DIR:", INPUT_DIR.resolve())
print("OUTPUT_DIR:", OUTPUT_DIR.resolve())


INPUT_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\dataset_metadata
OUTPUT_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\policy


In [3]:
# =========================
# Helpers: robust JSON access
# =========================

def read_text(path: Path) -> str:
    """Read text safely (UTF-8), ignoring decoding errors."""
    return path.read_text(encoding="utf-8", errors="ignore")

def load_json(path: Path) -> Dict[str, Any]:
    """Load a JSON file into a dict."""
    return json.loads(read_text(path))

def normalize_dataset_record(raw: Dict[str, Any]) -> Dict[str, Any]:
    """
    Normalize dataset record shape.

    Step 1 may store:
    - HDX dataset export JSON directly (has 'id' at top-level)
    - CKAN package_show fallback wrapped as {'dataset': {...}} (id inside 'dataset')

    This function returns the dataset dict that contains canonical fields like 'id', 'title', etc.
    """
    if isinstance(raw, dict) and "id" in raw:
        return raw
    if isinstance(raw, dict) and "dataset" in raw and isinstance(raw["dataset"], dict):
        return raw["dataset"]
    return raw

def norm_str(x: Any) -> str:
    """Normalize any value to a lowercase stripped string."""
    return (x or "").__str__().strip().lower()

def get_org_title(ds: Dict[str, Any]) -> str:
    """Get organization title/name across HDX export and CKAN shapes."""
    org = ds.get("organization")
    if isinstance(org, dict):
        return (org.get("title") or org.get("name") or "").strip()
    return (org or "").strip()

def get_tags(ds: Dict[str, Any]) -> List[str]:
    """Return tags as a list of lowercase strings."""
    tags = ds.get("tags") or []
    out: List[str] = []
    if isinstance(tags, list):
        for t in tags:
            if isinstance(t, dict):
                name = t.get("name") or ""
                if name:
                    out.append(name.strip().lower())
            elif isinstance(t, str):
                out.append(t.strip().lower())
    return out

def get_resources(ds: Dict[str, Any]) -> List[Dict[str, Any]]:
    """Return resources list (list of dicts)."""
    res = ds.get("resources") or []
    return res if isinstance(res, list) else []

def get_license_title(ds: Dict[str, Any]) -> str:
    """Return a normalized license string."""
    # HDX export uses license_title, CKAN may use license_title and/or license_id.
    lt = ds.get("license_title") or ds.get("license_id") or ""
    return (lt or "").strip()


In [4]:
# ===================================
# OSM detection rules (policy signals)
# ===================================

@dataclass(frozen=True)
class OSMDetectionResult:
    is_osm: bool
    reasons: Tuple[str, ...]
    signals: Dict[str, Any]

FAST_MARKERS = (
    "openstreetmap contributors",
    '"dataset_source":',
    '"dataset_source":"',
    '"license_title": "odbl"',
    '"license_title":"odbl"',
    "open database license",
    "hotosm",
    "export.hotosm.org",
    "exports-stage.hotosm.org",
    "openstreetmap.org",
)

# URL/domain markers for OSM/HOT.
OSM_URL_MARKERS = (
    "openstreetmap.org",
    "hotosm.org",
    "export.hotosm.org",
    "exports-stage.hotosm.org",
    "production-raw-data-api",
)

# Organization/title markers (used as supporting evidence).
OSM_ORG_MARKERS = (
    "humanitarian openstreetmap",
    "hotosm",
    "openstreetmap",
)

OSM_TITLE_MARKERS = (
    "openstreetmap export",
    "(openstreetmap export)",
    "openstreetmap",
)

# Notes markers (supporting evidence).
OSM_NOTES_MARKERS = (
    "openstreetmap",
    "wiki.openstreetmap.org",
    "osm",
)

def prefilter_maybe_osm(text: str) -> bool:
    """Quick text scan: returns True if file likely contains OSM indicators."""
    t = text.lower()
    return any(m in t for m in FAST_MARKERS)

def detect_osm(ds: Dict[str, Any]) -> OSMDetectionResult:
    """
    Detect whether a dataset record is derived from OpenStreetMap.

    The rules are designed for *policy exclusion* (high precision), using:
    1) dataset_source (strongest)
    2) license evidence (ODbL) + OSM textual cues
    3) resource URL evidence
    4) organization/title/notes evidence

    Returns an OSMDetectionResult with:
    - is_osm: bool
    - reasons: tuple of fired rule IDs (human-readable)
    - signals: selected fields for auditing
    """
    title = ds.get("title") or ""
    name = ds.get("name") or ""
    notes = ds.get("notes") or ""
    dataset_source = ds.get("dataset_source") or ""
    org_title = get_org_title(ds)
    license_title = get_license_title(ds)
    tags = get_tags(ds)
    resources = get_resources(ds)

    title_l = norm_str(title)
    notes_l = norm_str(notes)
    dataset_source_l = norm_str(dataset_source)
    org_l = norm_str(org_title)
    license_l = norm_str(license_title)

    reasons: List[str] = []

    # Rule 1: dataset_source explicitly references OpenStreetMap contributors.
    if "openstreetmap" in dataset_source_l:
        reasons.append("dataset_source_mentions_openstreetmap")

    # Rule 2: ODbL license with OSM cues (covers cases where dataset_source is absent).
    # Your example has license_title == 'ODbL'. This alone is not always sufficient,
    # but combined with textual cues it is a strong indicator of OSM-derived data.
    if license_l in {"odbl", "odc-odbl"} or "odbl" in license_l or "open database license" in license_l:
        if ("openstreetmap" in title_l) or ("openstreetmap" in notes_l) or ("openstreetmap" in dataset_source_l):
            reasons.append("odbl_license_plus_osm_cue")

    # Rule 3: Resource URLs point to HOT/OSM infrastructure.
    for r in resources:
        url = norm_str(r.get("download_url") or r.get("url") or "")
        if url and any(m in url for m in OSM_URL_MARKERS):
            reasons.append("resource_url_osm_domain")
            break

    # Rule 4: Organization suggests OSM/HOT.
    if any(m in org_l for m in OSM_ORG_MARKERS):
        reasons.append("organization_mentions_osm_or_hot")

    # Rule 5: Title suggests OSM export.
    if any(m in title_l for m in OSM_TITLE_MARKERS):
        reasons.append("title_mentions_osm_export")

    # Rule 6: Tags include openstreetmap (not always present, but strong when present).
    if "openstreetmap" in tags:
        reasons.append("tag_openstreetmap_present")

    # Rule 7: Notes include OSM references (weak; keep only as supporting evidence).
    if any(m in notes_l for m in OSM_NOTES_MARKERS) and "openstreetmap" in notes_l:
        reasons.append("notes_mentions_openstreetmap")

    # Policy decision: mark OSM if any of the strong rules fire.
    # Strong rules: dataset_source, resource_url domain, ODbL+cue.
    strong = {
        "dataset_source_mentions_openstreetmap",
        "resource_url_osm_domain",
        "odbl_license_plus_osm_cue",
        "tag_openstreetmap_present",
    }
    is_osm = any(r in strong for r in reasons)

    # Allow HOT/org/title evidence to upgrade borderline cases:
    # If we have 2+ supporting rules, treat as OSM (helps when dataset_source missing).
    supporting = set(reasons) - strong
    if not is_osm and len(supporting) >= 2:
        is_osm = True
        reasons.append("supporting_evidence_threshold_met")

    signals = {
        "dataset_source": dataset_source,
        "license_title": license_title,
        "organization": org_title,
        "tags": tags,
        "resource_url_sample": (resources[0].get("download_url") if resources else None),
    }

    return OSMDetectionResult(is_osm=is_osm, reasons=tuple(sorted(set(reasons))), signals=signals)


In [5]:
# ==========================================
# Scan dataset JSON folder and write outputs
# ==========================================

def iter_json_files(folder: Path) -> Iterable[Path]:
    """Yield JSON files in a folder (non-recursive)."""
    if not folder.exists():
        raise FileNotFoundError(f"Input folder not found: {folder}")
    yield from sorted(folder.glob("*.json"))

def scan_folder_for_osm(input_dir: Path) -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Scan a folder of dataset-level metadata JSON files.

    Returns:
    - report_rows: list of dict rows for CSV
    - excluded_ids: list of dataset UUID strings
    """
    report_rows: List[Dict[str, Any]] = []
    excluded_ids: List[str] = []

    files = list(iter_json_files(input_dir))
    total = len(files)
    print(f"Scanning {total:,} JSON files in: {input_dir}")

    for i, path in enumerate(files, start=1):
        # Minimal progress indicator (avoids external deps).
        if i % 500 == 0 or i == total:
            print(f"  processed {i:,}/{total:,}")

        try:
            txt = read_text(path) if USE_FAST_PREFILTER else ""
            if USE_FAST_PREFILTER and not prefilter_maybe_osm(txt):
                continue

            raw = json.loads(txt) if USE_FAST_PREFILTER else load_json(path)
            ds = normalize_dataset_record(raw)

            ds_id = ds.get("id") or ""
            title = ds.get("title") or ""
            name = ds.get("name") or ""
            org = get_org_title(ds)

            result = detect_osm(ds)

            if result.is_osm:
                excluded_ids.append(ds_id)

                report_rows.append({
                    "dataset_id": ds_id,
                    "name": name,
                    "title": title,
                    "organization": org,
                    "dataset_source": ds.get("dataset_source"),
                    "license_title": get_license_title(ds),
                    "reasons": ";".join(result.reasons),
                    "tags": ";".join(get_tags(ds)),
                    "n_resources": len(get_resources(ds)),
                    "file": str(path),
                })

        except Exception as e:
            # Keep scanning; errors can be audited separately if needed.
            report_rows.append({
                "dataset_id": "",
                "name": "",
                "title": "",
                "organization": "",
                "dataset_source": "",
                "license_title": "",
                "reasons": f"ERROR:{type(e).__name__}:{e}",
                "tags": "",
                "n_resources": "",
                "file": str(path),
            })

    # De-duplicate IDs (stable ordering)
    excluded_ids = sorted(set([x for x in excluded_ids if x]))

    return report_rows, excluded_ids

report_rows, excluded_ids = scan_folder_for_osm(INPUT_DIR)

print(f"Flagged OSM-derived datasets: {len(excluded_ids):,}")


Scanning 26,246 JSON files in: ..\hdx_dataset_metadata_dump\dataset_metadata
  processed 500/26,246
  processed 1,000/26,246
  processed 1,500/26,246
  processed 2,000/26,246
  processed 2,500/26,246
  processed 3,000/26,246
  processed 3,500/26,246
  processed 4,000/26,246
  processed 4,500/26,246
  processed 5,000/26,246
  processed 5,500/26,246
  processed 6,000/26,246
  processed 6,500/26,246
  processed 7,000/26,246
  processed 7,500/26,246
  processed 8,000/26,246
  processed 8,500/26,246
  processed 9,000/26,246
  processed 9,500/26,246
  processed 10,000/26,246
  processed 10,500/26,246
  processed 11,000/26,246
  processed 11,500/26,246
  processed 12,000/26,246
  processed 12,500/26,246
  processed 13,000/26,246
  processed 13,500/26,246
  processed 14,000/26,246
  processed 14,500/26,246
  processed 15,000/26,246
  processed 15,500/26,246
  processed 16,000/26,246
  processed 16,500/26,246
  processed 17,000/26,246
  processed 17,500/26,246
  processed 18,000/26,246
  proces

In [6]:
# ==========================
# Write outputs (CSV + TXT)
# ==========================

def write_ids_txt(path: Path, ids: Sequence[str]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        for x in ids:
            f.write(f"{x}\n")

def write_report_csv(path: Path, rows: Sequence[Dict[str, Any]]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    # Fixed header for consistent downstream usage
    header = [
        "dataset_id", "name", "title", "organization",
        "dataset_source", "license_title", "reasons",
        "tags", "n_resources", "file"
    ]
    with path.open("w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=header)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k, "") for k in header})

write_ids_txt(OUT_IDS_TXT, excluded_ids)
write_report_csv(OUT_REPORT_CSV, report_rows)

print("Wrote:", OUT_IDS_TXT)
print("Wrote:", OUT_REPORT_CSV)


Wrote: ..\hdx_dataset_metadata_dump\policy\osm_excluded_dataset_ids.txt
Wrote: ..\hdx_dataset_metadata_dump\policy\osm_exclusion_report.csv


In [7]:
# ==================================
# Optional: create a pilot shortlist
# ==================================
# The pilot shortlist is useful if you later want to test a controlled strategy for OSM within RDLS,
# without generating RDLS metadata for every OSM-derived dataset on HDX.
#
# Strategy:
# - Group by organization (and optionally by tag themes)
# - Keep only a small number of examples per group

def derive_theme(tags: List[str]) -> str:
    """Simple theme heuristic for pilot grouping."""
    if any(t in tags for t in ("roads", "railways", "transportation", "aviation")):
        return "transport"
    if any(t in tags for t in ("health facilities", "health")):
        return "health_facilities"
    if any(t in tags for t in ("waterways", "rivers", "hydrology")):
        return "hydrology"
    if any(t in tags for t in ("administrative boundaries-divisions", "gazetteer")):
        return "boundaries_gazetteer"
    return "other"

def make_pilot_shortlist(rows: List[Dict[str, Any]], max_per_bucket: int = 10) -> List[Dict[str, Any]]:
    # Keep only datasets with valid IDs (exclude error rows)
    clean = [r for r in rows if r.get("dataset_id")]
    # Add theme column
    for r in clean:
        tags = (r.get("tags") or "").split(";") if r.get("tags") else []
        tags = [t.strip().lower() for t in tags if t.strip()]
        r["theme"] = derive_theme(tags)

    # Bucket by (organization, theme)
    buckets: Dict[Tuple[str, str], List[Dict[str, Any]]] = {}
    for r in clean:
        key = ((r.get("organization") or "unknown").strip(), (r.get("theme") or "other").strip())
        buckets.setdefault(key, []).append(r)

    pilot: List[Dict[str, Any]] = []
    for (org, theme), items in sorted(buckets.items(), key=lambda x: (-len(x[1]), x[0])):
        # stable order by title
        items_sorted = sorted(items, key=lambda r: (r.get("title") or ""))
        pilot.extend(items_sorted[:max_per_bucket])

    # minimal header for pilot
    out = []
    for r in pilot:
        out.append({
            "dataset_id": r["dataset_id"],
            "title": r["title"],
            "organization": r["organization"],
            "theme": r["theme"],
            "reasons": r["reasons"],
        })
    return out

pilot_rows = make_pilot_shortlist(report_rows, max_per_bucket=PILOT_MAX_PER_BUCKET)

# Write pilot CSV
if pilot_rows:
    with OUT_PILOT_CSV.open("w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=["dataset_id", "title", "organization", "theme", "reasons"])
        w.writeheader()
        w.writerows(pilot_rows)
    print("Wrote:", OUT_PILOT_CSV, f"(rows={len(pilot_rows)})")
else:
    print("No pilot rows produced (no OSM datasets detected or all rows were errors).")


Wrote: ..\hdx_dataset_metadata_dump\policy\osm_candidates_for_pilot.csv (rows=138)


In [8]:
# ==========================
# Quick sanity-check outputs
# ==========================

print("Example excluded IDs (first 10):")
print("\n".join(excluded_ids[:10]))

print("\nIf you want to spot-check a few exclusions, open:")
print(" -", OUT_REPORT_CSV)


Example excluded IDs (first 10):
003b676c-3e72-4ab2-a33b-2d3b9f9d7857
004790b4-4ddd-4d7e-9288-73ae46e643a3
0048d3d1-50eb-4428-a506-23882f4ce7ac
00503aef-1f17-44fd-8664-679c5ad6e05c
00a377de-920d-43c4-8fa1-590e2b369dcf
00ad2859-55c5-4e91-bbe6-d79d3fde2dc2
00b5e80c-1e32-4603-ae37-04b54655389c
00d7616a-c748-41c1-ae80-e65669293924
00e423c7-70dd-4a0c-ace7-8996e782a99e
00e5fec7-756d-4b1d-8816-21fb89320e48

If you want to spot-check a few exclusions, open:
 - ..\hdx_dataset_metadata_dump\policy\osm_exclusion_report.csv
