# Step 3 - Define RDLS Mapping Model (Inclusive)

This notebook creates the **mapping configuration** that Step 4 (classifier) will use to label HDX datasets as:
- `hazard`
- `exposure`
- `vulnerability_proxy`
- `loss_impact`

It is **inclusive** by design (so indicator datasets can be treated as vulnerability proxies), and it respects your team’s policy by **excluding OSM-derived datasets** using the Step 2 exclusion list.

---

## Inputs
- Dataset metadata folder from Step 1  
  `hdx_dataset_metadata_dump/dataset_metadata/*.json`
- OSM exclusion list from Step 2  
  `hdx_dataset_metadata_dump/policy/osm_excluded_dataset_ids.txt`

## Outputs
This notebook writes configuration + documentation files used downstream:
- `config/tag_to_rdls_component.yaml` — weighted mapping from HDX tags → RDLS components
- `config/keyword_to_rdls_component.yaml` — keyword patterns over title/notes → RDLS components
- `config/org_hints.yaml` — organization/source hints → RDLS components (especially for indicator packs)
- `docs/mapping_rules.md` — human-readable explanation of the mapping logic
- `docs/samples_for_mapping.csv` — a review sample of datasets to calibrate mapping

## How you iterate
1. Run this notebook once to generate draft mapping files and tag statistics.
2. Open the YAML files and refine weights / add missing tags and keywords.
3. Re-run this notebook (it can append/update stats) or move directly to Step 4.

> Tip: Keep mapping rules conservative, then expand. The QA loop (Step 5) is where you lock edge cases.


In [1]:
"""
03_define_rdls_mapping.ipynb

Build a transparent, reviewable mapping model from HDX metadata to RDLS components.

This notebook is designed to create *configuration artifacts* (YAML + docs) for use in later steps.
It does not attempt to fully classify all datasets—that happens in Step 4.

Design principles:
- Inclusive RDLS labeling (supports vulnerability proxy indicators)
- Policy-aware: excludes OSM-derived datasets via Step 2 list
- Deterministic outputs: stable ordering and fixed random seed for samples
- Audit-friendly: produces statistics and a review sample for calibration
- Minimal dependencies: standard library only (writes YAML via a tiny safe serializer)

Author: <YOUR NAME/ORG>
License: <YOUR LICENSE>
"""

from __future__ import annotations

import csv
import json
import random
import re
from collections import Counter, defaultdict
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple


In [2]:
# =========================
# Configuration (EDIT HERE)
# =========================

# Root directory from Step 1 output (relative to notebooks/)
DUMP_DIR = Path("../hdx_dataset_metadata_dump")
DATASET_DIR = DUMP_DIR / "dataset_metadata"

# Step 2 output (OSM exclusion list) — inside the dump folder
POLICY_DIR = DUMP_DIR / "policy"
OSM_EXCLUDED_IDS_TXT = POLICY_DIR / "osm_excluded_dataset_ids.txt"

# Where to write mapping config files
CONFIG_DIR = DUMP_DIR / "config"
DOCS_DIR = DUMP_DIR / "docs"


# Sampling
RANDOM_SEED = 42
SAMPLE_SIZE = 400   # review sample size (200–500 is typical)
MAX_NOTES_CHARS = 350  # keep review CSV readable

# Overwrite existing YAML/docs?
OVERWRITE_CONFIG = False

CONFIG_DIR.mkdir(parents=True, exist_ok=True)
DOCS_DIR.mkdir(parents=True, exist_ok=True)

print("DATASET_DIR:", DATASET_DIR.resolve())
print("OSM_EXCLUDED_IDS_TXT:", OSM_EXCLUDED_IDS_TXT.resolve())
print("CONFIG_DIR:", CONFIG_DIR.resolve())
print("DOCS_DIR:", DOCS_DIR.resolve())


DATASET_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\dataset_metadata
OSM_EXCLUDED_IDS_TXT: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\policy\osm_excluded_dataset_ids.txt
CONFIG_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config
DOCS_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\docs


In [3]:
# =========================
# Helpers: JSON + extraction
# =========================

def iter_json_files(folder: Path) -> Iterable[Path]:
    """Yield JSON files in a folder (non-recursive), sorted for determinism."""
    if not folder.exists():
        raise FileNotFoundError(f"Dataset folder not found: {folder}")
    yield from sorted(folder.glob("*.json"))

def read_json(path: Path) -> Dict[str, Any]:
    """Read JSON with UTF-8; ignore decoding errors."""
    return json.loads(path.read_text(encoding="utf-8", errors="ignore"))

def normalize_dataset_record(raw: Dict[str, Any]) -> Dict[str, Any]:
    """Handle possible wrapper {'dataset': {...}}."""
    if isinstance(raw, dict) and "id" in raw:
        return raw
    if isinstance(raw, dict) and "dataset" in raw and isinstance(raw["dataset"], dict):
        return raw["dataset"]
    return raw

def get_org_title(ds: Dict[str, Any]) -> str:
    org = ds.get("organization")
    if isinstance(org, dict):
        return (org.get("title") or org.get("name") or "").strip()
    return (org or "").strip()

def get_tags(ds: Dict[str, Any]) -> List[str]:
    tags = ds.get("tags") or []
    out: List[str] = []
    if isinstance(tags, list):
        for t in tags:
            if isinstance(t, dict):
                name = t.get("name") or ""
                if name:
                    out.append(name.strip().lower())
            elif isinstance(t, str):
                out.append(t.strip().lower())
    return out

def get_resource_formats(ds: Dict[str, Any]) -> List[str]:
    formats: List[str] = []
    for r in (ds.get("resources") or []):
        if isinstance(r, dict):
            fmt = (r.get("format") or "").strip().lower()
            if fmt:
                formats.append(fmt)
    return formats

def short_text(s: str, max_len: int) -> str:
    s = (s or "").strip()
    s = re.sub(r"\s+", " ", s)
    return s[:max_len] + ("…" if len(s) > max_len else "")

def load_excluded_ids(path: Path) -> set[str]:
    if not path.exists():
        # For safety: if policy file missing, treat as no exclusions but warn.
        print(f"WARNING: OSM exclusion list not found: {path}. Proceeding with empty exclusion set.")
        return set()
    ids = set()
    for line in path.read_text(encoding="utf-8", errors="ignore").splitlines():
        line = line.strip()
        if line:
            ids.add(line)
    return ids


In [4]:
# =========================
# Scan corpus (excluding OSM)
# =========================

excluded_ids = load_excluded_ids(OSM_EXCLUDED_IDS_TXT)

tag_counter = Counter()
org_counter = Counter()
fmt_counter = Counter()

records_for_sampling: List[Dict[str, Any]] = []

files = list(iter_json_files(DATASET_DIR))
total = len(files)
kept = 0
skipped_osm = 0

for i, path in enumerate(files, start=1):
    if i % 2000 == 0 or i == total:
        print(f"processed {i:,}/{total:,}")

    raw = read_json(path)
    ds = normalize_dataset_record(raw)

    ds_id = (ds.get("id") or "").strip()
    if not ds_id:
        continue

    if ds_id in excluded_ids:
        skipped_osm += 1
        continue

    kept += 1

    tags = get_tags(ds)
    org = get_org_title(ds)
    fmts = get_resource_formats(ds)

    tag_counter.update(tags)
    if org:
        org_counter.update([org])
    fmt_counter.update(fmts)

    # Keep minimal fields for later sampling export
    records_for_sampling.append({
        "dataset_id": ds_id,
        "title": ds.get("title") or "",
        "name": ds.get("name") or "",
        "organization": org,
        "dataset_source": ds.get("dataset_source") or "",
        "license_title": ds.get("license_title") or ds.get("license_id") or "",
        "tags": tags,
        "formats": fmts,
        "notes": ds.get("notes") or "",
    })

print("\nSummary (non-OSM corpus):")
print(f"  total files:     {total:,}")
print(f"  excluded (OSM):  {skipped_osm:,}")
print(f"  kept (non-OSM):  {kept:,}")
print(f"  unique tags:     {len(tag_counter):,}")
print(f"  unique orgs:     {len(org_counter):,}")
print(f"  unique formats:  {len(fmt_counter):,}")


processed 2,000/26,246
processed 4,000/26,246
processed 6,000/26,246
processed 8,000/26,246
processed 10,000/26,246
processed 12,000/26,246
processed 14,000/26,246
processed 16,000/26,246
processed 18,000/26,246
processed 20,000/26,246
processed 22,000/26,246
processed 24,000/26,246
processed 26,000/26,246
processed 26,246/26,246

Summary (non-OSM corpus):
  total files:     26,246
  excluded (OSM):  3,649
  kept (non-OSM):  22,597
  unique tags:     142
  unique orgs:     357
  unique formats:  47


In [5]:
# =========================
# Inspect top tags/orgs/formats
# =========================

TOP_N = 60

print("Top tags (non-OSM):")
for t, c in tag_counter.most_common(TOP_N):
    print(f"  {t:<40} {c:>6,}")

print("\nTop organizations (non-OSM):")
for o, c in org_counter.most_common(30):
    print(f"  {o:<55} {c:>6,}")

print("\nTop resource formats (non-OSM):")
for f, c in fmt_counter.most_common(30):
    print(f"  {f:<12} {c:>6,}")


Top tags (non-OSM):
  hxl                                       9,764
  indicators                                7,387
  geodata                                   5,761
  health                                    2,925
  baseline population                       2,301
  food security                             2,147
  economics                                 2,113
  education                                 2,092
  development                               1,496
  environment                               1,483
  demographics                              1,393
  facilities-infrastructure                 1,349
  conflict-violence                         1,265
  climate-weather                           1,169
  population                                1,141
  internally displaced persons-idp          1,134
  socioeconomics                            1,099
  nutrition                                 1,066
  funding                                     932
  transportation              

In [6]:
# =========================
# Create review sample CSV
# =========================

random.seed(RANDOM_SEED)

# 1) Random sample
sample = random.sample(records_for_sampling, k=min(SAMPLE_SIZE, len(records_for_sampling)))

# 2) Ensure we include some examples for the top tags (calibration)
top_tags = [t for t, _ in tag_counter.most_common(40)]
tag_to_examples: Dict[str, int] = defaultdict(int)

# Add up to one dataset per top tag if missing in the random sample
present_ids = {r["dataset_id"] for r in sample}
for tag in top_tags:
    for r in records_for_sampling:
        if r["dataset_id"] in present_ids:
            continue
        if tag in r["tags"]:
            sample.append(r)
            present_ids.add(r["dataset_id"])
            tag_to_examples[tag] += 1
            break

out_sample_csv = DOCS_DIR / "samples_for_mapping.csv"
with out_sample_csv.open("w", newline="", encoding="utf-8") as f:
    w = csv.DictWriter(
        f,
        fieldnames=[
            "dataset_id", "title", "organization", "dataset_source", "license_title",
            "tags", "formats", "notes_snippet"
        ],
    )
    w.writeheader()
    for r in sample:
        w.writerow({
            "dataset_id": r["dataset_id"],
            "title": r["title"],
            "organization": r["organization"],
            "dataset_source": r["dataset_source"],
            "license_title": r["license_title"],
            "tags": ";".join(r["tags"]),
            "formats": ";".join(sorted(set(r["formats"]))),
            "notes_snippet": short_text(r["notes"], MAX_NOTES_CHARS),
        })

print("Wrote sample review file:", out_sample_csv.resolve())
print("Sample rows:", len(sample))


Wrote sample review file: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\docs\samples_for_mapping.csv
Sample rows: 440


In [7]:
# =========================
# Minimal YAML writer (safe, limited)
# =========================

def yaml_escape(s: str) -> str:
    # Quote strings that contain special characters or start with risky chars.
    if s == "" or any(ch in s for ch in [":", "#", "{", "}", "[", "]", ",", "\n", "\r", "\t"]) or s.strip() != s:
        return '"' + s.replace('"', '\\"') + '"'
    # YAML booleans/null can be misread; quote them
    if s.lower() in {"true", "false", "null", "~"}:
        return '"' + s + '"'
    return s

def dump_yaml(obj: Any, indent: int = 0) -> str:
    sp = "  " * indent
    if isinstance(obj, dict):
        lines = []
        for k in sorted(obj.keys()):
            v = obj[k]
            key = yaml_escape(str(k))
            if isinstance(v, (dict, list)):
                lines.append(f"{sp}{key}:")
                lines.append(dump_yaml(v, indent + 1))
            else:
                lines.append(f"{sp}{key}: {yaml_escape(str(v))}")
        return "\n".join(lines)
    if isinstance(obj, list):
        lines = []
        for item in obj:
            if isinstance(item, (dict, list)):
                lines.append(f"{sp}-")
                lines.append(dump_yaml(item, indent + 1))
            else:
                lines.append(f"{sp}- {yaml_escape(str(item))}")
        return "\n".join(lines)
    return f"{sp}{yaml_escape(str(obj))}"

def write_yaml(path: Path, obj: Any, overwrite: bool = False) -> None:
    if path.exists() and not overwrite:
        print(f"SKIP (exists): {path}")
        return
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(dump_yaml(obj) + "\n", encoding="utf-8")
    print("Wrote:", path.resolve())


In [8]:
# =========================
# Draft mapping configs (starter set)
# =========================
# These are intentionally minimal, meant to be refined by your team after reviewing docs/samples_for_mapping.csv.
# Weights: 1 (weak) .. 5 (strong). You can use fractional weights if you want, but keep it simple at first.

tag_to_rdls = {
    "hazard": {
        # classic hazard event tags
        "flooding": 5,
        "drought": 5,
        "cyclones-hurricanes-typhoons": 5,
        "earthquake-tsunami": 5,
        "climate hazards": 4,
        "hydrology": 3,  # often hazard-supporting
        "natural disasters": 3,
        "hazards and risk": 3,
        "forecasting": 2,  # could be hazard services
        "topography": 2,   # hazard-supporting
    },
    "loss_impact": {
        "damage assessment": 5,
        "casualties": 5,
        "fatalities": 5,
        "mortality": 4,
        "affected population": 4,
        "affected area": 4,
        "people in need-pin": 3,
        "severity": 3,
    },
    "exposure": {
        "facilities-infrastructure": 5,
        "populated places-settlements": 4,
        "population": 4,
        "roads": 4,
        "railways": 3,
        "ports": 3,
        "aviation": 3,
        "points of interest-poi": 3,
        "health facilities": 4,
        "education facilities-schools": 4,
        "energy": 3,
        "gazetteer": 2,
        "geodata": 2,
    },
    "vulnerability_proxy": {
        "demographics": 4,
        "poverty": 4,
        "socioeconomics": 4,
        "disability": 3,
        "gender": 3,
        "food security": 3,
        "health": 2,
        "education": 2,
        "livelihoods": 2,
        "nutrition": 2,
    },
}

# Keyword patterns applied to (title + notes); these are regex snippets.
keyword_to_rdls = {
    "hazard": [
        r"\bflood(s|ing)?\b",
        r"\bdrought\b",
        r"\bcyclone(s)?\b",
        r"\bhurricane(s)?\b",
        r"\btyphoon(s)?\b",
        r"\bearthquake(s)?\b",
        r"\btsunami\b",
        r"\breturn period\b",
        r"\bhazard\b",
    ],
    "loss_impact": [
        r"\bdamage\b",
        r"\bloss(es)?\b",
        r"\bcost(s)?\b",
        r"\bfatalit(y|ies)\b",
        r"\bcasualt(y|ies)\b",
        r"\baffected\b",
    ],
    "exposure": [
        r"\bairport(s)?\b",
        r"\broad(s)?\b",
        r"\bbridge(s)?\b",
        r"\bport(s)?\b",
        r"\bhospital(s)?\b",
        r"\bschool(s)?\b",
        r"\bfacilit(y|ies)\b",
        r"\binfrastructure\b",
        r"\bbuildings?\b",
        r"\bsettlement(s)?\b",
    ],
    "vulnerability_proxy": [
        r"\bpoverty\b",
        r"\bdisabilit(y|ies)\b",
        r"\bmalnutrition\b",
        r"\bfood security\b",
        r"\bhealth indicator(s)?\b",
        r"\bdemographic(s)?\b",
        r"\bvulnerability\b",
        r"\bhousehold(s)?\b",
    ],
}

# Organization hints (optional, but very useful for indicator packs)
org_hints = {
    # Examples to refine; add more after inspecting org_counter
    "World Bank Group": {"vulnerability_proxy": 2},
    "The DHS Program": {"vulnerability_proxy": 4},
    "Food and Agriculture Organization": {"vulnerability_proxy": 3},
    "UNICEF": {"vulnerability_proxy": 3},
}

# Where to write
path_tag_yaml = CONFIG_DIR / "tag_to_rdls_component.yaml"
path_kw_yaml = CONFIG_DIR / "keyword_to_rdls_component.yaml"
path_org_yaml = CONFIG_DIR / "org_hints.yaml"

write_yaml(path_tag_yaml, tag_to_rdls, overwrite=OVERWRITE_CONFIG)
write_yaml(path_kw_yaml, keyword_to_rdls, overwrite=OVERWRITE_CONFIG)
write_yaml(path_org_yaml, org_hints, overwrite=OVERWRITE_CONFIG)


Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\tag_to_rdls_component.yaml
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\keyword_to_rdls_component.yaml
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\org_hints.yaml


In [9]:
# =========================
# Write mapping rules doc
# =========================

rules_md = DOCS_DIR / "mapping_rules.md"

if (not rules_md.exists()) or OVERWRITE_CONFIG:
    content = f"""# Mapping Rules (Draft)

This document describes the *draft* mapping used to translate HDX dataset metadata into inclusive RDLS components.

## Components
- **hazard**: datasets describing hazard events/intensity/footprints or hazard model inputs (e.g., flood, drought).
- **exposure**: datasets describing exposed elements (population, facilities, infrastructure).
- **vulnerability_proxy**: indicator datasets used as proxies for vulnerability/sensitivity (health, poverty, food security).
- **loss_impact**: datasets describing observed/estimated impacts (damage, fatalities, affected population).

## Evidence sources (in priority order)
1. **HDX tags** (weighted)
2. **Keywords** in title/notes (regex patterns)
3. **Organization hints** (helps with indicator series)

## OSM policy
OSM-derived datasets are excluded from downstream RDLS translation using the exclusion list in:
`{OSM_EXCLUDED_IDS_TXT.as_posix()}`

## Next steps
1. Review `docs/samples_for_mapping.csv` to calibrate tag/keyword weights.
2. Expand `config/tag_to_rdls_component.yaml` to cover common tags in your corpus.
3. Add organization mappings for frequent publishers (World Bank, FAO, DHS, etc.).
"""
    rules_md.write_text(content, encoding="utf-8")
    print("Wrote:", rules_md.resolve())
else:
    print("SKIP (exists):", rules_md.resolve())


Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\docs\mapping_rules.md


In [10]:
# =========================
# Next: where this plugs in
# =========================

print("\nStep 3 complete.")
print("Next notebooks:")
print("  - Step 4 will read YAML configs from:", CONFIG_DIR.resolve())
print("  - Step 4 will use policy exclusion list from:", OSM_EXCLUDED_IDS_TXT.resolve())
print("\nFiles created:")
print("  -", (CONFIG_DIR / 'tag_to_rdls_component.yaml').resolve())
print("  -", (CONFIG_DIR / 'keyword_to_rdls_component.yaml').resolve())
print("  -", (CONFIG_DIR / 'org_hints.yaml').resolve())
print("  -", (DOCS_DIR / 'mapping_rules.md').resolve())
print("  -", (DOCS_DIR / 'samples_for_mapping.csv').resolve())



Step 3 complete.
Next notebooks:
  - Step 4 will read YAML configs from: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config
  - Step 4 will use policy exclusion list from: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\policy\osm_excluded_dataset_ids.txt

Files created:
  - C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\tag_to_rdls_component.yaml
  - C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\keyword_to_rdls_component.yaml
  - C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\config\org_hints.yaml
  - C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\docs\mapping_rules.md
  - C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\docs\samples_for_mapping.csv
