# Step 6 - Translate HDX metadata to RDLS v0.3 JSON

This notebook translates **HDX dataset-level metadata exports** into **RDLS v0.3** metadata records.

## Inputs
- Step 5 outputs:
  - `hdx_dataset_metadata_dump/derived/classification_final.csv`
  - `hdx_dataset_metadata_dump/derived/rdls_included_dataset_ids_final.txt`
- HDX dataset metadata JSON corpus:
  - `hdx_dataset_metadata_dump/dataset_metadata/*.json`
- RDLS v0.3 assets (provide local paths below):
  - `rdls_schema_v0.3.json`
  - `rdls_template_v03.json`

## Outputs
- `hdx_dataset_metadata_dump/rdls/records/*.json` — one RDLS record per included HDX dataset
- `hdx_dataset_metadata_dump/rdls/index/rdls_index.jsonl` — index of written records
- `hdx_dataset_metadata_dump/rdls/reports/translation_blocked.csv` — datasets blocked by policy/required-field gaps
- `hdx_dataset_metadata_dump/rdls/reports/schema_validation.csv` — JSON Schema validation results (pass/fail)

## Strictness & policy
- **Schema-first:** required RDLS fields are always populated; optional fields are omitted unless we can fill them safely.
- **No extra fields:** the output contains **only fields defined in the RDLS schema**.
- **Do not invent content:** if a value cannot be mapped from HDX, we **leave the RDLS optional field absent** (not an empty string), to avoid violating schema `minLength` constraints.
- **Open codelists:** for schema fields marked `openCodelist: true`, values may be kept as-is if not in suggestions.
- **Component combination rule (your team policy):**
  - hazard-only and exposure-only are allowed
  - vulnerability must accompany hazard or exposure
  - loss must accompany hazard or exposure
  - if this is violated, the dataset is **blocked** (until resolved via overrides in Step 5)

## Naming convention
The RDLS `datasets[0].id` equals the output filename stem, following:

`{prefix}{entity_token}{optional_hazard_suffix}.json`

Prefix precedence:
- loss → `rdls_lss-`
- vulnerability → `rdls_vln-`
- exposure → `rdls_exp-`
- hazard → `rdls_hzd-`

Collision-proofing:
- if a filename already exists, append `__hdx_{dataset_uuid[:8]}`.


### Cell 1

In [1]:
# ======================
# Configuration
# ======================
from __future__ import annotations

import json
import re
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple

import pandas as pd

# --- Paths (assumes you run this notebook from the `notebook/` folder) ---
DUMP_DIR = (Path("..") / "hdx_dataset_metadata_dump").resolve()
DATASET_DIR = (DUMP_DIR / "dataset_metadata").resolve()

DERIVED_DIR = (DUMP_DIR / "derived").resolve()
CLASSIFICATION_FINAL_CSV = (DERIVED_DIR / "classification_final.csv").resolve()
INCLUDED_IDS_TXT = (DERIVED_DIR / "rdls_included_dataset_ids_final.txt").resolve()

# --- RDLS assets (edit if you store them elsewhere) ---
# Recommended folder layout:
#   hdx_dataset_metadata_dump/rdls/schema/rdls_schema_v0.3.json
#   hdx_dataset_metadata_dump/rdls/template/rdls_template_v03.json
RDLS_DIR = (DUMP_DIR / "rdls").resolve()
RDLS_SCHEMA_PATH = (RDLS_DIR / "schema" / "rdls_schema_v0.3.json").resolve()
RDLS_TEMPLATE_PATH = (RDLS_DIR / "template" / "rdls_template_v03.json").resolve()

# =========================
# Output mode configuration
# =========================
# Choose ONE:
# - "run_folder": write to rdls/runs/<RUN_ID>/{records,index,reports}
# - "in_place"  : write to rdls/{records,index,reports}
OUTPUT_MODE = "in_place"  # "run_folder" | "in_place"

# If OUTPUT_MODE == "in_place", optionally clean existing outputs before writing
CLEAN_IN_PLACE_BEFORE_RUN = True  # True | False

# Safety: never allow cleaning when using run_folder mode
if OUTPUT_MODE == "run_folder" and CLEAN_IN_PLACE_BEFORE_RUN:
    raise ValueError("Invalid config: CLEAN_IN_PLACE_BEFORE_RUN cannot be True when OUTPUT_MODE='run_folder'")

# Resolve run root (either rdls/ or rdls/runs/<RUN_ID>/)
if OUTPUT_MODE == "run_folder":
    RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
    RDLS_RUN_DIR = (RDLS_DIR / "runs" / RUN_ID).resolve()
else:
    RUN_ID = "in_place"
    RDLS_RUN_DIR = RDLS_DIR

# --- Outputs (under the chosen run directory) ---
OUT_RECORDS_DIR = RDLS_RUN_DIR / "records"
OUT_INDEX_DIR = RDLS_RUN_DIR / "index"
OUT_REPORTS_DIR = RDLS_RUN_DIR / "reports"

OUT_INDEX_JSONL = OUT_INDEX_DIR / "rdls_index.jsonl"
OUT_BLOCKED_CSV = OUT_REPORTS_DIR / "translation_blocked.csv"
OUT_VALIDATION_CSV = OUT_REPORTS_DIR / "schema_validation.csv"
OUT_QA_CSV = OUT_REPORTS_DIR / "translation_qa.csv"

# --- Runtime controls ---
# MAX_DATASETS: Optional[int] = None     # set e.g. 200 for a test run
MAX_DATASETS = 50
SKIP_EXISTING: bool = False             # True (Default) resume-safe
WRITE_PRETTY_JSON: bool = True          # pretty-print for readability (slower, larger)

# --- Hazard inference / filename alias ---
# RDLS hazard types are fixed enums (e.g., "strong_wind"). Filenames can use friendlier aliases if you want.
HAZARD_FILENAME_ALIASES = {
    "strong_wind": "windstorm",  # your example preference
    # Keep others as-is by default; add aliases if you want:
    # "extreme_temperature": "extreme_temperature",
    # "coastal_flood": "coastal_flood",
}

# --- Resource format mapping (HDX -> RDLS data_format enum label) ---
HDX_FORMAT_TO_RDLS = {
    "CSV": "CSV (csv)",
    "XLS": "Excel (xlsx)",
    "XLSX": "Excel (xlsx)",
    "EXCEL": "Excel (xlsx)",
    "JSON": "JSON (json)",
    "GEOJSON": "GeoJSON (geojson)",
    "SHP": "Shapefile (shp)",
    "SHAPEFILE": "Shapefile (shp)",
    "GPKG": "GeoPackage (gpkg)",
    "GEOPACKAGE": "GeoPackage (gpkg)",
    "KML": "KML (kml)",
    "PDF": "PDF (pdf)",
    "NC": "NetCDF (nc)",
    "NETCDF": "NetCDF (nc)",
    "TIF": "GeoTIFF (tif)",
    "TIFF": "GeoTIFF (tif)",
    "COG": "Cloud Optimized GeoTIFF (cog)",
    "PARQUET": "Parquet (parquet)",
    "XML": "XML (xml)",
}

# --- License mapping (HDX license_title -> preferred RDLS suggestions when possible) ---
HDX_LICENSE_TO_RDLS = {
    "public domain": "PDDL-1.0",
    "odbl": "ODbL-1.0",
    "cc-by-4.0": "CC-BY-4.0",
    "cc by 4.0": "CC-BY-4.0",
    "cc-by": "CC-BY-4.0",
    "cc0": "CC0-1.0",
    "cc0-1.0": "CC0-1.0",
    "copyright": "Copyright",
}

# Ensure output folders exist
for p in [OUT_RECORDS_DIR, OUT_INDEX_DIR, OUT_REPORTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

def _safe_clean_folder(folder: Path, pattern: str) -> int:
    """
    Delete files matching pattern inside an expected RDLS subfolder.
    Guardrails:
      - only cleans inside RDLS_DIR
      - only allows folder names: records/index/reports
    """
    folder = folder.resolve()
    if not str(folder).startswith(str(RDLS_DIR)):
        raise ValueError(f"Refusing to clean outside rdls/: {folder}")
    if folder.name not in {"records", "index", "reports"}:
        raise ValueError(f"Unexpected folder name for cleaning: {folder.name}")
    n = 0
    for f in folder.glob(pattern):
        try:
            f.unlink()
            n += 1
        except Exception as e:
            print(f"WARNING: failed to delete {f}: {e}")
    return n

if OUTPUT_MODE == "in_place" and CLEAN_IN_PLACE_BEFORE_RUN:
    removed_records = _safe_clean_folder(OUT_RECORDS_DIR, "*.json")
    removed_index = _safe_clean_folder(OUT_INDEX_DIR, "*.jsonl")
    removed_reports = _safe_clean_folder(OUT_REPORTS_DIR, "*.csv")
    print(f"Cleaned in-place: records={removed_records}, index={removed_index}, reports={removed_reports}")
else:
    print("No cleaning performed.")

print("DUMP_DIR:", DUMP_DIR)
print("DATASET_DIR:", DATASET_DIR)
print("RDLS_SCHEMA_PATH:", RDLS_SCHEMA_PATH)
print("RDLS_TEMPLATE_PATH:", RDLS_TEMPLATE_PATH)
print("OUTPUT_MODE:", OUTPUT_MODE)
print("RUN_ID:", RUN_ID)
print("RDLS_RUN_DIR:", RDLS_RUN_DIR)
print("OUT_RECORDS_DIR:", OUT_RECORDS_DIR)
print("OUT_INDEX_DIR:", OUT_INDEX_DIR)
print("OUT_REPORTS_DIR:", OUT_REPORTS_DIR)


Cleaned in-place: records=50, index=1, reports=3
DUMP_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump
DATASET_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\dataset_metadata
RDLS_SCHEMA_PATH: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\schema\rdls_schema_v0.3.json
RDLS_TEMPLATE_PATH: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\template\rdls_template_v03.json
OUTPUT_MODE: in_place
RUN_ID: in_place
RDLS_RUN_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls
OUT_RECORDS_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\records
OUT_INDEX_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\index
OUT_REPORTS_DIR: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-c

### Cell 2

In [2]:
# ======================
# Load RDLS schema + template (and required field list)
# ======================
from copy import deepcopy

if not RDLS_SCHEMA_PATH.exists():
    raise FileNotFoundError(f"RDLS schema not found: {RDLS_SCHEMA_PATH}")

if not RDLS_TEMPLATE_PATH.exists():
    raise FileNotFoundError(f"RDLS template not found: {RDLS_TEMPLATE_PATH}")

rdls_schema = json.loads(RDLS_SCHEMA_PATH.read_text(encoding="utf-8"))
rdls_template = json.loads(RDLS_TEMPLATE_PATH.read_text(encoding="utf-8"))

# RDLS dataset object allowed keys (schema properties)
RDLS_ALLOWED_KEYS = set(rdls_schema["properties"].keys())

# Required keys for a dataset record (from schema)
RDLS_REQUIRED_KEYS = list(rdls_schema.get("required", []))

print("RDLS required keys:", RDLS_REQUIRED_KEYS)
print("RDLS allowed keys count:", len(RDLS_ALLOWED_KEYS))


RDLS required keys: ['id', 'title', 'risk_data_type', 'attributions', 'spatial', 'license', 'resources']
RDLS allowed keys count: 20


### Cell 3

In [3]:
# ======================
# Load Step 5 outputs + build dataset_id -> JSON path index
# ======================
def read_ids_txt(path: Path) -> List[str]:
    if not path.exists():
        raise FileNotFoundError(f"Missing IDs list: {path}")
    out: List[str] = []
    for line in path.read_text(encoding="utf-8").splitlines():
        s = line.strip()
        if s:
            out.append(s)
    return out

if not CLASSIFICATION_FINAL_CSV.exists():
    raise FileNotFoundError(f"Missing Step 5 output: {CLASSIFICATION_FINAL_CSV}")
if not INCLUDED_IDS_TXT.exists():
    raise FileNotFoundError(f"Missing Step 5 output: {INCLUDED_IDS_TXT}")
if not DATASET_DIR.exists():
    raise FileNotFoundError(f"Missing HDX dataset metadata folder: {DATASET_DIR}")

df = pd.read_csv(CLASSIFICATION_FINAL_CSV)
included_ids = read_ids_txt(INCLUDED_IDS_TXT)

# Fast lookups
df = df.set_index("dataset_id", drop=False)
included_set = set(included_ids)

print("classification_final rows:", len(df))
print("included ids:", len(included_ids))
print("included ids present in classification:", sum(did in df.index for did in included_ids))

# Build a dataset_id -> file path mapping once (avoid N=10k glob calls)
dataset_file_index: Dict[str, Path] = {}
for fp in sorted(DATASET_DIR.glob("*.json")):
    # expected filename pattern: {dataset_uuid}__{slug}.json
    stem = fp.stem
    if "__" in stem:
        dataset_uuid = stem.split("__", 1)[0]
    else:
        dataset_uuid = stem
    dataset_file_index[dataset_uuid] = fp

print("dataset files indexed:", len(dataset_file_index))

# Optional: limit for testing
if MAX_DATASETS is not None:
    included_ids = included_ids[:MAX_DATASETS]
    included_set = set(included_ids)
    print("TEST RUN: MAX_DATASETS =", MAX_DATASETS, "-> using", len(included_ids), "ids")


classification_final rows: 26246
included ids: 10759
included ids present in classification: 10759
dataset files indexed: 26246
TEST RUN: MAX_DATASETS = 50 -> using 50 ids


### Cell 4

In [4]:
# ======================
# Helpers: parsing, slugifying, hazard inference, mappings
# ======================
def slugify_token(s: str, max_len: int = 32) -> str:
    s = (s or "").strip().lower()
    s = re.sub(r"[^a-z0-9]+", "_", s)
    s = re.sub(r"_+", "_", s).strip("_")
    if not s:
        return "unknown"
    return s[:max_len].strip("_") or "unknown"

def split_semicolon_list(s: Any) -> List[str]:
    if s is None or (isinstance(s, float) and pd.isna(s)):
        return []
    if isinstance(s, list):
        return [str(x).strip() for x in s if str(x).strip()]
    s = str(s).strip()
    if not s:
        return []
    return [x.strip() for x in re.split(r"[;,]", s) if x.strip()]

def looks_like_url(s: str) -> bool:
    return bool(re.match(r"^https?://", (s or "").strip(), flags=re.I))

def safe_load_json(path: Path) -> Dict[str, Any]:
    return json.loads(path.read_text(encoding="utf-8"))

# --- ISO3 inference (best-effort, no external dependencies required) ---
def try_import_pycountry():
    try:
        import pycountry  # type: ignore
        return pycountry
    except Exception:
        return None

_pycountry = try_import_pycountry()

COMMON_COUNTRY_FIXES = {
    "cote d'ivoire": "CIV",
    "ivory coast": "CIV",
    "democratic republic of the congo": "COD",
    "dr congo": "COD",
    "republic of the congo": "COG",
    "congo, rep.": "COG",
    "congo, dem. rep.": "COD",
    "lao pdr": "LAO",
    "viet nam": "VNM",
    "korea, rep.": "KOR",
    "korea, dem. rep.": "PRK",
    "syrian arab republic": "SYR",
    "iran, islamic republic of": "IRN",
    "tanzania, united republic of": "TZA",
    "venezuela, bolivarian republic of": "VEN",
    "bolivia, plurinational state of": "BOL",
    "moldova, republic of": "MDA",
    "palestine": "PSE",
    "russia": "RUS",
    "united states": "USA",
    "united kingdom": "GBR",
}


# Optional: a lightweight country name -> ISO3 table persisted to disk.
# This avoids forcing `pycountry` at runtime while still enabling country inference at scale.
COUNTRY_ISO3_CSV = (DUMP_DIR / "config" / "country_name_to_iso3.csv")

def _norm_country_key(s: str) -> str:
    s = (s or "").strip().lower()
    s = re.sub(r"[\(\)\[\]\{\}\.\,\;\:]", " ", s)
    s = s.replace("&", " and ")
    s = re.sub(r"[^a-z0-9\s\-']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

COMMON_COUNTRY_FIXES_NORM: Dict[str, str] = {_norm_country_key(k): v for k, v in COMMON_COUNTRY_FIXES.items()}

def load_country_iso3_table(path: Path) -> Dict[str, str]:
    if not path.exists():
        return {}
    try:
        df_iso = pd.read_csv(path)
    except Exception:
        return {}
    cols = {c.lower(): c for c in df_iso.columns}
    name_col = cols.get("name") or cols.get("country") or cols.get("country_name")
    iso3_col = cols.get("iso3") or cols.get("alpha_3") or cols.get("code")
    if not name_col or not iso3_col:
        return {}
    out: Dict[str, str] = {}
    for _, r in df_iso.iterrows():
        name = str(r.get(name_col, "")).strip()
        iso3 = str(r.get(iso3_col, "")).strip().upper()
        if name and iso3 and len(iso3) == 3:
            out[_norm_country_key(name)] = iso3
    return out

COUNTRY_ISO3_TABLE: Dict[str, str] = load_country_iso3_table(COUNTRY_ISO3_CSV)

def maybe_generate_country_iso3_table(path: Path) -> None:
    """Generate a country name -> ISO3 table if pycountry is available and file is absent."""
    if path.exists():
        return
    if _pycountry is None:
        print(
            "NOTE: pycountry not installed; country inference may default to 'global'. "
            "To enable ISO3 mapping, install pycountry or provide: " + str(path)
        )
        return
    rows = []
    seen = set()
    for c in list(_pycountry.countries):  # type: ignore
        iso3 = getattr(c, "alpha_3", None)
        if not iso3:
            continue
        for attr in ["name", "official_name", "common_name"]:
            nm = getattr(c, attr, None)
            if nm:
                k = _norm_country_key(str(nm))
                if k and k not in seen:
                    seen.add(k)
                    rows.append({"name": str(nm), "iso3": str(iso3)})
    # Add common fixes/synonyms
    for k, iso3 in COMMON_COUNTRY_FIXES.items():
        k2 = _norm_country_key(k)
        if k2 and k2 not in seen:
            seen.add(k2)
            rows.append({"name": k, "iso3": iso3})
    path.parent.mkdir(parents=True, exist_ok=True)
    pd.DataFrame(rows).to_csv(path, index=False)
    print("Generated:", path)

maybe_generate_country_iso3_table(COUNTRY_ISO3_CSV)
# Reload after generation
if not COUNTRY_ISO3_TABLE and COUNTRY_ISO3_CSV.exists():
    COUNTRY_ISO3_TABLE = load_country_iso3_table(COUNTRY_ISO3_CSV)

def country_name_to_iso3(name: str) -> Optional[str]:
    n = (name or "").strip()
    if not n:
        return None

    # If the group already looks like an ISO3, accept it
    if len(n) == 3 and n.isalpha():
        return n.upper()

    key = _norm_country_key(n)

    # First: explicit fixes
    if key in COMMON_COUNTRY_FIXES_NORM:
        return COMMON_COUNTRY_FIXES_NORM[key]

    # Second: persisted table (if available)
    iso3 = COUNTRY_ISO3_TABLE.get(key)
    if iso3:
        return iso3

    # Third: pycountry lookup (optional dependency)
    if _pycountry is not None:
        try:
            c = _pycountry.countries.lookup(n)  # type: ignore
            return getattr(c, "alpha_3", None)
        except Exception:
            return None

    return None

def infer_spatial(groups: List[str]) -> Dict[str, Any]:
    """Infer RDLS spatial block from HDX country groups (best-effort)."""
    iso3s: List[str] = []
    for g in groups:
        iso3 = country_name_to_iso3(g)
        if iso3:
            iso3s.append(iso3)

    iso3s = sorted(set(iso3s))
    if len(iso3s) == 1:
        return {"scale": "national", "countries": iso3s}
    if len(iso3s) > 1:
        return {"scale": "regional", "countries": iso3s}
    return {"scale": "global"}

# ============================
# Hazard keyword → hazard_type
# ============================
# Used only for filename suffix inference (optional).
# Keep conservative; default to empty if no matches.

HAZARD_KEYWORDS_TO_TYPE = {
    # hydro
    "flood": "flood",
    "flooding": "flood",
    "river flood": "flood",
    "flash flood": "flood",
    "inundation": "flood",

    # drought
    "drought": "drought",
    "dry spell": "drought",
    "water scarcity": "drought",

    # storms / wind
    "cyclone": "windstorm",
    "hurricane": "windstorm",
    "typhoon": "windstorm",
    "windstorm": "windstorm",
    "storm": "windstorm",

    # heat / wildfire
    "heatwave": "heat",
    "extreme heat": "heat",
    "wildfire": "wildfire",
    "fire": "wildfire",

    # seismic
    "earthquake": "earthquake",
    "tsunami": "tsunami",

    # landslide
    "landslide": "landslide",
    "mudslide": "landslide",
    "avalanche": "landslide",
}

def infer_hazard_types(tags: List[str], title: str = "", notes: str = "") -> List[str]:
    text = " ".join([*tags, title or "", notes or ""]).lower()
    hits = set()
    for k, ht in HAZARD_KEYWORDS_TO_TYPE.items():
        if k in text:
            hits.add(ht)
    return sorted(hits)

def hazard_suffix_for_filename(hazard_types: List[str]) -> str:
    if not hazard_types:
        return ""
    if len(hazard_types) > 1:
        return "_multihazard"
    ht = hazard_types[0]
    ht_alias = HAZARD_FILENAME_ALIASES.get(ht, ht)
    return "_" + slugify_token(ht_alias, max_len=24)

# --- Risk components mapping (Step 5 -> RDLS enum) ---
COMPONENT_MAP = {
    "hazard": "hazard",
    "exposure": "exposure",
    "vulnerability_proxy": "vulnerability",
    "loss_impact": "loss",
    "vulnerability": "vulnerability",
    "loss": "loss",
}

def parse_components(s: Any) -> List[str]:
    parts = split_semicolon_list(s)
    out = []
    for p in parts:
        p2 = COMPONENT_MAP.get(p.strip().lower(), None)
        if p2:
            out.append(p2)
    # unique + preserve precedence order (hazard/exposure/vulnerability/loss not required here)
    seen = set()
    final = []
    for x in out:
        if x not in seen:
            final.append(x)
            seen.add(x)
    return final

# --- Naming prefix precedence ---
def choose_prefix(risk_data_type: List[str]) -> str:
    rset = set(risk_data_type)
    if "loss" in rset:
        return "rdls_lss-"
    if "vulnerability" in rset:
        return "rdls_vln-"
    if "exposure" in rset:
        return "rdls_exp-"
    return "rdls_hzd-"

def map_license(hdx_license_title: str) -> str:
    """Map HDX license strings into RDLS schema license suggestions when possible.

    Policy:
    - If we can confidently map to a schema-aligned identifier (e.g., CC-BY-4.0, ODbL-1.0), return that.
    - Otherwise, keep the original HDX string (openCodelist allows new values).
    """
    raw = (hdx_license_title or "").strip()
    if not raw:
        return ""

    key = raw.lower().strip()
    key = re.sub(r"\s+", " ", key)

    # High-confidence pattern mappings
    # CC0
    if re.search(r"\bcc0\b", key) or "public domain" in key and "cc0" in key:
        return "CC0-1.0"

    # ODbL
    if "odbl" in key or "open database license" in key:
        return "ODbL-1.0"

    # ODC / PDDL
    if "pddl" in key or "public domain dedication" in key:
        return "PDDL-1.0"
    if "odc-by" in key or "odc by" in key:
        return "ODC-By-1.0"

    # Creative Commons variants
    # Normalize "creative commons" prefix
    k2 = key.replace("creative commons", "cc")

    # CC-BY
    if re.search(r"\bcc\s*by\b", k2) and "sa" not in k2 and "nd" not in k2 and "nc" not in k2:
        # Prefer explicit version 4.0 if present
        if "4.0" in k2 or "v4" in k2:
            return "CC-BY-4.0"
        if "3.0" in k2:
            return "CC-BY-3.0"
        return "CC-BY-4.0" if "by" in k2 else raw

    # CC-BY-SA
    if ("by-sa" in k2) or re.search(r"\bcc\s*by\s*sa\b", k2):
        if "4.0" in k2:
            return "CC-BY-SA-4.0"
        if "3.0" in k2:
            return "CC-BY-SA-3.0"
        return "CC-BY-SA-4.0"

    # CC-BY-NC
    if ("by-nc" in k2) or re.search(r"\bcc\s*by\s*nc\b", k2):
        if "4.0" in k2:
            return "CC-BY-NC-4.0"
        if "3.0" in k2:
            return "CC-BY-NC-3.0"
        return "CC-BY-NC-4.0"

    # CC-BY-ND
    if ("by-nd" in k2) or re.search(r"\bcc\s*by\s*nd\b", k2):
        if "4.0" in k2:
            return "CC BY-ND 4.0"
        if "3.0" in k2:
            return "CC BY-ND 3.0"
        return "CC BY-ND 4.0"

    # CC-BY-NC-SA
    if "by-nc-sa" in k2 or ("nc" in k2 and "sa" in k2 and "by" in k2):
        if "4.0" in k2:
            return "CC-BY-NC-SA-4.0"
        if "3.0" in k2:
            return "CC-BY-NC-SA-3.0"
        return "CC-BY-NC-SA-4.0"

    # Fallback to explicit mapping dict, if provided
    k3 = re.sub(r"\s+", " ", k2).strip()
    return HDX_LICENSE_TO_RDLS.get(k3, raw)


def map_data_format(hdx_fmt: str, url: str = "") -> Optional[str]:
    s = (hdx_fmt or "").strip().upper()
    if s in HDX_FORMAT_TO_RDLS:
        return HDX_FORMAT_TO_RDLS[s]
    # Guess from URL extension if needed
    u = (url or "").lower()
    for ext, rdls in [
        (".geojson", "GeoJSON (geojson)"),
        (".json", "JSON (json)"),
        (".csv", "CSV (csv)"),
        (".xlsx", "Excel (xlsx)"),
        (".xls", "Excel (xlsx)"),
        (".shp", "Shapefile (shp)"),
        (".zip", "Shapefile (shp)"),
        (".tif", "GeoTIFF (tif)"),
        (".tiff", "GeoTIFF (tif)"),
        (".nc", "NetCDF (nc)"),
        (".pdf", "PDF (pdf)"),
        (".parquet", "Parquet (parquet)"),
        (".gpkg", "GeoPackage (gpkg)"),
    ]:
        if u.endswith(ext):
            return rdls
    return None


### Cell 5

In [5]:
# ==========================
# Component gating / repair
# ==========================

AUTO_REPAIR_COMPONENTS = True  # set False if you want strict blocking

@dataclass(frozen=True)
class ComponentGateResult:
    ok: bool
    reasons: Tuple[str, ...]
    risk_data_type: List[str]  # must be subset of: hazard, exposure, vulnerability, loss

def apply_component_gate(components: List[str]) -> ComponentGateResult:
    """
    Enforce RDLS component combination rules for risk_data_type.

    Rules:
    - vulnerability must co-occur with hazard or exposure
    - loss must co-occur with hazard or exposure
    - risk_data_type must be non-empty and only include allowed values

    If AUTO_REPAIR_COMPONENTS=True:
      - vulnerability-only -> add exposure
      - loss-only -> add exposure
    """
    allowed = {"hazard", "exposure", "vulnerability", "loss"}
    rset = {c for c in (components or []) if c in allowed}

    if not rset:
        return ComponentGateResult(
            ok=False,
            reasons=("empty_or_unrecognized_components",),
            risk_data_type=[],
        )

    reasons: List[str] = []
    ok = True

    if "vulnerability" in rset and not ({"hazard", "exposure"} & rset):
        if AUTO_REPAIR_COMPONENTS:
            rset.add("exposure")
            reasons.append("auto_added_exposure_for_vulnerability")
        else:
            ok = False
            reasons.append("vulnerability_without_hazard_or_exposure")

    if "loss" in rset and not ({"hazard", "exposure"} & rset):
        if AUTO_REPAIR_COMPONENTS:
            rset.add("exposure")
            reasons.append("auto_added_exposure_for_loss")
        else:
            ok = False
            reasons.append("loss_without_hazard_or_exposure")

    return ComponentGateResult(ok=ok, reasons=tuple(reasons), risk_data_type=sorted(rset))


### Cell 6

In [6]:
# ======================
# Build RDLS dataset record (minimal, schema-safe)
# ======================
def build_attributions(hdx: Dict[str, Any], dataset_id: str, dataset_page_url: str) -> List[Dict[str, Any]]:
    org = (hdx.get("organization") or "").strip() or "Unknown publisher"
    src = (hdx.get("dataset_source") or "").strip() or org

    # Prefer dataset_source URL if it is a URL; otherwise use the dataset landing page
    creator_url = src if looks_like_url(src) else dataset_page_url

    return [
        {
            "id": "attribution_publisher",
            "role": "publisher",
            "entity": {"name": org, "url": dataset_page_url},
        },
        {
            "id": "attribution_creator",
            "role": "creator",
            "entity": {"name": src, "url": creator_url},
        },
        {
            "id": "attribution_contact",
            "role": "contact_point",
            "entity": {"name": org, "url": dataset_page_url},
        },
    ]

def build_resources(hdx: Dict[str, Any], dataset_id: str) -> List[Dict[str, Any]]:
    # Always include at least one safe resource: HDX metadata export JSON
    meta_url = f"https://data.humdata.org/dataset/{dataset_id}/download_metadata?format=json"
    resources: List[Dict[str, Any]] = [
        {
            "id": "hdx_dataset_metadata_json",
            "title": "HDX dataset metadata (JSON)",
            "description": "Dataset-level metadata exported from HDX.",
            "data_format": "JSON (json)",
            "access_modality": "file_download",
            "download_url": meta_url,
        }
    ]

    for r in hdx.get("resources", []) or []:
        rid = (r.get("id") or "").strip()
        rname = (r.get("name") or "").strip() or rid[:8] or "resource"
        desc = (r.get("description") or "").strip() or f"HDX resource: {rname}"
        dl = (r.get("download_url") or "").strip()
        fmt = map_data_format(r.get("format") or "", dl)
        if not dl or not fmt:
            # skip if we cannot provide required fields safely
            continue

        resources.append(
            {
                "id": f"hdx_res_{rid[:8] or slugify_token(rname, 8)}",
                "title": rname,
                "description": desc,
                "data_format": fmt,
                "access_modality": "file_download",
                "download_url": dl,
            }
        )

    # Ensure resources are unique by id
    seen = set()
    deduped = []
    for rr in resources:
        if rr["id"] not in seen:
            deduped.append(rr)
            seen.add(rr["id"])
    return deduped

def build_rdls_record(
    hdx: Dict[str, Any],
    class_row: pd.Series,
) -> Tuple[Optional[Dict[str, Any]], Dict[str, Any]]:
    """Return (rdls_record_or_none, info_dict). If blocked, rdls_record is None and info has reasons."""

    dataset_id = str(class_row["dataset_id"])
    title = (hdx.get("title") or class_row.get("title") or "").strip()
    notes = (hdx.get("notes") or "").strip()

    # Parse components (from Step 5) and apply gate
    components = parse_components(class_row.get("rdls_components"))
    gate = apply_component_gate(components)
    if not gate.ok:
        return None, {
            "dataset_id": dataset_id,
            "blocked": True,
            "blocked_reasons": ";".join(gate.reasons),
            "risk_data_type": ";".join(gate.risk_data_type),
        }

    # Spatial inference from groups
    groups = split_semicolon_list(class_row.get("groups"))
    spatial = infer_spatial(groups)

    # Hazard inference for naming (optional)
    tags = split_semicolon_list(class_row.get("tags"))
    hazard_types = infer_hazard_types(tags, title=title, notes=notes)

    dataset_page_url = f"https://data.humdata.org/dataset/{dataset_id}"

    # Entity token for naming:
    # Prefer HDX dataset slug (`name`) because it is usually descriptive and unique within HDX.
    # Fall back to title if needed.
    hdx_slug = slugify_token(str(hdx.get("name") or ""), max_len=48)
    title_slug = slugify_token(title, max_len=48)
    dataset_slug = hdx_slug if hdx_slug != "unknown" else title_slug

    # Organization token (short, stable)
    org_token = slugify_token(str(class_row.get("organization") or hdx.get("organization") or "unknown"), max_len=20)

    # Optional ISO3 token (only if exactly one country inferred)
    iso3_tok = ""
    if spatial.get("countries") and len(spatial["countries"]) == 1:
        iso3_tok = str(spatial["countries"][0]).lower()

    # Compose an informative identifier: org + optional iso3 + dataset slug
    parts = [org_token]
    if iso3_tok:
        parts.append(iso3_tok)
    parts.append(dataset_slug)
    entity_token = "_".join([p for p in parts if p])

    # Prefix follows your component priority rules, with explicit HDX provenance marker.
    prefix = choose_prefix(gate.risk_data_type) + "hdx_"

    # Hazard suffix is helpful mainly when hazard/loss is present.
    hz_suffix = hazard_suffix_for_filename(hazard_types) if ("hazard" in set(gate.risk_data_type) or "loss" in set(gate.risk_data_type)) else ""

    stem_base = f"{prefix}{entity_token}{hz_suffix}"
    stem = stem_base

    # Collision-proofing (deterministic, short)
    out_path = OUT_RECORDS_DIR / f"{stem}.json"
    if out_path.exists():
        stem = f"{stem_base}__{dataset_id[:8]}"
        out_path = OUT_RECORDS_DIR / f"{stem}.json"

    license_raw = str(class_row.get("license_title") or hdx.get("license_title") or "").strip()
    license_mapped = map_license(license_raw or "Custom")

    # Build minimal RDLS dataset record with schema-safe keys only
    rdls_ds: Dict[str, Any] = {
        "id": stem,
        "title": title or f"HDX dataset {dataset_id}",
        "description": notes or None,  # optional; omit if None below
        "risk_data_type": gate.risk_data_type,  # required
        "spatial": spatial,  # required
        "license": license_mapped,
        "attributions": build_attributions(hdx, dataset_id, dataset_page_url),  # required (minItems=3)
        "resources": build_resources(hdx, dataset_id),  # required (>=1)
        "links": [
            {
                "href": rdls_schema.get("$id") or "https://docs.riskdatalibrary.org/en/latest/reference/rdls_schema/",  # best-effort
                "rel": "describedby",
            }
        ],
    }

    # Remove optional keys with None (avoid violating minLength constraints on optional fields)
    rdls_ds = {k: v for k, v in rdls_ds.items() if v is not None}

    # Filter strictly to schema-allowed keys (no extras)
    rdls_ds = {k: v for k, v in rdls_ds.items() if k in RDLS_ALLOWED_KEYS}

    # Wrap in top-level structure (template style)
    rdls_record = {"datasets": [rdls_ds]}

    info = {
        "dataset_id": dataset_id,
        "rdls_id": stem,                      # output_id
        "filename": f"{stem}.json",
        "risk_data_type": ";".join(gate.risk_data_type),
    
        # --- QA fields ---
        "spatial_scale": spatial.get("scale", ""),
        "countries_count": len(spatial.get("countries", []) or []),
        "license_raw": license_raw,
        "orgtoken": org_token,
        "hazard_suffix": hz_suffix.lstrip("_"),
    
        # existing fields you already had
        "organization_token": org_token,
        "iso3": iso3_tok,
        "hazard_types": ";".join(hazard_types),
        "blocked": False,
        "blocked_reasons": "",
    }

    return rdls_record, info


### Cell 7

In [7]:
# ======================
# Validate + write outputs
# ======================

from typing import List, Dict, Any, Tuple

qa_rows: List[Dict[str, Any]] = []

def try_import_jsonschema():
    try:
        import jsonschema  # type: ignore
        return jsonschema
    except Exception:
        return None

_jsonschema = try_import_jsonschema()

validator = None
if _jsonschema is not None:
    try:
        validator = _jsonschema.Draft202012Validator(rdls_schema)  # type: ignore
        print("jsonschema validation enabled (Draft2020-12).")
    except Exception as e:
        print("WARNING: jsonschema available but validator init failed:", e)
        validator = None
else:
    print("WARNING: jsonschema not installed; schema validation will be skipped.")

def validate_record(rec: Dict[str, Any]) -> Tuple[bool, str]:
    """Validate a top-level RDLS record wrapper of the form {'datasets':[...]}"""
    if validator is None:
        return True, ""
    errors = sorted(validator.iter_errors(rec["datasets"][0]), key=lambda e: e.path)
    if not errors:
        return True, ""
    msgs = []
    for e in errors[:10]:
        path = ".".join([str(p) for p in e.path])
        msgs.append(f"{path}: {e.message}")
    return False, " | ".join(msgs)

def append_jsonl(path: Path, obj: Dict[str, Any]) -> None:
    with path.open("a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

# Fresh run outputs
OUT_INDEX_JSONL.write_text("", encoding="utf-8")

blocked_rows: List[Dict[str, Any]] = []
validation_rows: List[Dict[str, Any]] = []

written = 0
skipped_existing = 0
blocked = 0
validated_ok = 0

for dataset_id in included_ids:
    fp = dataset_file_index.get(dataset_id)
    if fp is None or not fp.exists():
        blocked += 1
        blocked_rows.append(
            {
                "dataset_id": dataset_id,
                "status": "blocked_missing_hdx_dataset_json",
                "reason": "missing_hdx_dataset_json",
                "risk_data_type": "",
            }
        )
        qa_rows.append(
            {
                "dataset_id": dataset_id,
                "output_id": "",
                "filename": "",
                "risk_data_type": "",
                "spatial_scale": "",
                "countries_count": 0,
                "license_raw": "",
                "orgtoken": "",
                "hazard_suffix": "",
                "status": "blocked_missing_hdx_dataset_json",
                "reason": "missing_hdx_dataset_json",
            }
        )
        continue

    hdx = safe_load_json(fp)
    row = df.loc[dataset_id]

    rdls_rec, info = build_rdls_record(hdx, row)
    if rdls_rec is None:
        blocked += 1
        reason = info.get("blocked_reasons") or "blocked_by_policy"
        rdt = info.get("risk_data_type") or ""
        blocked_rows.append(
            {
                "dataset_id": dataset_id,
                "status": "blocked_by_policy",
                "reason": reason,
                "risk_data_type": rdt,
            }
        )
        qa_rows.append(
            {
                "dataset_id": dataset_id,
                "output_id": "",
                "filename": "",
                "risk_data_type": rdt,
                "spatial_scale": "",
                "countries_count": 0,
                "license_raw": "",
                "orgtoken": "",
                "hazard_suffix": "",
                "status": "blocked_by_policy",
                "reason": reason,
            }
        )
        continue

    out_path = OUT_RECORDS_DIR / info["filename"]
    if SKIP_EXISTING and out_path.exists():
        skipped_existing += 1
        qa_rows.append(
            {
                "dataset_id": dataset_id,
                "output_id": info.get("rdls_id", ""),
                "filename": info.get("filename", ""),
                "risk_data_type": info.get("risk_data_type", ""),
                "spatial_scale": info.get("spatial_scale", ""),
                "countries_count": info.get("countries_count", 0),
                "license_raw": info.get("license_raw", ""),
                "orgtoken": info.get("orgtoken", ""),
                "hazard_suffix": info.get("hazard_suffix", ""),
                "status": "skipped_existing",
                "reason": "",
            }
        )
        continue

    # Validate
    ok, msg = validate_record(rdls_rec)
    validation_rows.append(
        {
            "dataset_id": dataset_id,
            "rdls_id": info["rdls_id"],
            "filename": info["filename"],
            "valid": ok,
            "message": msg,
        }
    )
    if ok:
        validated_ok += 1

    # Write JSON
    if WRITE_PRETTY_JSON:
        out_path.write_text(json.dumps(rdls_rec, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
    else:
        out_path.write_text(json.dumps(rdls_rec, ensure_ascii=False) + "\n", encoding="utf-8")

    append_jsonl(OUT_INDEX_JSONL, info)
    written += 1

    qa_rows.append(
        {
            "dataset_id": dataset_id,
            "output_id": info.get("rdls_id", ""),
            "filename": info.get("filename", ""),
            "risk_data_type": info.get("risk_data_type", ""),
            "spatial_scale": info.get("spatial_scale", ""),
            "countries_count": info.get("countries_count", 0),
            "license_raw": info.get("license_raw", ""),
            "orgtoken": info.get("orgtoken", ""),
            "hazard_suffix": info.get("hazard_suffix", ""),
            "status": "written",
            "reason": "",
        }
    )

print("Written:", written)
print("Skipped existing:", skipped_existing)
print("Blocked:", blocked)
print("Schema valid:", validated_ok, "of", len(validation_rows))

# Save reports (always write headers, even if empty)
blocked_df = pd.DataFrame(blocked_rows, columns=["dataset_id", "status", "reason", "risk_data_type"])
blocked_df.to_csv(OUT_BLOCKED_CSV, index=False)

val_df = pd.DataFrame(validation_rows, columns=["dataset_id", "rdls_id", "filename", "valid", "message"])
val_df.to_csv(OUT_VALIDATION_CSV, index=False)

qa_df = pd.DataFrame(
    qa_rows,
    columns=[
        "dataset_id",
        "output_id",
        "filename",
        "risk_data_type",
        "spatial_scale",
        "countries_count",
        "license_raw",
        "orgtoken",
        "hazard_suffix",
        "status",
        "reason",
    ],
)
qa_df.to_csv(OUT_QA_CSV, index=False)

print("Wrote:", OUT_INDEX_JSONL)
print("Wrote:", OUT_BLOCKED_CSV)
print("Wrote:", OUT_VALIDATION_CSV)
print("Wrote:", OUT_QA_CSV)


jsonschema validation enabled (Draft2020-12).
Written: 50
Skipped existing: 0
Blocked: 0
Schema valid: 50 of 50
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\index\rdls_index.jsonl
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\reports\translation_blocked.csv
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\reports\schema_validation.csv
Wrote: C:\Users\benny\OneDrive\Documents\Github\hdx-metadata-crawler\hdx_dataset_metadata_dump\rdls\reports\translation_qa.csv


### Cell 8

In [8]:
# ======================
# Quick QA: summarize what happened
# ======================
from pandas.errors import EmptyDataError

def safe_read_csv(path: Path) -> pd.DataFrame:
    """
    Read a CSV safely:
    - returns empty DataFrame if file is empty or has no parsable columns.
    """
    try:
        return pd.read_csv(path)
    except EmptyDataError:
        return pd.DataFrame()

idx_lines = OUT_INDEX_JSONL.read_text(encoding="utf-8").strip().splitlines()
print("Index lines:", len(idx_lines))
print("Records on disk:", len(list(OUT_RECORDS_DIR.glob("*.json"))))

if OUT_BLOCKED_CSV.exists():
    blocked_df = safe_read_csv(OUT_BLOCKED_CSV)
    print("Blocked rows:", len(blocked_df))
    if not blocked_df.empty and "reason" in blocked_df.columns:
        print("Blocked reasons (top 10):")
        print(blocked_df["reason"].value_counts().head(10))

if OUT_VALIDATION_CSV.exists():
    val_df = safe_read_csv(OUT_VALIDATION_CSV)
    if not val_df.empty:
        failures = val_df.loc[val_df["valid"] == False, "message"]
        print("Validation failures:", len(failures))
        if len(failures) > 0:
            print("Validation failures (top 10):")
            print(failures.value_counts().head(10))

# Optional: show QA summary if you added it in Cell 7
if "OUT_QA_CSV" in globals() and OUT_QA_CSV.exists():
    qa_df = safe_read_csv(OUT_QA_CSV)
    if not qa_df.empty and "status" in qa_df.columns:
        print("QA status counts:")
        print(qa_df["status"].value_counts())


Index lines: 50
Records on disk: 50
Blocked rows: 0
Validation failures: 0
QA status counts:
status
written    50
Name: count, dtype: int64
