# Step 6 - Translate HDX Metadata to RDLS v0.3 JSON

**Purpose:** Transform HDX dataset-level metadata exports into RDLS v0.3 metadata records.

**Process:**
1. Load classification results and included dataset IDs from Step 5
2. Build RDLS-compliant records with proper attributions, resources, and spatial info
3. Apply component gating rules (V/L require H or E)
4. Validate against JSON Schema and write outputs

**Author**: Benny Istanto/Risk Data Librarian/GFDRR  
**Version**: 2026.1

---

## Inputs
- Step 5 outputs:
  - `hdx_dataset_metadata_dump/derived/classification_final.csv`
  - `hdx_dataset_metadata_dump/derived/rdls_included_dataset_ids_final.txt`
- HDX dataset metadata JSON corpus:
  - `hdx_dataset_metadata_dump/dataset_metadata/*.json`
- RDLS v0.3 assets:
  - `rdls_schema_v0.3.json`
  - `rdls_template_v03.json`

## Outputs
- `hdx_dataset_metadata_dump/rdls/records/*.json` — one RDLS record per included HDX dataset
- `hdx_dataset_metadata_dump/rdls/index/rdls_index.jsonl` — index of written records
- `hdx_dataset_metadata_dump/rdls/reports/translation_blocked.csv` — datasets blocked by policy/required-field gaps
- `hdx_dataset_metadata_dump/rdls/reports/schema_validation.csv` — JSON Schema validation results

## Strictness & Policy
- **Schema-first:** required RDLS fields are always populated; optional fields omitted unless safely filled
- **No extra fields:** output contains only fields defined in the RDLS schema
- **No invented content:** missing values → absent optional fields (not empty strings)
- **Open codelists:** values may be kept as-is if not in suggestions
- **Component combination rule:**
  - hazard-only and exposure-only are allowed
  - vulnerability must accompany hazard or exposure
  - loss must accompany hazard or exposure

## 1. Setup and Configuration

In [1]:
"""
Setup: Import libraries and configure paths.

Configuration Options:
    MAX_DATASETS: Limit number of datasets to process (None for all, 50 for testing)
    OUTPUT_MODE: 'in_place' or 'run_folder' for versioned outputs
    SKIP_EXISTING: Resume-safe mode to skip already processed records
    WRITE_PRETTY_JSON: Pretty-print JSON for readability
"""
from __future__ import annotations

import json
import re
from copy import deepcopy
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple

import pandas as pd

# --- tqdm with graceful fallback ---
try:
    from tqdm.auto import tqdm
    TQDM_AVAILABLE = True
except ImportError:
    TQDM_AVAILABLE = False
    def tqdm(iterable, **kwargs):
        """Fallback: return iterable unchanged if tqdm not installed."""
        return iterable

print(f"tqdm available: {TQDM_AVAILABLE}")


@dataclass
class TranslationConfig:
    """
    Configuration for HDX to RDLS translation.
    
    Attributes:
        dump_dir: Root directory for HDX metadata dump
        max_datasets: Limit number of datasets (None for all)
        output_mode: 'in_place' or 'run_folder'
        clean_before_run: Clean existing outputs before writing (in_place only)
        skip_existing: Skip already processed records
        write_pretty_json: Pretty-print JSON output
        auto_repair_components: Auto-add exposure for standalone V/L
    """
    dump_dir: Path = field(default_factory=lambda: (Path("..") / "hdx_dataset_metadata_dump").resolve())
    max_datasets: Optional[int] = None  # Set to None for production
    output_mode: str = "in_place"  # "in_place" | "run_folder"
    clean_before_run: bool = True
    skip_existing: bool = False
    write_pretty_json: bool = True
    auto_repair_components: bool = True
    
    def __post_init__(self):
        """Validate configuration."""
        if self.output_mode == "run_folder" and self.clean_before_run:
            raise ValueError("clean_before_run cannot be True when output_mode='run_folder'")


# Initialize configuration
config = TranslationConfig()

# ── Output cleanup mode ───────────────────────────────────────────────
# Controls what happens to old output files when this notebook is re-run.
#   "replace" - Auto-delete old outputs and continue (default)
#   "prompt"  - Show what will be deleted, ask user to confirm
#   "skip"    - Keep old files, write new on top (may leave orphans)
#   "abort"   - Stop if old outputs exist (for CI/automated runs)
CLEANUP_MODE = "replace"

# --- Resolve paths ---
DUMP_DIR = config.dump_dir
DATASET_DIR = (DUMP_DIR / "dataset_metadata").resolve()
DERIVED_DIR = (DUMP_DIR / "derived").resolve()
CLASSIFICATION_FINAL_CSV = (DERIVED_DIR / "classification_final.csv").resolve()
INCLUDED_IDS_TXT = (DERIVED_DIR / "rdls_included_dataset_ids_final.txt").resolve()

# RDLS assets
RDLS_DIR = (DUMP_DIR / "rdls").resolve()
RDLS_SCHEMA_PATH = (RDLS_DIR / "schema" / "rdls_schema_v0.3.json").resolve()
RDLS_TEMPLATE_PATH = (RDLS_DIR / "template" / "rdls_template_v03.json").resolve()

# Resolve run directory
if config.output_mode == "run_folder":
    RUN_ID = datetime.now().strftime("%Y%m%d_%H%M%S")
    RDLS_RUN_DIR = (RDLS_DIR / "runs" / RUN_ID).resolve()
else:
    RUN_ID = "in_place"
    RDLS_RUN_DIR = RDLS_DIR

# Output directories
OUT_RECORDS_DIR = RDLS_RUN_DIR / "records"
OUT_INDEX_DIR = RDLS_RUN_DIR / "index"
OUT_REPORTS_DIR = RDLS_RUN_DIR / "reports"

OUT_INDEX_JSONL = OUT_INDEX_DIR / "rdls_index.jsonl"
OUT_BLOCKED_CSV = OUT_REPORTS_DIR / "translation_blocked.csv"
OUT_VALIDATION_CSV = OUT_REPORTS_DIR / "schema_validation.csv"
OUT_QA_CSV = OUT_REPORTS_DIR / "translation_qa.csv"

# Ensure output folders exist
for p in [OUT_RECORDS_DIR, OUT_INDEX_DIR, OUT_REPORTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print(f"Configuration:")
print(f"  DUMP_DIR: {DUMP_DIR}")
print(f"  DATASET_DIR: {DATASET_DIR}")
print(f"  OUTPUT_MODE: {config.output_mode}")
print(f"  RUN_ID: {RUN_ID}")
print(f"  MAX_DATASETS: {config.max_datasets}")
print(f"  SKIP_EXISTING: {config.skip_existing}")
print(f"  CLEANUP_MODE: {CLEANUP_MODE}")

tqdm available: True
Configuration:
  DUMP_DIR: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump
  DATASET_DIR: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/dataset_metadata
  OUTPUT_MODE: in_place
  RUN_ID: in_place
  MAX_DATASETS: None
  SKIP_EXISTING: False
  CLEANUP_MODE: replace


## 2. Mapping Configuration

In [2]:
"""
Mapping configurations for HDX to RDLS translation.

Includes:
    - Hazard filename aliases
    - Resource format mapping (HDX -> RDLS)
    - License mapping (HDX -> RDLS)
    - Hazard keywords for inference
"""

# --- Hazard inference / filename alias ---
HAZARD_FILENAME_ALIASES: Dict[str, str] = {
    "strong_wind": "windstorm",
    # Add more aliases as needed
}

# --- Resource format mapping (HDX -> RDLS data_format enum) ---
# RDLS data_format is a CLOSED codelist with 21 values.
# All HDX formats must map to one of these or be skipped.
HDX_FORMAT_TO_RDLS: Dict[str, str] = {
    # Direct matches
    "CSV": "CSV (csv)",
    "XLS": "Excel (xlsx)",
    "XLSX": "Excel (xlsx)",
    "EXCEL": "Excel (xlsx)",
    "JSON": "JSON (json)",
    "GEOJSON": "GeoJSON (geojson)",
    "SHP": "Shapefile (shp)",
    "SHAPEFILE": "Shapefile (shp)",
    "GPKG": "GeoPackage (gpkg)",
    "GEOPACKAGE": "GeoPackage (gpkg)",
    "KML": "KML (kml)",
    "PDF": "PDF (pdf)",
    "NC": "NetCDF (nc)",
    "NETCDF": "NetCDF (nc)",
    "TIF": "GeoTIFF (tif)",
    "TIFF": "GeoTIFF (tif)",
    "COG": "Cloud Optimized GeoTIFF (cog)",
    "PARQUET": "Parquet (parquet)",
    "APACHE PARQUET": "Parquet (parquet)",
    "XML": "XML (xml)",
    # Extended mappings for previously-unmapped HDX formats
    "GEOTIFF": "GeoTIFF (tif)",
    "GEODATABASE": "File Geodatabase (gdb)",
    "GDB": "File Geodatabase (gdb)",
    "TOPOJSON": "GeoJSON (geojson)",          # TopoJSON is a GeoJSON extension
    "MBTILES": "GeoPackage (gpkg)",            # Closest tile-based spatial format
    "KMZ": "KML (kml)",                        # KMZ is compressed KML
    "TSV": "CSV (csv)",                         # Tab-separated is CSV variant
    "TXT": "CSV (csv)",                         # Text tabular data
    "GOOGLE SHEET": "Excel (xlsx)",             # Spreadsheet equivalent
    "ESRI MAP PACKAGE": "File Geodatabase (gdb)",  # Contains geodata
    "GDAL VIRTUAL FORMAT": "GeoTIFF (tif)",    # Virtual raster reference
    "SQL": "CSV (csv)",                         # Tabular data
    "DOCX": "PDF (pdf)",                        # Document format
    "GRIB": "GRIB (grib)",
    "HDF5": "HDF5 (hdf5)",
    "HDF": "HDF5 (hdf5)",
    "ZARR": "Zarr (zarr)",
    "FGB": "FlatGeobuf (fgb)",
    "FLATGEOBUF": "FlatGeobuf (fgb)",
    "LAS": "LAS (las)",
    "LAZ": "LAS (las)",
    "COPC": "COPC (copc)",
    "GRD": "GRID (grd)",
    "GRID": "GRID (grd)",
}

# Formats to skip entirely (not downloadable data files)
HDX_FORMATS_SKIP: set = {
    "HTML", "PNG", "JPEG", "JPG", "GIF", "EMF",   # Images/web pages
    "WEB APP",                                       # Not a data resource
    "RAR",                                           # Archive container
    "QGIS", "ESRI ARCMAP PROJECT FILE",             # Project files
}

# Service format -> (data_format, access_modality)
# Previously in HDX_FORMATS_SKIP, now handled properly
HDX_SERVICE_FORMATS: Dict[str, Tuple[str, str]] = {
    'GEOSERVICE': ('GeoJSON (geojson)', 'REST'),
    'API':        ('JSON (json)', 'API'),
}

# --- License mapping (HDX -> RDLS) ---
HDX_LICENSE_TO_RDLS: Dict[str, str] = {
    "public domain": "PDDL-1.0",
    "odbl": "ODbL-1.0",
    "cc-by-4.0": "CC-BY-4.0",
    "cc by 4.0": "CC-BY-4.0",
    "cc-by": "CC-BY-4.0",
    "cc0": "CC0-1.0",
    "cc0-1.0": "CC0-1.0",
    "copyright": "Copyright",
}

# --- Hazard keyword to type mapping ---
HAZARD_KEYWORDS_TO_TYPE: Dict[str, str] = {
    # Hydro
    "flood": "flood", "flooding": "flood", "river flood": "flood",
    "flash flood": "flood", "inundation": "flood",
    # Drought
    "drought": "drought", "dry spell": "drought", "water scarcity": "drought",
    # Storms / wind
    "cyclone": "windstorm", "hurricane": "windstorm", "typhoon": "windstorm",
    "windstorm": "windstorm", "storm": "windstorm",
    # Heat / wildfire
    "heatwave": "heat", "extreme heat": "heat",
    "wildfire": "wildfire", "fire": "wildfire",
    # Seismic
    "earthquake": "earthquake", "tsunami": "tsunami",
    # Landslide
    "landslide": "landslide", "mudslide": "landslide", "avalanche": "landslide",
}

# --- Risk components mapping (Step 5 -> RDLS enum) ---
COMPONENT_MAP: Dict[str, str] = {
    "hazard": "hazard",
    "exposure": "exposure",
    "vulnerability_proxy": "vulnerability",
    "loss_impact": "loss",
    "vulnerability": "vulnerability",
    "loss": "loss",
}

print(f"Format mappings loaded: {len(HDX_FORMAT_TO_RDLS)} entries (+ {len(HDX_FORMATS_SKIP)} skip formats)")
print(f"Hazard keywords loaded: {len(HAZARD_KEYWORDS_TO_TYPE)} entries")

Format mappings loaded: 44 entries (+ 10 skip formats)
Hazard keywords loaded: 22 entries


## 3. Load RDLS Schema and Template

In [3]:
"""
Load RDLS schema and template for validation and record building.

Raises:
    FileNotFoundError: If schema or template files are missing.
"""

def safe_load_json(path: Path) -> Dict[str, Any]:
    """Load JSON file with UTF-8 encoding."""
    return json.loads(path.read_text(encoding="utf-8"))


# Validate schema/template existence
if not RDLS_SCHEMA_PATH.exists():
    raise FileNotFoundError(f"RDLS schema not found: {RDLS_SCHEMA_PATH}")

if not RDLS_TEMPLATE_PATH.exists():
    raise FileNotFoundError(f"RDLS template not found: {RDLS_TEMPLATE_PATH}")

rdls_schema = safe_load_json(RDLS_SCHEMA_PATH)
rdls_template = safe_load_json(RDLS_TEMPLATE_PATH)

# RDLS dataset object allowed and required keys
RDLS_ALLOWED_KEYS = set(rdls_schema["properties"].keys())
RDLS_REQUIRED_KEYS = list(rdls_schema.get("required", []))

print(f"RDLS required keys: {RDLS_REQUIRED_KEYS}")
print(f"RDLS allowed keys count: {len(RDLS_ALLOWED_KEYS)}")

RDLS required keys: ['id', 'title', 'risk_data_type', 'attributions', 'spatial', 'license', 'resources']
RDLS allowed keys count: 20


## 4. Load Step 5 Outputs

In [4]:
"""
Load classification results and build dataset index.

Loads:
    - classification_final.csv from Step 5
    - included dataset IDs list
    - Builds dataset_id -> file path index
"""

def read_ids_txt(path: Path) -> List[str]:
    """
    Read dataset IDs from text file (one per line).
    
    Parameters:
        path: Path to the IDs file
        
    Returns:
        List of dataset IDs
    """
    if not path.exists():
        raise FileNotFoundError(f"Missing IDs list: {path}")
    return [line.strip() for line in path.read_text(encoding="utf-8").splitlines() if line.strip()]


# Validate inputs exist
for path, name in [
    (CLASSIFICATION_FINAL_CSV, "Classification CSV"),
    (INCLUDED_IDS_TXT, "Included IDs list"),
    (DATASET_DIR, "Dataset metadata folder"),
]:
    if not path.exists():
        raise FileNotFoundError(f"Missing Step 5 output: {name} at {path}")

# Load classification data
df = pd.read_csv(CLASSIFICATION_FINAL_CSV)
included_ids = read_ids_txt(INCLUDED_IDS_TXT)

# Fast lookups
df = df.set_index("dataset_id", drop=False)
included_set = set(included_ids)

print(f"Classification rows: {len(df):,}")
print(f"Included IDs: {len(included_ids):,}")
print(f"Included IDs in classification: {sum(did in df.index for did in included_ids):,}")

# Build dataset_id -> file path mapping (avoid N glob calls)
print("\nBuilding dataset file index...")
dataset_file_index: Dict[str, Path] = {}
for fp in tqdm(sorted(DATASET_DIR.glob("*.json")), desc="Indexing files"):
    stem = fp.stem
    dataset_uuid = stem.split("__", 1)[0] if "__" in stem else stem
    dataset_file_index[dataset_uuid] = fp

print(f"Dataset files indexed: {len(dataset_file_index):,}")

# Apply MAX_DATASETS limit for testing
if config.max_datasets is not None:
    included_ids = included_ids[:config.max_datasets]
    included_set = set(included_ids)
    print(f"\nTEST MODE: Limited to {len(included_ids)} datasets")

Classification rows: 26,246
Included IDs: 13,152
Included IDs in classification: 13,152

Building dataset file index...


Indexing files:   0%|          | 0/26246 [00:00<?, ?it/s]

Dataset files indexed: 26,246


## 5. Helper Functions

In [5]:
"""
Helper functions for parsing, slugifying, and mapping.

Includes:
    - Text parsing utilities
    - ISO3 country inference (expanded for HDX group names)
    - Regional group -> member countries mapping
    - Hazard type inference
    - License and format mapping
"""


def sanitize_text(text: str) -> str:
    """Clean text for RDLS JSON: fix encoding, strip HTML, normalize characters.

    Handles: mojibake (double-encoded UTF-8), HTML tags/entities,
    smart quotes, em/en dashes, non-breaking spaces, zero-width spaces,
    control characters, and internal double quotes.
    """
    if not text:
        return text

    # 1. Fix mojibake (double-encoded UTF-8 via CP1252, ~240 UNESCO files)
    try:
        clean = text.encode('cp1252').decode('utf-8')
    except (UnicodeEncodeError, UnicodeDecodeError):
        clean = text

    # 2. Strip HTML tags
    clean = re.sub(r'<[^>]+>', ' ', clean)

    # 3. Decode HTML entities
    clean = clean.replace('&nbsp;', ' ').replace('&amp;', '&')
    clean = clean.replace('&lt;', '<').replace('&gt;', '>')
    clean = clean.replace('&quot;', "'").replace('&#39;', "'")
    clean = re.sub(r'&#(\d+);', lambda m: chr(int(m.group(1))), clean)
    clean = re.sub(r'&#x([0-9a-fA-F]+);', lambda m: chr(int(m.group(1), 16)), clean)

    # 4. Normalize Unicode to ASCII-safe equivalents
    clean = clean.replace('\u2018', "'").replace('\u2019', "'")
    clean = clean.replace('\u201C', "'").replace('\u201D', "'")
    clean = clean.replace('\u2013', '-').replace('\u2014', '-')
    clean = clean.replace('\u2026', '...').replace('\u00A0', ' ')
    clean = clean.replace('\u200B', '').replace('\u2022', '-')

    # 5. Remove/replace control characters
    clean = clean.replace('\t', ' ').replace('\r', ' ')

    # 6. Replace internal double quotes with single quotes
    clean = clean.replace('"', "'")

    # 7. Collapse whitespace to single space
    clean = re.sub(r'\s+', ' ', clean)

    return clean.strip()


def slugify_token(s: str, max_len: int = 32) -> str:
    """Convert string to URL-safe slug token."""
    s = (s or "").strip().lower()
    s = re.sub(r"[^a-z0-9]+", "_", s)
    s = re.sub(r"_+", "_", s).strip("_")
    return (s[:max_len].strip("_") or "unknown")


def split_semicolon_list(s: Any) -> List[str]:
    """Split semicolon/comma separated string into list."""
    if s is None or (isinstance(s, float) and pd.isna(s)):
        return []
    if isinstance(s, list):
        return [str(x).strip() for x in s if str(x).strip()]
    s = str(s).strip()
    if not s:
        return []
    return [x.strip() for x in re.split(r"[;,]", s) if x.strip()]


def looks_like_url(s: str) -> bool:
    """Check if string looks like a URL."""
    return bool(re.match(r"^https?://", (s or "").strip(), flags=re.I))


# --- ISO3 inference ---
def try_import_pycountry():
    """Try to import pycountry for country code lookups."""
    try:
        import pycountry
        return pycountry
    except ImportError:
        return None

_pycountry = try_import_pycountry()

COMMON_COUNTRY_FIXES: Dict[str, str] = {
    # Standard variants
    "cote d'ivoire": "CIV", "ivory coast": "CIV",
    "democratic republic of the congo": "COD", "dr congo": "COD",
    "republic of the congo": "COG", "congo, rep.": "COG", "congo, dem. rep.": "COD",
    "lao pdr": "LAO", "viet nam": "VNM",
    "korea, rep.": "KOR", "korea, dem. rep.": "PRK",
    "syrian arab republic": "SYR", "iran, islamic republic of": "IRN",
    "tanzania, united republic of": "TZA", "venezuela, bolivarian republic of": "VEN",
    "bolivia, plurinational state of": "BOL", "moldova, republic of": "MDA",
    "palestine": "PSE", "russia": "RUS",
    "united states": "USA", "united kingdom": "GBR",
    # HDX-specific unresolved group names
    "state of palestine": "PSE",
    "republic of korea": "KOR", "south korea": "KOR",
    "holy see": "VAT",
    "kosovo": "XKX",  # Not in RDLS schema enum; accept validation warning
    "united states virgin islands": "VIR",
    "sint maarten": "SXM",
    "saint martin": "MAF",
    "svalbard and jan mayen islands": "SJM", "svalbard and jan mayen": "SJM",
    "bonaire, sint eustatius and saba": "BES", "bonaire; sint eustatius and saba": "BES",
    "wallis and futuna islands": "WLF", "wallis and futuna": "WLF",
    "french southern and antarctic territories": "ATF", "french southern territories": "ATF",
    "saint helena": "SHN",
    "eswatini": "SWZ", "kingdom of eswatini": "SWZ", "swaziland": "SWZ",
    "north macedonia": "MKD",
    "timor-leste": "TLS", "east timor": "TLS",
    "cabo verde": "CPV", "cape verde": "CPV",
    "micronesia": "FSM", "micronesia (federated states of)": "FSM",
    "brunei": "BRN", "brunei darussalam": "BRN",
    "turkiye": "TUR",
    "czechia": "CZE",
}

# HDX group names that are NOT countries (skip without warning)
NON_COUNTRY_GROUPS: set = {
    "world", "global",
    "nepal earthquake",  # Event name, not a country
}

# Regional group name -> list of member ISO3 codes
REGION_TO_COUNTRIES: Dict[str, List[str]] = {
    "africa": [
        "DZA", "AGO", "BEN", "BWA", "BFA", "BDI", "CPV", "CMR", "CAF", "TCD",
        "COM", "COG", "COD", "CIV", "DJI", "EGY", "GNQ", "ERI", "SWZ", "ETH",
        "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "LBY", "MDG",
        "MWI", "MLI", "MRT", "MUS", "MAR", "MOZ", "NAM", "NER", "NGA", "RWA",
        "STP", "SEN", "SYC", "SLE", "SOM", "ZAF", "SSD", "SDN", "TZA", "TGO",
        "TUN", "UGA", "ZMB", "ZWE",
    ],
    "east africa": [
        "BDI", "COM", "DJI", "ERI", "ETH", "KEN", "MDG", "MWI", "MUS", "MOZ",
        "RWA", "SYC", "SOM", "SSD", "TZA", "UGA", "ZMB", "ZWE",
    ],
    "west africa": [
        "BEN", "BFA", "CPV", "CIV", "GMB", "GHA", "GIN", "GNB", "LBR", "MLI",
        "MRT", "NER", "NGA", "SEN", "SLE", "TGO",
    ],
    "southern africa": [
        "BWA", "SWZ", "LSO", "MOZ", "NAM", "ZAF", "ZMB", "ZWE",
    ],
    "asia": [
        "AFG", "BGD", "BTN", "BRN", "KHM", "CHN", "IND", "IDN", "IRN", "IRQ",
        "ISR", "JPN", "JOR", "KAZ", "KWT", "KGZ", "LAO", "LBN", "MYS", "MDV",
        "MNG", "MMR", "NPL", "PRK", "OMN", "PAK", "PSE", "PHL", "QAT", "SAU",
        "SGP", "KOR", "LKA", "SYR", "TWN", "TJK", "THA", "TLS", "TUR", "TKM",
        "ARE", "UZB", "VNM", "YEM",
    ],
    "southeast asia": [
        "BRN", "KHM", "IDN", "LAO", "MYS", "MMR", "PHL", "SGP", "THA", "TLS", "VNM",
    ],
    "south asia": [
        "AFG", "BGD", "BTN", "IND", "MDV", "NPL", "PAK", "LKA",
    ],
    "central asia": [
        "KAZ", "KGZ", "TJK", "TKM", "UZB",
    ],
    "europe": [
        "ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BGR", "HRV", "CYP", "CZE",
        "DNK", "EST", "FIN", "FRA", "DEU", "GRC", "HUN", "ISL", "IRL", "ITA",
        "LVA", "LIE", "LTU", "LUX", "MLT", "MDA", "MCO", "MNE", "NLD", "MKD",
        "NOR", "POL", "PRT", "ROU", "RUS", "SMR", "SRB", "SVK", "SVN", "ESP",
        "SWE", "CHE", "UKR", "GBR", "VAT",
    ],
    "middle east": [
        "BHR", "EGY", "IRN", "IRQ", "ISR", "JOR", "KWT", "LBN", "OMN", "PSE",
        "QAT", "SAU", "SYR", "TUR", "ARE", "YEM",
    ],
    "americas": [
        "ATG", "ARG", "BHS", "BRB", "BLZ", "BOL", "BRA", "CAN", "CHL", "COL",
        "CRI", "CUB", "DMA", "DOM", "ECU", "SLV", "GRD", "GTM", "GUY", "HTI",
        "HND", "JAM", "MEX", "NIC", "PAN", "PRY", "PER", "KNA", "LCA", "VCT",
        "SUR", "TTO", "URY", "USA", "VEN",
    ],
    "latin america": [
        "ARG", "BOL", "BRA", "CHL", "COL", "CRI", "CUB", "DOM", "ECU", "SLV",
        "GTM", "HTI", "HND", "MEX", "NIC", "PAN", "PRY", "PER", "URY", "VEN",
    ],
    "latin america and the caribbean": [
        "ATG", "ARG", "BHS", "BRB", "BLZ", "BOL", "BRA", "CHL", "COL", "CRI",
        "CUB", "DMA", "DOM", "ECU", "SLV", "GRD", "GTM", "GUY", "HTI", "HND",
        "JAM", "MEX", "NIC", "PAN", "PRY", "PER", "KNA", "LCA", "VCT", "SUR",
        "TTO", "URY", "VEN",
    ],
    "caribbean": [
        "ATG", "BHS", "BRB", "CUB", "DMA", "DOM", "GRD", "HTI", "JAM", "KNA",
        "LCA", "VCT", "TTO",
    ],
    "central america": [
        "BLZ", "CRI", "SLV", "GTM", "HND", "NIC", "PAN",
    ],
    "oceania": [
        "AUS", "FJI", "KIR", "MHL", "FSM", "NRU", "NZL", "PLW", "PNG", "WSM",
        "SLB", "TON", "TUV", "VUT",
    ],
    "pacific": [
        "FJI", "KIR", "MHL", "FSM", "NRU", "PLW", "PNG", "WSM", "SLB", "TON",
        "TUV", "VUT",
    ],
}

# Normalize region keys for lookup
REGION_TO_COUNTRIES_NORM = {_norm_key.lower().strip(): v for _norm_key, v in REGION_TO_COUNTRIES.items()}


def _norm_country_key(s: str) -> str:
    """Normalize country name for lookup."""
    s = (s or "").strip().lower()
    s = re.sub(r"[\(\)\[\]\{\}\.\,\;\:]", " ", s)
    s = s.replace("&", " and ")
    s = re.sub(r"[^a-z0-9\s\-']", " ", s)
    return re.sub(r"\s+", " ", s).strip()

COMMON_COUNTRY_FIXES_NORM = {_norm_country_key(k): v for k, v in COMMON_COUNTRY_FIXES.items()}

# Load country ISO3 table if available
COUNTRY_ISO3_CSV = (DUMP_DIR / "config" / "country_name_to_iso3.csv")

def load_country_iso3_table(path: Path) -> Dict[str, str]:
    """Load country name to ISO3 mapping from CSV."""
    if not path.exists():
        return {}
    try:
        df_iso = pd.read_csv(path)
        cols = {c.lower(): c for c in df_iso.columns}
        name_col = cols.get("name") or cols.get("country") or cols.get("country_name")
        iso3_col = cols.get("iso3") or cols.get("alpha_3") or cols.get("code")
        if not name_col or not iso3_col:
            return {}
        return {
            _norm_country_key(str(r.get(name_col, ""))): str(r.get(iso3_col, "")).strip().upper()
            for _, r in df_iso.iterrows()
            if str(r.get(name_col, "")).strip() and len(str(r.get(iso3_col, "")).strip()) == 3
        }
    except Exception:
        return {}

COUNTRY_ISO3_TABLE = load_country_iso3_table(COUNTRY_ISO3_CSV)


def country_name_to_iso3(name: str) -> Optional[str]:
    """
    Convert country name to ISO3 code.
    
    Parameters:
        name: Country name or ISO3 code
        
    Returns:
        ISO3 code or None if not found
    """
    n = (name or "").strip()
    if not n:
        return None
    
    # Already ISO3?
    if len(n) == 3 and n.isalpha():
        return n.upper()
    
    key = _norm_country_key(n)
    
    # Check fixes first
    if key in COMMON_COUNTRY_FIXES_NORM:
        return COMMON_COUNTRY_FIXES_NORM[key]
    
    # Check table
    if key in COUNTRY_ISO3_TABLE:
        return COUNTRY_ISO3_TABLE[key]
    
    # Try pycountry
    if _pycountry is not None:
        try:
            c = _pycountry.countries.lookup(n)
            return getattr(c, "alpha_3", None)
        except Exception:
            pass
    
    return None


def infer_spatial(groups: List[str]) -> Dict[str, Any]:
    """
    Infer RDLS spatial block from HDX country groups.
    
    Logic:
      - "World"/"global" -> scale: "global", no countries
      - Regional name (Africa, Europe, etc.) -> scale: "regional", countries: [members]
      - Single country -> scale: "national", countries: [iso3]
      - Multiple countries -> scale: "regional", countries: [iso3s]
      - Non-country names (events) -> silently skip
      - Completely unresolvable -> scale: "global"
    """
    if not groups:
        return {"scale": "global"}
    
    # Normalize group names for matching
    norm_groups = [g.strip().lower() for g in groups]
    
    # Check for "World" / "global" -> global scale
    if any(g in NON_COUNTRY_GROUPS for g in norm_groups):
        # Filter out non-country groups, keep any actual countries
        country_groups = [g for g, ng in zip(groups, norm_groups) if ng not in NON_COUNTRY_GROUPS]
        if not country_groups:
            return {"scale": "global"}
        groups = country_groups
        norm_groups = [g.strip().lower() for g in groups]
    
    # Check for regional group names
    all_iso3s = []
    remaining_groups = []
    is_regional = False
    
    for g, ng in zip(groups, norm_groups):
        if ng in REGION_TO_COUNTRIES_NORM:
            all_iso3s.extend(REGION_TO_COUNTRIES_NORM[ng])
            is_regional = True
        else:
            remaining_groups.append(g)
    
    # Resolve remaining groups as country names
    for g in remaining_groups:
        iso3 = country_name_to_iso3(g)
        if iso3:
            all_iso3s.append(iso3)
    
    # Deduplicate and sort
    iso3s = sorted(set(all_iso3s))
    
    if is_regional and iso3s:
        return {"scale": "regional", "countries": iso3s}
    if len(iso3s) == 1:
        return {"scale": "national", "countries": iso3s}
    if len(iso3s) > 1:
        return {"scale": "regional", "countries": iso3s}
    
    # Nothing resolved — default to global without noisy warning
    return {"scale": "global"}


def infer_hazard_types(tags: List[str], title: str = "", notes: str = "") -> List[str]:
    """Infer hazard types from tags and text content."""
    text = " ".join([*tags, title or "", notes or ""]).lower()
    hits = set()
    for k, ht in HAZARD_KEYWORDS_TO_TYPE.items():
        if k in text:
            hits.add(ht)
    return sorted(hits)


def hazard_suffix_for_filename(hazard_types: List[str]) -> str:
    """Generate hazard suffix for filename."""
    if not hazard_types:
        return ""
    if len(hazard_types) > 1:
        return "_multihazard"
    ht = hazard_types[0]
    return "_" + slugify_token(HAZARD_FILENAME_ALIASES.get(ht, ht), max_len=24)


def parse_components(s: Any) -> List[str]:
    """Parse risk components from semicolon-separated string."""
    parts = split_semicolon_list(s)
    seen = set()
    result = []
    for p in parts:
        mapped = COMPONENT_MAP.get(p.strip().lower())
        if mapped and mapped not in seen:
            result.append(mapped)
            seen.add(mapped)
    return result


def choose_prefix(risk_data_type: List[str]) -> str:
    """Choose RDLS filename prefix based on risk data type."""
    rset = set(risk_data_type)
    if "loss" in rset:
        return "rdls_lss-"
    if "vulnerability" in rset:
        return "rdls_vln-"
    if "exposure" in rset:
        return "rdls_exp-"
    return "rdls_hzd-"


def map_license(hdx_license_title: str) -> str:
    """
    Map HDX license strings to RDLS schema suggestions.
    
    Parameters:
        hdx_license_title: License title from HDX
        
    Returns:
        RDLS-compatible license identifier
    """
    raw = (hdx_license_title or "").strip()
    if not raw:
        return ""
    
    key = re.sub(r"\s+", " ", raw.lower().strip())
    
    # Pattern mappings
    if re.search(r"\bcc0\b", key) or ("public domain" in key and "cc0" in key):
        return "CC0-1.0"
    if "odbl" in key or "open database license" in key:
        return "ODbL-1.0"
    if "pddl" in key or "public domain dedication" in key:
        return "PDDL-1.0"
    
    # Creative Commons variants
    k2 = key.replace("creative commons", "cc")
    
    if re.search(r"\bcc\s*by\b", k2) and "sa" not in k2 and "nd" not in k2 and "nc" not in k2:
        return "CC-BY-4.0" if "4.0" in k2 or "v4" in k2 else ("CC-BY-3.0" if "3.0" in k2 else "CC-BY-4.0")
    if "by-sa" in k2 or re.search(r"\bcc\s*by\s*sa\b", k2):
        return "CC-BY-SA-4.0" if "4.0" in k2 else ("CC-BY-SA-3.0" if "3.0" in k2 else "CC-BY-SA-4.0")
    if "by-nc" in k2 or re.search(r"\bcc\s*by\s*nc\b", k2):
        return "CC-BY-NC-4.0" if "4.0" in k2 else ("CC-BY-NC-3.0" if "3.0" in k2 else "CC-BY-NC-4.0")
    
    return HDX_LICENSE_TO_RDLS.get(re.sub(r"\s+", " ", k2).strip(), raw)


# Track unmapped HDX formats for reporting
unmapped_hdx_formats: set = set()


def _infer_format_from_name(name: str, url: str = "") -> Optional[str]:
    """Infer RDLS data_format from filename keywords when HDX format is ZIP or unknown."""
    text = f"{name} {url}".lower()
    # Ordered: more specific patterns first to avoid false matches
    HINTS = [
        ('geotiff', 'GeoTIFF (tif)'),    ('geotif', 'GeoTIFF (tif)'),
        ('.tif', 'GeoTIFF (tif)'),        ('shapefile', 'Shapefile (shp)'),
        ('.shp', 'Shapefile (shp)'),      ('geopackage', 'GeoPackage (gpkg)'),
        ('.gpkg', 'GeoPackage (gpkg)'),   ('geodatabase', 'File Geodatabase (gdb)'),
        ('.gdb', 'File Geodatabase (gdb)'),
        ('geojson', 'GeoJSON (geojson)'), ('.geojson', 'GeoJSON (geojson)'),
        ('flatgeobuf', 'FlatGeobuf (fgb)'),
        ('netcdf', 'NetCDF (nc)'),        ('.nc.', 'NetCDF (nc)'),
        ('parquet', 'Parquet (parquet)'),
        ('_csv', 'CSV (csv)'),            ('.csv', 'CSV (csv)'),
        ('excel', 'Excel (xlsx)'),        ('_xlsx', 'Excel (xlsx)'),
        ('.xlsx', 'Excel (xlsx)'),        ('.xls', 'Excel (xlsx)'),
        ('_json', 'JSON (json)'),         ('.json', 'JSON (json)'),
        ('.xml', 'XML (xml)'),            ('.kml', 'KML (kml)'),
        ('.pdf', 'PDF (pdf)'),
    ]
    for hint, rdls_fmt in HINTS:
        if hint in text:
            return rdls_fmt
    return None


# Service URL patterns: (regex, data_format, access_modality)
SERVICE_URL_PATTERNS = [
    (r'arcgis\.com/.*/rest/services/', 'GeoJSON (geojson)', 'REST'),
    (r'/rest/services/.*(?:Feature|Map)Server', 'GeoJSON (geojson)', 'REST'),
    (r'geoserver.*(?:/wms|/wfs|/ows)', 'GeoJSON (geojson)', 'WFS'),
    (r'/wms\b', 'XML (xml)', 'WMS'),
    (r'/wfs\b', 'GeoJSON (geojson)', 'WFS'),
    (r'/wcs\b', 'GeoTIFF (tif)', 'WCS'),
]


def detect_service_url(url: str) -> Optional[Tuple[str, str]]:
    """Detect service URLs and return (data_format, access_modality) or None."""
    if not url:
        return None
    for pattern, fmt, modality in SERVICE_URL_PATTERNS:
        if re.search(pattern, url, re.IGNORECASE):
            return (fmt, modality)
    return None


def map_data_format(hdx_fmt: str, url: str = "", name: str = "") -> Optional[str]:
    """
    Map HDX format to RDLS data_format enum value.
    
    Returns None for formats that should be skipped (non-data formats).
    Handles ZIP files by inferring format from filename.
    """
    s = (hdx_fmt or "").strip().upper()
    
    # Check skip list first (non-data formats)
    if s in HDX_FORMATS_SKIP:
        return None
    
    # Direct dictionary lookup
    if s in HDX_FORMAT_TO_RDLS:
        return HDX_FORMAT_TO_RDLS[s]
    
    # ZIP/archive: infer format from filename or URL
    if s in ('ZIP', '7Z', 'TAR', 'GZ', 'GZIP'):
        return _infer_format_from_name(name, url)
    
    # Guess from URL extension
    u = (url or "").lower().split("?")[0]  # Strip query params
    ext_map = [
        (".geojson", "GeoJSON (geojson)"), (".json", "JSON (json)"),
        (".csv", "CSV (csv)"), (".xlsx", "Excel (xlsx)"), (".xls", "Excel (xlsx)"),
        (".shp", "Shapefile (shp)"),
        (".tif", "GeoTIFF (tif)"), (".tiff", "GeoTIFF (tif)"),
        (".nc", "NetCDF (nc)"), (".pdf", "PDF (pdf)"),
        (".parquet", "Parquet (parquet)"), (".gpkg", "GeoPackage (gpkg)"),
        (".kml", "KML (kml)"), (".kmz", "KML (kml)"),
        (".xml", "XML (xml)"), (".gdb", "File Geodatabase (gdb)"),
    ]
    for ext, rdls in ext_map:
        if u.endswith(ext):
            return rdls
    
    # Last resort: try inferring from filename for unknown formats
    inferred = _infer_format_from_name(name, url)
    if inferred:
        return inferred

    # Truly unmapped — skip resource (return None)
    if s:
        unmapped_hdx_formats.add(s.lower())
    return None


print(f"Helper functions loaded successfully.")
print(f"  Country fixes: {len(COMMON_COUNTRY_FIXES)} entries")
print(f"  Regional mappings: {len(REGION_TO_COUNTRIES)} regions")
print(f"  Country ISO3 table: {len(COUNTRY_ISO3_TABLE)} entries")

Helper functions loaded successfully.
  Country fixes: 52 entries
  Regional mappings: 17 regions
  Country ISO3 table: 436 entries


## 6. Component Gating Logic

In [6]:
"""
Component gating logic for RDLS risk_data_type validation.

Rules:
    - vulnerability must co-occur with hazard or exposure
    - loss must co-occur with hazard or exposure
    - risk_data_type must be non-empty
"""

@dataclass(frozen=True)
class ComponentGateResult:
    """
    Result of component gating validation.
    
    Attributes:
        ok: Whether validation passed
        reasons: Tuple of reason codes
        risk_data_type: Validated/repaired risk data type list
    """
    ok: bool
    reasons: Tuple[str, ...]
    risk_data_type: List[str]


def apply_component_gate(components: List[str]) -> ComponentGateResult:
    """
    Enforce RDLS component combination rules.
    
    Parameters:
        components: List of risk components
        
    Returns:
        ComponentGateResult with validation status
    """
    allowed = {"hazard", "exposure", "vulnerability", "loss"}
    rset = {c for c in (components or []) if c in allowed}
    
    if not rset:
        return ComponentGateResult(
            ok=False,
            reasons=("empty_or_unrecognized_components",),
            risk_data_type=[],
        )
    
    reasons: List[str] = []
    ok = True
    
    # Vulnerability requires hazard or exposure
    if "vulnerability" in rset and not ({"hazard", "exposure"} & rset):
        if config.auto_repair_components:
            rset.add("exposure")
            reasons.append("auto_added_exposure_for_vulnerability")
        else:
            ok = False
            reasons.append("vulnerability_without_hazard_or_exposure")
    
    # Loss requires hazard or exposure
    if "loss" in rset and not ({"hazard", "exposure"} & rset):
        if config.auto_repair_components:
            rset.add("exposure")
            reasons.append("auto_added_exposure_for_loss")
        else:
            ok = False
            reasons.append("loss_without_hazard_or_exposure")
    
    return ComponentGateResult(ok=ok, reasons=tuple(reasons), risk_data_type=sorted(rset))


print(f"Component gating configured (auto_repair={config.auto_repair_components})")

Component gating configured (auto_repair=True)


## 7. RDLS Record Builder

In [7]:
"""
Build RDLS dataset records from HDX metadata.

Creates minimal, schema-safe records with:
    - Required attributions (publisher, creator, contact)
    - Resources with proper format mapping
    - Spatial information inferred from groups
"""

def build_attributions(hdx: Dict[str, Any], dataset_id: str, dataset_page_url: str) -> List[Dict[str, Any]]:
    """
    Build RDLS attributions from HDX metadata.
    
    Parameters:
        hdx: HDX dataset metadata
        dataset_id: Dataset UUID
        dataset_page_url: HDX dataset landing page URL
        
    Returns:
        List of attribution objects (minItems=3)
    """
    org = (hdx.get("organization") or "").strip() or "Not specified"
    src = (hdx.get("dataset_source") or "").strip() or org
    creator_url = src if looks_like_url(src) else dataset_page_url
    
    return [
        {"id": "attribution_publisher", "role": "publisher", "entity": {"name": org, "url": dataset_page_url}},
        {"id": "attribution_creator", "role": "creator", "entity": {"name": src, "url": creator_url}},
        {"id": "attribution_contact", "role": "contact_point", "entity": {"name": org, "url": dataset_page_url}},
    ]


def build_resources(hdx: Dict[str, Any], dataset_id: str) -> List[Dict[str, Any]]:
    """
    Build RDLS resources from HDX resources.
    
    Parameters:
        hdx: HDX dataset metadata
        dataset_id: Dataset UUID
        
    Returns:
        List of resource objects (minItems=1)
    """
    # Always include HDX metadata export
    meta_url = f"https://data.humdata.org/dataset/{dataset_id}/download_metadata?format=json"
    resources = [{
        "id": "hdx_dataset_metadata_json",
        "title": "HDX dataset metadata (JSON)",
        "description": "Dataset-level metadata exported from HDX.",
        "data_format": "JSON (json)",
        "access_modality": "file_download",
        "download_url": meta_url,
    }]
    
    for r in hdx.get("resources", []) or []:
        rid = (r.get("id") or "").strip()
        rname = sanitize_text((r.get("name") or "").strip()) or rid[:8] or "resource"
        desc = sanitize_text((r.get("description") or "").strip()) or f"HDX resource: {rname}"
        dl = (r.get("download_url") or "").strip()
        hdx_format = (r.get("format") or "").strip().upper()

        # Determine data_format and access_modality
        fmt = None
        access_modality = "file_download"

        # Check if it is a known service format (GEOSERVICE, API)
        if hdx_format in HDX_SERVICE_FORMATS:
            fmt, access_modality = HDX_SERVICE_FORMATS[hdx_format]
            # Refine using URL patterns (e.g., ArcGIS REST)
            svc = detect_service_url(dl)
            if svc:
                fmt, access_modality = svc
        else:
            fmt = map_data_format(hdx_format, dl, rname)
            # Check if download URL is actually a service endpoint
            svc = detect_service_url(dl)
            if svc:
                _, access_modality = svc

        if not dl or not fmt:
            continue

        resources.append({
            "id": f"hdx_res_{rid[:8] or slugify_token(rname, 8)}",
            "title": rname,
            "description": desc,
            "data_format": fmt,
            "access_modality": access_modality,
            "download_url": dl,
        })
    
    # Deduplicate by id
    seen = set()
    return [r for r in resources if not (r["id"] in seen or seen.add(r["id"]))]




def build_details(hdx: Dict[str, Any]) -> Optional[str]:
    """
    Build RDLS 'details' field from HDX metadata fields that have
    no direct RDLS equivalent.

    Composites: caveats, methodology, methodology_other, dataset_date,
    data_update_frequency, last_modified.

    Returns None if no meaningful content is available.
    """
    parts = []

    # Caveats (data quality / limitations)
    caveats = sanitize_text((hdx.get('caveats') or '').strip())
    if caveats:
        parts.append(f"Caveats: {caveats}")

    # Methodology
    methodology = sanitize_text((hdx.get('methodology') or '').strip())
    methodology_other = sanitize_text((hdx.get('methodology_other') or '').strip())
    if methodology_other:
        parts.append(f"Methodology: {methodology_other}")
    elif methodology and methodology.lower() not in ('other', ''):
        parts.append(f"Methodology: {methodology}")

    # Temporal coverage
    dataset_date = (hdx.get('dataset_date') or '').strip()
    if dataset_date:
        parts.append(f"Temporal coverage: {dataset_date}")

    # Update frequency
    frequency = (hdx.get('data_update_frequency') or '').strip()
    if frequency:
        parts.append(f"Update frequency: {frequency}")

    # Last modified
    last_modified = (hdx.get('last_modified') or hdx.get('metadata_modified') or '').strip()
    if last_modified:
        # Trim microseconds for readability: 2025-11-18T16:54:54.268514 -> 2025-11-18
        date_part = last_modified[:10] if len(last_modified) >= 10 else last_modified
        parts.append(f"Last modified: {date_part}")

    if not parts:
        return None

    return ' | '.join(parts)


def build_rdls_record(
    hdx: Dict[str, Any],
    class_row: pd.Series,
) -> Tuple[Optional[Dict[str, Any]], Dict[str, Any]]:
    """
    Build RDLS record from HDX metadata and classification.
    
    Parameters:
        hdx: HDX dataset metadata
        class_row: Classification row from Step 5
        
    Returns:
        Tuple of (rdls_record or None if blocked, info dict)
    """
    dataset_id = str(class_row["dataset_id"])
    title = sanitize_text((hdx.get("title") or class_row.get("title") or "").strip())
    notes = sanitize_text((hdx.get("notes") or "").strip())
    details = build_details(hdx)
    
    # Parse and validate components
    components = parse_components(class_row.get("rdls_components"))
    gate = apply_component_gate(components)
    
    if not gate.ok:
        return None, {
            "dataset_id": dataset_id,
            "blocked": True,
            "blocked_reasons": ";".join(gate.reasons),
            "risk_data_type": ";".join(gate.risk_data_type),
        }
    
    # Infer spatial from groups
    groups = split_semicolon_list(class_row.get("groups"))
    spatial = infer_spatial(groups)
    
    # Infer hazard types for naming
    tags = split_semicolon_list(class_row.get("tags"))
    hazard_types = infer_hazard_types(tags, title=title, notes=notes)
    
    dataset_page_url = f"https://data.humdata.org/dataset/{dataset_id}"
    
    # Build entity token for naming
    hdx_slug = slugify_token(str(hdx.get("name") or ""), max_len=48)
    title_slug = slugify_token(title, max_len=48)
    dataset_slug = hdx_slug if hdx_slug != "unknown" else title_slug
    
    org_token = slugify_token(str(class_row.get("organization") or hdx.get("organization") or "unknown"), max_len=20)
    iso3_tok = str(spatial["countries"][0]).lower() if spatial.get("countries") and len(spatial["countries"]) == 1 else ""
    
    # Compose identifier
    parts = [p for p in [org_token, iso3_tok, dataset_slug] if p]
    entity_token = "_".join(parts)
    
    prefix = choose_prefix(gate.risk_data_type) + "hdx_"
    hz_suffix = hazard_suffix_for_filename(hazard_types) if ({"hazard", "loss"} & set(gate.risk_data_type)) else ""
    
    stem_base = f"{prefix}{entity_token}{hz_suffix}"
    stem = stem_base
    
    # Collision-proofing
    out_path = OUT_RECORDS_DIR / f"{stem}.json"
    if out_path.exists():
        stem = f"{stem_base}__{dataset_id[:8]}"
    
    # Map license
    license_raw = str(class_row.get("license_title") or hdx.get("license_title") or "").strip()
    license_mapped = map_license(license_raw or "Custom")
    
    # Build RDLS dataset record
    rdls_ds: Dict[str, Any] = {
        "id": stem,
        "title": title or hdx.get("name", "") or dataset_id,
        "description": notes or None,
        "risk_data_type": gate.risk_data_type,
        "details": details,
        "spatial": spatial,
        "license": license_mapped,
        "attributions": build_attributions(hdx, dataset_id, dataset_page_url),
        "resources": build_resources(hdx, dataset_id),
        "links": [
            {
                "href": "https://docs.riskdatalibrary.org/en/0__3__0/rdls_schema.json",
                "rel": "describedby",
            },
            {
                "href": dataset_page_url,
                "rel": "source",
            },
        ],
    }
    
    # Remove None values and filter to allowed keys
    rdls_ds = {k: v for k, v in rdls_ds.items() if v is not None and k in RDLS_ALLOWED_KEYS}
    
    # Wrap in top-level structure
    rdls_record = {"datasets": [rdls_ds]}
    
    info = {
        "dataset_id": dataset_id,
        "rdls_id": stem,
        "filename": f"{stem}.json",
        "risk_data_type": ";".join(gate.risk_data_type),
        "spatial_scale": spatial.get("scale", ""),
        "countries_count": len(spatial.get("countries", []) or []),
        "license_raw": license_raw,
        "orgtoken": org_token,
        "hazard_suffix": hz_suffix.lstrip("_"),
        "organization_token": org_token,
        "iso3": iso3_tok,
        "hazard_types": ";".join(hazard_types),
        "blocked": False,
        "blocked_reasons": "",
    }
    
    return rdls_record, info


print("Record builder functions loaded.")

Record builder functions loaded.


## 8. Clean Previous Outputs (Optional)

In [8]:
"""
8.1 Clean Previous Outputs

Removes stale output files before writing new ones.
Controlled by CLEANUP_MODE in cell 1 above.
"""

def clean_previous_outputs(output_dir, patterns, label, mode="replace"):
    """
    Remove previous output files matching the given glob patterns.

    Parameters
    ----------
    output_dir : Path
        Directory containing old outputs.
    patterns : list[str]
        Glob patterns to match.
    label : str
        Human-readable label for log messages.
    mode : str
        One of: "replace" (auto-delete), "prompt" (ask user),
        "skip" (keep old files), "abort" (error if stale files exist).

    Returns
    -------
    dict  with keys 'deleted' (int) and 'skipped' (bool)
    """
    result = {'deleted': 0, 'skipped': False}
    targets = {}
    for pattern in patterns:
        matches = sorted(output_dir.glob(pattern))
        if matches:
            targets[pattern] = matches
    total = sum(len(files) for files in targets.values())

    if total == 0:
        print(f'Output cleanup [{label}]: Directory is clean.')
        return result

    summary = []
    for pattern, files in targets.items():
        summary.append(f'  {pattern:40s}: {len(files):,} files')

    if mode == 'skip':
        print(f'Output cleanup [{label}]: SKIPPED ({total:,} existing files kept)')
        result['skipped'] = True
        return result

    if mode == 'abort':
        raise RuntimeError(
            f'Output cleanup [{label}]: ABORT -- {total:,} stale files found. '
            f'Delete manually or change CLEANUP_MODE.'
        )

    if mode == 'prompt':
        print(f'Output cleanup [{label}]: Found {total:,} existing output files:')
        for line in summary:
            print(line)
        choice = input('Choose [R]eplace / [S]kip / [A]bort: ').strip().lower()
        if choice in ('s', 'skip'):
            print('  Skipped.')
            result['skipped'] = True
            return result
        elif choice in ('a', 'abort'):
            raise RuntimeError('User chose to abort.')
        elif choice not in ('r', 'replace', ''):
            print(f'  Unknown choice, defaulting to Replace.')

    # Mode: replace (default)
    print(f'Output cleanup [{label}]:')
    for line in summary:
        print(line)
    for pattern, files in targets.items():
        for f in files:
            try:
                f.unlink()
                result['deleted'] += 1
            except Exception as e:
                print(f'  WARNING: Could not delete {f.name}: {e}')
    deleted_count = result['deleted']
    print(f'  Cleaned {deleted_count:,} files. Ready for fresh output.')
    print()
    return result


# ── Run cleanup for NB 06 outputs ─────────────────────────────
clean_previous_outputs(
    OUT_RECORDS_DIR,
    patterns=["rdls_*.json"],
    label="NB 06 Records",
    mode=CLEANUP_MODE,
)

clean_previous_outputs(
    OUT_INDEX_DIR,
    patterns=["rdls_index.jsonl"],
    label="NB 06 Index",
    mode=CLEANUP_MODE,
)

clean_previous_outputs(
    OUT_REPORTS_DIR,
    patterns=[
        "translation_blocked.csv",
        "schema_validation.csv",
        "translation_qa.csv",
    ],
    label="NB 06 Reports",
    mode=CLEANUP_MODE,
)


Output cleanup [NB 06 Records]:
  rdls_*.json                             : 13,152 files
  Cleaned 13,152 files. Ready for fresh output.

Output cleanup [NB 06 Index]:
  rdls_index.jsonl                        : 1 files
  Cleaned 1 files. Ready for fresh output.

Output cleanup [NB 06 Reports]: Directory is clean.


{'deleted': 0, 'skipped': False}

## 9. Validate and Write Records

In [9]:
"""
Process all datasets: build records, validate, and write outputs.

Outputs:
    - Individual RDLS JSON records
    - Index JSONL file
    - Blocked datasets report
    - Validation results report
    - QA summary report
"""

# --- JSON Schema validation setup ---
def try_import_jsonschema():
    try:
        import jsonschema
        return jsonschema
    except ImportError:
        return None

_jsonschema = try_import_jsonschema()
validator = None

if _jsonschema is not None:
    try:
        validator = _jsonschema.Draft202012Validator(rdls_schema)
        print("jsonschema validation enabled (Draft2020-12)")
    except Exception as e:
        print(f"WARNING: jsonschema init failed: {e}")
else:
    print("WARNING: jsonschema not installed; validation will be skipped")


def validate_record(rec: Dict[str, Any]) -> Tuple[bool, str]:
    """Validate RDLS record against schema."""
    if validator is None:
        return True, ""
    errors = sorted(validator.iter_errors(rec["datasets"][0]), key=lambda e: e.path)
    if not errors:
        return True, ""
    msgs = [f"{'.'.join(str(p) for p in e.path)}: {e.message}" for e in errors[:10]]
    return False, " | ".join(msgs)


def append_jsonl(path: Path, obj: Dict[str, Any]) -> None:
    """Append object to JSONL file."""
    with path.open("a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")


# Initialize output file
OUT_INDEX_JSONL.write_text("", encoding="utf-8")

# Tracking lists
blocked_rows: List[Dict[str, Any]] = []
validation_rows: List[Dict[str, Any]] = []
qa_rows: List[Dict[str, Any]] = []

# Counters
written = 0
skipped_existing = 0
blocked = 0
validated_ok = 0

print(f"\nProcessing {len(included_ids):,} datasets...")

for dataset_id in tqdm(included_ids, desc="Translating to RDLS"):
    # Find dataset file
    fp = dataset_file_index.get(dataset_id)
    
    if fp is None or not fp.exists():
        blocked += 1
        blocked_rows.append({
            "dataset_id": dataset_id,
            "status": "blocked_missing_hdx_dataset_json",
            "reason": "missing_hdx_dataset_json",
            "risk_data_type": "",
        })
        qa_rows.append({
            "dataset_id": dataset_id, "output_id": "", "filename": "",
            "risk_data_type": "", "spatial_scale": "", "countries_count": 0,
            "license_raw": "", "orgtoken": "", "hazard_suffix": "",
            "status": "blocked_missing_hdx_dataset_json", "reason": "missing_hdx_dataset_json",
        })
        continue
    
    # Load HDX metadata
    hdx = safe_load_json(fp)
    row = df.loc[dataset_id]
    
    # Build RDLS record
    rdls_rec, info = build_rdls_record(hdx, row)
    
    if rdls_rec is None:
        blocked += 1
        reason = info.get("blocked_reasons") or "blocked_by_policy"
        rdt = info.get("risk_data_type") or ""
        blocked_rows.append({
            "dataset_id": dataset_id,
            "status": "blocked_by_policy",
            "reason": reason,
            "risk_data_type": rdt,
        })
        qa_rows.append({
            "dataset_id": dataset_id, "output_id": "", "filename": "",
            "risk_data_type": rdt, "spatial_scale": "", "countries_count": 0,
            "license_raw": "", "orgtoken": "", "hazard_suffix": "",
            "status": "blocked_by_policy", "reason": reason,
        })
        continue
    
    out_path = OUT_RECORDS_DIR / info["filename"]
    
    # Skip if exists and configured
    if config.skip_existing and out_path.exists():
        skipped_existing += 1
        qa_rows.append({
            "dataset_id": dataset_id, "output_id": info.get("rdls_id", ""),
            "filename": info.get("filename", ""), "risk_data_type": info.get("risk_data_type", ""),
            "spatial_scale": info.get("spatial_scale", ""), "countries_count": info.get("countries_count", 0),
            "license_raw": info.get("license_raw", ""), "orgtoken": info.get("orgtoken", ""),
            "hazard_suffix": info.get("hazard_suffix", ""),
            "status": "skipped_existing", "reason": "",
        })
        continue
    
    # Validate
    ok, msg = validate_record(rdls_rec)
    validation_rows.append({
        "dataset_id": dataset_id,
        "rdls_id": info["rdls_id"],
        "filename": info["filename"],
        "valid": ok,
        "message": msg,
    })
    if ok:
        validated_ok += 1
    
    # Write JSON
    if config.write_pretty_json:
        out_path.write_text(json.dumps(rdls_rec, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
    else:
        out_path.write_text(json.dumps(rdls_rec, ensure_ascii=False) + "\n", encoding="utf-8")
    
    append_jsonl(OUT_INDEX_JSONL, info)
    written += 1
    
    qa_rows.append({
        "dataset_id": dataset_id, "output_id": info.get("rdls_id", ""),
        "filename": info.get("filename", ""), "risk_data_type": info.get("risk_data_type", ""),
        "spatial_scale": info.get("spatial_scale", ""), "countries_count": info.get("countries_count", 0),
        "license_raw": info.get("license_raw", ""), "orgtoken": info.get("orgtoken", ""),
        "hazard_suffix": info.get("hazard_suffix", ""),
        "status": "written", "reason": "",
    })

# Summary
print(f"\n" + "="*50)
print(f"TRANSLATION COMPLETE")
print(f"="*50)
print(f"Written: {written:,}")
print(f"Skipped (existing): {skipped_existing:,}")
print(f"Blocked: {blocked:,}")
print(f"Schema valid: {validated_ok:,} of {len(validation_rows):,}")

jsonschema validation enabled (Draft2020-12)

Processing 13,152 datasets...


Translating to RDLS:   0%|          | 0/13152 [00:00<?, ?it/s]


TRANSLATION COMPLETE
Written: 13,152
Skipped (existing): 0
Blocked: 0
Schema valid: 13,143 of 13,152


## 10. Save Reports

In [10]:
"""
Save translation reports to CSV files.
"""

# Save blocked datasets report
blocked_df = pd.DataFrame(blocked_rows, columns=["dataset_id", "status", "reason", "risk_data_type"])
blocked_df.to_csv(OUT_BLOCKED_CSV, index=False)
print(f"Wrote: {OUT_BLOCKED_CSV}")

# Save validation report
val_df = pd.DataFrame(validation_rows, columns=["dataset_id", "rdls_id", "filename", "valid", "message"])
val_df.to_csv(OUT_VALIDATION_CSV, index=False)
print(f"Wrote: {OUT_VALIDATION_CSV}")

# Save QA report
qa_df = pd.DataFrame(qa_rows, columns=[
    "dataset_id", "output_id", "filename", "risk_data_type",
    "spatial_scale", "countries_count", "license_raw", "orgtoken",
    "hazard_suffix", "status", "reason",
])
qa_df.to_csv(OUT_QA_CSV, index=False)
print(f"Wrote: {OUT_QA_CSV}")

print(f"\nIndex file: {OUT_INDEX_JSONL}")

Wrote: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/translation_blocked.csv
Wrote: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/schema_validation.csv
Wrote: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/translation_qa.csv

Index file: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/index/rdls_index.jsonl


## 11. QA Summary

In [11]:
"""
Quick QA summary of translation results.
"""
from pandas.errors import EmptyDataError


def safe_read_csv(path: Path) -> pd.DataFrame:
    """Read CSV safely, returning empty DataFrame if file is empty."""
    try:
        return pd.read_csv(path)
    except EmptyDataError:
        return pd.DataFrame()


print("\n" + "="*50)
print("QA SUMMARY")
print("="*50)

# Index stats
idx_lines = OUT_INDEX_JSONL.read_text(encoding="utf-8").strip().splitlines()
print(f"Index lines: {len(idx_lines):,}")
print(f"Records on disk: {len(list(OUT_RECORDS_DIR.glob('*.json'))):,}")

# Blocked stats
if OUT_BLOCKED_CSV.exists():
    blocked_df = safe_read_csv(OUT_BLOCKED_CSV)
    print(f"\nBlocked datasets: {len(blocked_df):,}")
    if not blocked_df.empty and "reason" in blocked_df.columns:
        print("Top blocked reasons:")
        for reason, count in blocked_df["reason"].value_counts().head(5).items():
            print(f"  - {reason}: {count:,}")

# Validation stats
if OUT_VALIDATION_CSV.exists():
    val_df = safe_read_csv(OUT_VALIDATION_CSV)
    if not val_df.empty:
        failures = val_df.loc[val_df["valid"] == False, "message"]
        print(f"\nValidation failures: {len(failures):,}")
        if len(failures) > 0:
            print("Top validation errors:")
            for msg, count in failures.value_counts().head(5).items():
                print(f"  - {msg[:60]}...: {count:,}")

# QA status
if OUT_QA_CSV.exists():
    qa_df = safe_read_csv(OUT_QA_CSV)
    if not qa_df.empty and "status" in qa_df.columns:
        print(f"\nQA Status Distribution:")
        for status, count in qa_df["status"].value_counts().items():
            print(f"  - {status}: {count:,}")

# --- Nested required fields validation (M8) ---
print("\nNested Required Fields Check:")
print("-" * 40)

REQUIRED_ATTRIBUTION_ROLES = {"publisher", "creator", "contact_point"}
REQUIRED_RESOURCE_FIELDS = {"id", "title", "data_format", "access_modality"}

nested_warnings = []

for rec_path in sorted(OUT_RECORDS_DIR.glob("*.json")):
    try:
        rec = json.loads(rec_path.read_text(encoding="utf-8"))
    except Exception:
        continue
    ds = rec.get("datasets", [{}])[0] if isinstance(rec.get("datasets"), list) and rec.get("datasets") else rec
    rdls_id = ds.get("id", rec_path.stem)

    # Check attributions: must have all 3 required roles
    attributions = ds.get("attributions", [])
    attr_roles = {a.get("role") for a in attributions if isinstance(a, dict)}
    missing_roles = REQUIRED_ATTRIBUTION_ROLES - attr_roles
    if missing_roles:
        nested_warnings.append({
            "rdls_id": rdls_id,
            "check": "attribution_roles",
            "detail": f"Missing roles: {sorted(missing_roles)}",
        })

    # Check resources: each must have required sub-fields
    resources = ds.get("resources", [])
    for i, r in enumerate(resources):
        if not isinstance(r, dict):
            continue
        missing_res = {k for k in REQUIRED_RESOURCE_FIELDS if not r.get(k)}
        if missing_res:
            nested_warnings.append({
                "rdls_id": rdls_id,
                "check": f"resource[{i}]_fields",
                "detail": f"Missing fields: {sorted(missing_res)}",
            })

if nested_warnings:
    print(f"  Nested field warnings: {len(nested_warnings)}")
    for w in nested_warnings[:10]:
        print(f"    [{w['rdls_id']}] {w['check']}: {w['detail']}")
    if len(nested_warnings) > 10:
        print(f"    ... and {len(nested_warnings) - 10} more")
else:
    print("  All records have complete attributions (3 roles) and resource sub-fields.")

# Report unmapped formats if any were tracked
if unmapped_hdx_formats:
    print(f"\nUnmapped HDX Formats ({len(unmapped_hdx_formats)}):")
    for fmt in sorted(unmapped_hdx_formats):
        print(f"  - {fmt}")

print("\n" + "="*50)
print("Step 6 complete. Proceed to Step 7 for validation and packaging.")
print("="*50)


QA SUMMARY
Index lines: 13,152
Records on disk: 13,152

Blocked datasets: 0

Validation failures: 9
Top validation errors:
  - spatial.countries.0: 'XKX' is not one of ['AFG', 'ALB', 'DZA...: 5
  - spatial.countries.137: 'XKX' is not one of ['AFG', 'ALB', 'D...: 1
  - spatial.countries.176: 'XKX' is not one of ['AFG', 'ALB', 'D...: 1
  - spatial.countries.87: 'XKX' is not one of ['AFG', 'ALB', 'DZ...: 1
  - spatial.countries.203: 'XKX' is not one of ['AFG', 'ALB', 'D...: 1

QA Status Distribution:
  - written: 13,152

Nested Required Fields Check:
----------------------------------------
  All records have complete attributions (3 roles) and resource sub-fields.

Unmapped HDX Formats (6):
  - arc/info grid
  - doc
  - erdas image
  - stata data file
  - zipped jpeg
  - zipped tif

Step 6 complete. Proceed to Step 7 for validation and packaging.


## End of Code