# Notebook 11: RDLS Vulnerability Block Extractor

**Purpose**: Extract and populate RDLS v0.3 Vulnerability component blocks from HDX metadata.

**RDLS Vulnerability Block Structure (v0.3)**:
```
vulnerability:
  functions:
    vulnerability: [...]        # vulnerability curves (damage ratio vs intensity)
    fragility: [...]            # fragility curves (P(damage state) vs intensity)
    damage_to_loss: [...]       # consequence functions (loss given damage)
    engineering_demand: [...]   # engineering demand functions
  socio_economic: [...]         # socio-economic vulnerability indicators/indices
```

**Two Extraction Pathways**:
1. **Functions** (vulnerability/fragility/damage_to_loss/engineering_demand):
   - Require 10 mandatory fields per entry from closed codelists
   - Rare in HDX metadata â€” typically only in specialized risk model datasets
2. **Socio-economic indicators**:
   - Common in HDX â€” poverty, displacement, food security, health, education indices
   - Require indicator_name, indicator_code, description, reference_year

**Author**: Benny Istanto/Risk Data Librarian/GFDRR    
**Version**: 2026.2

---

## 1. Setup and Configuration

In [1]:
"""
1.1 Import Dependencies
"""

import json
import os
import re
import yaml
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime
from typing import Dict, List, Tuple, Optional, Any, Set
from dataclasses import dataclass, field, asdict
from copy import deepcopy

import pandas as pd
import numpy as np

try:
    from tqdm.notebook import tqdm
    HAS_TQDM = True
except ImportError:
    HAS_TQDM = False

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 120)

print(f"Notebook started: {datetime.now().isoformat()}")

Notebook started: 2026-02-11T17:41:17.346813


In [2]:
"""
1.2 Paths and Configuration
"""

NOTEBOOK_DIR = Path.cwd()
BASE_DIR = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == 'notebook' else NOTEBOOK_DIR

# Input paths
DATASET_METADATA_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'dataset_metadata'
SIGNAL_DICT_PATH = BASE_DIR / 'hdx_dataset_metadata_dump' / 'config' / 'signal_dictionary.yaml'
RDLS_SCHEMA_PATH = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'schema' / 'rdls_schema_v0.3.json'

# Output paths
OUTPUT_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'extracted'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# â”€â”€ Output cleanup mode â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# Controls what happens to old output files when this notebook is re-run.
#   "replace" - Auto-delete old outputs and continue (default)
#   "prompt"  - Show what will be deleted, ask user to confirm
#   "skip"    - Keep old files, write new on top (may leave orphans)
#   "abort"   - Stop if old outputs exist (for CI/automated runs)
CLEANUP_MODE = "replace"

# Hazard extraction results from NB 09 (for cross-referencing)
HAZARD_CSV_PATH = OUTPUT_DIR / 'hazard_extraction_results.csv'

# Verify paths
assert DATASET_METADATA_DIR.exists(), f"Not found: {DATASET_METADATA_DIR}"
assert SIGNAL_DICT_PATH.exists(), f"Not found: {SIGNAL_DICT_PATH}"
assert RDLS_SCHEMA_PATH.exists(), f"Not found: {RDLS_SCHEMA_PATH}"

# Load configs
with open(SIGNAL_DICT_PATH, 'r', encoding='utf-8') as f:
    SIGNAL_DICT = yaml.safe_load(f)

with open(RDLS_SCHEMA_PATH, 'r', encoding='utf-8') as f:
    RDLS_SCHEMA = json.load(f)

# Load hazard extraction cross-reference (optional)
HAZARD_XREF = {}
if HAZARD_CSV_PATH.exists():
    _hdf = pd.read_csv(HAZARD_CSV_PATH)
    for _, row in _hdf.iterrows():
        if row.get('has_hazard') and pd.notna(row.get('hazard_types')):
            HAZARD_XREF[str(row['id'])] = {
                'hazard_types': str(row['hazard_types']).split('|'),
                'process_types': str(row.get('process_types', '')).split('|') if pd.notna(row.get('process_types')) else [],
                'analysis_type': str(row.get('analysis_type', '')) if pd.notna(row.get('analysis_type')) else None,
                'intensity_measures': str(row.get('intensity_measures', '')).split('|') if pd.notna(row.get('intensity_measures')) else [],
            }
    print(f"Loaded hazard cross-reference: {len(HAZARD_XREF):,} records")
else:
    print("WARNING: Hazard extraction CSV not found â€” will infer hazard context from text")

print(f"Configuration loaded.")
print(f"  Signal Dictionary sections: {list(SIGNAL_DICT.keys())}")
print(f"  Base: {BASE_DIR}")
print(f"  Output: {OUTPUT_DIR}")
print(f"  Cleanup mode: {CLEANUP_MODE}")

Loaded hazard cross-reference: 3,208 records
Configuration loaded.
  Signal Dictionary sections: ['hazard_type', 'process_type', 'exposure_category', 'analysis_type', 'return_period', 'spatial_scale', 'vulnerability_indicators', 'loss_indicators', 'format_hints', 'organization_hints', 'exclusion_patterns']
  Base: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler
  Output: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted
  Cleanup mode: replace


In [3]:
"""
1.3 Load Schema Constants

Load all closed codelist enums from the RDLS v0.3 schema for validation.
"""

# --- Closed codelist enums ---
VALID_FUNCTION_APPROACHES: Set[str] = set(RDLS_SCHEMA['$defs']['function_approach']['enum'])
VALID_RELATIONSHIP_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['relationship_type']['enum'])
VALID_HAZARD_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['hazard_type']['enum'])
VALID_PROCESS_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['process_type']['enum'])
VALID_ANALYSIS_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['analysis_type']['enum'])
VALID_EXPOSURE_CATEGORIES: Set[str] = set(RDLS_SCHEMA['$defs']['exposure_category']['enum'])
VALID_IMPACT_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['VulnerabilityCommonFields']['properties']['impact_type']['enum'])
VALID_CALCULATION_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['data_calculation_type']['enum'])
VALID_IMPACT_METRICS: Set[str] = set(RDLS_SCHEMA['$defs']['impact_metric']['enum'])
VALID_TAXONOMIES: Set[str] = set(RDLS_SCHEMA['$defs']['taxonomy']['enum'])

# --- Hazard type -> process type mappings ---
HAZARD_PROCESS_MAPPINGS: Dict[str, List[str]] = RDLS_SCHEMA.get('hazard_process_mappings', {})

# --- Intensity measure mappings (if available) ---
INTENSITY_MEASURE_MAPPINGS: Dict[str, List[str]] = RDLS_SCHEMA.get('intensity_measure_mappings', {})

# --- Default hazard process per hazard type ---
HAZARD_PROCESS_DEFAULT = {
    'flood': 'fluvial_flood',
    'earthquake': 'ground_motion',
    'tsunami': 'tsunami',
    'drought': 'meteorological_drought',
    'landslide': 'landslide_general',
    'wildfire': 'wildfire',
    'volcanic': 'ashfall',
    'extreme_temperature': 'extreme_heat',
    'strong_wind': 'tropical_cyclone',
    'convective_storm': 'tornado',
    'coastal_flood': 'storm_surge',
}

# --- Default intensity measure per hazard type ---
DEFAULT_INTENSITY_MEASURE = {}
for ht in VALID_HAZARD_TYPES:
    measures = INTENSITY_MEASURE_MAPPINGS.get(ht, [])
    if measures:
        DEFAULT_INTENSITY_MEASURE[ht] = measures[0]

# Fallback defaults if schema doesn't have intensity_measure_mappings
_IM_FALLBACK = {
    'flood': 'wd:m', 'coastal_flood': 'wd:m', 'earthquake': 'PGA:g',
    'tsunami': 'Rh_tsi:m', 'drought': 'SPI:-', 'wildfire': 'FWI:-',
    'convective_storm': 'sws_10m:m/s', 'strong_wind': 'sws_10m:m/s',
    'landslide': 'PGA:g', 'volcanic': 'AirTemp:C', 'extreme_temperature': 'AirTemp:C',
}
for ht, im in _IM_FALLBACK.items():
    if ht not in DEFAULT_INTENSITY_MEASURE:
        DEFAULT_INTENSITY_MEASURE[ht] = im

print("Schema constants loaded:")
print(f"  Function approaches ({len(VALID_FUNCTION_APPROACHES)}): {sorted(VALID_FUNCTION_APPROACHES)}")
print(f"  Relationship types ({len(VALID_RELATIONSHIP_TYPES)}): {sorted(VALID_RELATIONSHIP_TYPES)}")
print(f"  Hazard types ({len(VALID_HAZARD_TYPES)}): {len(VALID_HAZARD_TYPES)} values")
print(f"  Impact types ({len(VALID_IMPACT_TYPES)}): {sorted(VALID_IMPACT_TYPES)}")
print(f"  Calculation types ({len(VALID_CALCULATION_TYPES)}): {sorted(VALID_CALCULATION_TYPES)}")
print(f"  Impact metrics ({len(VALID_IMPACT_METRICS)}): {len(VALID_IMPACT_METRICS)} values")
print(f"  Taxonomies ({len(VALID_TAXONOMIES)}): {sorted(VALID_TAXONOMIES)}")
print(f"  Exposure categories ({len(VALID_EXPOSURE_CATEGORIES)}): {sorted(VALID_EXPOSURE_CATEGORIES)}")

Schema constants loaded:
  Function approaches (4): ['analytical', 'empirical', 'hybrid', 'judgement']
  Relationship types (3): ['discrete', 'math_bespoke', 'math_parametric']
  Hazard types (11): 11 values
  Impact types (3): ['direct', 'indirect', 'total']
  Calculation types (3): ['inferred', 'observed', 'simulated']
  Impact metrics (20): 20 values
  Taxonomies (12): ['CDC-SVI', 'Custom', 'EMDAT', 'EMS-98', 'GED4ALL', 'GLIDE', 'HAZUS', 'INFORM', 'MOVER', 'OED', 'PAGER', 'USGS_EHP']
  Exposure categories (7): ['agriculture', 'buildings', 'development_index', 'economic_indicator', 'infrastructure', 'natural_environment', 'population']


## 2. Vulnerability Detection Patterns

In [4]:
"""
2.1 Define Detection Patterns

Patterns for detecting vulnerability function types, socio-economic indicators,
and inferring mandatory field values from HDX metadata text.
"""

# --- Function type detection ---
FUNCTION_TYPE_PATTERNS = {
    'vulnerability': [
        r'\b(vulnerability[\s._-]?curve|vulnerability[\s._-]?function)\b',
        r'\b(damage[\s._-]?curve|damage[\s._-]?function)\b',
        r'\b(mean[\s._-]?damage[\s._-]?ratio|mdr)\b',
        r'\b(damage[\s._-]?ratio[\s._-]?(?:vs|versus|function))\b',
        r'\b(depth[\s._-]?damage)\b',
    ],
    'fragility': [
        r'\b(fragility[\s._-]?curve|fragility[\s._-]?function)\b',
        r'\b(probability[\s._-]?of[\s._-]?damage|failure[\s._-]?probability)\b',
        r'\b(capacity[\s._-]?spectrum|pushover)\b',
        r'\b(damage[\s._-]?state[\s._-]?(?:ds|probability))\b',
        r'\b(lognormal[\s._-]?fragility)\b',
    ],
    'damage_to_loss': [
        r'\b(damage[\s._-]?to[\s._-]?loss|consequence[\s._-]?function)\b',
        r'\b(loss[\s._-]?function|loss[\s._-]?model)\b',
        r'\b(repair[\s._-]?cost[\s._-]?(?:function|ratio|curve))\b',
        r'\b(replacement[\s._-]?cost[\s._-]?(?:function|ratio))\b',
    ],
    'engineering_demand': [
        r'\b(engineering[\s._-]?demand)\b',
        r'\b(interstorey[\s._-]?drift|inter[\s._-]?storey[\s._-]?drift)\b',
        r'\b(floor[\s._-]?acceleration|peak[\s._-]?floor)\b',
        r'\b(spectral[\s._-]?displacement|demand[\s._-]?capacity[\s._-]?ratio)\b',
    ],
}

# --- Approach inference ---
APPROACH_PATTERNS = {
    'analytical': [
        r'\b(analytical|numerical|finite[\s._-]?element|simulation[\s._-]?based)\b',
        r'\b(capacity[\s._-]?spectrum|pushover[\s._-]?analysis|nonlinear[\s._-]?analysis)\b',
        r'\b(time[\s._-]?history[\s._-]?analysis|dynamic[\s._-]?analysis)\b',
    ],
    'empirical': [
        r'\b(empirical|observed|survey[\s._-]?based|field[\s._-]?data)\b',
        r'\b(post[\s._-]?disaster|post[\s._-]?event|damage[\s._-]?survey)\b',
        r'\b(historical[\s._-]?data|real[\s._-]?event)\b',
    ],
    'hybrid': [
        r'\b(hybrid|combined|mixed[\s._-]?method)\b',
    ],
    'judgement': [
        r'\b(expert[\s._-]?judg[e]?ment|expert[\s._-]?opinion|elicitation)\b',
        r'\b(heuristic|rule[\s._-]?based)\b',
    ],
}

# --- Relationship type inference ---
RELATIONSHIP_PATTERNS = {
    'math_parametric': [
        r'\b(parametric|lognormal|normal[\s._-]?distribution|cumulative[\s._-]?distribution)\b',
        r'\b(cdf|probability[\s._-]?distribution|log[\s._-]?normal)\b',
        r'\b(median[\s._-]?and[\s._-]?dispersion|mu[\s._-]?and[\s._-]?sigma)\b',
    ],
    'math_bespoke': [
        r'\b(bespoke|custom[\s._-]?function|non[\s._-]?standard)\b',
        r'\b(piecewise|polynomial|spline)\b',
    ],
    'discrete': [
        r'\b(discrete|tabular|lookup[\s._-]?table|step[\s._-]?function)\b',
        r'\b(depth[\s._-]?damage[\s._-]?table|damage[\s._-]?matrix)\b',
    ],
}

# --- Impact type inference ---
IMPACT_TYPE_PATTERNS = {
    'direct': [r'\b(direct[\s._-]?(?:loss|damage|impact))\b'],
    'indirect': [r'\b(indirect[\s._-]?(?:loss|damage)|business[\s._-]?interruption|downtime)\b'],
    'total': [r'\b(total[\s._-]?(?:loss|damage|impact)|combined[\s._-]?loss)\b'],
}

# --- Impact modelling inference ---
IMPACT_MODELLING_PATTERNS = {
    'simulated': [r'\b(simulat|model(?:led|ed)|scenario[\s._-]?based)\b'],
    'observed': [r'\b(observed|recorded|actual|measured|field[\s._-]?survey)\b'],
    'inferred': [r'\b(inferred|derived|estimated|statistical)\b'],
}

# --- Default impact_metric per function type ---
DEFAULT_IMPACT_METRIC = {
    'vulnerability': 'damage_ratio',
    'fragility': 'probability',
    'damage_to_loss': 'loss_ratio',
    'engineering_demand': 'damage_index',
}

# --- Socio-economic indicator detection ---
SOCIOECONOMIC_INDICATORS = [
    {
        'patterns': [r'\b(poverty[\s._-]?headcount|poverty[\s._-]?ratio|below[\s._-]?poverty[\s._-]?line)\b',
                     r'\b(poverty[\s._-]?index|poverty[\s._-]?rate|poor[\s._-]?population)\b'],
        'indicator_name': 'Poverty headcount ratio',
        'indicator_code': 'POV_HEADCOUNT',
        'scheme': 'Custom',
        'description': 'Population living below the poverty line with limited resources for disaster preparedness and recovery',
    },
    {
        'patterns': [r'\b(human[\s._-]?development[\s._-]?index|hdi)\b'],
        'indicator_name': 'Human Development Index',
        'indicator_code': 'HDI',
        'scheme': 'Custom',
        'description': 'Composite index measuring average achievement in health, education, and standard of living',
    },
    {
        'patterns': [r'\b(social[\s._-]?vulnerability[\s._-]?index|svi)\b',
                     r'\b(socio[\s._-]?economic[\s._-]?vulnerability[\s._-]?index)\b'],
        'indicator_name': 'Social Vulnerability Index',
        'indicator_code': 'SVI_OVERALL',
        'scheme': 'CDC-SVI',
        'description': 'Overall social vulnerability score combining socioeconomic status, household composition, minority status, and housing factors',
    },
    {
        'patterns': [r'\b(food[\s._-]?security|food[\s._-]?insecurity|ipc[\s._-]?phase|ipc[\s._-]?classification)\b',
                     r'\b(food[\s._-]?crisis|famine[\s._-]?early[\s._-]?warning)\b'],
        'indicator_name': 'Food security classification',
        'indicator_code': 'FOOD_SECURITY',
        'scheme': 'Custom',
        'description': 'Food security status indicating population vulnerability to food crises and famine',
    },
    {
        'patterns': [r'\b(population[\s._-]?density)\b'],
        'indicator_name': 'Population density',
        'indicator_code': 'POP_DENSITY',
        'scheme': 'Custom',
        'description': 'Number of people per unit area, indicating exposure concentration and potential vulnerability',
    },
    {
        'patterns': [r'\b(elderly[\s._-]?population|aging[\s._-]?population|aged[\s._-]?(?:65|over))\b',
                     r'\b(population[\s._-]?(?:over|above)[\s._-]?65)\b'],
        'indicator_name': 'Elderly population percentage',
        'indicator_code': 'AGE_65_PLUS',
        'scheme': 'Custom',
        'description': 'Population aged 65 years and older, more vulnerable to hazard-related health impacts',
    },
    {
        'patterns': [r'\b(education[\s._-]?attainment|literacy[\s._-]?rate|school[\s._-]?enrollment)\b',
                     r'\b(out[\s._-]?of[\s._-]?school|educational[\s._-]?level)\b'],
        'indicator_name': 'Educational attainment',
        'indicator_code': 'EDU_ATTAINMENT',
        'scheme': 'Custom',
        'description': 'Level of educational attainment indicating capacity to understand and respond to hazard warnings',
    },
    {
        'patterns': [r'\b(health[\s._-]?(?:access|facility|service|indicator))\b',
                     r'\b(healthcare[\s._-]?access|medical[\s._-]?facility)\b'],
        'indicator_name': 'Access to healthcare facilities',
        'indicator_code': 'HEALTH_ACCESS',
        'scheme': 'Custom',
        'description': 'Proximity and access to healthcare services affecting disaster response and recovery capacity',
    },
    {
        'patterns': [r'\b(inform[\s._-]?(?:risk|index|severity))\b'],
        'indicator_name': 'INFORM Risk Index',
        'indicator_code': 'INFORM_RISK',
        'scheme': 'INFORM',
        'description': 'Composite risk index combining hazard exposure, vulnerability, and lack of coping capacity',
    },
    {
        'patterns': [r'\b(displaced|displacement|idp|internally[\s._-]?displaced)\b',
                     r'\b(refugee[\s._-]?(?:population|camp|settlement))\b'],
        'indicator_name': 'Displacement indicator',
        'indicator_code': 'DISPLACEMENT',
        'scheme': 'Custom',
        'description': 'Population displaced by conflict or disaster, indicating heightened vulnerability',
    },
    {
        'patterns': [r'\b(coping[\s._-]?capacity|adaptive[\s._-]?capacity)\b',
                     r'\b(lack[\s._-]?of[\s._-]?coping[\s._-]?capacity)\b'],
        'indicator_name': 'Coping capacity index',
        'indicator_code': 'COPING_CAPACITY',
        'scheme': 'Custom',
        'description': 'Capacity of communities to cope with and recover from hazard impacts',
    },
    {
        'patterns': [r'\b(resilience[\s._-]?index|community[\s._-]?resilience)\b'],
        'indicator_name': 'Resilience index',
        'indicator_code': 'RESILIENCE',
        'scheme': 'Custom',
        'description': 'Composite index measuring community resilience to natural hazards and disasters',
    },
    {
        'patterns': [r'\b(deprivation[\s._-]?index|multi[\s._-]?dimensional[\s._-]?poverty)\b'],
        'indicator_name': 'Deprivation index',
        'indicator_code': 'DEPRIVATION',
        'scheme': 'Custom',
        'description': 'Multi-dimensional deprivation index measuring socio-economic disadvantage',
    },
    {
        'patterns': [r'\b(malnutrition|stunting|wasting|underweight|nutrition[\s._-]?status)\b',
                     r'\b(global[\s._-]?acute[\s._-]?malnutrition|gam)\b'],
        'indicator_name': 'Malnutrition prevalence',
        'indicator_code': 'MALNUTRITION',
        'scheme': 'Custom',
        'description': 'Prevalence of malnutrition indicating population health vulnerability to hazard impacts',
    },
    {
        'patterns': [r'\b(vulnerability[\s._-]?index|nexus[\s._-]?risk)\b',
                     r'\b(climate[\s._-]?vulnerability|climate[\s._-]?risk[\s._-]?index)\b'],
        'indicator_name': 'Vulnerability index',
        'indicator_code': 'VULN_INDEX',
        'scheme': 'Custom',
        'description': 'Composite vulnerability index combining multiple socio-economic and environmental factors',
    },
    {
        'patterns': [r'\b(gender[\s._-]?(?:inequality|index|gap))\b',
                     r'\b(women[\s._-]?(?:vulnerability|empowerment))\b'],
        'indicator_name': 'Gender inequality index',
        'indicator_code': 'GENDER_INEQUALITY',
        'scheme': 'Custom',
        'description': 'Gender inequality indicator reflecting differential vulnerability to hazard impacts',
    },
    {
        'patterns': [r'\b(disability[\s._-]?(?:prevalence|rate|population))\b',
                     r'\b(persons[\s._-]?with[\s._-]?disabilities)\b'],
        'indicator_name': 'Disability prevalence',
        'indicator_code': 'DISABILITY',
        'scheme': 'Custom',
        'description': 'Prevalence of disability in the population indicating heightened vulnerability to hazard impacts',
    },
    {
        'patterns': [r'\b(livelihood[\s._-]?(?:zone|index|vulnerability))\b',
                     r'\b(livelihood[\s._-]?coping[\s._-]?strategy)\b'],
        'indicator_name': 'Livelihood vulnerability',
        'indicator_code': 'LIVELIHOOD',
        'scheme': 'Custom',
        'description': 'Livelihood vulnerability indicator measuring economic susceptibility to hazard disruption',
    },
]

# --- Generic socio-economic catch-all patterns ---
GENERIC_SOCIOECONOMIC_PATTERNS = [
    r'\b(socio[\s._-]?economic[\s._-]?vulnerability)\b',
    r'\b(socioeconomic[\s._-]?(?:indicator|index|data))\b',
    r'\b(vulnerability[\s._-]?assessment[\s._-]?(?:data|indicator))\b',
]

# --- Hazard type inference patterns (reuse from signal dictionary) ---
HAZARD_TYPE_PATTERNS = {}
for hazard, config in SIGNAL_DICT.get('hazard_type', {}).items():
    HAZARD_TYPE_PATTERNS[hazard] = [
        re.compile(p, re.IGNORECASE) for p in config.get('patterns', [])
    ]

# --- Exposure category patterns (reuse from signal dictionary) ---
EXPOSURE_CATEGORY_PATTERNS = {}
for cat, config in SIGNAL_DICT.get('exposure_category', {}).items():
    EXPOSURE_CATEGORY_PATTERNS[cat] = [
        re.compile(p, re.IGNORECASE) for p in config.get('patterns', [])
    ]
# Note: All 7 exposure categories (including economic_indicator and
# development_index) are now defined in signal_dictionary.yaml

print("Detection patterns defined.")
print(f"  Function types: {list(FUNCTION_TYPE_PATTERNS.keys())}")
print(f"  Approach patterns: {list(APPROACH_PATTERNS.keys())}")
print(f"  Socio-economic indicators: {len(SOCIOECONOMIC_INDICATORS)}")
print(f"  Hazard type patterns: {len(HAZARD_TYPE_PATTERNS)}")
print(f"  Exposure category patterns: {len(EXPOSURE_CATEGORY_PATTERNS)}")

# =============================================================================
# VULNERABILITY CONSTRAINT TABLES
# =============================================================================
#
# Derived from RDLS v0.3 schema. Used by VulnerabilityExtractor to validate
# field combinations for vulnerability function entries.
#
# Group 1: FUNCTION_TYPE_CONSTRAINTS
#   function_type -> default impact_metric, quantity_kind, allowed metrics
#
# Group 2: VULN_CATEGORY_DEFAULTS
#   exposure_category -> default function_type + metric overrides
#   (Which function types / metrics make sense for each asset category)
#
# Group 3: Reuses IMPACT_METRIC_CONSTRAINTS from Loss section (cell 23)
#   impact_metric -> (quantity_kind, allowed impact_types)
#
# Group 4: FUNCTION_TYPE_APPROACH_DEFAULTS
#   function_type -> (typical_approach, typical_relationship)
# =============================================================================

# --- SHARED: impact_metric -> (quantity_kind, allowed_impact_types) ---
# Shared between Vulnerability (cell 8) and Loss (cell 24) extractors.
# Maps each of the 20 impact_metric values to its expected quantity_kind
# and which impact_types it logically applies to.
# quantity_kind is open codelist so we add 'ratio' (used in Chattogram example).
IMPACT_METRIC_CONSTRAINTS = {
    # Physical damage metrics (direct only)
    'damage_ratio':               ('ratio',    {'direct'}),
    'mean_damage_ratio':          ('ratio',    {'direct'}),
    'damage_index':               ('count',    {'direct'}),
    # Loss ratio metrics
    'loss_ratio':                 ('ratio',    {'direct', 'indirect', 'total'}),
    'mean_loss_ratio':            ('ratio',    {'direct', 'indirect', 'total'}),
    # Probability
    'probability':                ('ratio',    {'direct', 'indirect', 'total'}),
    # Vulnerability metrics
    'downtime_vulnerability':     ('time',     {'indirect'}),
    'casualty_ratio_vulnerability': ('ratio',  {'direct'}),
    # Economic/monetary loss metrics
    'economic_loss_value':        ('monetary', {'direct', 'indirect', 'total'}),
    'insured_loss_value':         ('monetary', {'direct', 'indirect', 'total'}),
    'loss_annual_average_value':  ('monetary', {'total'}),
    'loss_probable_maximum_value': ('monetary', {'total'}),
    'at_risk_value':              ('monetary', {'total'}),
    'at_risk_tail_value':         ('monetary', {'total'}),
    'asset_loss':                 ('monetary', {'direct'}),
    # Time-based loss
    'downtime_loss':              ('time',     {'indirect'}),
    # Count-based metrics
    'casualty_count':             ('count',    {'direct'}),
    'casualty_ratio_loss':        ('ratio',    {'direct'}),
    'displaced_count':            ('count',    {'direct'}),
    'exposure_to_hazard':         ('count',    {'direct'}),
}

# --- Group 1: function_type -> impact_metric + quantity_kind constraints ---
# Each function type has a natural default metric. The 'allowed' set lists
# all impact_metrics that are semantically valid for that function type.
FUNCTION_TYPE_CONSTRAINTS = {
    'vulnerability': {
        'default_metric':    'damage_ratio',
        'default_qty':       'ratio',
        'allowed_metrics':   {
            'damage_ratio', 'mean_damage_ratio',       # physical damage
            'casualty_ratio_vulnerability',             # human vulnerability
            'downtime_vulnerability',                   # service disruption
        },
    },
    'fragility': {
        'default_metric':    'probability',
        'default_qty':       'ratio',
        'allowed_metrics':   {
            'probability',                              # P(damage_state | intensity)
            'damage_index',                             # ordinal damage state
            'damage_ratio', 'mean_damage_ratio',        # continuous damage
        },
    },
    'damage_to_loss': {
        'default_metric':    'loss_ratio',
        'default_qty':       'ratio',
        'allowed_metrics':   {
            'loss_ratio', 'mean_loss_ratio',            # fractional loss
            'economic_loss_value', 'insured_loss_value', # absolute monetary
            'asset_loss',                                # asset replacement
            'downtime_loss',                             # service interruption
            'casualty_count', 'displaced_count',         # human consequences
        },
    },
    'engineering_demand': {
        'default_metric':    'damage_index',
        'default_qty':       'count',
        'allowed_metrics':   {
            'damage_index',                             # ordinal state
            'damage_ratio', 'mean_damage_ratio',        # continuous damage
            'probability',                              # exceedance
        },
    },
}

# --- Group 2: category -> default function type & metric overrides ---
# When a category is detected, what function behaviour makes most sense?
VULN_CATEGORY_DEFAULTS = {
    'buildings': {
        'typical_function':  'fragility',
        'metric_override':   'damage_ratio',
        'qty_override':      'ratio',
    },
    'infrastructure': {
        'typical_function':  'vulnerability',
        'metric_override':   'downtime_vulnerability',
        'qty_override':      'time',
    },
    'population': {
        'typical_function':  'vulnerability',
        'metric_override':   'casualty_ratio_vulnerability',
        'qty_override':      'ratio',
    },
    'agriculture': {
        'typical_function':  'vulnerability',
        'metric_override':   'damage_ratio',
        'qty_override':      'ratio',
    },
    'natural_environment': {
        'typical_function':  'vulnerability',
        'metric_override':   'damage_ratio',
        'qty_override':      'ratio',
    },
    'economic_indicator': {
        'typical_function':  'damage_to_loss',
        'metric_override':   'economic_loss_value',
        'qty_override':      'monetary',
    },
    'development_index': {
        'typical_function':  'vulnerability',
        'metric_override':   'damage_index',
        'qty_override':      'count',
    },
}

# --- Group 4: function_type -> typical approach + relationship ---
# Fragility functions are usually parametric (lognormal CDF);
# vulnerability functions are often discrete (depth-damage tables).
FUNCTION_TYPE_APPROACH_DEFAULTS = {
    'vulnerability': {
        'typical_approach':     'empirical',
        'typical_relationship': 'discrete',
    },
    'fragility': {
        'typical_approach':     'analytical',
        'typical_relationship': 'math_parametric',
    },
    'damage_to_loss': {
        'typical_approach':     'empirical',
        'typical_relationship': 'discrete',
    },
    'engineering_demand': {
        'typical_approach':     'analytical',
        'typical_relationship': 'math_parametric',
    },
}

print("\nVulnerability constraint tables defined.")
print(f"  Group 1 - Function type constraints: {len(FUNCTION_TYPE_CONSTRAINTS)} types")
print(f"  Group 2 - Category defaults: {len(VULN_CATEGORY_DEFAULTS)} categories")
print(f"  Group 3 - Reuses IMPACT_METRIC_CONSTRAINTS from Loss section")
print(f"  Group 4 - Approach defaults: {len(FUNCTION_TYPE_APPROACH_DEFAULTS)} types")


Detection patterns defined.
  Function types: ['vulnerability', 'fragility', 'damage_to_loss', 'engineering_demand']
  Approach patterns: ['analytical', 'empirical', 'hybrid', 'judgement']
  Socio-economic indicators: 18
  Hazard type patterns: 11
  Exposure category patterns: 7

Vulnerability constraint tables defined.
  Group 1 - Function type constraints: 4 types
  Group 2 - Category defaults: 7 categories
  Group 3 - Reuses IMPACT_METRIC_CONSTRAINTS from Loss section
  Group 4 - Approach defaults: 4 types


In [5]:
"""
2.2 Data Classes
"""

@dataclass
class FunctionExtraction:
    """Extraction result for a single vulnerability function."""
    function_type: str                  # vulnerability, fragility, damage_to_loss, engineering_demand
    approach: str = 'empirical'         # function_approach codelist
    relationship: str = 'discrete'      # relationship_type codelist
    hazard_primary: Optional[str] = None  # hazard_type codelist
    hazard_secondary: Optional[str] = None
    hazard_process_primary: Optional[str] = None
    hazard_process_secondary: Optional[str] = None
    hazard_analysis_type: str = 'empirical'  # analysis_type codelist
    intensity_measure: Optional[str] = None
    category: Optional[str] = None       # exposure_category codelist
    impact_type: str = 'direct'          # impact_type codelist
    impact_modelling: str = 'observed'   # data_calculation_type codelist
    impact_metric: Optional[str] = None  # impact_metric codelist
    quantity_kind: str = 'ratio'         # open codelist
    taxonomy: Optional[str] = None       # taxonomy codelist
    analysis_details: Optional[str] = None
    damage_scale_name: Optional[str] = None
    damage_states_names: Optional[List[str]] = None
    parameter: Optional[str] = None      # engineering_demand only
    confidence: float = 0.0

@dataclass
class SocioEconomicExtraction:
    """Extraction result for a socio-economic indicator."""
    indicator_name: str
    indicator_code: str
    description: str
    reference_year: Optional[int] = None
    scheme: str = 'Custom'
    threshold: Optional[str] = None
    uri: Optional[str] = None
    analysis_details: Optional[str] = None
    confidence: float = 0.0

@dataclass
class VulnerabilityExtraction:
    """Complete vulnerability extraction for a dataset."""
    functions: List[FunctionExtraction] = field(default_factory=list)
    socio_economic: List[SocioEconomicExtraction] = field(default_factory=list)
    overall_confidence: float = 0.0

    def has_any_signal(self) -> bool:
        return len(self.functions) > 0 or len(self.socio_economic) > 0

print("Data classes defined.")

Data classes defined.


In [6]:
"""
2.3 VulnerabilityExtractor Class

Main extraction engine using signal dictionary + schema constants.
All inferred field combinations are validated against constraint tables:
  Group 1: FUNCTION_TYPE_CONSTRAINTS (function_type -> metrics)
  Group 2: VULN_CATEGORY_DEFAULTS (category -> function type + metric overrides)
  Group 3: IMPACT_METRIC_CONSTRAINTS (metric -> quantity_kind + impact_type)
  Group 4: FUNCTION_TYPE_APPROACH_DEFAULTS (function_type -> approach + relationship)
"""

class VulnerabilityExtractor:
    """
    Extracts RDLS Vulnerability block components from HDX metadata.

    Two pathways:
    1. Functions: Detect vulnerability/fragility/damage_to_loss/engineering_demand
       curves, infer all mandatory fields from text + hazard cross-reference.
    2. Socio-economic: Detect indicator datasets (poverty, HDI, SVI, etc.),
       extract indicator_name, indicator_code, description, reference_year.
    """

    def __init__(self, signal_dict: Dict[str, Any], hazard_xref: Dict[str, Dict]):
        self.signal_dict = signal_dict
        self.hazard_xref = hazard_xref
        self._compile_patterns()

    def _compile_patterns(self) -> None:
        """Pre-compile regex patterns."""
        self.func_type_patterns = {}
        for ftype, patterns in FUNCTION_TYPE_PATTERNS.items():
            self.func_type_patterns[ftype] = [re.compile(p, re.IGNORECASE) for p in patterns]

        self.approach_patterns = {}
        for approach, patterns in APPROACH_PATTERNS.items():
            self.approach_patterns[approach] = [re.compile(p, re.IGNORECASE) for p in patterns]

        self.relationship_patterns = {}
        for rel, patterns in RELATIONSHIP_PATTERNS.items():
            self.relationship_patterns[rel] = [re.compile(p, re.IGNORECASE) for p in patterns]

        self.impact_type_patterns = {}
        for itype, patterns in IMPACT_TYPE_PATTERNS.items():
            self.impact_type_patterns[itype] = [re.compile(p, re.IGNORECASE) for p in patterns]

        self.impact_modelling_patterns = {}
        for mod, patterns in IMPACT_MODELLING_PATTERNS.items():
            self.impact_modelling_patterns[mod] = [re.compile(p, re.IGNORECASE) for p in patterns]

        self.socio_indicator_patterns = []
        for ind in SOCIOECONOMIC_INDICATORS:
            compiled = [re.compile(p, re.IGNORECASE) for p in ind['patterns']]
            self.socio_indicator_patterns.append({**ind, 'compiled': compiled})

        self.generic_socio_patterns = [re.compile(p, re.IGNORECASE) for p in GENERIC_SOCIOECONOMIC_PATTERNS]

    def _get_all_text(self, record: Dict[str, Any]) -> str:
        """Concatenate all searchable text fields for pattern matching.

        Note: methodology_other is deliberately excluded. It describes
        how analysis was performed (e.g. 'vulnerability models used in
        risk calculation'), not what data the dataset contains. Including
        it causes false positives where methodology text about risk
        assessment triggers vulnerability/loss detection on hazard-only
        datasets.
        """
        parts = [
            record.get('title', ''),
            record.get('name', ''),
            record.get('notes', ''),
        ]
        for tag in record.get('tags', []):
            if isinstance(tag, dict):
                parts.append(tag.get('name', ''))
            elif isinstance(tag, str):
                parts.append(tag)
        for r in record.get('resources', []):
            parts.append(r.get('name', '') or '')
            parts.append(r.get('description', '') or '')
        return ' '.join(filter(None, parts))

    def _detect_function_types(self, text: str) -> List[str]:
        """Detect which vulnerability function types are present."""
        detected = []
        for ftype, patterns in self.func_type_patterns.items():
            for p in patterns:
                if p.search(text):
                    detected.append(ftype)
                    break
        return detected

    def _infer_approach(self, text: str, function_type: str) -> str:
        """
        Infer function approach from text, with function-type defaults.
        Uses Group 4 defaults when text provides no signal.
        """
        scores = {k: 0 for k in VALID_FUNCTION_APPROACHES}
        for approach, patterns in self.approach_patterns.items():
            for p in patterns:
                if p.search(text):
                    scores[approach] += 1
        best = max(scores, key=scores.get)
        if scores[best] > 0:
            return best
        # Fallback to Group 4 default for this function type
        defaults = FUNCTION_TYPE_APPROACH_DEFAULTS.get(function_type, {})
        return defaults.get('typical_approach', 'empirical')

    def _infer_relationship(self, text: str, function_type: str) -> str:
        """
        Infer relationship type from text, with function-type defaults.
        Uses Group 4 defaults when text provides no signal.
        """
        scores = {k: 0 for k in VALID_RELATIONSHIP_TYPES}
        for rel, patterns in self.relationship_patterns.items():
            for p in patterns:
                if p.search(text):
                    scores[rel] += 1
        best = max(scores, key=scores.get)
        if scores[best] > 0:
            return best
        # Fallback to Group 4 default
        defaults = FUNCTION_TYPE_APPROACH_DEFAULTS.get(function_type, {})
        return defaults.get('typical_relationship', 'discrete')

    def _infer_hazard_context(self, record: Dict[str, Any], text: str) -> Dict[str, Optional[str]]:
        """Infer hazard context from cross-reference or text."""
        dataset_id = record.get('id', '')

        if dataset_id in self.hazard_xref:
            xref = self.hazard_xref[dataset_id]
            ht_list = [h for h in xref['hazard_types'] if h in VALID_HAZARD_TYPES]
            pt_list = [p for p in xref['process_types'] if p in VALID_PROCESS_TYPES]
            at = xref.get('analysis_type')
            im_list = xref.get('intensity_measures', [])

            result = {
                'hazard_primary': ht_list[0] if ht_list else None,
                'hazard_secondary': ht_list[1] if len(ht_list) > 1 else None,
                'hazard_process_primary': pt_list[0] if pt_list else None,
                'hazard_analysis_type': at if at in VALID_ANALYSIS_TYPES else 'empirical',
                'intensity_measure': im_list[0] if im_list else None,
            }
            if not result['hazard_process_primary'] and result['hazard_primary']:
                result['hazard_process_primary'] = HAZARD_PROCESS_DEFAULT.get(result['hazard_primary'])
            if not result['intensity_measure'] and result['hazard_primary']:
                result['intensity_measure'] = DEFAULT_INTENSITY_MEASURE.get(result['hazard_primary'])
            return result

        text_lower = text.lower()
        hazard_primary = None
        for ht, patterns in HAZARD_TYPE_PATTERNS.items():
            for p in patterns:
                if p.search(text_lower):
                    hazard_primary = ht
                    break
            if hazard_primary:
                break

        return {
            'hazard_primary': hazard_primary,
            'hazard_secondary': None,
            'hazard_process_primary': HAZARD_PROCESS_DEFAULT.get(hazard_primary) if hazard_primary else None,
            'hazard_analysis_type': 'empirical',
            'intensity_measure': DEFAULT_INTENSITY_MEASURE.get(hazard_primary) if hazard_primary else None,
        }

    def _infer_category(self, text: str) -> Optional[str]:
        """Infer exposure category from text."""
        text_lower = text.lower()
        for cat, patterns in EXPOSURE_CATEGORY_PATTERNS.items():
            for p in patterns:
                if p.search(text_lower):
                    if cat in VALID_EXPOSURE_CATEGORIES:
                        return cat
        return 'buildings'  # Default for vulnerability functions

    def _infer_impact_type(self, text: str) -> str:
        """Infer impact type from text."""
        text_lower = text.lower()
        for itype, patterns in self.impact_type_patterns.items():
            for p in patterns:
                if p.search(text_lower):
                    return itype
        return 'direct'

    def _infer_impact_modelling(self, text: str) -> str:
        """Infer impact modelling method from text."""
        text_lower = text.lower()
        scores = {k: 0 for k in VALID_CALCULATION_TYPES}
        for mod, patterns in self.impact_modelling_patterns.items():
            for p in patterns:
                if p.search(text_lower):
                    scores[mod] += 1
        best = max(scores, key=scores.get)
        return best if scores[best] > 0 else 'observed'

    def _validate_function_metrics(self, function_type: str, category: str,
                                    impact_metric: str, quantity_kind: str,
                                    impact_type: str) -> Dict[str, str]:
        """
        Validate impact_metric + quantity_kind against constraint tables.

        Checks:
        1. Group 1: Is impact_metric allowed for this function_type?
        2. Group 3: Does quantity_kind match the metric? (IMPACT_METRIC_CONSTRAINTS)
        3. Group 3: Does impact_type match the metric?

        Falls back to Group 2 category defaults if function_type constraints fail.
        """
        ftype_constraints = FUNCTION_TYPE_CONSTRAINTS.get(function_type)

        if ftype_constraints:
            # Check if the impact_metric is in the allowed set for this function type
            if impact_metric not in ftype_constraints['allowed_metrics']:
                # Try category override (Group 2)
                cat_defaults = VULN_CATEGORY_DEFAULTS.get(category, {})
                override_metric = cat_defaults.get('metric_override')
                if override_metric and override_metric in ftype_constraints['allowed_metrics']:
                    impact_metric = override_metric
                    quantity_kind = cat_defaults.get('qty_override', ftype_constraints['default_qty'])
                else:
                    # Fall back to function type default
                    impact_metric = ftype_constraints['default_metric']
                    quantity_kind = ftype_constraints['default_qty']

        # Group 3: validate metric -> quantity_kind + impact_type
        # (IMPACT_METRIC_CONSTRAINTS is defined in cell 23 for Loss; reuse it)
        metric_constraint = IMPACT_METRIC_CONSTRAINTS.get(impact_metric)
        if metric_constraint:
            expected_qty, allowed_types = metric_constraint
            if quantity_kind != expected_qty:
                quantity_kind = expected_qty
            if impact_type not in allowed_types:
                impact_type = 'direct' if 'direct' in allowed_types else sorted(allowed_types)[0]

        return {
            'impact_metric': impact_metric,
            'quantity_kind': quantity_kind,
            'impact_type': impact_type,
        }

    def _extract_reference_year(self, record: Dict[str, Any]) -> Optional[int]:
        """Extract reference year from dataset metadata."""
        dataset_date = record.get('dataset_date', '') or ''
        year_match = re.search(r'(\d{4})', dataset_date)
        if year_match:
            year = int(year_match.group(1))
            if 1900 <= year <= 2100:
                return year
        last_mod = record.get('last_modified', '') or record.get('metadata_modified', '') or ''
        year_match = re.search(r'(\d{4})', last_mod)
        if year_match:
            year = int(year_match.group(1))
            if 1900 <= year <= 2100:
                return year
        return None

    def _extract_functions(self, record: Dict[str, Any], text: str) -> List[FunctionExtraction]:
        """Extract vulnerability function information with constraint validation."""
        func_types = self._detect_function_types(text)
        if not func_types:
            return []

        # Infer shared context
        hazard_ctx = self._infer_hazard_context(record, text)
        category = self._infer_category(text)
        text_impact_type = self._infer_impact_type(text)
        impact_modelling = self._infer_impact_modelling(text)

        title = record.get('title', '')
        notes = record.get('notes', '')
        analysis_details = f"Extracted from HDX dataset: {title[:200]}"
        if notes:
            analysis_details += f". {notes[:300]}"

        functions = []
        for ftype in func_types:
            # Group 4: approach + relationship defaults per function type
            approach = self._infer_approach(text, ftype)
            relationship = self._infer_relationship(text, ftype)

            # Group 1: default metric for this function type
            ftype_constraints = FUNCTION_TYPE_CONSTRAINTS.get(ftype, {})
            default_metric = ftype_constraints.get('default_metric', 'damage_ratio')
            default_qty = ftype_constraints.get('default_qty', 'ratio')

            # Group 2: category may override metric
            cat_defaults = VULN_CATEGORY_DEFAULTS.get(category, {})
            impact_metric = default_metric
            quantity_kind = default_qty

            # If category has a metric override that's valid for this function type
            cat_metric = cat_defaults.get('metric_override')
            allowed = ftype_constraints.get('allowed_metrics', set())
            if cat_metric and cat_metric in allowed:
                impact_metric = cat_metric
                quantity_kind = cat_defaults.get('qty_override', default_qty)

            # Group 1+3: validate the full combination
            validated = self._validate_function_metrics(
                ftype, category, impact_metric, quantity_kind, text_impact_type
            )

            func = FunctionExtraction(
                function_type=ftype,
                approach=approach,
                relationship=relationship,
                hazard_primary=hazard_ctx['hazard_primary'],
                hazard_secondary=hazard_ctx['hazard_secondary'],
                hazard_process_primary=hazard_ctx['hazard_process_primary'],
                hazard_analysis_type=hazard_ctx['hazard_analysis_type'] or 'empirical',
                intensity_measure=hazard_ctx['intensity_measure'],
                category=category,
                impact_type=validated['impact_type'],
                impact_modelling=impact_modelling,
                impact_metric=validated['impact_metric'],
                quantity_kind=validated['quantity_kind'],
                analysis_details=analysis_details[:500],
                confidence=0.8 if hazard_ctx['hazard_primary'] else 0.6,
            )
            functions.append(func)

        return functions

    def _extract_socio_economic(self, record: Dict[str, Any], text: str) -> List[SocioEconomicExtraction]:
        """Extract socio-economic indicator information."""
        text_lower = text.lower()
        indicators = []
        matched_codes = set()

        for ind_def in self.socio_indicator_patterns:
            for p in ind_def['compiled']:
                if p.search(text_lower):
                    code = ind_def['indicator_code']
                    if code not in matched_codes:
                        matched_codes.add(code)
                        ref_year = self._extract_reference_year(record)
                        title = record.get('title', '')
                        description = ind_def['description']
                        if title:
                            description = f"{description}. Source: {title[:150]}"

                        indicators.append(SocioEconomicExtraction(
                            indicator_name=ind_def['indicator_name'],
                            indicator_code=code,
                            description=description[:500],
                            reference_year=ref_year,
                            scheme=ind_def['scheme'],
                            confidence=0.7,
                        ))
                    break

        if not indicators:
            for p in self.generic_socio_patterns:
                if p.search(text_lower):
                    title = record.get('title', '')
                    ref_year = self._extract_reference_year(record)
                    indicators.append(SocioEconomicExtraction(
                        indicator_name='Socio-economic vulnerability indicator',
                        indicator_code='SOCIO_VULN',
                        description=f"Socio-economic vulnerability data from: {title[:200]}",
                        reference_year=ref_year,
                        scheme='Custom',
                        confidence=0.5,
                    ))
                    break

        # --- Single-indicator false positive filter ---
        # A single generic indicator is insufficient evidence for vulnerability.
        # DISPLACEMENT alone belongs in loss, POP_DENSITY alone belongs in
        # exposure, and standalone SOCIO_VULN is too ambiguous.
        # Require either >=2 distinct indicators or at least 1 specific
        # composite indicator to flag as vulnerability.
        SINGLE_INDICATOR_INSUFFICIENT = {
            'DISPLACEMENT', 'POP_DENSITY', 'SOCIO_VULN',
        }
        if len(indicators) == 1 and indicators[0].indicator_code in SINGLE_INDICATOR_INSUFFICIENT:
            return []  # Not enough evidence for vulnerability

        return indicators

    def extract(self, record: Dict[str, Any]) -> VulnerabilityExtraction:
        """Extract vulnerability information from HDX record."""
        text = self._get_all_text(record)

        functions = self._extract_functions(record, text)
        socio_economic = self._extract_socio_economic(record, text)

        confidences = [f.confidence for f in functions] + [s.confidence for s in socio_economic]
        overall = float(np.mean(confidences)) if confidences else 0.0

        return VulnerabilityExtraction(
            functions=functions,
            socio_economic=socio_economic,
            overall_confidence=overall,
        )

# Initialize
vuln_extractor = VulnerabilityExtractor(SIGNAL_DICT, HAZARD_XREF)
print("VulnerabilityExtractor initialized (constraint-validated).")
print(f"  Function type patterns: {len(vuln_extractor.func_type_patterns)}")
print(f"  Socio-economic indicators: {len(vuln_extractor.socio_indicator_patterns)}")


VulnerabilityExtractor initialized (constraint-validated).
  Function type patterns: 4
  Socio-economic indicators: 18


## 3. RDLS Vulnerability Block Builder

In [7]:
"""
3.1 Build RDLS Vulnerability Block

Builds schema-compliant vulnerability block with:
- functions: vulnerability, fragility, damage_to_loss, engineering_demand
- socio_economic: indicator entries with all required fields
- All function entries validated against constraint tables:
  Group 1: impact_metric valid for function_type
  Group 3: quantity_kind valid for impact_metric
"""

def build_vulnerability_block(
    extraction: VulnerabilityExtraction,
    dataset_id: str,
) -> Optional[Dict[str, Any]]:
    """
    Build RDLS vulnerability block from extraction results.

    All function entries include 10+ mandatory fields validated against schema
    AND constraint tables. All socio-economic entries include indicator_name,
    indicator_code, description, reference_year, and id.

    Parameters
    ----------
    extraction : VulnerabilityExtraction
        Extraction results (already constraint-validated by extractor)
    dataset_id : str
        Dataset identifier for building unique IDs

    Returns
    -------
    Optional[Dict[str, Any]]
        RDLS vulnerability block or None if no data
    """
    if not extraction.has_any_signal():
        return None

    block = {}

    # --- Build functions ---
    if extraction.functions:
        functions_dict = {
            'vulnerability': [],
            'fragility': [],
            'damage_to_loss': [],
            'engineering_demand': [],
        }

        for idx, func in enumerate(extraction.functions):
            ftype = func.function_type
            if ftype not in functions_dict:
                continue

            # --- Codelist validation ---
            approach = func.approach if func.approach in VALID_FUNCTION_APPROACHES else 'empirical'
            relationship = func.relationship if func.relationship in VALID_RELATIONSHIP_TYPES else 'discrete'
            hazard_analysis_type = func.hazard_analysis_type if func.hazard_analysis_type in VALID_ANALYSIS_TYPES else 'empirical'
            impact_type = func.impact_type if func.impact_type in VALID_IMPACT_TYPES else 'direct'
            impact_modelling = func.impact_modelling if func.impact_modelling in VALID_CALCULATION_TYPES else 'observed'
            impact_metric = func.impact_metric if func.impact_metric in VALID_IMPACT_METRICS else DEFAULT_IMPACT_METRIC.get(ftype, 'damage_ratio')
            category = func.category if func.category in VALID_EXPOSURE_CATEGORIES else None

            # --- Group 1 re-validation: metric valid for function type ---
            ftype_constraints = FUNCTION_TYPE_CONSTRAINTS.get(ftype)
            if ftype_constraints and impact_metric not in ftype_constraints['allowed_metrics']:
                impact_metric = ftype_constraints['default_metric']
                quantity_kind = ftype_constraints['default_qty']
            else:
                quantity_kind = func.quantity_kind or 'ratio'

            # --- Group 3 re-validation: quantity_kind for metric ---
            metric_constraint = IMPACT_METRIC_CONSTRAINTS.get(impact_metric)
            if metric_constraint:
                expected_qty, allowed_types = metric_constraint
                quantity_kind = expected_qty
                if impact_type not in allowed_types:
                    impact_type = 'direct' if 'direct' in allowed_types else sorted(allowed_types)[0]

            # --- Group 4: approach/relationship defaults if not text-detected ---
            ft_defaults = FUNCTION_TYPE_APPROACH_DEFAULTS.get(ftype, {})

            entry = {
                'approach': approach,
                'relationship': relationship,
                'hazard_primary': func.hazard_primary if func.hazard_primary in VALID_HAZARD_TYPES else None,
                'hazard_analysis_type': hazard_analysis_type,
                'intensity_measure': func.intensity_measure or DEFAULT_INTENSITY_MEASURE.get(func.hazard_primary or 'flood', 'wd:m'),
                'category': category,
                'impact_type': impact_type,
                'impact_modelling': impact_modelling,
                'impact_metric': impact_metric,
                'quantity_kind': quantity_kind,
                'id': f"vuln_func_{dataset_id[:8]}_{ftype}_{idx + 1}",
            }

            # P1+P2 fix: Skip entries with no determinable hazard or category
            if entry['hazard_primary'] is None or entry['category'] is None:
                continue

            # Optional standard fields
            if func.hazard_secondary and func.hazard_secondary in VALID_HAZARD_TYPES:
                entry['hazard_secondary'] = func.hazard_secondary
            if func.hazard_process_primary and func.hazard_process_primary in VALID_PROCESS_TYPES:
                entry['hazard_process_primary'] = func.hazard_process_primary
            if func.hazard_process_secondary and func.hazard_process_secondary in VALID_PROCESS_TYPES:
                entry['hazard_process_secondary'] = func.hazard_process_secondary
            if func.taxonomy and func.taxonomy in VALID_TAXONOMIES:
                entry['taxonomy'] = func.taxonomy
            if func.analysis_details:
                entry['analysis_details'] = func.analysis_details

            # Type-specific fields
            if ftype in ('fragility', 'damage_to_loss', 'engineering_demand'):
                if func.damage_scale_name:
                    entry['damage_scale_name'] = func.damage_scale_name
                if func.damage_states_names:
                    entry['damage_states_names'] = func.damage_states_names
            if ftype == 'engineering_demand' and func.parameter:
                entry['parameter'] = func.parameter

            functions_dict[ftype].append(entry)

        non_empty = {k: v for k, v in functions_dict.items() if v}
        if non_empty:
            block['functions'] = non_empty

    # --- Build socio_economic ---
    if extraction.socio_economic:
        socio_list = []
        for idx, se in enumerate(extraction.socio_economic):
            entry = {
                'indicator_name': se.indicator_name,
                'indicator_code': se.indicator_code,
                'description': se.description,
                'id': f"socio_{dataset_id[:8]}_{idx + 1}",
            }

            if se.reference_year and 1900 <= se.reference_year <= 2100:
                entry['reference_year'] = se.reference_year
            else:
                entry['reference_year'] = datetime.now().year

            if se.scheme and se.scheme in VALID_TAXONOMIES:
                entry['scheme'] = se.scheme
            elif se.scheme == 'Custom':
                entry['scheme'] = 'Custom'

            if se.threshold:
                entry['threshold'] = se.threshold
            if se.uri:
                entry['uri'] = se.uri
            if se.analysis_details:
                entry['analysis_details'] = se.analysis_details

            socio_list.append(entry)

        block['socio_economic'] = socio_list

    return block if block else None


print("Vulnerability block builder defined (constraint-validated).")
print("  - Group 1: impact_metric validated for function_type")
print("  - Group 3: quantity_kind + impact_type validated for impact_metric")
print("  - Group 4: approach + relationship defaults per function type")
print("  - All entries validated against schema codelists")


Vulnerability block builder defined (constraint-validated).
  - Group 1: impact_metric validated for function_type
  - Group 3: quantity_kind + impact_type validated for impact_metric
  - Group 4: approach + relationship defaults per function type
  - All entries validated against schema codelists


## 4. Test Extraction

In [8]:
"""
4.1 Load Curated Test Samples

Organized by expected vulnerability type to stress-test extraction.
"""

VULN_TEST_SAMPLES = {
    'vulnerability_function': [
        # Datasets likely to have vulnerability/fragility function signals
        ('vulnerability', 'Vulnerability function signals'),
        ('fragility', 'Fragility function signals'),
        ('damage', 'Damage assessment/function signals'),
    ],
    'socioeconomic_poverty': [
        ('poverty', 'Poverty indicators'),
        ('deprivation', 'Deprivation/inequality'),
    ],
    'socioeconomic_health': [
        ('nutrition', 'Nutrition/health vulnerability'),
        ('malnutrition', 'Malnutrition signals'),
    ],
    'socioeconomic_displacement': [
        ('displacement', 'Displacement data'),
        ('idp', 'IDP settlement data'),
    ],
    'socioeconomic_index': [
        ('vulnerability-index', 'Vulnerability indices'),
        ('resilience', 'Resilience indices'),
        ('inform', 'INFORM risk index'),
    ],
    'socioeconomic_food': [
        ('food-security', 'Food security assessments'),
        ('ipc', 'IPC classification'),
    ],
    'edge_cases': [
        ('risk', 'Risk datasets â€” may or may not have vulnerability'),
        ('climate', 'Climate datasets â€” edge case'),
    ],
}

# Load samples by searching filenames
sample_records = []
sample_meta = []
loaded_ids = set()

for category, keyword_list in VULN_TEST_SAMPLES.items():
    for keyword, note in keyword_list:
        files = sorted(DATASET_METADATA_DIR.glob(f'*{keyword}*.json'))[:5]
        for fp in files:
            try:
                with open(fp, 'r', encoding='utf-8') as f:
                    record = json.load(f)
                rid = record.get('id', fp.stem)
                if rid not in loaded_ids:
                    loaded_ids.add(rid)
                    sample_records.append(record)
                    sample_meta.append({
                        'category': category,
                        'note': note,
                        'filename': fp.name,
                    })
            except Exception:
                pass

print(f"Loaded {len(sample_records)} unique test samples across {len(VULN_TEST_SAMPLES)} categories.")
print(f"\nSamples per category:")
for cat in VULN_TEST_SAMPLES:
    count = sum(1 for m in sample_meta if m['category'] == cat)
    print(f"  {cat}: {count}")

Loaded 57 unique test samples across 7 categories.

Samples per category:
  vulnerability_function: 10
  socioeconomic_poverty: 5
  socioeconomic_health: 5
  socioeconomic_displacement: 9
  socioeconomic_index: 8
  socioeconomic_food: 10
  edge_cases: 10


In [9]:
"""
4.2 Run Extraction on Test Samples
"""

print("=" * 90)
print("VULNERABILITY EXTRACTION TEST RESULTS")
print(f"Testing {len(sample_records)} samples")
print("=" * 90)

test_results = []
func_count = 0
socio_count = 0

for record, meta in zip(sample_records, sample_meta):
    extraction = vuln_extractor.extract(record)

    test_results.append({
        'id': record.get('id'),
        'title': record.get('title', '')[:70],
        'category': meta['category'],
        'extraction': extraction,
    })

    if extraction.has_any_signal():
        func_count += len(extraction.functions)
        socio_count += len(extraction.socio_economic)

        print(f"\n{'â”€' * 90}")
        print(f"[{meta['category']}] {record.get('title', '')[:75]}")

        if extraction.functions:
            for func in extraction.functions:
                print(f"  FUNCTION: {func.function_type} | approach={func.approach} | "
                      f"hazard={func.hazard_primary} | category={func.category} | "
                      f"metric={func.impact_metric} | conf={func.confidence:.2f}")

        if extraction.socio_economic:
            for se in extraction.socio_economic:
                print(f"  SOCIO-ECON: {se.indicator_name} ({se.indicator_code}) | "
                      f"scheme={se.scheme} | year={se.reference_year} | conf={se.confidence:.2f}")

print(f"\n{'=' * 90}")
print("TEST SUMMARY")
print(f"{'=' * 90}")
total = len(test_results)
with_signal = sum(1 for r in test_results if r['extraction'].has_any_signal())
print(f"  Total samples: {total}")
print(f"  With vulnerability signal: {with_signal} ({with_signal/total*100:.1f}%)")
print(f"  Function extractions: {func_count}")
print(f"  Socio-economic extractions: {socio_count}")

VULNERABILITY EXTRACTION TEST RESULTS
Testing 57 samples

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
[vulnerability_function] Afghanistan Displacement Data - Climate Vulnerability Assessment [IOM DTM]
  SOCIO-ECON: Displacement indicator (DISPLACEMENT) | scheme=Custom | year=2024 | conf=0.70
  SOCIO-ECON: Vulnerability index (VULN_INDEX) | scheme=Custom | year=2024 | conf=0.70

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
[vulnerability_function] Malawi National Vulnerability Index
  SOCIO-ECON: Coping capacity index (COPING_CAPACITY) | sch

In [10]:
"""
4.3 Build Blocks and Verify Structural Compliance

Verify all mandatory fields, codelist values, and structural requirements.
"""

print("=" * 90)
print("STRUCTURAL COMPLIANCE VERIFICATION")
print("=" * 90)

total_blocks = 0
total_functions = 0
total_socio = 0

# Compliance trackers
func_field_violations = []
socio_field_violations = []
codelist_violations = []

FUNCTION_MANDATORY_FIELDS = [
    'approach', 'relationship', 'hazard_primary', 'hazard_analysis_type',
    'intensity_measure', 'category', 'impact_type', 'impact_modelling',
    'impact_metric', 'quantity_kind', 'id'
]
SOCIO_MANDATORY_FIELDS = ['indicator_name', 'indicator_code', 'description', 'reference_year', 'id']

for result in test_results:
    extraction = result['extraction']
    if not extraction.has_any_signal():
        continue

    block = build_vulnerability_block(extraction, result['id'])
    if not block:
        continue

    total_blocks += 1

    # Check functions
    for ftype_key in ['vulnerability', 'fragility', 'damage_to_loss', 'engineering_demand']:
        for entry in block.get('functions', {}).get(ftype_key, []):
            total_functions += 1

            # Check mandatory fields
            for field in FUNCTION_MANDATORY_FIELDS:
                if field not in entry or not entry[field]:
                    func_field_violations.append(f"{result['id'][:8]}/{ftype_key}: missing {field}")

            # Check codelist values
            if entry.get('approach') and entry['approach'] not in VALID_FUNCTION_APPROACHES:
                codelist_violations.append(f"{result['id'][:8]}: approach='{entry['approach']}'")
            if entry.get('relationship') and entry['relationship'] not in VALID_RELATIONSHIP_TYPES:
                codelist_violations.append(f"{result['id'][:8]}: relationship='{entry['relationship']}'")
            if entry.get('hazard_primary') and entry['hazard_primary'] not in VALID_HAZARD_TYPES:
                codelist_violations.append(f"{result['id'][:8]}: hazard_primary='{entry['hazard_primary']}'")
            if entry.get('hazard_analysis_type') and entry['hazard_analysis_type'] not in VALID_ANALYSIS_TYPES:
                codelist_violations.append(f"{result['id'][:8]}: hazard_analysis_type='{entry['hazard_analysis_type']}'")
            if entry.get('category') and entry['category'] not in VALID_EXPOSURE_CATEGORIES:
                codelist_violations.append(f"{result['id'][:8]}: category='{entry['category']}'")
            if entry.get('impact_type') and entry['impact_type'] not in VALID_IMPACT_TYPES:
                codelist_violations.append(f"{result['id'][:8]}: impact_type='{entry['impact_type']}'")
            if entry.get('impact_modelling') and entry['impact_modelling'] not in VALID_CALCULATION_TYPES:
                codelist_violations.append(f"{result['id'][:8]}: impact_modelling='{entry['impact_modelling']}'")
            if entry.get('impact_metric') and entry['impact_metric'] not in VALID_IMPACT_METRICS:
                codelist_violations.append(f"{result['id'][:8]}: impact_metric='{entry['impact_metric']}'")

    # Check socio-economic
    for entry in block.get('socio_economic', []):
        total_socio += 1

        for field in SOCIO_MANDATORY_FIELDS:
            if field not in entry or (field != 'reference_year' and not entry[field]):
                socio_field_violations.append(f"{result['id'][:8]}: missing {field}")
            elif field == 'reference_year':
                yr = entry.get('reference_year')
                if not isinstance(yr, int) or yr < 1900 or yr > 2100:
                    socio_field_violations.append(f"{result['id'][:8]}: invalid reference_year={yr}")

        if entry.get('scheme') and entry['scheme'] not in VALID_TAXONOMIES:
            codelist_violations.append(f"{result['id'][:8]}: scheme='{entry['scheme']}'")

    # Show first 3 block previews
    if total_blocks <= 3:
        print(f"\n{'â”€' * 90}")
        print(f"Block preview: {result['title']}")
        print(json.dumps(block, indent=2)[:2000])

# --- Compliance Report ---
print(f"\n{'=' * 90}")
print("COMPLIANCE REPORT")
print(f"{'=' * 90}")
print(f"  Total blocks built:          {total_blocks}")
print(f"  Total function entries:       {total_functions}")
print(f"  Total socio-economic entries: {total_socio}")
print()
print(f"  Function mandatory fields:    {'PASS' if not func_field_violations else f'FAIL ({len(func_field_violations)} violations)'}")
for v in func_field_violations[:5]:
    print(f"    - {v}")
print(f"  Socio-economic mandatory:     {'PASS' if not socio_field_violations else f'FAIL ({len(socio_field_violations)} violations)'}")
for v in socio_field_violations[:5]:
    print(f"    - {v}")
print(f"  Codelist compliance:          {'PASS' if not codelist_violations else f'FAIL ({len(codelist_violations)} violations)'}")
for v in codelist_violations[:5]:
    print(f"    - {v}")

# ID uniqueness check
all_ids = []
for result in test_results:
    extraction = result['extraction']
    if not extraction.has_any_signal():
        continue
    block = build_vulnerability_block(extraction, result['id'])
    if not block:
        continue
    for ftype_key in ['vulnerability', 'fragility', 'damage_to_loss', 'engineering_demand']:
        for entry in block.get('functions', {}).get(ftype_key, []):
            all_ids.append(entry['id'])
    for entry in block.get('socio_economic', []):
        all_ids.append(entry['id'])

dup_ids = [id for id in all_ids if all_ids.count(id) > 1]
print(f"  ID uniqueness:                {'PASS' if not dup_ids else f'FAIL ({len(set(dup_ids))} duplicate IDs)'}")

STRUCTURAL COMPLIANCE VERIFICATION

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Block preview: Afghanistan Displacement Data - Climate Vulnerability Assessment [IOM 
{
  "socio_economic": [
    {
      "indicator_name": "Displacement indicator",
      "indicator_code": "DISPLACEMENT",
      "description": "Population displaced by conflict or disaster, indicating heightened vulnerability. Source: Afghanistan Displacement Data - Climate Vulnerability Assessment [IOM DTM]",
      "id": "socio_1e6fc369_1",
      "reference_year": 2024,
      "scheme": "Custom"
    },
    {
      "indicator_name": "Vulnerability index",
      "indicator_code": "VULN_INDEX",
      "description": "Composite vulnerability index combining multiple socio-economic and environmental factors.

## 5. Batch Processing

In [11]:
"""
5.1 Process Full Corpus
"""

def process_vulnerability_extraction(
    metadata_dir: Path,
    extractor: VulnerabilityExtractor,
    limit: Optional[int] = None
) -> pd.DataFrame:
    """Process all records for vulnerability extraction."""
    json_files = sorted(metadata_dir.glob('*.json'))
    if limit:
        json_files = json_files[:limit]

    results = []
    iterator = tqdm(json_files, desc="Extracting vulnerability") if HAS_TQDM else json_files

    for filepath in iterator:
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                record = json.load(f)

            extraction = extractor.extract(record)

            results.append({
                'id': record.get('id'),
                'title': record.get('title'),
                'organization': record.get('organization'),
                'has_functions': len(extraction.functions) > 0,
                'function_types': [f.function_type for f in extraction.functions],
                'has_socio_economic': len(extraction.socio_economic) > 0,
                'socio_indicators': [s.indicator_code for s in extraction.socio_economic],
                'overall_confidence': extraction.overall_confidence,
                'has_vulnerability': extraction.has_any_signal(),
                'extraction': extraction,
            })
        except Exception as e:
            results.append({'id': filepath.stem, 'error': str(e)})

    return pd.DataFrame(results)

PROCESS_LIMIT = None  # Set to e.g. 2000 for testing, None for full corpus

print(f"Processing {'all' if PROCESS_LIMIT is None else PROCESS_LIMIT} records...")
df_vuln = process_vulnerability_extraction(DATASET_METADATA_DIR, vuln_extractor, limit=PROCESS_LIMIT)

Processing all records...


Extracting vulnerability:   0%|          | 0/26246 [00:00<?, ?it/s]

In [12]:
"""
5.2 Extraction Statistics
"""

print("=" * 60)
print("VULNERABILITY EXTRACTION STATISTICS")
print("=" * 60)

total = len(df_vuln)
with_vuln = df_vuln['has_vulnerability'].sum()
with_func = df_vuln['has_functions'].sum()
with_socio = df_vuln['has_socio_economic'].sum()

print(f"\nTotal records processed: {total:,}")
print(f"  With any vulnerability signal: {with_vuln:,} ({with_vuln/total*100:.1f}%)")
print(f"  With function detection:       {with_func:,} ({with_func/total*100:.1f}%)")
print(f"  With socio-economic detection: {with_socio:,} ({with_socio/total*100:.1f}%)")

# Function type distribution
func_type_counts = Counter()
for ftypes in df_vuln['function_types'].dropna():
    if isinstance(ftypes, list):
        func_type_counts.update(ftypes)

if func_type_counts:
    print(f"\nFunction Type Distribution:")
    for ft, count in func_type_counts.most_common():
        print(f"  {ft}: {count}")

# Socio-economic indicator distribution
indicator_counts = Counter()
for indicators in df_vuln['socio_indicators'].dropna():
    if isinstance(indicators, list):
        indicator_counts.update(indicators)

if indicator_counts:
    print(f"\nSocio-Economic Indicator Distribution:")
    for ind, count in indicator_counts.most_common(15):
        print(f"  {ind}: {count}")

# Confidence distribution
conf = df_vuln[df_vuln['has_vulnerability']]['overall_confidence']
if len(conf) > 0:
    print(f"\nConfidence Distribution:")
    print(f"  Mean: {conf.mean():.2f}")
    print(f"  Median: {conf.median():.2f}")
    print(f"  High (>=0.7): {(conf >= 0.7).sum()}")
    print(f"  Medium (0.5-0.7): {((conf >= 0.5) & (conf < 0.7)).sum()}")
    print(f"  Low (<0.5): {(conf < 0.5).sum()}")

VULNERABILITY EXTRACTION STATISTICS

Total records processed: 26,246
  With any vulnerability signal: 5,327 (20.3%)
  With function detection:       393 (1.5%)
  With socio-economic detection: 5,327 (20.3%)

Function Type Distribution:
  vulnerability: 393

Socio-Economic Indicator Distribution:
  FOOD_SECURITY: 2448
  POV_HEADCOUNT: 1346
  EDU_ATTAINMENT: 1342
  MALNUTRITION: 1273
  HEALTH_ACCESS: 599
  DEPRIVATION: 401
  DISPLACEMENT: 291
  AGE_65_PLUS: 247
  HDI: 218
  COPING_CAPACITY: 68
  DISABILITY: 64
  POP_DENSITY: 28
  INFORM_RISK: 15
  VULN_INDEX: 9
  LIVELIHOOD: 9

Confidence Distribution:
  Mean: 0.70
  Median: 0.70
  High (>=0.7): 4099
  Medium (0.5-0.7): 1228
  Low (<0.5): 0


## 6. Export Results

In [13]:
"""
6.0 Clean Previous Outputs (Vulnerability + Loss)

Removes stale output files before writing new ones.
Controlled by CLEANUP_MODE in cell 1.2 above.
"""

def clean_previous_outputs(output_dir, patterns, label, mode="replace"):
    """
    Remove previous output files matching the given glob patterns.

    Parameters
    ----------
    output_dir : Path
        Directory containing old outputs.
    patterns : list[str]
        Glob patterns to match.
    label : str
        Human-readable label for log messages.
    mode : str
        One of: "replace" (auto-delete), "prompt" (ask user),
        "skip" (keep old files), "abort" (error if stale files exist).

    Returns
    -------
    dict  with keys 'deleted' (int) and 'skipped' (bool)
    """
    result = {'deleted': 0, 'skipped': False}
    targets = {}
    for pattern in patterns:
        matches = sorted(output_dir.glob(pattern))
        if matches:
            targets[pattern] = matches
    total = sum(len(files) for files in targets.values())

    if total == 0:
        print(f'Output cleanup [{label}]: Directory is clean.')
        return result

    summary = []
    for pattern, files in targets.items():
        summary.append(f'  {pattern:40s}: {len(files):,} files')

    if mode == 'skip':
        print(f'Output cleanup [{label}]: SKIPPED ({total:,} existing files kept)')
        result['skipped'] = True
        return result

    if mode == 'abort':
        raise RuntimeError(
            f'Output cleanup [{label}]: ABORT -- {total:,} stale files found. '
            f'Delete manually or change CLEANUP_MODE.'
        )

    if mode == 'prompt':
        print(f'Output cleanup [{label}]: Found {total:,} existing output files:')
        for line in summary:
            print(line)
        choice = input('Choose [R]eplace / [S]kip / [A]bort: ').strip().lower()
        if choice in ('s', 'skip'):
            print('  Skipped.')
            result['skipped'] = True
            return result
        elif choice in ('a', 'abort'):
            raise RuntimeError('User chose to abort.')
        elif choice not in ('r', 'replace', ''):
            print(f'  Unknown choice "{choice}", defaulting to Replace.')

    # Mode: replace (default)
    print(f'Output cleanup [{label}]:')
    for line in summary:
        print(line)
    for pattern, files in targets.items():
        for f in files:
            try:
                f.unlink()
                result['deleted'] += 1
            except Exception as e:
                print(f'  WARNING: Could not delete {f.name}: {e}')
    deleted_count = result['deleted']
    print(f'  Cleaned {deleted_count:,} files. Ready for fresh output.')
    print()
    return result


# -- Run cleanup for NB 11 Vulnerability + Loss Extraction outputs --
clean_previous_outputs(
    OUTPUT_DIR,
    patterns=[
        "rdls_vln-hdx_*.json",
        "vulnerability_extraction_results.csv",
        "vulnerability_detected_records.csv",
        "rdls_lss-hdx_*.json",
        "loss_extraction_results.csv",
        "loss_detected_records.csv",
    ],
    label="NB 11 Vulnerability + Loss Extraction",
    mode=CLEANUP_MODE,
)


Output cleanup [NB 11 Vulnerability + Loss Extraction]:
  rdls_vln-hdx_*.json                     : 5,327 files
  vulnerability_extraction_results.csv    : 1 files
  vulnerability_detected_records.csv      : 1 files
  rdls_lss-hdx_*.json                     : 821 files
  loss_extraction_results.csv             : 1 files
  loss_detected_records.csv               : 1 files
  Cleaned 6,152 files. Ready for fresh output.



{'deleted': 6152, 'skipped': False}

In [14]:
"""
6.1 Export Results and Generate RDLS Vulnerability Block JSONs
"""

# Prepare export DataFrame
export_df = df_vuln[[
    'id', 'title', 'organization', 'has_functions', 'function_types',
    'has_socio_economic', 'socio_indicators', 'overall_confidence', 'has_vulnerability'
]].copy()

# Convert lists to pipe-separated for CSV
for col in ['function_types', 'socio_indicators']:
    export_df[col] = export_df[col].apply(
        lambda x: '|'.join(x) if isinstance(x, list) else ''
    )

# Save full results
output_file = OUTPUT_DIR / 'vulnerability_extraction_results.csv'
export_df.to_csv(output_file, index=False)
print(f"Saved: {output_file}")

# Save records with vulnerability signals
vuln_records = export_df[export_df['has_vulnerability']]
vuln_file = OUTPUT_DIR / 'vulnerability_detected_records.csv'
vuln_records.to_csv(vuln_file, index=False)
print(f"Saved: {vuln_file} ({len(vuln_records)} records)")

# --- Generate RDLS vulnerability block JSONs for ALL flagged datasets ---
all_vuln = df_vuln[
    df_vuln['has_vulnerability'] &
    (df_vuln['overall_confidence'] >= 0.5)
].copy()

print(f"\nGenerating RDLS vulnerability block JSONs for {len(all_vuln):,} datasets...")

generated = 0
skipped = 0

iterator = tqdm(all_vuln.iterrows(), total=len(all_vuln), desc="Building vuln JSONs") if HAS_TQDM else all_vuln.iterrows()

for idx, row in iterator:
    extraction = row['extraction']
    vuln_block = build_vulnerability_block(extraction, row['id'])

    if vuln_block:
        rdls_record = {
            'datasets': [{
                'id': f"rdls_vln-hdx_{row['id'][:8]}",
                'title': row['title'],
                'risk_data_type': ['vulnerability'],
                'vulnerability': vuln_block,
                'links': [{
                    'href': 'https://docs.riskdatalibrary.org/en/0__3__0/rdls_schema.json',
                    'rel': 'describedby'
                }]
            }]
        }

        output_path = OUTPUT_DIR / f"rdls_vln-hdx_{row['id'][:8]}.json"
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(rdls_record, f, indent=2, ensure_ascii=False)

        generated += 1
    else:
        skipped += 1

print(f"\nDone.")
print(f"  Generated: {generated:,} vulnerability block JSONs")
print(f"  Skipped (no valid block): {skipped:,}")
print(f"  Output: {OUTPUT_DIR}")

Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted/vulnerability_extraction_results.csv
Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted/vulnerability_detected_records.csv (5327 records)

Generating RDLS vulnerability block JSONs for 5,327 datasets...


Building vuln JSONs:   0%|          | 0/5327 [00:00<?, ?it/s]


Done.
  Generated: 5,327 vulnerability block JSONs
  Skipped (no valid block): 0
  Output: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted


## 8. RDLS Loss Block Extractor

**Purpose**: Extract and populate RDLS v0.3 Loss component blocks from HDX metadata.

**RDLS Loss Block Structure (v0.3)**:
```
loss:
  losses:
    - id: string (required)
      hazard_type: hazard_type codelist (required)
      hazard_process: process_type codelist (optional)
      asset_category: exposure_category codelist (required)
      asset_dimension: metric_dimension codelist (required)
      impact_and_losses:  (required, 7 required sub-fields)
        impact_type: direct | indirect | total
        impact_modelling: inferred | observed | simulated
        impact_metric: impact_metric codelist
        quantity_kind: open codelist
        loss_type: ground_up | insured | gross | count | net_precat | net_postcat
        loss_approach: analytical | empirical | hybrid | judgement
        loss_frequency_type: probabilistic | deterministic | empirical
        currency: ISO 4217 (optional, when monetary)
      lineage: (optional)
        hazard_dataset: string
        exposure_dataset: string
        vulnerability_dataset: string
      description: string (optional)
```

**Extraction Strategy**:
- Detect loss signals from HDX metadata (damage, fatality, casualty, economic loss, AAL)
- Cross-reference hazard extraction from NB 09 for hazard_type and hazard_process
- Infer asset_category from exposure patterns, asset_dimension from context
- Populate all 7 required impact_and_losses sub-fields with codelist-valid values

---

### 8.1 Loss Detection Patterns and Constants

In [15]:
"""
8.1 Loss Detection Patterns and Data Classes

Patterns for detecting loss types, asset dimensions, and inferring
mandatory field values from HDX metadata text.
"""

# --- Load additional schema constants for Loss ---
VALID_LOSS_TYPES: Set[str] = set(RDLS_SCHEMA['$defs']['Losses']['properties']['impact_and_losses']['properties']['loss_type']['enum'])
VALID_METRIC_DIMENSIONS: Set[str] = set(RDLS_SCHEMA['$defs']['metric_dimension']['enum'])
# Currency: large ISO 4217 codelist â€” load from schema
VALID_CURRENCIES: Set[str] = set(RDLS_SCHEMA['$defs']['Losses']['properties']['impact_and_losses']['properties']['currency']['enum'])

print(f"Loss schema constants loaded:")
print(f"  Loss types ({len(VALID_LOSS_TYPES)}): {sorted(VALID_LOSS_TYPES)}")
print(f"  Metric dimensions ({len(VALID_METRIC_DIMENSIONS)}): {sorted(VALID_METRIC_DIMENSIONS)}")
print(f"  Currencies: {len(VALID_CURRENCIES)} ISO 4217 codes")
print(f"  (Reusing from vulnerability: hazard_types, process_types, analysis_types, etc.)")

# --- Loss signal detection patterns ---
LOSS_SIGNAL_PATTERNS = {
    'human_loss': [
        re.compile(r'\b(casualt(?:y|ies)|fatalit(?:y|ies)|mortalit(?:y|ies)|death)\b', re.IGNORECASE),
        re.compile(r'\b(killed|dead|perished|deceased)\b', re.IGNORECASE),
        re.compile(r'\b(injur(?:y|ies|ed)|wounded|hospitalized)\b', re.IGNORECASE),
        re.compile(r'\b(missing[\s._-]?persons?|unaccounted)\b', re.IGNORECASE),
    ],
    'displacement': [
        re.compile(r'\b(displaced|displacement|evacuated|evacuation)\b', re.IGNORECASE),
        re.compile(r'\b(homeless|shelter[\s._-]?(?:less|need))\b', re.IGNORECASE),
        re.compile(r'\b(internally[\s._-]?displaced|idp)\b', re.IGNORECASE),
        re.compile(r'\b(refugee[\s._-]?(?:flow|movement|crisis))\b', re.IGNORECASE),
    ],
    'affected_population': [
        re.compile(r'\b(affected[\s._-]?(?:population|people|person|household|communit))\b', re.IGNORECASE),
        re.compile(r'\b(people[\s._-]?(?:affected|impacted|in[\s._-]?need))\b', re.IGNORECASE),
        re.compile(r'\b(population[\s._-]?(?:affected|exposed|at[\s._-]?risk))\b', re.IGNORECASE),
    ],
    'economic_loss': [
        re.compile(r'\b(economic[\s._-]?loss|financial[\s._-]?loss|monetary[\s._-]?loss)\b', re.IGNORECASE),
        re.compile(r'\b(damage[\s._-]?cost|repair[\s._-]?cost|replacement[\s._-]?cost)\b', re.IGNORECASE),
        re.compile(r'\b(insured[\s._-]?loss|insurance[\s._-]?claim)\b', re.IGNORECASE),
        re.compile(r'\b(aal|average[\s._-]?annual[\s._-]?loss)\b', re.IGNORECASE),
        re.compile(r'\b(expected[\s._-]?loss|probable[\s._-]?maximum[\s._-]?loss|pml)\b', re.IGNORECASE),
    ],
    'structural_damage': [
        re.compile(r'\b(building[\s._-]?(?:damage|destroyed|collapsed|affected))\b', re.IGNORECASE),
        re.compile(r'\b(structural[\s._-]?damage|house[\s._-]?(?:damage|destroyed))\b', re.IGNORECASE),
        re.compile(r'\b(infrastructure[\s._-]?(?:damage|destroyed|loss))\b', re.IGNORECASE),
        re.compile(r'\b(damage[\s._-]?(?:state|ratio|assessment|survey))\b', re.IGNORECASE),
    ],
    'agricultural_loss': [
        re.compile(r'\b(crop[\s._-]?(?:loss|damage|failure|destroyed))\b', re.IGNORECASE),
        re.compile(r'\b(agricultural[\s._-]?(?:loss|damage|impact))\b', re.IGNORECASE),
        re.compile(r'\b(livestock[\s._-]?(?:loss|death|mortality))\b', re.IGNORECASE),
        re.compile(r'\b(harvest[\s._-]?(?:loss|failure|damage))\b', re.IGNORECASE),
    ],
    'catastrophe_model': [
        re.compile(r'\b(cat[\s._-]?model|catastrophe[\s._-]?model)\b', re.IGNORECASE),
        re.compile(r'\b(risk[\s._-]?model|loss[\s._-]?model)\b', re.IGNORECASE),
        re.compile(r'\b(loss[\s._-]?exceedance|ep[\s._-]?curve)\b', re.IGNORECASE),
    ],
    'general_loss': [
        re.compile(r'\b(disaster[\s._-]?(?:loss|damage|impact|incident))\b', re.IGNORECASE),
        re.compile(r'\b(natural[\s._-]?disaster[\s._-]?(?:loss|damage|impact|incident))\b', re.IGNORECASE),
        re.compile(r'\b(damage[\s._-]?and[\s._-]?loss(?:es)?)\b', re.IGNORECASE),
        re.compile(r'\b(post[\s._-]?disaster[\s._-]?(?:need|assessment|damage))\b', re.IGNORECASE),
        re.compile(r'\b(pdna|dala|rapid[\s._-]?damage[\s._-]?assessment)\b', re.IGNORECASE),
    ],
}

# --- Exclusion patterns: things that look like loss but aren't ---
LOSS_EXCLUSION_PATTERNS = [
    re.compile(r'\b(data[\s._-]?loss|packet[\s._-]?loss|signal[\s._-]?loss)\b', re.IGNORECASE),
    re.compile(r'\b(weight[\s._-]?loss|hair[\s._-]?loss|blood[\s._-]?loss)\b', re.IGNORECASE),
    re.compile(r'\b(loss[\s._-]?of[\s._-]?(?:data|signal|connectivity|precision))\b', re.IGNORECASE),
    re.compile(r'\b(profit[\s._-]?and[\s._-]?loss|p&l)\b', re.IGNORECASE),
]


# --- Currency detection patterns ---
CURRENCY_PATTERNS = [
    (re.compile(r'\b(usd|us[\s._-]?dollar|united[\s._-]?states[\s._-]?dollar)\b', re.IGNORECASE), 'USD'),
    (re.compile(r'\b(eur|euro)\b', re.IGNORECASE), 'EUR'),
    (re.compile(r'\b(gbp|british[\s._-]?pound|pound[\s._-]?sterling)\b', re.IGNORECASE), 'GBP'),
    (re.compile(r'\b(jpy|japanese[\s._-]?yen)\b', re.IGNORECASE), 'JPY'),
    (re.compile(r'\b(cny|chinese[\s._-]?yuan|rmb|renminbi)\b', re.IGNORECASE), 'CNY'),
    (re.compile(r'\b(inr|indian[\s._-]?rupee)\b', re.IGNORECASE), 'INR'),
    (re.compile(r'\b(aud|australian[\s._-]?dollar)\b', re.IGNORECASE), 'AUD'),
    (re.compile(r'\b(cad|canadian[\s._-]?dollar)\b', re.IGNORECASE), 'CAD'),
    (re.compile(r'\b(chf|swiss[\s._-]?franc)\b', re.IGNORECASE), 'CHF'),
    (re.compile(r'\b(bdt|bangladeshi[\s._-]?taka|taka)\b', re.IGNORECASE), 'BDT'),
    (re.compile(r'\b(pkr|pakistani[\s._-]?rupee)\b', re.IGNORECASE), 'PKR'),
    (re.compile(r'\b(php|philippine[\s._-]?peso)\b', re.IGNORECASE), 'PHP'),
    (re.compile(r'\b(idr|indonesian[\s._-]?rupiah|rupiah)\b', re.IGNORECASE), 'IDR'),
    (re.compile(r'\b(kes|kenyan[\s._-]?shilling)\b', re.IGNORECASE), 'KES'),
    (re.compile(r'\b(ngn|nigerian[\s._-]?naira|naira)\b', re.IGNORECASE), 'NGN'),
    (re.compile(r'\b(etb|ethiopian[\s._-]?birr|birr)\b', re.IGNORECASE), 'ETB'),
    (re.compile(r'\b(mmk|myanmar[\s._-]?kyat|kyat)\b', re.IGNORECASE), 'MMK'),
    (re.compile(r'\b(afn|afghani)\b', re.IGNORECASE), 'AFN'),
    (re.compile(r'\b(htg|haitian[\s._-]?gourde|gourde)\b', re.IGNORECASE), 'HTG'),
    (re.compile(r'\b(ssp|south[\s._-]?sudanese[\s._-]?pound)\b', re.IGNORECASE), 'SSP'),
    (re.compile(r'\b(yer|yemeni[\s._-]?rial)\b', re.IGNORECASE), 'YER'),
    (re.compile(r'\b(sdg|sudanese[\s._-]?pound)\b', re.IGNORECASE), 'SDG'),
    (re.compile(r'\b(syp|syrian[\s._-]?pound)\b', re.IGNORECASE), 'SYP'),
    (re.compile(r'\b(cdf|congolese[\s._-]?franc)\b', re.IGNORECASE), 'CDF'),
    (re.compile(r'\b(mzn|mozambican[\s._-]?metical|metical)\b', re.IGNORECASE), 'MZN'),
]

# --- Insured loss detection ---
INSURED_LOSS_PATTERNS = [
    re.compile(r'\b(insured[\s._-]?loss|insurance[\s._-]?claim|insured[\s._-]?damage)\b', re.IGNORECASE),
    re.compile(r'\b(insurance[\s._-]?payout|claim[\s._-]?amount)\b', re.IGNORECASE),
]

# --- Loss approach inference (parallels function_approach) ---
LOSS_APPROACH_PATTERNS = {
    'analytical': [
        re.compile(r'\b(analytical|simulation[\s._-]?based|modelled|modeled|cat[\s._-]?model)\b', re.IGNORECASE),
        re.compile(r'\b(catastrophe[\s._-]?model|risk[\s._-]?model)\b', re.IGNORECASE),
    ],
    'empirical': [
        re.compile(r'\b(empirical|observed|survey|historical|field[\s._-]?data)\b', re.IGNORECASE),
        re.compile(r'\b(post[\s._-]?disaster|post[\s._-]?event|damage[\s._-]?survey)\b', re.IGNORECASE),
        re.compile(r'\b(actual|recorded|reported|pdna|dala)\b', re.IGNORECASE),
    ],
    'hybrid': [
        re.compile(r'\b(hybrid|combined|mixed[\s._-]?method)\b', re.IGNORECASE),
    ],
    'judgement': [
        re.compile(r'\b(expert[\s._-]?judg[e]?ment|expert[\s._-]?opinion|estimated)\b', re.IGNORECASE),
        re.compile(r'\b(rapid[\s._-]?assessment|preliminary[\s._-]?estimate)\b', re.IGNORECASE),
    ],
}

# --- Loss frequency type inference (parallels analysis_type) ---
LOSS_FREQUENCY_PATTERNS = {
    'probabilistic': [
        re.compile(r'\b(probabilistic|stochastic|return[\s._-]?period|aal)\b', re.IGNORECASE),
        re.compile(r'\b(average[\s._-]?annual[\s._-]?loss|expected[\s._-]?loss)\b', re.IGNORECASE),
        re.compile(r'\b(ep[\s._-]?curve|loss[\s._-]?exceedance|exceedance[\s._-]?probability)\b', re.IGNORECASE),
        re.compile(r'\b(probable[\s._-]?maximum[\s._-]?loss|pml|annual[\s._-]?exceedance)\b', re.IGNORECASE),
    ],
    'deterministic': [
        re.compile(r'\b(deterministic|scenario[\s._-]?based|single[\s._-]?event)\b', re.IGNORECASE),
        re.compile(r'\b(worst[\s._-]?case|maximum[\s._-]?credible)\b', re.IGNORECASE),
    ],
    'empirical': [
        re.compile(r'\b(empirical|historical|observed|actual[\s._-]?event)\b', re.IGNORECASE),
        re.compile(r'\b(recorded|reported|real[\s._-]?event|past[\s._-]?event)\b', re.IGNORECASE),
        re.compile(r'\b(disaster[\s._-]?incident|event[\s._-]?based)\b', re.IGNORECASE),
    ],
}

print("Loss detection patterns defined.")
print(f"  Loss signal categories: {list(LOSS_SIGNAL_PATTERNS.keys())}")
print(f"  Exclusion patterns: {len(LOSS_EXCLUSION_PATTERNS)}")
print(f"  Currency detection: {len(CURRENCY_PATTERNS)} currencies")

# =============================================================================
# LOSS CONSTRAINT TABLES
# =============================================================================
#
# Three constraint groups derived from RDLS v0.3 schema + Chattogram example.
# Used by LossExtractor to validate field combinations.
#
# Group 1: VALID_ASSET_TRIPLETS â€” asset_category -> allowed asset_dimensions
#           (mirrors exposure VALID_TRIPLETS dimension column)
#
# Group 2: IMPACT_METRIC_CONSTRAINTS â€” impact_metric -> (quantity_kind, impact_types)
#           Maps each of the 19 impact_metric values to its expected quantity_kind
#           and which impact_types it logically applies to.
#
# Group 3: LOSS_SIGNAL_DEFAULTS â€” loss_signal_type -> full field defaults
#           Single lookup replacing the 4 separate LOSS_SIGNAL_TO_* dicts.
#           Each signal type maps to a complete, validated set of defaults.
# =============================================================================

# --- Group 1: asset_category -> allowed asset_dimensions ---
# First entry is the default for each category.
VALID_ASSET_TRIPLETS = {
    'agriculture':          ['product', 'structure', 'content'],
    'buildings':            ['structure', 'content'],
    'infrastructure':       ['structure', 'disruption'],
    'population':           ['population'],
    'natural_environment':  ['structure', 'index'],
    'economic_indicator':   ['product', 'index'],
    'development_index':    ['index'],
}

# --- Group 2: impact_metric -> (quantity_kind, impact_types) ---
# IMPACT_METRIC_CONSTRAINTS is defined in cell 6 (shared with vulnerability).
# It is available here as a global variable.


# --- Group 3: Unified signal defaults ---
# Replaces LOSS_SIGNAL_TO_LOSS_TYPE, LOSS_SIGNAL_TO_IMPACT_METRIC,
# LOSS_SIGNAL_TO_QUANTITY_KIND, and LOSS_SIGNAL_TO_ASSET with one table.
# Each signal type provides a complete, internally-consistent set of defaults
# that are guaranteed valid against Groups 1 and 2.
LOSS_SIGNAL_DEFAULTS = {
    'human_loss': {
        'asset_category':    'population',
        'asset_dimension':   'population',
        'impact_type':       'direct',
        'impact_metric':     'casualty_count',
        'quantity_kind':     'count',
        'loss_type':         'count',
    },
    'displacement': {
        'asset_category':    'population',
        'asset_dimension':   'population',
        'impact_type':       'direct',
        'impact_metric':     'displaced_count',
        'quantity_kind':     'count',
        'loss_type':         'count',
    },
    'affected_population': {
        'asset_category':    'population',
        'asset_dimension':   'population',
        'impact_type':       'direct',
        'impact_metric':     'exposure_to_hazard',
        'quantity_kind':     'count',
        'loss_type':         'count',
    },
    'economic_loss': {
        'asset_category':    'buildings',
        'asset_dimension':   'structure',
        'impact_type':       'direct',
        'impact_metric':     'economic_loss_value',
        'quantity_kind':     'monetary',
        'loss_type':         'ground_up',
    },
    'structural_damage': {
        'asset_category':    'buildings',
        'asset_dimension':   'structure',
        'impact_type':       'direct',
        'impact_metric':     'damage_ratio',
        'quantity_kind':     'ratio',
        'loss_type':         'ground_up',
    },
    'agricultural_loss': {
        'asset_category':    'agriculture',
        'asset_dimension':   'product',
        'impact_type':       'direct',
        'impact_metric':     'asset_loss',
        'quantity_kind':     'monetary',
        'loss_type':         'ground_up',
    },
    'catastrophe_model': {
        'asset_category':    'buildings',
        'asset_dimension':   'structure',
        'impact_type':       'total',
        'impact_metric':     'loss_annual_average_value',
        'quantity_kind':     'monetary',
        'loss_type':         'ground_up',
    },
    'general_loss': {
        'asset_category':    'population',
        'asset_dimension':   'population',
        'impact_type':       'direct',
        'impact_metric':     'asset_loss',
        'quantity_kind':     'count',
        'loss_type':         'ground_up',
    },
}

# --- Loss-type inference rules ---
# loss_type values that require specific loss_approach combinations.
# net_precat/net_postcat only valid for analytical (cat model) approaches.
LOSS_TYPE_APPROACH_RULES = {
    'net_precat':  {'analytical'},
    'net_postcat': {'analytical'},
    'insured':     {'analytical', 'empirical', 'hybrid'},
    'gross':       {'analytical', 'empirical', 'hybrid'},
    'ground_up':   {'analytical', 'empirical', 'hybrid', 'judgement'},
    'count':       {'analytical', 'empirical', 'hybrid', 'judgement'},
}

print("\nLoss constraint tables defined.")
print(f"  Group 1 - Asset triplets: {len(VALID_ASSET_TRIPLETS)} categories")
print(f"  Group 2 - Impact metric constraints: {len(IMPACT_METRIC_CONSTRAINTS)} metrics")
print(f"  Group 3 - Signal defaults: {len(LOSS_SIGNAL_DEFAULTS)} signal types")
print(f"  Loss-type approach rules: {len(LOSS_TYPE_APPROACH_RULES)} loss types")


Loss schema constants loaded:
  Loss types (6): ['count', 'gross', 'ground_up', 'insured', 'net_postcat', 'net_precat']
  Metric dimensions (6): ['content', 'disruption', 'index', 'population', 'product', 'structure']
  Currencies: 302 ISO 4217 codes
  (Reusing from vulnerability: hazard_types, process_types, analysis_types, etc.)
Loss detection patterns defined.
  Loss signal categories: ['human_loss', 'displacement', 'affected_population', 'economic_loss', 'structural_damage', 'agricultural_loss', 'catastrophe_model', 'general_loss']
  Exclusion patterns: 4
  Currency detection: 25 currencies

Loss constraint tables defined.
  Group 1 - Asset triplets: 7 categories
  Group 2 - Impact metric constraints: 20 metrics
  Group 3 - Signal defaults: 8 signal types
  Loss-type approach rules: 6 loss types


In [16]:
"""
8.2 Loss Data Classes and Extractor

Updated to use VALID_LOSS_TRIPLETS constraint tables for consistent
field combinations. All inferred values are validated against the 3 groups:
  Group 1: asset_category -> asset_dimension
  Group 2: impact_metric -> quantity_kind + impact_type
  Group 3: loss_signal -> unified defaults
"""

# Re-import 'field' in case it was shadowed by loop variables in earlier cells
from dataclasses import field

@dataclass
class LossEntryExtraction:
    """Extraction result for a single loss entry."""
    loss_signal_type: str                    # human_loss, economic_loss, etc.
    hazard_type: Optional[str] = None        # hazard_type codelist
    hazard_process: Optional[str] = None     # process_type codelist
    asset_category: str = 'population'       # exposure_category codelist
    asset_dimension: str = 'population'      # metric_dimension codelist
    impact_type: str = 'direct'              # impact_type codelist
    impact_modelling: str = 'observed'       # data_calculation_type codelist
    impact_metric: str = 'asset_loss'        # impact_metric codelist
    quantity_kind: str = 'count'             # open codelist
    loss_type: str = 'ground_up'             # loss_type codelist
    loss_approach: str = 'empirical'         # function_approach codelist
    loss_frequency_type: str = 'empirical'   # analysis_type codelist
    currency: Optional[str] = None           # ISO 4217
    description: Optional[str] = None
    # lineage
    hazard_dataset: Optional[str] = None
    exposure_dataset: Optional[str] = None
    vulnerability_dataset: Optional[str] = None
    confidence: float = 0.0

@dataclass
class LossExtraction:
    """Complete loss extraction for a dataset."""
    losses: List[LossEntryExtraction] = field(default_factory=list)
    overall_confidence: float = 0.0

    def has_any_signal(self) -> bool:
        return len(self.losses) > 0


class LossExtractor:
    """
    Extracts RDLS Loss block components from HDX metadata.

    Detects loss signals (human loss, economic loss, structural damage, etc.),
    infers hazard context from NB09 cross-reference, and populates all required
    fields with codelist-valid values.

    All field combinations are validated against the 3 constraint groups:
      Group 1: VALID_ASSET_TRIPLETS (asset_category -> asset_dimension)
      Group 2: IMPACT_METRIC_CONSTRAINTS (metric -> quantity_kind + impact_type)
      Group 3: LOSS_SIGNAL_DEFAULTS (unified per-signal defaults)
    """

    def __init__(self, signal_dict: Dict[str, Any], hazard_xref: Dict[str, Dict]):
        self.signal_dict = signal_dict
        self.hazard_xref = hazard_xref

    def _get_all_text(self, record: Dict[str, Any]) -> str:
        """Concatenate all searchable text fields for pattern matching.

        Note: methodology_other is deliberately excluded. It describes
        how analysis was performed (e.g. 'produced AAL and PML values'),
        not what data the dataset contains. Including it causes false
        positives where methodology text about loss calculations triggers
        loss detection on hazard-only datasets.
        """
        parts = [
            record.get('title', ''),
            record.get('name', ''),
            record.get('notes', ''),
        ]
        for tag in record.get('tags', []):
            if isinstance(tag, dict):
                parts.append(tag.get('name', ''))
            elif isinstance(tag, str):
                parts.append(tag)
        for r in record.get('resources', []):
            parts.append(r.get('name', '') or '')
            parts.append(r.get('description', '') or '')
        return ' '.join(filter(None, parts))

    def _check_exclusions(self, text: str) -> bool:
        """Check if text matches exclusion patterns (false positive filter)."""
        for p in LOSS_EXCLUSION_PATTERNS:
            if p.search(text):
                return True
        return False

    def _detect_loss_signals(self, text: str) -> List[str]:
        """Detect which loss signal categories are present."""
        detected = []
        for signal_type, patterns in LOSS_SIGNAL_PATTERNS.items():
            for p in patterns:
                if p.search(text):
                    detected.append(signal_type)
                    break
        return detected

    def _infer_hazard_context(self, record: Dict[str, Any], text: str) -> Dict[str, Optional[str]]:
        """Infer hazard context from cross-reference or text."""
        dataset_id = record.get('id', '')

        # Try cross-reference first (from NB 09)
        if dataset_id in self.hazard_xref:
            xref = self.hazard_xref[dataset_id]
            ht_list = [h for h in xref['hazard_types'] if h in VALID_HAZARD_TYPES]
            pt_list = [p for p in xref['process_types'] if p in VALID_PROCESS_TYPES]
            return {
                'hazard_type': ht_list[0] if ht_list else None,
                'hazard_process': pt_list[0] if pt_list else None,
            }

        # Fallback: infer from text using signal dictionary patterns
        text_lower = text.lower()
        for ht, patterns in HAZARD_TYPE_PATTERNS.items():
            for p in patterns:
                if p.search(text_lower):
                    return {
                        'hazard_type': ht,
                        'hazard_process': HAZARD_PROCESS_DEFAULT.get(ht),
                    }

        return {'hazard_type': None, 'hazard_process': None}

    def _infer_asset_context(self, text: str, signal_type: str) -> Dict[str, str]:
        """
        Infer asset_category and asset_dimension from text + signal type.

        Validates against Group 1 (VALID_ASSET_TRIPLETS): the inferred
        asset_dimension must be allowed for the asset_category. If not,
        falls back to the category's default dimension (first in list).
        """
        text_lower = text.lower()

        # Try to detect specific asset category from text
        detected_category = None
        for cat, patterns in EXPOSURE_CATEGORY_PATTERNS.items():
            for p in patterns:
                if p.search(text_lower):
                    if cat in VALID_EXPOSURE_CATEGORIES:
                        detected_category = cat
                        break
            if detected_category:
                break

        # Get defaults from Group 3
        defaults = LOSS_SIGNAL_DEFAULTS.get(signal_type, {
            'asset_category': 'population',
            'asset_dimension': 'population',
        })

        asset_category = detected_category or defaults['asset_category']

        # --- Group 1 validation: asset_dimension must be valid for category ---
        allowed_dims = VALID_ASSET_TRIPLETS.get(asset_category, ['structure'])

        # Try signal-type default dimension first
        default_dim = defaults.get('asset_dimension', allowed_dims[0])

        if default_dim in allowed_dims:
            asset_dimension = default_dim
        else:
            # Dimension not valid for this category; use category's default
            asset_dimension = allowed_dims[0]

        # Ensure asset_dimension is valid codelist value
        if asset_dimension not in VALID_METRIC_DIMENSIONS:
            asset_dimension = allowed_dims[0] if allowed_dims[0] in VALID_METRIC_DIMENSIONS else 'structure'

        return {
            'asset_category': asset_category,
            'asset_dimension': asset_dimension,
        }

    def _validate_impact_metric(self, impact_metric: str, quantity_kind: str,
                                 impact_type: str) -> Dict[str, str]:
        """
        Validate impact_metric + quantity_kind + impact_type against Group 2.

        Returns corrected values if the combination is invalid.
        """
        constraints = IMPACT_METRIC_CONSTRAINTS.get(impact_metric)

        if constraints is None:
            # Unknown metric â€” keep as-is (open codelist may extend)
            return {
                'impact_metric': impact_metric,
                'quantity_kind': quantity_kind,
                'impact_type': impact_type,
            }

        expected_qty, allowed_types = constraints

        # Fix quantity_kind if wrong
        if quantity_kind != expected_qty:
            quantity_kind = expected_qty

        # Fix impact_type if not allowed for this metric
        if impact_type not in allowed_types:
            # Pick first allowed type (prefer 'direct')
            impact_type = 'direct' if 'direct' in allowed_types else sorted(allowed_types)[0]

        return {
            'impact_metric': impact_metric,
            'quantity_kind': quantity_kind,
            'impact_type': impact_type,
        }

    def _validate_loss_approach(self, loss_type: str, loss_approach: str) -> str:
        """
        Validate loss_type + loss_approach against LOSS_TYPE_APPROACH_RULES.

        If the approach is invalid for the loss_type, returns a valid one.
        """
        allowed = LOSS_TYPE_APPROACH_RULES.get(loss_type)
        if allowed and loss_approach not in allowed:
            # Pick empirical as a safe default, or first allowed
            return 'empirical' if 'empirical' in allowed else sorted(allowed)[0]
        return loss_approach

    def _infer_loss_approach(self, text: str) -> str:
        """Infer loss approach from text."""
        scores = {k: 0 for k in VALID_FUNCTION_APPROACHES}
        for approach, patterns in LOSS_APPROACH_PATTERNS.items():
            for p in patterns:
                if p.search(text):
                    scores[approach] += 1
        best = max(scores, key=scores.get)
        return best if scores[best] > 0 else 'empirical'

    def _infer_loss_frequency(self, text: str) -> str:
        """Infer loss frequency type from text."""
        scores = {k: 0 for k in VALID_ANALYSIS_TYPES}
        for freq, patterns in LOSS_FREQUENCY_PATTERNS.items():
            for p in patterns:
                if p.search(text):
                    scores[freq] += 1
        best = max(scores, key=scores.get)
        return best if scores[best] > 0 else 'empirical'

    def _infer_impact_type(self, text: str) -> str:
        """Infer impact type from text."""
        text_lower = text.lower()
        for itype, patterns in IMPACT_TYPE_PATTERNS.items():
            for p in patterns:
                if isinstance(p, str):
                    if re.search(p, text_lower, re.IGNORECASE):
                        return itype
                else:
                    if p.search(text_lower):
                        return itype
        return 'direct'

    def _infer_impact_modelling(self, text: str) -> str:
        """Infer impact modelling method from text."""
        text_lower = text.lower()
        scores = {k: 0 for k in VALID_CALCULATION_TYPES}
        for mod, patterns in IMPACT_MODELLING_PATTERNS.items():
            for p in patterns:
                if isinstance(p, str):
                    if re.search(p, text_lower, re.IGNORECASE):
                        scores[mod] += 1
                else:
                    if p.search(text_lower):
                        scores[mod] += 1
        best = max(scores, key=scores.get)
        return best if scores[best] > 0 else 'observed'

    def _detect_currency(self, text: str) -> Optional[str]:
        """Detect currency from text."""
        for pattern, currency_code in CURRENCY_PATTERNS:
            if pattern.search(text):
                if currency_code in VALID_CURRENCIES:
                    return currency_code
        return None

    def _detect_insured(self, text: str) -> bool:
        """Detect if losses are insured losses."""
        for p in INSURED_LOSS_PATTERNS:
            if p.search(text):
                return True
        return False

    def _extract_reference_year(self, record: Dict[str, Any]) -> Optional[int]:
        """Extract reference year from dataset metadata."""
        dataset_date = record.get('dataset_date', '') or ''
        year_match = re.search(r'(\d{4})', dataset_date)
        if year_match:
            year = int(year_match.group(1))
            if 1900 <= year <= 2100:
                return year
        last_mod = record.get('last_modified', '') or record.get('metadata_modified', '') or ''
        year_match = re.search(r'(\d{4})', last_mod)
        if year_match:
            year = int(year_match.group(1))
            if 1900 <= year <= 2100:
                return year
        return None

    def extract(self, record: Dict[str, Any]) -> LossExtraction:
        """
        Extract loss information from HDX record.

        All inferred field combinations are validated against the 3 constraint
        groups before building LossEntryExtraction objects.
        """
        text = self._get_all_text(record)

        # Check for exclusion patterns
        has_exclusion = self._check_exclusions(text)

        # Detect loss signal types
        signal_types = self._detect_loss_signals(text)
        if not signal_types:
            return LossExtraction()

        # P5 fix: If exclusion patterns matched and only weak/generic signals remain,
        # filter out the generic signals to reduce false positives
        if has_exclusion:
            strong_signals = [s for s in signal_types if s not in ('general_loss',)]
            if not strong_signals:
                return LossExtraction()
            signal_types = strong_signals

        # Get shared context
        hazard_ctx = self._infer_hazard_context(record, text)
        loss_approach = self._infer_loss_approach(text)
        loss_frequency = self._infer_loss_frequency(text)
        text_impact_type = self._infer_impact_type(text)
        impact_modelling = self._infer_impact_modelling(text)
        currency = self._detect_currency(text)
        is_insured = self._detect_insured(text)
        title = record.get('title', '')

        # Deduplicate: group signals by (asset_category, impact_metric)
        seen_keys = set()
        losses = []

        for signal_type in signal_types:
            # --- Group 3: Get unified defaults for this signal ---
            sig_defaults = LOSS_SIGNAL_DEFAULTS.get(signal_type, LOSS_SIGNAL_DEFAULTS['general_loss'])

            # --- Group 1: Validate asset_category -> asset_dimension ---
            asset_ctx = self._infer_asset_context(text, signal_type)

            # Start with signal defaults for impact fields
            impact_metric = sig_defaults['impact_metric']
            quantity_kind = sig_defaults['quantity_kind']
            impact_type = sig_defaults.get('impact_type', text_impact_type)
            loss_type = sig_defaults['loss_type']

            # Override impact_type from text if more specific
            if text_impact_type != 'direct' and impact_type == 'direct':
                impact_type = text_impact_type

            # --- Group 2: Validate impact_metric + quantity_kind + impact_type ---
            validated = self._validate_impact_metric(impact_metric, quantity_kind, impact_type)
            impact_metric = validated['impact_metric']
            quantity_kind = validated['quantity_kind']
            impact_type = validated['impact_type']

            # Override loss_type for insured losses
            if is_insured and signal_type in ('economic_loss', 'structural_damage', 'catastrophe_model'):
                loss_type = 'insured'

            # --- Validate loss_type + loss_approach ---
            validated_approach = self._validate_loss_approach(loss_type, loss_approach)

            # Dedup key
            key = (asset_ctx['asset_category'], impact_metric)
            if key in seen_keys:
                continue
            seen_keys.add(key)

            # Build description
            description = f"Loss data from HDX dataset: {title[:200]}"

            # Confidence based on hazard context availability
            confidence = 0.8 if hazard_ctx['hazard_type'] else 0.6

            entry = LossEntryExtraction(
                loss_signal_type=signal_type,
                hazard_type=hazard_ctx['hazard_type'],
                hazard_process=hazard_ctx['hazard_process'],
                asset_category=asset_ctx['asset_category'],
                asset_dimension=asset_ctx['asset_dimension'],
                impact_type=impact_type,
                impact_modelling=impact_modelling,
                impact_metric=impact_metric,
                quantity_kind=quantity_kind,
                loss_type=loss_type,
                loss_approach=validated_approach,
                loss_frequency_type=loss_frequency,
                currency=currency if quantity_kind == 'monetary' else None,
                description=description[:500],
                confidence=confidence,
            )
            losses.append(entry)

        confidences = [e.confidence for e in losses]
        overall = float(np.mean(confidences)) if confidences else 0.0

        return LossExtraction(losses=losses, overall_confidence=overall)


# Initialize
loss_extractor = LossExtractor(SIGNAL_DICT, HAZARD_XREF)
print(f"\nLossExtractor initialized (constraint-validated).")



LossExtractor initialized (constraint-validated).


### 8.3 Loss Block Builder

In [17]:
"""
8.3 Build RDLS Loss Block

Builds schema-compliant loss block with:
- losses[]: array of loss entries, each with all required fields
- impact_and_losses: nested sub-object with 7 required fields
- All values validated against closed codelists AND constraint tables
- Group 1: asset_dimension valid for asset_category
- Group 2: quantity_kind valid for impact_metric
"""

def build_loss_block(
    extraction: LossExtraction,
    dataset_id: str,
) -> Optional[Dict[str, Any]]:
    """
    Build RDLS loss block from extraction results.

    Parameters
    ----------
    extraction : LossExtraction
        Extraction results (already constraint-validated by LossExtractor)
    dataset_id : str
        Dataset identifier for building unique IDs

    Returns
    -------
    Optional[Dict[str, Any]]
        RDLS loss block or None if no data
    """
    if not extraction.has_any_signal():
        return None

    losses_list = []
    for idx, entry in enumerate(extraction.losses):
        # --- Final codelist validation ---
        impact_type = entry.impact_type if entry.impact_type in VALID_IMPACT_TYPES else 'direct'
        impact_modelling = entry.impact_modelling if entry.impact_modelling in VALID_CALCULATION_TYPES else 'observed'
        impact_metric = entry.impact_metric if entry.impact_metric in VALID_IMPACT_METRICS else 'asset_loss'
        loss_type = entry.loss_type if entry.loss_type in VALID_LOSS_TYPES else 'ground_up'
        loss_approach = entry.loss_approach if entry.loss_approach in VALID_FUNCTION_APPROACHES else 'empirical'
        loss_freq = entry.loss_frequency_type if entry.loss_frequency_type in VALID_ANALYSIS_TYPES else 'empirical'

        # --- Group 2 re-validation at build time ---
        metric_constraint = IMPACT_METRIC_CONSTRAINTS.get(impact_metric)
        if metric_constraint:
            expected_qty, allowed_types = metric_constraint
            quantity_kind = expected_qty  # enforce correct quantity_kind
            if impact_type not in allowed_types:
                impact_type = 'direct' if 'direct' in allowed_types else sorted(allowed_types)[0]
        else:
            quantity_kind = entry.quantity_kind or 'count'

        # Build impact_and_losses sub-object (7 required fields)
        impact_and_losses = {
            'impact_type': impact_type,
            'impact_modelling': impact_modelling,
            'impact_metric': impact_metric,
            'quantity_kind': quantity_kind,
            'loss_type': loss_type,
            'loss_approach': loss_approach,
            'loss_frequency_type': loss_freq,
        }

        # Optional: currency (only when quantity_kind is monetary)
        if entry.currency and entry.currency in VALID_CURRENCIES and quantity_kind == 'monetary':
            impact_and_losses['currency'] = entry.currency

        # --- Group 1 re-validation: asset_dimension for asset_category ---
        asset_category = entry.asset_category if entry.asset_category in VALID_EXPOSURE_CATEGORIES else None
        asset_dimension = entry.asset_dimension if entry.asset_dimension in VALID_METRIC_DIMENSIONS else None

        if asset_category and asset_dimension:
            allowed_dims = VALID_ASSET_TRIPLETS.get(asset_category, [])
            if allowed_dims and asset_dimension not in allowed_dims:
                asset_dimension = allowed_dims[0]

        # Build the loss entry (5 required top-level fields)
        loss_entry = {
            'id': f"loss_{dataset_id[:8]}_{idx + 1}",
            'hazard_type': entry.hazard_type if entry.hazard_type in VALID_HAZARD_TYPES else None,
            'asset_category': asset_category,
            'asset_dimension': asset_dimension,
            'impact_and_losses': impact_and_losses,
        }

        # P1+P2 fix: Skip loss entries with no determinable hazard or asset
        if loss_entry['hazard_type'] is None or asset_category is None or asset_dimension is None:
            continue

        # Optional: hazard_process
        if entry.hazard_process and entry.hazard_process in VALID_PROCESS_TYPES:
            loss_entry['hazard_process'] = entry.hazard_process

        # Optional: lineage
        lineage = {}
        if entry.hazard_dataset:
            lineage['hazard_dataset'] = entry.hazard_dataset
        if entry.exposure_dataset:
            lineage['exposure_dataset'] = entry.exposure_dataset
        if entry.vulnerability_dataset:
            lineage['vulnerability_dataset'] = entry.vulnerability_dataset
        if lineage:
            loss_entry['lineage'] = lineage

        # Optional: description
        if entry.description:
            loss_entry['description'] = entry.description

        losses_list.append(loss_entry)

    return {'losses': losses_list} if losses_list else None


print("Loss block builder defined (constraint-validated).")
print("  - Group 1: asset_dimension validated for asset_category")
print("  - Group 2: quantity_kind + impact_type validated for impact_metric")
print("  - All entries validated against schema codelists")
print("  - impact_and_losses includes all 7 required fields")


Loss block builder defined (constraint-validated).
  - Group 1: asset_dimension validated for asset_category
  - Group 2: quantity_kind + impact_type validated for impact_metric
  - All entries validated against schema codelists
  - impact_and_losses includes all 7 required fields


### 8.4 Test Loss Extraction

In [18]:
"""
8.4 Load Loss Test Samples

Organized by expected loss signal type to stress-test extraction.
"""

LOSS_TEST_SAMPLES = {
    'human_loss': [
        ('casualty', 'Casualty/fatality data'),
        ('mortality', 'Mortality data'),
        ('fatality', 'Fatality records'),
    ],
    'displacement': [
        ('displaced', 'Displacement data'),
        ('evacuation', 'Evacuation data'),
        ('idp', 'IDP data'),
    ],
    'structural_damage': [
        ('damage-assessment', 'Damage assessment'),
        ('building-damage', 'Building damage'),
        ('infrastructure-damage', 'Infrastructure damage'),
    ],
    'economic_loss': [
        ('economic-loss', 'Economic loss data'),
        ('insurance', 'Insurance claim data'),
    ],
    'agricultural_loss': [
        ('crop-loss', 'Crop loss data'),
        ('livestock-loss', 'Livestock loss data'),
    ],
    'general_loss': [
        ('disaster-loss', 'Disaster loss records'),
        ('disaster-incidents', 'Disaster incidents'),
        ('post-disaster', 'Post disaster assessment'),
        ('pdna', 'Post-Disaster Needs Assessment'),
    ],
    'edge_cases': [
        ('flood', 'Flood datasets (may have loss signals)'),
        ('earthquake', 'Earthquake datasets (may have loss signals)'),
    ],
}

# Load samples by searching filenames
loss_sample_records = []
loss_sample_meta = []
loss_loaded_ids = set()

for category, keyword_list in LOSS_TEST_SAMPLES.items():
    for keyword, note in keyword_list:
        files = sorted(DATASET_METADATA_DIR.glob(f'*{keyword}*.json'))[:5]
        for fp in files:
            try:
                with open(fp, 'r', encoding='utf-8') as f:
                    record = json.load(f)
                rid = record.get('id', fp.stem)
                if rid not in loss_loaded_ids:
                    loss_loaded_ids.add(rid)
                    loss_sample_records.append(record)
                    loss_sample_meta.append({
                        'category': category,
                        'note': note,
                        'filename': fp.name,
                    })
            except Exception:
                pass

print(f"Loaded {len(loss_sample_records)} unique Loss test samples across {len(LOSS_TEST_SAMPLES)} categories.")
print(f"\nSamples per category:")
for cat in LOSS_TEST_SAMPLES:
    count = sum(1 for m in loss_sample_meta if m['category'] == cat)
    print(f"  {cat}: {count}")

Loaded 47 unique Loss test samples across 7 categories.

Samples per category:
  human_loss: 7
  displacement: 12
  structural_damage: 9
  economic_loss: 0
  agricultural_loss: 0
  general_loss: 10
  edge_cases: 9


In [19]:
"""
8.5 Run Loss Extraction on Test Samples
"""

print("=" * 90)
print("LOSS EXTRACTION TEST RESULTS")
print(f"Testing {len(loss_sample_records)} samples")
print("=" * 90)

loss_test_results = []
loss_entry_count = 0

for record, meta in zip(loss_sample_records, loss_sample_meta):
    extraction = loss_extractor.extract(record)

    loss_test_results.append({
        'id': record.get('id'),
        'title': record.get('title', '')[:70],
        'category': meta['category'],
        'extraction': extraction,
    })

    if extraction.has_any_signal():
        loss_entry_count += len(extraction.losses)

        print(f"\n{'â”€' * 90}")
        print(f"[{meta['category']}] {record.get('title', '')[:75]}")

        for entry in extraction.losses:
            print(f"  LOSS: signal={entry.loss_signal_type} | hazard={entry.hazard_type} | "
                  f"asset={entry.asset_category}/{entry.asset_dimension} | "
                  f"metric={entry.impact_metric} | loss_type={entry.loss_type} | "
                  f"approach={entry.loss_approach} | freq={entry.loss_frequency_type} | "
                  f"conf={entry.confidence:.2f}")
            if entry.currency:
                print(f"         currency={entry.currency}")

print(f"\n{'=' * 90}")
print("LOSS TEST SUMMARY")
print(f"{'=' * 90}")
total = len(loss_test_results)
with_signal = sum(1 for r in loss_test_results if r['extraction'].has_any_signal())
print(f"  Total samples: {total}")
print(f"  With loss signal: {with_signal} ({with_signal/total*100:.1f}%)")
print(f"  Total loss entries: {loss_entry_count}")

LOSS EXTRACTION TEST RESULTS
Testing 47 samples

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
[human_loss] Ethiopia-Infant Mortality Rate
  LOSS: signal=human_loss | hazard=None | asset=population/population | metric=casualty_count | loss_type=count | approach=empirical | freq=empirical | conf=0.60

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
[human_loss] Perú: Covid19 Mortality Rate in Lima
  LOSS: signal=human_loss | hazard=None | asset=population/population | metric=casualty_count | loss_type=count | approach=empirical | freq=empirical | 

In [20]:
"""
8.6 Build Loss Blocks and Verify Structural Compliance
"""

print("=" * 90)
print("LOSS STRUCTURAL COMPLIANCE VERIFICATION")
print("=" * 90)

loss_total_blocks = 0
loss_total_entries = 0

# Compliance trackers
loss_field_violations = []
loss_impact_violations = []
loss_codelist_violations = []

LOSS_TOP_REQUIRED = ['id', 'hazard_type', 'asset_category', 'asset_dimension', 'impact_and_losses']
IMPACT_LOSSES_REQUIRED = [
    'impact_type', 'impact_modelling', 'impact_metric', 'quantity_kind',
    'loss_type', 'loss_approach', 'loss_frequency_type'
]

for result in loss_test_results:
    extraction = result['extraction']
    if not extraction.has_any_signal():
        continue

    block = build_loss_block(extraction, result['id'])
    if not block:
        continue

    loss_total_blocks += 1

    for entry in block.get('losses', []):
        loss_total_entries += 1

        # Check top-level required fields
        for fld in LOSS_TOP_REQUIRED:
            if fld not in entry or not entry[fld]:
                loss_field_violations.append(f"{result['id'][:8]}: missing {fld}")

        # Check impact_and_losses required fields
        ial = entry.get('impact_and_losses', {})
        for fld in IMPACT_LOSSES_REQUIRED:
            if fld not in ial or not ial[fld]:
                loss_impact_violations.append(f"{result['id'][:8]}: impact_and_losses missing {fld}")

        # Check codelist values
        if entry.get('hazard_type') and entry['hazard_type'] not in VALID_HAZARD_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: hazard_type='{entry['hazard_type']}'")
        if entry.get('asset_category') and entry['asset_category'] not in VALID_EXPOSURE_CATEGORIES:
            loss_codelist_violations.append(f"{result['id'][:8]}: asset_category='{entry['asset_category']}'")
        if entry.get('asset_dimension') and entry['asset_dimension'] not in VALID_METRIC_DIMENSIONS:
            loss_codelist_violations.append(f"{result['id'][:8]}: asset_dimension='{entry['asset_dimension']}'")
        if entry.get('hazard_process') and entry['hazard_process'] not in VALID_PROCESS_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: hazard_process='{entry['hazard_process']}'")

        # Check impact_and_losses codelist values
        if ial.get('impact_type') and ial['impact_type'] not in VALID_IMPACT_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: impact_type='{ial['impact_type']}'")
        if ial.get('impact_modelling') and ial['impact_modelling'] not in VALID_CALCULATION_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: impact_modelling='{ial['impact_modelling']}'")
        if ial.get('impact_metric') and ial['impact_metric'] not in VALID_IMPACT_METRICS:
            loss_codelist_violations.append(f"{result['id'][:8]}: impact_metric='{ial['impact_metric']}'")
        if ial.get('loss_type') and ial['loss_type'] not in VALID_LOSS_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: loss_type='{ial['loss_type']}'")
        if ial.get('loss_approach') and ial['loss_approach'] not in VALID_FUNCTION_APPROACHES:
            loss_codelist_violations.append(f"{result['id'][:8]}: loss_approach='{ial['loss_approach']}'")
        if ial.get('loss_frequency_type') and ial['loss_frequency_type'] not in VALID_ANALYSIS_TYPES:
            loss_codelist_violations.append(f"{result['id'][:8]}: loss_frequency_type='{ial['loss_frequency_type']}'")
        if ial.get('currency') and ial['currency'] not in VALID_CURRENCIES:
            loss_codelist_violations.append(f"{result['id'][:8]}: currency='{ial['currency']}'")

    # Show first 3 block previews
    if loss_total_blocks <= 3:
        print(f"\n{'â”€' * 90}")
        print(f"Loss block preview: {result['title']}")
        print(json.dumps(block, indent=2)[:2000])

# --- Compliance Report ---
print(f"\n{'=' * 90}")
print("LOSS COMPLIANCE REPORT")
print(f"{'=' * 90}")
print(f"  Total blocks built:           {loss_total_blocks}")
print(f"  Total loss entries:            {loss_total_entries}")
print()
print(f"  Top-level required fields:     {'PASS' if not loss_field_violations else f'FAIL ({len(loss_field_violations)} violations)'}")
for v in loss_field_violations[:5]:
    print(f"    - {v}")
print(f"  impact_and_losses required:    {'PASS' if not loss_impact_violations else f'FAIL ({len(loss_impact_violations)} violations)'}")
for v in loss_impact_violations[:5]:
    print(f"    - {v}")
print(f"  Codelist compliance:           {'PASS' if not loss_codelist_violations else f'FAIL ({len(loss_codelist_violations)} violations)'}")
for v in loss_codelist_violations[:5]:
    print(f"    - {v}")

# ID uniqueness check
all_loss_ids = []
for result in loss_test_results:
    extraction = result['extraction']
    if not extraction.has_any_signal():
        continue
    block = build_loss_block(extraction, result['id'])
    if not block:
        continue
    for entry in block.get('losses', []):
        all_loss_ids.append(entry['id'])

dup_loss_ids = [id for id in all_loss_ids if all_loss_ids.count(id) > 1]
print(f"  ID uniqueness:                 {'PASS' if not dup_loss_ids else f'FAIL ({len(set(dup_loss_ids))} duplicates)'}")

LOSS STRUCTURAL COMPLIANCE VERIFICATION

â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Loss block preview: Internally Displaced Persons Shelters in Daraa and Quneitra Governorat
{
  "losses": [
    {
      "id": "loss_1b24eb18_1",
      "hazard_type": "flood",
      "asset_category": "population",
      "asset_dimension": "population",
      "impact_and_losses": {
        "impact_type": "direct",
        "impact_modelling": "inferred",
        "impact_metric": "displaced_count",
        "quantity_kind": "count",
        "loss_type": "count",
        "loss_approach": "judgement",
        "loss_frequency_type": "empirical"
      },
      "description": "Loss data from HDX dataset: Internally Displaced Persons Shelters in Daraa and Quneitra Governorate"
    }
  ]
}

â

### 8.7 Loss Batch Processing

In [21]:
"""
8.7 Process Full Corpus for Loss Extraction
"""

def process_loss_extraction(
    metadata_dir: Path,
    extractor: LossExtractor,
    limit: Optional[int] = None
) -> pd.DataFrame:
    """Process all records for loss extraction."""
    json_files = sorted(metadata_dir.glob('*.json'))
    if limit:
        json_files = json_files[:limit]

    results = []
    iterator = tqdm(json_files, desc="Extracting loss") if HAS_TQDM else json_files

    for filepath in iterator:
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                record = json.load(f)

            extraction = extractor.extract(record)

            loss_signal_types = [e.loss_signal_type for e in extraction.losses]
            hazard_types = list(set(e.hazard_type for e in extraction.losses if e.hazard_type))
            asset_categories = list(set(e.asset_category for e in extraction.losses))

            results.append({
                'id': record.get('id'),
                'title': record.get('title'),
                'organization': record.get('organization'),
                'has_loss': extraction.has_any_signal(),
                'loss_count': len(extraction.losses),
                'loss_signal_types': loss_signal_types,
                'hazard_types': hazard_types,
                'asset_categories': asset_categories,
                'overall_confidence': extraction.overall_confidence,
                'extraction': extraction,
            })
        except Exception as e:
            results.append({'id': filepath.stem, 'error': str(e), 'has_loss': False})

    return pd.DataFrame(results)

LOSS_PROCESS_LIMIT = None  # Set to e.g. 2000 for testing, None for full corpus

print(f"Processing {'all' if LOSS_PROCESS_LIMIT is None else LOSS_PROCESS_LIMIT} records for Loss extraction...")
df_loss = process_loss_extraction(DATASET_METADATA_DIR, loss_extractor, limit=LOSS_PROCESS_LIMIT)

Processing all records for Loss extraction...


Extracting loss:   0%|          | 0/26246 [00:00<?, ?it/s]

In [22]:
"""
8.8 Loss Extraction Statistics
"""

print("=" * 60)
print("LOSS EXTRACTION STATISTICS")
print("=" * 60)

total = len(df_loss)
with_loss = df_loss['has_loss'].sum()
avg_entries = df_loss[df_loss['has_loss']]['loss_count'].mean() if with_loss > 0 else 0

print(f"\nTotal records processed: {total:,}")
print(f"  With any loss signal: {with_loss:,} ({with_loss/total*100:.1f}%)")
print(f"  Average loss entries per record: {avg_entries:.1f}")

# Loss signal type distribution
signal_counts = Counter()
for signals in df_loss['loss_signal_types'].dropna():
    if isinstance(signals, list):
        signal_counts.update(signals)

if signal_counts:
    print(f"\nLoss Signal Type Distribution:")
    for sig, count in signal_counts.most_common():
        print(f"  {sig}: {count:,}")

# Hazard type distribution among loss records
hazard_counts = Counter()
for htypes in df_loss['hazard_types'].dropna():
    if isinstance(htypes, list):
        hazard_counts.update(htypes)

if hazard_counts:
    print(f"\nHazard Types in Loss Records:")
    for ht, count in hazard_counts.most_common(10):
        print(f"  {ht}: {count:,}")

# Asset category distribution
asset_counts = Counter()
for cats in df_loss['asset_categories'].dropna():
    if isinstance(cats, list):
        asset_counts.update(cats)

if asset_counts:
    print(f"\nAsset Category Distribution:")
    for ac, count in asset_counts.most_common():
        print(f"  {ac}: {count:,}")

# Confidence distribution
conf = df_loss[df_loss['has_loss']]['overall_confidence']
if len(conf) > 0:
    print(f"\nConfidence Distribution:")
    print(f"  Mean: {conf.mean():.2f}")
    print(f"  Median: {conf.median():.2f}")
    print(f"  High (>=0.7): {(conf >= 0.7).sum():,}")
    print(f"  Medium (0.5-0.7): {((conf >= 0.5) & (conf < 0.7)).sum():,}")
    print(f"  Low (<0.5): {(conf < 0.5).sum():,}")

LOSS EXTRACTION STATISTICS

Total records processed: 26,246
  With any loss signal: 5,771 (22.0%)
  Average loss entries per record: 1.2

Loss Signal Type Distribution:
  human_loss: 3,093
  displacement: 1,908
  affected_population: 1,179
  structural_damage: 482
  general_loss: 39
  agricultural_loss: 7
  economic_loss: 2
  catastrophe_model: 1

Hazard Types in Loss Records:
  flood: 637
  earthquake: 73
  convective_storm: 72
  drought: 33
  strong_wind: 2
  volcanic: 2
  coastal_flood: 1
  landslide: 1

Asset Category Distribution:
  population: 2,371
  infrastructure: 2,002
  buildings: 1,144
  economic_indicator: 245
  agriculture: 27
  natural_environment: 1

Confidence Distribution:
  Mean: 0.63
  Median: 0.60
  High (>=0.7): 821
  Medium (0.5-0.7): 4,950
  Low (<0.5): 0


### 8.9 Loss Export

In [23]:
"""
8.9 Export Loss Results and Generate RDLS Loss Block JSONs
"""

# Prepare export DataFrame
loss_export_df = df_loss[[
    'id', 'title', 'organization', 'has_loss', 'loss_count',
    'loss_signal_types', 'hazard_types', 'asset_categories', 'overall_confidence'
]].copy()

# Convert lists to pipe-separated for CSV
for col in ['loss_signal_types', 'hazard_types', 'asset_categories']:
    loss_export_df[col] = loss_export_df[col].apply(
        lambda x: '|'.join(str(v) for v in x) if isinstance(x, list) else ''
    )

# Save full results
loss_output_file = OUTPUT_DIR / 'loss_extraction_results.csv'
loss_export_df.to_csv(loss_output_file, index=False)
print(f"Saved: {loss_output_file}")

# Save records with loss signals
loss_records = loss_export_df[loss_export_df['has_loss']]
loss_detected_file = OUTPUT_DIR / 'loss_detected_records.csv'
loss_records.to_csv(loss_detected_file, index=False)
print(f"Saved: {loss_detected_file} ({len(loss_records)} records)")

# --- Generate RDLS loss block JSONs for ALL flagged datasets ---
all_loss = df_loss[
    df_loss['has_loss'] &
    (df_loss['overall_confidence'] >= 0.5)
].copy()

print(f"\nGenerating RDLS loss block JSONs for {len(all_loss):,} datasets...")

generated = 0
skipped = 0

iterator = tqdm(all_loss.iterrows(), total=len(all_loss), desc="Building loss JSONs") if HAS_TQDM else all_loss.iterrows()

for idx, row in iterator:
    extraction = row['extraction']
    loss_block = build_loss_block(extraction, row['id'])

    if loss_block:
        rdls_record = {
            'datasets': [{
                'id': f"rdls_lss-hdx_{row['id'][:8]}",
                'title': row['title'],
                'risk_data_type': ['loss'],
                'loss': loss_block,
                'links': [{
                    'href': 'https://docs.riskdatalibrary.org/en/0__3__0/rdls_schema.json',
                    'rel': 'describedby'
                }]
            }]
        }

        output_path = OUTPUT_DIR / f"rdls_lss-hdx_{row['id'][:8]}.json"
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(rdls_record, f, indent=2, ensure_ascii=False)

        generated += 1
    else:
        skipped += 1

print(f"\nDone.")
print(f"  Generated: {generated:,} loss block JSONs")
print(f"  Skipped (no valid block): {skipped:,}")
print(f"  Output: {OUTPUT_DIR}")

Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted/loss_extraction_results.csv
Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted/loss_detected_records.csv (5771 records)

Generating RDLS loss block JSONs for 5,771 datasets...


Building loss JSONs:   0%|          | 0/5771 [00:00<?, ?it/s]


Done.
  Generated: 821 loss block JSONs
  Skipped (no valid block): 4,950
  Output: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/extracted


In [24]:
print(f"\nLoss extraction complete: {datetime.now().isoformat()}")


Loss extraction complete: 2026-02-11T18:06:04.028728


## 9. Next Steps

This notebook produces both Vulnerability and Loss extraction results that feed into:

1. **Notebook 12**: HEVL Integration â€” merges all four components with general metadata
2. **Notebook 13**: Validation QA â€” validates complete RDLS records

### CSV Backward Compatibility
The CSV outputs are compatible with Notebook 12:
- `vulnerability_extraction_results.csv` â€” Vulnerability signals
- `vulnerability_detected_records.csv` â€” Records with vulnerability
- `loss_extraction_results.csv` â€” Loss signals
- `loss_detected_records.csv` â€” Records with loss

New columns are additive and will be consumed when Notebook 12 is updated.

## End of Code