# Notebook 08: HDX Metadata Signal Analysis for HEVL Inference

**Purpose**: Analyze 26,246 HDX metadata files to identify extractable signals for populating RDLS v0.3 HEVL (Hazard, Exposure, Vulnerability, Loss) component blocks.

**Outputs**:
- Signal frequency analysis (hazard types, exposure categories, etc.)
- Duplication pattern report
- HEVL coverage potential assessment
- Signal dictionary foundation for downstream extraction

**Author**: Benny Istanto/Risk Data Librarian/GFDRR  
**Version**: 2026.1

---

## 1. Setup and Configuration

In [1]:
"""
1.1 Import Dependencies

Standard data science stack for metadata analysis.
All packages are commonly available via pip/conda.
"""

import json
import os
import re
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime
from typing import Dict, List, Tuple, Optional, Any

import pandas as pd
import numpy as np

# Optional: progress bar for long operations
try:
    from tqdm.notebook import tqdm
    HAS_TQDM = True
except ImportError:
    HAS_TQDM = False
    print("Note: tqdm not available. Install with 'pip install tqdm' for progress bars.")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', None)

print(f"Analysis started: {datetime.now().isoformat()}")
print(f"Python packages loaded successfully.")

Analysis started: 2026-02-11T06:37:39.927614
Python packages loaded successfully.


In [2]:
"""
1.2 Define Paths and Constants

All paths are relative to the repository root for reproducibility.
Adjust BASE_DIR if running from a different location.
"""

# ============================================================================
# PATH CONFIGURATION - Adjust if needed
# ============================================================================

# Repository root (parent of 'notebook' folder)
NOTEBOOK_DIR = Path.cwd()
BASE_DIR = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == 'notebook' else NOTEBOOK_DIR

# Input paths
DATASET_METADATA_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'dataset_metadata'
RDLS_SCHEMA_PATH = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'schema' / 'rdls_schema_v0.3.json'

# Output paths
OUTPUT_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'analysis'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Verify paths exist
assert DATASET_METADATA_DIR.exists(), f"Dataset metadata directory not found: {DATASET_METADATA_DIR}"
assert RDLS_SCHEMA_PATH.exists(), f"RDLS schema not found: {RDLS_SCHEMA_PATH}"

print(f"Base directory: {BASE_DIR}")
print(f"Dataset metadata: {DATASET_METADATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")

# ── Output cleanup mode ───────────────────────────────────────────────
CLEANUP_MODE = "replace"


Base directory: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler
Dataset metadata: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/dataset_metadata
Output directory: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/analysis


In [3]:
"""
1.3 Clean Previous Outputs

Remove stale output files from previous runs (controlled by CLEANUP_MODE).
"""

def clean_previous_outputs(output_dir, patterns, label, mode="replace"):
    """
    Remove previous output files matching the given glob patterns.

    Parameters
    ----------
    output_dir : Path
        Directory containing old outputs.
    patterns : list[str]
        Glob patterns to match.
    label : str
        Human-readable label for log messages.
    mode : str
        One of: "replace" (auto-delete), "prompt" (ask user),
        "skip" (keep old files), "abort" (error if stale files exist).

    Returns
    -------
    dict  with keys 'deleted' (int) and 'skipped' (bool)
    """
    result = {'deleted': 0, 'skipped': False}
    targets = {}
    for pattern in patterns:
        matches = sorted(output_dir.glob(pattern))
        if matches:
            targets[pattern] = matches
    total = sum(len(files) for files in targets.values())

    if total == 0:
        print(f'Output cleanup [{label}]: Directory is clean.')
        return result

    summary = []
    for pattern, files in targets.items():
        summary.append(f'  {pattern:40s}: {len(files):,} files')

    if mode == 'skip':
        print(f'Output cleanup [{label}]: SKIPPED ({total:,} existing files kept)')
        result['skipped'] = True
        return result

    if mode == 'abort':
        raise RuntimeError(
            f'Output cleanup [{label}]: ABORT -- {total:,} stale files found. '
            f'Delete manually or change CLEANUP_MODE.'
        )

    if mode == 'prompt':
        print(f'Output cleanup [{label}]: Found {total:,} existing output files:')
        for line in summary:
            print(line)
        choice = input('Choose [R]eplace / [S]kip / [A]bort: ').strip().lower()
        if choice in ('s', 'skip'):
            print('  Skipped.')
            result['skipped'] = True
            return result
        elif choice in ('a', 'abort'):
            raise RuntimeError('User chose to abort.')
        elif choice not in ('r', 'replace', ''):
            print(f'  Unknown choice, defaulting to Replace.')

    # Mode: replace (default)
    print(f'Output cleanup [{label}]:')
    for line in summary:
        print(line)
    for pattern, files in targets.items():
        for f in files:
            try:
                f.unlink()
                result['deleted'] += 1
            except Exception as e:
                print(f'  WARNING: Could not delete {f.name}: {e}')
    deleted_count = result['deleted']
    print(f'  Cleaned {deleted_count:,} files. Ready for fresh output.')
    print()
    return result

# ── Run cleanup ────────────────────────────────────────────────────────
clean_previous_outputs(
    OUTPUT_DIR,
    patterns=[
        "hdx_hevl_signal_analysis.csv",
        "hdx_hevl_signal_summary.json",
        "hdx_high_signal_records.csv",
    ],
    label="NB 08 Signal Analysis",
    mode=CLEANUP_MODE,
)


Output cleanup [NB 08 Signal Analysis]:
  hdx_hevl_signal_analysis.csv            : 1 files
  hdx_hevl_signal_summary.json            : 1 files
  hdx_high_signal_records.csv             : 1 files
  Cleaned 3 files. Ready for fresh output.



{'deleted': 3, 'skipped': False}

In [4]:
"""
1.3 Load RDLS Schema Codelists

Extract closed codelists from RDLS v0.3 schema to use as reference
for signal matching. These are the valid values we need to map to.
"""

def load_rdls_codelists(schema_path: Path) -> Dict[str, List[str]]:
    """
    Extract codelist values from RDLS schema.
    
    Parameters
    ----------
    schema_path : Path
        Path to rdls_schema_v0.3.json
        
    Returns
    -------
    Dict[str, List[str]]
        Dictionary mapping codelist names to their valid values
    """
    with open(schema_path, 'r', encoding='utf-8') as f:
        schema = json.load(f)
    
    codelists = {}
    defs = schema.get('$defs', {})
    
    # Extract enum values from $defs
    for name, definition in defs.items():
        if 'enum' in definition:
            codelists[name] = definition['enum']
        elif definition.get('type') == 'string' and 'enum' in definition:
            codelists[name] = definition['enum']
    
    return codelists

# Load codelists
RDLS_CODELISTS = load_rdls_codelists(RDLS_SCHEMA_PATH)

# Display key codelists for HEVL
key_codelists = ['hazard_type', 'process_type', 'exposure_category', 'analysis_type', 'risk_data_type']
print("=" * 60)
print("RDLS Key Codelists (closed - must match exactly)")
print("=" * 60)
for cl in key_codelists:
    if cl in RDLS_CODELISTS:
        print(f"\n{cl}:")
        print(f"  {RDLS_CODELISTS[cl]}")

RDLS Key Codelists (closed - must match exactly)

hazard_type:
  ['coastal_flood', 'convective_storm', 'drought', 'extreme_temperature', 'flood', 'wildfire', 'strong_wind', 'earthquake', 'landslide', 'tsunami', 'volcanic']

process_type:
  ['coastal_flood', 'storm_surge', 'tornado', 'agricultural_drought', 'hydrological_drought', 'meteorological_drought', 'socioeconomic_drought', 'primary_rupture', 'secondary_rupture', 'ground_motion', 'liquefaction', 'extreme_cold', 'extreme_heat', 'fluvial_flood', 'pluvial_flood', 'groundwater_flood', 'snow_avalanche', 'landslide_general', 'landslide_rockslide', 'landslide_mudflow', 'landslide_rockfall', 'tsunami', 'ashfall', 'volcano_ballistics', 'lahar', 'lava', 'pyroclastic_flow', 'wildfire', 'extratropical_cyclone', 'tropical_cyclone']

exposure_category:
  ['agriculture', 'buildings', 'infrastructure', 'population', 'natural_environment', 'economic_indicator', 'development_index']

analysis_type:
  ['probabilistic', 'deterministic', 'empirical']

## 2. Data Loading and Initial Statistics

In [5]:
"""
2.1 Load All HDX Metadata Files

Read all JSON files from dataset_metadata directory.
This may take a few minutes for 26,000+ files.
"""

def load_hdx_metadata(metadata_dir: Path, limit: Optional[int] = None) -> List[Dict[str, Any]]:
    """
    Load HDX metadata JSON files from directory.
    
    Parameters
    ----------
    metadata_dir : Path
        Directory containing HDX metadata JSON files
    limit : Optional[int]
        Maximum number of files to load (for testing). None = all files.
        
    Returns
    -------
    List[Dict[str, Any]]
        List of parsed metadata dictionaries
    """
    json_files = list(metadata_dir.glob('*.json'))
    
    if limit:
        json_files = json_files[:limit]
    
    records = []
    errors = []
    
    iterator = tqdm(json_files, desc="Loading metadata") if HAS_TQDM else json_files
    
    for filepath in iterator:
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                data = json.load(f)
                data['_source_file'] = filepath.name
                records.append(data)
        except Exception as e:
            errors.append({'file': filepath.name, 'error': str(e)})
    
    print(f"\nLoaded: {len(records):,} records")
    print(f"Errors: {len(errors):,} files")
    
    return records, errors

# Load all metadata (set limit=1000 for faster testing)
LOAD_LIMIT = None  # Set to integer for testing, None for full load

print(f"Loading HDX metadata files...")
print(f"Limit: {'All files' if LOAD_LIMIT is None else f'{LOAD_LIMIT:,} files'}")

hdx_records, load_errors = load_hdx_metadata(DATASET_METADATA_DIR, limit=LOAD_LIMIT)

# Store for later use
print(f"\nTotal records available for analysis: {len(hdx_records):,}")

Loading HDX metadata files...
Limit: All files


Loading metadata:   0%|          | 0/26246 [00:00<?, ?it/s]


Loaded: 26,246 records
Errors: 0 files

Total records available for analysis: 26,246


In [6]:
"""
2.2 Convert to DataFrame for Analysis

Flatten key fields into a DataFrame for efficient analysis.
"""

def extract_flat_record(record: Dict[str, Any]) -> Dict[str, Any]:
    """
    Extract key fields from HDX record into flat dictionary.
    
    Parameters
    ----------
    record : Dict[str, Any]
        Raw HDX metadata record
        
    Returns
    -------
    Dict[str, Any]
        Flattened record with key fields
    """
    return {
        'id': record.get('id', ''),
        'name': record.get('name', ''),
        'title': record.get('title', ''),
        'notes': record.get('notes', ''),
        'organization': record.get('organization', ''),
        'dataset_source': record.get('dataset_source', ''),
        'groups': '|'.join(record.get('groups', [])),
        'tags': '|'.join(record.get('tags', [])),
        'license_title': record.get('license_title', ''),
        'methodology': record.get('methodology', ''),
        'methodology_other': record.get('methodology_other', ''),
        'caveats': record.get('caveats', ''),
        'dataset_date': record.get('dataset_date', ''),
        'last_modified': record.get('last_modified', ''),
        'data_update_frequency': record.get('data_update_frequency', ''),
        'resource_count': len(record.get('resources', [])),
        'resource_formats': '|'.join(set(r.get('format', '') for r in record.get('resources', []))),
        'resource_names': '|'.join(r.get('name', '') for r in record.get('resources', [])),
        '_source_file': record.get('_source_file', ''),
        # Concatenate all text fields for pattern matching
        '_all_text': ' '.join(filter(None, [
            record.get('title', ''),
            record.get('name', ''),
            record.get('notes', ''),
            ' '.join(record.get('tags', [])),
            ' '.join(r.get('name', '') for r in record.get('resources', [])),
            ' '.join(r.get('description', '') for r in record.get('resources', []))
        ])).lower()
    }

# Create DataFrame
df = pd.DataFrame([extract_flat_record(r) for r in hdx_records])

print(f"DataFrame shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

DataFrame shape: (26246, 20)

Columns: ['id', 'name', 'title', 'notes', 'organization', 'dataset_source', 'groups', 'tags', 'license_title', 'methodology', 'methodology_other', 'caveats', 'dataset_date', 'last_modified', 'data_update_frequency', 'resource_count', 'resource_formats', 'resource_names', '_source_file', '_all_text']

Memory usage: 399.3 MB


In [7]:
"""
2.3 Basic Statistics Overview

Summary statistics for the HDX metadata corpus.
"""

print("=" * 60)
print("HDX METADATA CORPUS OVERVIEW")
print("=" * 60)

print(f"\nTotal datasets: {len(df):,}")
print(f"Unique dataset IDs: {df['id'].nunique():,}")
print(f"Unique dataset names: {df['name'].nunique():,}")
print(f"Unique organizations: {df['organization'].nunique():,}")

# Top organizations
print(f"\n--- Top 15 Organizations by Dataset Count ---")
org_counts = df['organization'].value_counts().head(15)
for org, count in org_counts.items():
    print(f"  {org}: {count:,}")

# Resource format distribution
print(f"\n--- Resource Format Distribution ---")
all_formats = []
for formats in df['resource_formats'].dropna():
    all_formats.extend(formats.split('|'))
format_counts = Counter(f for f in all_formats if f)
for fmt, count in format_counts.most_common(15):
    print(f"  {fmt}: {count:,}")

HDX METADATA CORPUS OVERVIEW

Total datasets: 26,246
Unique dataset IDs: 26,246
Unique dataset names: 26,246
Unique organizations: 358

--- Top 15 Organizations by Dataset Count ---
  World Bank Group: 4,792
  Humanitarian OpenStreetMap Team (HOT): 2,593
  WorldPop: 1,569
  United Nations Satellite Centre (UNOSAT): 1,452
  UNHCR - The UN Refugee Agency: 1,132
  FEWS NET: 833
  World Health Organization: 678
  HeiGIT (Heidelberg Institute for Geoinformation Technology): 661
  HDX: 603
  Kontur: 502
  WFP - World Food Programme: 492
  Copernicus: 478
  Food and Agriculture Organization (FAO) of the United Nations: 441
  Internal Displacement Monitoring Centre (IDMC): 426
  UNICEF Data and Analytics (HQ): 292

--- Resource Format Distribution ---
  CSV: 12,603
  SHP: 6,248
  GeoJSON: 4,770
  Geopackage: 3,892
  KML: 3,337
  XLSX: 3,336
  GeoTIFF: 2,433
  Geodatabase: 1,310
  PDF: 1,199
  Web App: 925
  Garmin IMG: 560
  XML: 293
  XLS: 289
  JSON: 275
  PNG: 107


## 3. HEVL Signal Pattern Extraction

In [8]:
"""
3.1 Define HEVL Signal Patterns

Regular expression patterns to detect HEVL-relevant signals in text.
These patterns are designed to map to RDLS codelist values.

NOTE: These patterns should be validated with unit tests in a future iteration
to ensure precision/recall against a labeled ground truth set.
"""

# ============================================================================
# HAZARD TYPE PATTERNS
# Maps to: RDLS hazard_type codelist
# ============================================================================
HAZARD_TYPE_PATTERNS = {
    'flood': r'\b(flood|flooding|fluvial|pluvial|inundation)\b',
    'coastal_flood': r'\b(coastal.?flood|storm.?surge|tidal.?flood|sea.?level)\b',
    'earthquake': r'\b(earthquake|seismic|quake|tremor|ground.?motion)\b',
    'tsunami': r'\b(tsunami|tidal.?wave)\b',
    'landslide': r'\b(landslide|mudslide|rockfall|debris.?flow|mass.?movement)\b',
    'volcanic': r'\b(volcan|lava|pyroclastic|ash.?fall|eruption)\b',
    'drought': r'\b(drought|water.?scarcity|aridity)\b',
    'wildfire': r'\b(wildfire|forest.?fire|bushfire|fire.?hazard)\b',
    'strong_wind': r'\b(wind|gust|gale)\b',
    'convective_storm': r'\b(cyclone|typhoon|hurricane|tropical.?storm|tornado|thunderstorm)\b',
    'extreme_temperature': r'\b(heat.?wave|cold.?wave|extreme.?temperature|frost|freeze)\b',
}

# ============================================================================
# HAZARD PROCESS TYPE PATTERNS (more specific)
# Maps to: RDLS process_type codelist
# ============================================================================
PROCESS_TYPE_PATTERNS = {
    'fluvial_flood': r'\b(fluvial|river.?flood|riverine)\b',
    'pluvial_flood': r'\b(pluvial|flash.?flood|surface.?water|urban.?flood)\b',
    'coastal_flood': r'\b(coastal.?flood|storm.?surge|tidal)\b',
    'ground_motion': r'\b(ground.?motion|pga|pgv|shaking|intensity)\b',
    'liquefaction': r'\b(liquefaction)\b',
    'tornado': r'\b(tornado|twister)\b',
    'tropical_cyclone': r'\b(tropical.?cyclone|typhoon|hurricane|cyclone)\b',
    'storm_surge': r'\b(storm.?surge|sea.?surge)\b',
    'meteorological_drought': r'\b(meteorological.?drought|rainfall.?deficit|precipitation.?deficit)\b',
    'agricultural_drought': r'\b(agricultural.?drought|crop.?drought|soil.?moisture.?deficit)\b',
    'surface_rupture': r'\b(surface.?rupture|fault.?rupture)\b',
    'ash_fall': r'\b(ash.?fall|tephra|volcanic.?ash)\b',
}

# ============================================================================
# EXPOSURE CATEGORY PATTERNS
# Maps to: RDLS exposure_category codelist
# ============================================================================
EXPOSURE_CATEGORY_PATTERNS = {
    'buildings': r'\b(building|structure|dwelling|house|residential|commercial|industrial)\b',
    'infrastructure': r'\b(infrastructure|road|bridge|railway|transport|power.?line|utility|airport|port)\b',
    'population': r'\b(population|people|inhabitant|resident|demographic|census)\b',
    'agriculture': r'\b(agriculture|crop|farm|livestock|agricultural|cultivation)\b',
    'natural_environment': r'\b(environment|ecosystem|forest|wetland|biodiversity|natural.?resource)\b',
    'economic_indicator': r'\b(gdp|gnp|gni|economic.?indicator|economic.?loss|damage.?cost|financial.?indicator|economic.?value|trade|export|import)\b',
    'development_index': r'\b(hdi|human.?development|poverty.?index|vulnerability.?index|svi|social.?vulnerability|development.?indicator|gini|inequality)\b',
}

# ============================================================================
# ANALYSIS TYPE PATTERNS
# Maps to: RDLS analysis_type codelist
# ============================================================================
ANALYSIS_TYPE_PATTERNS = {
    'probabilistic': r'\b(probabilistic|return.?period|rp\d+|annual.?exceedance|aep|frequency|stochastic|\d+.?year.?event)\b',
    'deterministic': r'\b(deterministic|index|susceptibility|ranking|score|classification)\b',
    'empirical': r'\b(empirical|historical|observed|actual|recorded|past.?event)\b',
}

# ============================================================================
# RETURN PERIOD EXTRACTION PATTERN
# ============================================================================
RETURN_PERIOD_PATTERN = r'(?:return.?period|rp|recurrence).?(?:of)?\s*(\d+)\s*(?:year|yr)?|(?:(\d+).?year.?(?:return|event|flood|storm))|(\d+)\s*yr'

# ============================================================================
# VULNERABILITY/LOSS INDICATORS
# ============================================================================
VULNERABILITY_PATTERNS = {
    'vulnerability': r'\b(vulnerability|fragility|damage.?function|loss.?function|susceptibility)\b',
    'loss': r'\b(loss|damage|impact|economic.?loss|casualty|fatality|injury)\b',
    'risk_assessment': r'\b(risk.?assessment|risk.?analysis|risk.?model|cat.?model)\b',
}

print("HEVL signal patterns defined (hardcoded):")
print(f"  - Hazard types: {len(HAZARD_TYPE_PATTERNS)} patterns")
print(f"  - Process types: {len(PROCESS_TYPE_PATTERNS)} patterns")
print(f"  - Exposure categories: {len(EXPOSURE_CATEGORY_PATTERNS)} patterns")
print(f"  - Analysis types: {len(ANALYSIS_TYPE_PATTERNS)} patterns")
print(f"  - Vulnerability/Loss indicators: {len(VULNERABILITY_PATTERNS)} patterns")

# ============================================================================
# LOAD SIGNAL DICTIONARY (centralized pattern source)
# Merges additional patterns from signal_dictionary.yaml without overwriting
# existing hardcoded patterns. This ensures UNOSAT event codes (H8) and other
# extended patterns are available for signal detection.
# ============================================================================
SIGNAL_DICT_PATH = BASE_DIR / 'hdx_dataset_metadata_dump' / 'config' / 'signal_dictionary.yaml'
if SIGNAL_DICT_PATH.exists():
    try:
        import yaml
        with open(SIGNAL_DICT_PATH, 'r', encoding='utf-8') as f:
            signal_dict = yaml.safe_load(f)

        # Merge hazard type patterns
        if 'hazard_type' in signal_dict:
            for htype, info in signal_dict['hazard_type'].items():
                if htype not in HAZARD_TYPE_PATTERNS:
                    # Combine patterns into one OR group
                    combined = '|'.join(info.get('patterns', []))
                    if combined:
                        HAZARD_TYPE_PATTERNS[htype] = combined

        # Merge process type patterns
        if 'process_type' in signal_dict:
            for ptype, info in signal_dict['process_type'].items():
                if ptype not in PROCESS_TYPE_PATTERNS:
                    combined = '|'.join(info.get('patterns', []))
                    if combined:
                        PROCESS_TYPE_PATTERNS[ptype] = combined

        # Merge exposure category patterns
        if 'exposure_category' in signal_dict:
            for cat, info in signal_dict['exposure_category'].items():
                if cat not in EXPOSURE_CATEGORY_PATTERNS:
                    combined = '|'.join(info.get('patterns', []))
                    if combined:
                        EXPOSURE_CATEGORY_PATTERNS[cat] = combined

        # Load exclusion patterns for false positive filtering
        EXCLUSION_PATTERNS = {}
        if 'exclusion_patterns' in signal_dict:
            for category, patterns in signal_dict['exclusion_patterns'].items():
                EXCLUSION_PATTERNS[category] = [re.compile(p, re.IGNORECASE) for p in patterns]

        print(f"\nLoaded signal dictionary: {SIGNAL_DICT_PATH.name}")
        print(f"  Hazard types: {len(HAZARD_TYPE_PATTERNS)}")
        print(f"  Process types: {len(PROCESS_TYPE_PATTERNS)}")
        print(f"  Exposure categories: {len(EXPOSURE_CATEGORY_PATTERNS)}")
        print(f"  Exclusion patterns: {len(EXCLUSION_PATTERNS)}")
    except Exception as e:
        print(f"WARNING: Could not load signal dictionary: {e}")
        EXCLUSION_PATTERNS = {}
else:
    print(f"\nNOTE: Signal dictionary not found at {SIGNAL_DICT_PATH}")
    EXCLUSION_PATTERNS = {}

HEVL signal patterns defined (hardcoded):
  - Hazard types: 11 patterns
  - Process types: 12 patterns
  - Exposure categories: 7 patterns
  - Analysis types: 3 patterns
  - Vulnerability/Loss indicators: 3 patterns

Loaded signal dictionary: signal_dictionary.yaml
  Hazard types: 11
  Process types: 12
  Exposure categories: 7
  Exclusion patterns: 3


In [9]:
"""
3.2 Apply Pattern Matching to Corpus

Scan all records for HEVL signals using defined patterns.
"""

def extract_patterns(text: str, patterns: Dict[str, str]) -> List[str]:
    """
    Find all matching pattern names in text.
    
    Parameters
    ----------
    text : str
        Text to search (should be lowercase)
    patterns : Dict[str, str]
        Dictionary of {name: regex_pattern}
        
    Returns
    -------
    List[str]
        List of matched pattern names
    """
    matches = []
    for name, pattern in patterns.items():
        if re.search(pattern, text, re.IGNORECASE):
            matches.append(name)
    return matches

def extract_return_periods(text: str) -> List[int]:
    """
    Extract return period values from text.
    
    Parameters
    ----------
    text : str
        Text to search
        
    Returns
    -------
    List[int]
        List of extracted return period values (years)
    """
    # Filter out year-like values (1900-2100) that are likely dates, not return periods
    YEAR_RANGE = range(1900, 2101)

    rp_values = []
    for match in re.finditer(RETURN_PERIOD_PATTERN, text, re.IGNORECASE):
        for group in match.groups():
            if group:
                try:
                    rp = int(group)
                    if 1 <= rp <= 100000:  # Reasonable range
                        if rp not in YEAR_RANGE:
                            rp_values.append(rp)
                except ValueError:
                    pass
    return sorted(set(rp_values))

# Apply pattern matching to all records
print("Extracting HEVL signals from metadata...")

df['hazard_types'] = df['_all_text'].apply(lambda x: extract_patterns(x, HAZARD_TYPE_PATTERNS))
df['process_types'] = df['_all_text'].apply(lambda x: extract_patterns(x, PROCESS_TYPE_PATTERNS))
df['exposure_categories'] = df['_all_text'].apply(lambda x: extract_patterns(x, EXPOSURE_CATEGORY_PATTERNS))
df['analysis_types'] = df['_all_text'].apply(lambda x: extract_patterns(x, ANALYSIS_TYPE_PATTERNS))
df['vuln_loss_indicators'] = df['_all_text'].apply(lambda x: extract_patterns(x, VULNERABILITY_PATTERNS))
df['return_periods'] = df['_all_text'].apply(extract_return_periods)

# Create binary flags
df['has_hazard'] = df['hazard_types'].apply(lambda x: len(x) > 0)
df['has_exposure'] = df['exposure_categories'].apply(lambda x: len(x) > 0)
df['has_vulnerability'] = df['vuln_loss_indicators'].apply(lambda x: 'vulnerability' in x)
df['has_loss'] = df['vuln_loss_indicators'].apply(lambda x: 'loss' in x or 'risk_assessment' in x)
df['has_return_period'] = df['return_periods'].apply(lambda x: len(x) > 0)

print("Signal extraction complete.")

Extracting HEVL signals from metadata...
Signal extraction complete.


In [10]:
"""
3.3 Signal Detection Summary Statistics

Analyze coverage of HEVL signals across the corpus.
"""

print("=" * 70)
print("HEVL SIGNAL DETECTION SUMMARY")
print("=" * 70)

total = len(df)

# Overall component detection rates
print(f"\n--- Component Detection Rates ---")
print(f"{'Component':<20} {'Count':>10} {'Percentage':>12}")
print("-" * 45)
print(f"{'Hazard signal':<20} {df['has_hazard'].sum():>10,} {df['has_hazard'].mean()*100:>11.1f}%")
print(f"{'Exposure signal':<20} {df['has_exposure'].sum():>10,} {df['has_exposure'].mean()*100:>11.1f}%")
print(f"{'Vulnerability signal':<20} {df['has_vulnerability'].sum():>10,} {df['has_vulnerability'].mean()*100:>11.1f}%")
print(f"{'Loss signal':<20} {df['has_loss'].sum():>10,} {df['has_loss'].mean()*100:>11.1f}%")
print(f"{'Return period found':<20} {df['has_return_period'].sum():>10,} {df['has_return_period'].mean()*100:>11.1f}%")

# HEVL combination analysis
print(f"\n--- HEVL Component Combinations ---")
df['hevl_combo'] = df.apply(
    lambda r: ''.join([
        'H' if r['has_hazard'] else '-',
        'E' if r['has_exposure'] else '-',
        'V' if r['has_vulnerability'] else '-',
        'L' if r['has_loss'] else '-'
    ]), axis=1
)

combo_counts = df['hevl_combo'].value_counts()
print(f"{'Combination':<15} {'Count':>10} {'Percentage':>12}  Description")
print("-" * 70)
for combo, count in combo_counts.head(15).items():
    desc = []
    if combo[0] == 'H': desc.append('Hazard')
    if combo[1] == 'E': desc.append('Exposure')
    if combo[2] == 'V': desc.append('Vulnerability')
    if combo[3] == 'L': desc.append('Loss')
    desc_str = '+'.join(desc) if desc else 'No HEVL signals'
    print(f"{combo:<15} {count:>10,} {count/total*100:>11.1f}%  {desc_str}")

HEVL SIGNAL DETECTION SUMMARY

--- Component Detection Rates ---
Component                 Count   Percentage
---------------------------------------------
Hazard signal             2,517         9.6%
Exposure signal          18,673        71.1%
Vulnerability signal        205         0.8%
Loss signal               2,407         9.2%
Return period found          56         0.2%

--- HEVL Component Combinations ---
Combination          Count   Percentage  Description
----------------------------------------------------------------------
-E--                15,356        58.5%  Exposure
----                 6,298        24.0%  No HEVL signals
-E-L                 1,667         6.4%  Exposure+Loss
HE--                 1,240         4.7%  Hazard+Exposure
H---                   776         3.0%  Hazard
HE-L                   250         1.0%  Hazard+Exposure+Loss
---L                   241         0.9%  Loss
H--L                   213         0.8%  Hazard+Loss
-EV-                   103    

In [11]:
"""
3.4 Detailed Hazard Type Distribution

Frequency analysis of specific hazard types detected.
"""

print("=" * 60)
print("HAZARD TYPE DISTRIBUTION")
print("=" * 60)

# Count each hazard type
hazard_counter = Counter()
for hazards in df['hazard_types']:
    hazard_counter.update(hazards)

print(f"\n{'Hazard Type':<25} {'Count':>10} {'% of Corpus':>12} {'RDLS Code':<20}")
print("-" * 70)
for hazard, count in hazard_counter.most_common():
    # Check if in RDLS codelist
    rdls_code = hazard if hazard in RDLS_CODELISTS.get('hazard_type', []) else f"{hazard}*"
    print(f"{hazard:<25} {count:>10,} {count/total*100:>11.1f}% {rdls_code:<20}")

print("\n* = May need mapping to official RDLS hazard_type code")

HAZARD TYPE DISTRIBUTION

Hazard Type                    Count  % of Corpus RDLS Code           
----------------------------------------------------------------------
flood                          1,315         5.0% flood               
convective_storm                 534         2.0% convective_storm    
drought                          370         1.4% drought             
earthquake                       365         1.4% earthquake          
tsunami                          312         1.2% tsunami             
strong_wind                      282         1.1% strong_wind         
landslide                         76         0.3% landslide           
volcanic                          26         0.1% volcanic            
coastal_flood                      5         0.0% coastal_flood       
wildfire                           5         0.0% wildfire            
extreme_temperature                4         0.0% extreme_temperature 

* = May need mapping to official RDLS hazard_type 

In [12]:
"""
3.5 Detailed Exposure Category Distribution
"""

print("=" * 60)
print("EXPOSURE CATEGORY DISTRIBUTION")
print("=" * 60)

# Count each exposure category
exposure_counter = Counter()
for categories in df['exposure_categories']:
    exposure_counter.update(categories)

print(f"\n{'Exposure Category':<25} {'Count':>10} {'% of Corpus':>12}")
print("-" * 50)
for category, count in exposure_counter.most_common():
    print(f"{category:<25} {count:>10,} {count/total*100:>11.1f}%")

EXPOSURE CATEGORY DISTRIBUTION

Exposure Category              Count  % of Corpus
--------------------------------------------------
population                    11,213        42.7%
infrastructure                 6,750        25.7%
economic_indicator             6,490        24.7%
buildings                      5,489        20.9%
agriculture                    3,637        13.9%
natural_environment            3,063        11.7%
development_index              1,482         5.6%


In [13]:
"""
3.6 Return Period Analysis

Distribution of extracted return period values.
"""

print("=" * 60)
print("RETURN PERIOD EXTRACTION ANALYSIS")
print("=" * 60)

# Flatten all return periods
all_rp = []
for rps in df['return_periods']:
    all_rp.extend(rps)

rp_counter = Counter(all_rp)

print(f"\nDatasets with return period: {df['has_return_period'].sum():,}")
print(f"Total return period values found: {len(all_rp):,}")
print(f"Unique return period values: {len(rp_counter):,}")

print(f"\n{'Return Period (years)':<25} {'Occurrences':>12}")
print("-" * 40)
for rp, count in rp_counter.most_common(20):
    print(f"{rp:<25} {count:>12,}")

RETURN PERIOD EXTRACTION ANALYSIS

Datasets with return period: 56
Total return period values found: 89
Unique return period values: 15

Return Period (years)      Occurrences
----------------------------------------
100                                 27
5                                   12
4                                   11
2                                    9
1                                    8
3                                    8
21                                   3
500                                  2
25                                   2
24                                   2
50                                   1
250                                  1
1000                                 1
26                                   1
16                                   1


## 4. Duplication and Clustering Analysis

In [14]:
"""
4.1 Identify Potential Duplicates

Detect datasets that appear to be versions/variants of each other.
Uses title similarity and organization matching.
"""

def normalize_title(title: str) -> str:
    """
    Normalize title for comparison by removing common variations.
    
    Parameters
    ----------
    title : str
        Original title
        
    Returns
    -------
    str
        Normalized title
    """
    if not title:
        return ''
    
    # Lowercase
    t = title.lower()
    
    # Remove country-specific suffixes (for GAR15 type datasets)
    t = re.sub(r'\s+for\s+[\w\s-]+$', '', t)
    
    # Remove year references
    t = re.sub(r'\b(19|20)\d{2}\b', '', t)
    
    # Remove common version indicators
    t = re.sub(r'\b(v\d+|version\s*\d+|rev\s*\d+)\b', '', t)
    
    # Remove extra whitespace
    t = ' '.join(t.split())
    
    return t.strip()

# Create normalized title column
df['title_normalized'] = df['title'].apply(normalize_title)

# Count duplicates by normalized title
title_counts = df['title_normalized'].value_counts()
duplicate_titles = title_counts[title_counts > 1]

print("=" * 60)
print("DUPLICATION ANALYSIS")
print("=" * 60)

print(f"\nTotal datasets: {len(df):,}")
print(f"Unique normalized titles: {df['title_normalized'].nunique():,}")
print(f"Potential duplicate groups: {len(duplicate_titles):,}")
print(f"Records in duplicate groups: {df[df['title_normalized'].isin(duplicate_titles.index)].shape[0]:,}")

# Show largest duplicate groups
print(f"\n--- Largest Duplicate Groups (by normalized title) ---")
print(f"{'Normalized Title (truncated)':<50} {'Count':>8}")
print("-" * 60)
for title, count in duplicate_titles.head(20).items():
    display_title = title[:47] + '...' if len(title) > 50 else title
    print(f"{display_title:<50} {count:>8}")

DUPLICATION ANALYSIS

Total datasets: 26,246
Unique normalized titles: 24,194
Potential duplicate groups: 895
Records in duplicate groups: 2,947

--- Largest Duplicate Groups (by normalized title) ---
Normalized Title (truncated)                          Count
------------------------------------------------------------
gar15 global exposure dataset                           166
hdx hapi data                                           153
daily summaries of precipitation indicators              52
kenya medium term projection fews net acute foo...       12
uganda current situation fews net acute food in...       12
kenya near term projection fews net acute food ...       12
zimbabwe medium term projection fews net acute ...       12
mozambique near term projection fews net acute ...       12
chad medium term projection fews net acute food...       12
malawi current situation fews net acute food in...       12
kenya current situation fews net acute food ins...       12
guatemala near ter

In [15]:
"""
4.2 Identify Dataset Series (Country Variants)

Detect systematic series like "GAR15 Global Exposure Dataset for [Country]".
"""

# Known series patterns
SERIES_PATTERNS = [
    (r'^gar15\s+global\s+exposure\s+dataset', 'GAR15 Exposure'),
    (r'^\w{3}\s+requirements\s+and\s+funding\s+data', 'Requirements & Funding'),
    (r'level\s+1\s+exposure\s+data', 'Level 1 Exposure'),
    (r'admin\s*\d+\s+(boundaries|administrative)', 'Admin Boundaries'),
    (r'flood\s+hazard.*return\s+period', 'Flood Hazard RP'),
    (r'earthquake.*hazard.*pga', 'Earthquake PGA'),
    (r'population\s+(density|count|statistics)', 'Population Data'),
]

def identify_series(title: str) -> Optional[str]:
    """
    Identify if title belongs to a known dataset series.
    
    Parameters
    ----------
    title : str
        Dataset title
        
    Returns
    -------
    Optional[str]
        Series name if matched, None otherwise
    """
    title_lower = title.lower() if title else ''
    for pattern, series_name in SERIES_PATTERNS:
        if re.search(pattern, title_lower):
            return series_name
    return None

df['series'] = df['title'].apply(identify_series)

print("=" * 60)
print("DATASET SERIES ANALYSIS")
print("=" * 60)

series_counts = df['series'].value_counts()
print(f"\nDatasets identified as part of series: {df['series'].notna().sum():,}")
print(f"\n{'Series Name':<30} {'Count':>10}")
print("-" * 45)
for series, count in series_counts.items():
    print(f"{series:<30} {count:>10,}")

DATASET SERIES ANALYSIS

Datasets identified as part of series: 1,360

Series Name                         Count
---------------------------------------------
Population Data                     1,132
GAR15 Exposure                        181
Level 1 Exposure                       47


## 5. Organization and Source Analysis

In [16]:
"""
5.1 Risk-Relevant Organizations

Identify organizations that publish HEVL-relevant data.
"""

print("=" * 70)
print("RISK-RELEVANT ORGANIZATIONS ANALYSIS")
print("=" * 70)

# Organizations with high HEVL signal rates
org_hevl_stats = df.groupby('organization').agg({
    'id': 'count',
    'has_hazard': 'sum',
    'has_exposure': 'sum',
    'has_vulnerability': 'sum',
    'has_loss': 'sum',
}).rename(columns={'id': 'total_datasets'})

# Calculate rates
org_hevl_stats['hazard_rate'] = org_hevl_stats['has_hazard'] / org_hevl_stats['total_datasets']
org_hevl_stats['exposure_rate'] = org_hevl_stats['has_exposure'] / org_hevl_stats['total_datasets']

# Filter to orgs with significant HEVL content (at least 10 datasets and 30% HEVL rate)
org_hevl_stats['any_hevl'] = org_hevl_stats['has_hazard'] + org_hevl_stats['has_exposure']
org_hevl_stats['hevl_rate'] = org_hevl_stats['any_hevl'] / org_hevl_stats['total_datasets']

risk_orgs = org_hevl_stats[
    (org_hevl_stats['total_datasets'] >= 10) & 
    (org_hevl_stats['hevl_rate'] >= 0.3)
].sort_values('any_hevl', ascending=False)

print(f"\nOrganizations with significant risk data (>=10 datasets, >=30% HEVL rate):")
print(f"\n{'Organization':<45} {'Total':>8} {'Hazard':>8} {'Exposure':>8} {'HEVL%':>8}")
print("-" * 80)
for org, row in risk_orgs.head(20).iterrows():
    org_display = org[:42] + '...' if len(org) > 45 else org
    print(f"{org_display:<45} {row['total_datasets']:>8,.0f} {row['has_hazard']:>8,.0f} {row['has_exposure']:>8,.0f} {row['hevl_rate']*100:>7.1f}%")

RISK-RELEVANT ORGANIZATIONS ANALYSIS

Organizations with significant risk data (>=10 datasets, >=30% HEVL rate):

Organization                                     Total   Hazard Exposure    HEVL%
--------------------------------------------------------------------------------
World Bank Group                                 4,792        0    4,506    94.0%
Humanitarian OpenStreetMap Team (HOT)            2,593       10    2,587   100.2%
United Nations Satellite Centre (UNOSAT)         1,452    1,019      724   120.0%
WorldPop                                         1,569        3    1,247    79.7%
UNHCR - The UN Refugee Agency                    1,132        9      873    77.9%
HeiGIT (Heidelberg Institute for Geoinform...      661      192      661   129.0%
Copernicus                                         478      240      478   150.2%
Kontur                                             502        2      502   100.4%
Internal Displacement Monitoring Centre (I...      426       56    

In [17]:
"""
5.2 Tag Analysis for HEVL Signals

Analyze HDX tags to identify risk-relevant categorization.
"""

print("=" * 60)
print("TAG ANALYSIS")
print("=" * 60)

# Count all tags
all_tags = []
for tags_str in df['tags'].dropna():
    all_tags.extend(tags_str.split('|'))

tag_counter = Counter(t.strip() for t in all_tags if t.strip())

# Risk-relevant tags
risk_keywords = ['hazard', 'risk', 'disaster', 'flood', 'earthquake', 'cyclone', 
                 'drought', 'exposure', 'vulnerability', 'tsunami', 'storm']

risk_tags = {tag: count for tag, count in tag_counter.items() 
             if any(kw in tag.lower() for kw in risk_keywords)}

print(f"\nTotal unique tags: {len(tag_counter):,}")
print(f"Risk-relevant tags: {len(risk_tags):,}")

print(f"\n--- Top Risk-Relevant Tags ---")
print(f"{'Tag':<45} {'Count':>10}")
print("-" * 58)
for tag, count in sorted(risk_tags.items(), key=lambda x: x[1], reverse=True)[:25]:
    print(f"{tag:<45} {count:>10,}")

TAG ANALYSIS

Total unique tags: 143
Risk-relevant tags: 11

--- Top Risk-Relevant Tags ---
Tag                                                Count
----------------------------------------------------------
flooding                                             843
cyclones-hurricanes-typhoons                         806
natural disasters                                    592
hazards and risk                                     567
disaster risk reduction-drr                          357
earthquake-tsunami                                   306
drought                                              298
climate hazards                                      158
crisis-myanmar-earthquake                             33
libya-floods                                          20
morocco-earthquake                                    14


## 6. Sample High-Quality HEVL Records

In [18]:
"""
6.1 Identify High-Quality Records for Each Component

Find records with strong HEVL signals for manual review and pattern validation.
"""

def calculate_signal_strength(row: pd.Series) -> int:
    """
    Calculate overall HEVL signal strength score.
    
    NOTE: These confidence scores are heuristic weights and have not been
    calibrated against ground truth labels. The weights (2x for hazard/exposure,
    3x for return periods) reflect assumed relative informativeness but should
    be validated empirically in a future iteration.
    
    Parameters
    ----------
    row : pd.Series
        DataFrame row
        
    Returns
    -------
    int
        Signal strength score (0-20, heuristic)
    """
    score = 0
    score += len(row['hazard_types']) * 2  # Weight hazard signals
    score += len(row['exposure_categories']) * 2
    score += len(row['process_types'])
    score += len(row['analysis_types']) * 2
    score += len(row['return_periods']) * 3  # Return periods are very specific
    return min(score, 20)  # Cap at 20

df['signal_strength'] = df.apply(calculate_signal_strength, axis=1)

# Find top records for each component
print("=" * 70)
print("HIGH-QUALITY HEVL RECORD SAMPLES")
print("=" * 70)

HIGH-QUALITY HEVL RECORD SAMPLES


In [19]:
"""
6.2 Sample Hazard-Rich Records
"""

hazard_rich = df[df['has_hazard'] & df['has_return_period']].nlargest(10, 'signal_strength')

print("\n--- Top 10 Hazard-Rich Records (with return periods) ---\n")
for idx, row in hazard_rich.iterrows():
    print(f"Title: {row['title'][:80]}")
    print(f"  Organization: {row['organization']}")
    print(f"  Hazard types: {row['hazard_types']}")
    print(f"  Process types: {row['process_types']}")
    print(f"  Analysis types: {row['analysis_types']}")
    print(f"  Return periods: {row['return_periods']}")
    print(f"  Signal strength: {row['signal_strength']}")
    print()


--- Top 10 Hazard-Rich Records (with return periods) ---

Title: Global Drought Hazard
  Organization: Institute for International Law of Peace and Armed Conflict
  Hazard types: ['drought']
  Process types: []
  Analysis types: ['probabilistic', 'deterministic']
  Return periods: [25, 50, 100, 250, 500, 1000]
  Signal strength: 20

Title: United Republic of Tanzania: Integrated Context Analysis (ICA), 2015
  Organization: WFP - World Food Programme
  Hazard types: ['flood', 'landslide', 'drought']
  Process types: []
  Analysis types: ['probabilistic', 'deterministic', 'empirical']
  Return periods: [100]
  Signal strength: 19

Title: Burkina Faso: Integrated Context Analysis (ICA), 2018
  Organization: WFP - World Food Programme
  Hazard types: ['flood', 'drought']
  Process types: []
  Analysis types: ['probabilistic', 'deterministic', 'empirical']
  Return periods: [100]
  Signal strength: 17

Title: Zambia: Response Plan projects
  Organization: OCHA Humanitarian Programme Cycle 

In [20]:
"""
6.3 Sample Exposure-Rich Records
"""

exposure_rich = df[df['has_exposure'] & (df['exposure_categories'].apply(len) >= 2)].nlargest(10, 'signal_strength')

print("\n--- Top 10 Exposure-Rich Records (multiple categories) ---\n")
for idx, row in exposure_rich.iterrows():
    print(f"Title: {row['title'][:80]}")
    print(f"  Organization: {row['organization']}")
    print(f"  Exposure categories: {row['exposure_categories']}")
    print(f"  Tags: {row['tags'][:60]}")
    print(f"  Signal strength: {row['signal_strength']}")
    print()


--- Top 10 Exposure-Rich Records (multiple categories) ---

Title: United Republic of Tanzania: Integrated Context Analysis (ICA), 2015
  Organization: WFP - World Food Programme
  Exposure categories: ['population', 'agriculture']
  Tags: geodata|hazards and risk
  Signal strength: 19

Title: Mozambique - Health Indicators
  Organization: World Health Organization
  Exposure categories: ['buildings', 'infrastructure', 'population', 'natural_environment', 'economic_indicator', 'development_index']
  Tags: disability|disease|environment|health|hxl|indicators|malaria
  Signal strength: 18

Title: Mauritius - Health Indicators
  Organization: World Health Organization
  Exposure categories: ['buildings', 'infrastructure', 'population', 'natural_environment', 'economic_indicator', 'development_index']
  Tags: disability|disease|environment|health|hxl|indicators|malaria
  Signal strength: 18

Title: Tonga - Health Indicators
  Organization: World Health Organization
  Exposure categories: 

## 7. Export Analysis Results

In [21]:
"""
7.1 Export Analysis DataFrames

Save analysis results for downstream notebooks.
"""

# Prepare export DataFrame (exclude large text columns)
export_cols = [
    'id', 'name', 'title', 'organization', 'groups', 'tags',
    'hazard_types', 'process_types', 'exposure_categories', 'analysis_types',
    'vuln_loss_indicators', 'return_periods',
    'has_hazard', 'has_exposure', 'has_vulnerability', 'has_loss',
    'has_return_period', 'hevl_combo', 'signal_strength', 'series'
]

df_export = df[export_cols].copy()

# Convert lists to pipe-separated strings for CSV compatibility
list_cols = ['hazard_types', 'process_types', 'exposure_categories', 
             'analysis_types', 'vuln_loss_indicators', 'return_periods']
for col in list_cols:
    df_export[col] = df_export[col].apply(lambda x: '|'.join(map(str, x)) if x else '')

# Save full analysis
output_file = OUTPUT_DIR / 'hdx_hevl_signal_analysis.csv'
df_export.to_csv(output_file, index=False)
print(f"Saved: {output_file}")
print(f"  Records: {len(df_export):,}")
print(f"  Columns: {len(df_export.columns)}")

Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/analysis/hdx_hevl_signal_analysis.csv
  Records: 26,246
  Columns: 20


In [22]:
"""
7.2 Export Summary Statistics
"""

def convert_to_native(obj):
    """
    Recursively convert NumPy/pandas types to native Python types for JSON serialization.
    
    Parameters
    ----------
    obj : Any
        Object to convert (can be dict, list, numpy type, etc.)
        
    Returns
    -------
    Any
        Object with all NumPy types converted to native Python types
    """
    if isinstance(obj, dict):
        return {k: convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, (list, tuple)):
        return [convert_to_native(item) for item in obj]
    elif isinstance(obj, (np.integer, np.int64, np.int32)):
        return int(obj)
    elif isinstance(obj, (np.floating, np.float64, np.float32)):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return convert_to_native(obj.tolist())
    elif isinstance(obj, pd.Series):
        return convert_to_native(obj.to_dict())
    else:
        return obj

summary = {
    'analysis_date': datetime.now().isoformat(),
    'total_datasets': int(len(df)),
    'unique_organizations': int(df['organization'].nunique()),
    
    'hevl_detection': {
        'hazard_signal_count': int(df['has_hazard'].sum()),
        'hazard_signal_rate': float(round(df['has_hazard'].mean(), 4)),
        'exposure_signal_count': int(df['has_exposure'].sum()),
        'exposure_signal_rate': float(round(df['has_exposure'].mean(), 4)),
        'vulnerability_signal_count': int(df['has_vulnerability'].sum()),
        'vulnerability_signal_rate': float(round(df['has_vulnerability'].mean(), 4)),
        'loss_signal_count': int(df['has_loss'].sum()),
        'loss_signal_rate': float(round(df['has_loss'].mean(), 4)),
        'return_period_count': int(df['has_return_period'].sum()),
        'return_period_rate': float(round(df['has_return_period'].mean(), 4)),
    },
    
    'hazard_type_counts': convert_to_native(dict(hazard_counter.most_common())),
    'exposure_category_counts': convert_to_native(dict(exposure_counter.most_common())),
    'return_period_counts': convert_to_native(dict(rp_counter.most_common(20))),
    
    'hevl_combinations': convert_to_native(dict(df['hevl_combo'].value_counts().head(10))),
    
    'duplication': {
        'unique_normalized_titles': int(df['title_normalized'].nunique()),
        'duplicate_groups': int(len(duplicate_titles)),
    },
    
    'series_counts': convert_to_native(dict(series_counts)) if len(series_counts) > 0 else {},
}

summary_file = OUTPUT_DIR / 'hdx_hevl_signal_summary.json'
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2)
print(f"\nSaved: {summary_file}")


Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/analysis/hdx_hevl_signal_summary.json


In [23]:
"""
7.3 Export High-Signal Records for Manual Review
"""

# Top 500 records by signal strength
high_signal = df.nlargest(500, 'signal_strength')[[
    'id', 'title', 'organization', 'hazard_types', 'exposure_categories',
    'analysis_types', 'return_periods', 'signal_strength', 'hevl_combo'
]].copy()

for col in ['hazard_types', 'exposure_categories', 'analysis_types', 'return_periods']:
    high_signal[col] = high_signal[col].apply(lambda x: '|'.join(map(str, x)) if x else '')

high_signal_file = OUTPUT_DIR / 'hdx_high_signal_records.csv'
high_signal.to_csv(high_signal_file, index=False)
print(f"Saved: {high_signal_file}")
print(f"  Records: {len(high_signal):,}")

Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/analysis/hdx_high_signal_records.csv
  Records: 500


## 8. Conclusions and Next Steps

In [24]:
"""
8.1 Analysis Summary Report
"""

print("=" * 70)
print("ANALYSIS SUMMARY REPORT")
print("=" * 70)

print(f"""
CORPUS OVERVIEW
---------------
Total HDX datasets analyzed: {len(df):,}
Unique organizations: {df['organization'].nunique():,}

HEVL SIGNAL COVERAGE
--------------------
Datasets with Hazard signals:       {df['has_hazard'].sum():>6,} ({df['has_hazard'].mean()*100:.1f}%)
Datasets with Exposure signals:     {df['has_exposure'].sum():>6,} ({df['has_exposure'].mean()*100:.1f}%)
Datasets with Vulnerability signals:{df['has_vulnerability'].sum():>6,} ({df['has_vulnerability'].mean()*100:.1f}%)
Datasets with Loss signals:         {df['has_loss'].sum():>6,} ({df['has_loss'].mean()*100:.1f}%)
Datasets with Return Period info:   {df['has_return_period'].sum():>6,} ({df['has_return_period'].mean()*100:.1f}%)

INFERENCE POTENTIAL
-------------------
Datasets suitable for Hazard block population: ~{df['has_hazard'].sum():,}
  - With specific hazard type: {(df['hazard_types'].apply(len) >= 1).sum():,}
  - With process type detail:  {(df['process_types'].apply(len) >= 1).sum():,}
  - With analysis type:        {(df['analysis_types'].apply(len) >= 1).sum():,}
  - With return periods:       {df['has_return_period'].sum():,}

Datasets suitable for Exposure block population: ~{df['has_exposure'].sum():,}
  - With category detected:    {(df['exposure_categories'].apply(len) >= 1).sum():,}

DUPLICATION STATUS
------------------
Potential duplicate groups: {len(duplicate_titles):,}
Records in duplicate groups: {df[df['title_normalized'].isin(duplicate_titles.index)].shape[0]:,}
Identified dataset series: {df['series'].notna().sum():,}

KEY INSIGHTS
------------
1. Hazard data is well-represented ({df['has_hazard'].mean()*100:.0f}% of corpus)
2. Return period extraction is feasible for {df['has_return_period'].sum():,} datasets
3. Major risk data publishers: UNDRR, GEM, OCHA, WFP, UNOSAT
4. Significant series duplication exists (GAR15 Exposure: ~190 country variants)

RECOMMENDED NEXT STEPS
----------------------
1. Build Signal Dictionary with confident mappings to RDLS codelists
2. Develop Hazard block extractor (highest coverage potential)
3. Handle series deduplication to avoid redundant processing
4. Develop Exposure block extractor
""")

print(f"\nAnalysis completed: {datetime.now().isoformat()}")
print(f"Output files saved to: {OUTPUT_DIR}")

ANALYSIS SUMMARY REPORT

CORPUS OVERVIEW
---------------
Total HDX datasets analyzed: 26,246
Unique organizations: 358

HEVL SIGNAL COVERAGE
--------------------
Datasets with Hazard signals:        2,517 (9.6%)
Datasets with Exposure signals:     18,673 (71.1%)
Datasets with Vulnerability signals:   205 (0.8%)
Datasets with Loss signals:          2,407 (9.2%)
Datasets with Return Period info:       56 (0.2%)

INFERENCE POTENTIAL
-------------------
Datasets suitable for Hazard block population: ~2,517
  - With specific hazard type: 2,517
  - With process type detail:  1,320
  - With analysis type:        7,095
  - With return periods:       56

Datasets suitable for Exposure block population: ~18,673
  - With category detected:    18,673

DUPLICATION STATUS
------------------
Potential duplicate groups: 895
Records in duplicate groups: 2,947
Identified dataset series: 1,360

KEY INSIGHTS
------------
1. Hazard data is well-represented (10% of corpus)
2. Return period extraction is fea

## End of Code