# Notebook 13: RDLS Validation and Quality Assurance

**Purpose**: Validate integrated RDLS records (from Notebook 12) against the v0.3 JSON schema, score composite confidence, and produce tiered output packages.

**Process**:
1. Load all integrated RDLS records from `rdls/integrated/`
2. Validate against RDLS v0.3 JSON schema
3. Check HEVL block completeness and structure
4. Compute composite confidence score per record
5. Sort records into 3 confidence tiers: `high/`, `medium/`, `low/`
6. Generate manifests, reports, and ZIP archive

**Confidence Tiers**:
- **`high/`** (score >= 0.8): Production-ready records. Have HEVL JSON blocks with high extraction confidence and pass schema validation.
- **`medium/`** (0.5 <= score < 0.8): Review-needed records. May have partial HEVL blocks, lower confidence, or minor validation issues.
- **`low/`** (score < 0.5): Curation-needed records. Flags-only (no HEVL JSON blocks), low confidence, or validation failures.

**Author**: Benny Istanto/Risk Data Librarian/GFDRR  
**Version**: 2026.2

---

## 1. Setup

In [1]:
"""
1.1 Import Dependencies
"""

import json
import re
import shutil
import zipfile
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Optional, Any, Tuple
from collections import Counter

import pandas as pd
import numpy as np

try:
    import jsonschema
    from jsonschema import Draft202012Validator, ValidationError
    HAS_JSONSCHEMA = True
except ImportError:
    try:
        from jsonschema import Draft7Validator as Draft202012Validator, ValidationError
        HAS_JSONSCHEMA = True
        print("Note: Draft202012Validator not available, falling back to Draft7Validator")
    except ImportError:
        HAS_JSONSCHEMA = False
        print("Warning: jsonschema not installed. Install with: pip install jsonschema")

try:
    from tqdm.notebook import tqdm
    HAS_TQDM = True
except ImportError:
    HAS_TQDM = False

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print(f"Notebook started: {datetime.now().isoformat()}")
print(f"JSON Schema validation: {'Available' if HAS_JSONSCHEMA else 'Not available'}")

Notebook started: 2026-02-11T18:29:45.906586
JSON Schema validation: Available


In [2]:
"""
1.2 Configure Paths
"""

NOTEBOOK_DIR = Path.cwd()
BASE_DIR = NOTEBOOK_DIR.parent if NOTEBOOK_DIR.name == 'notebook' else NOTEBOOK_DIR

# RDLS schema
RDLS_SCHEMA_PATH = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'schema' / 'rdls_schema_v0.3.json'

# Input: integrated records from NB 12 only
INTEGRATED_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'integrated'

# Output: reports and tiered dist
REPORTS_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'reports'
DIST_DIR = BASE_DIR / 'hdx_dataset_metadata_dump' / 'rdls' / 'dist'

# Confidence tier folders (under dist/) -- schema-VALID records only
TIER_HIGH_DIR = DIST_DIR / 'high'
TIER_MEDIUM_DIR = DIST_DIR / 'medium'
TIER_LOW_DIR = DIST_DIR / 'low'

# Invalid tier folders (schema-INVALID records, sub-tiered by confidence)
INVALID_DIR = DIST_DIR / 'invalid'
INVALID_HIGH_DIR = INVALID_DIR / 'high'
INVALID_MEDIUM_DIR = INVALID_DIR / 'medium'
INVALID_LOW_DIR = INVALID_DIR / 'low'

# Confidence thresholds
THRESHOLD_HIGH = 0.8
THRESHOLD_MEDIUM = 0.5

# Remove old flat 'records/' folder and stale files
stale_records_dir = DIST_DIR / 'records'
if stale_records_dir.exists():
    shutil.rmtree(stale_records_dir)
    print(f"Removed stale: {stale_records_dir}")

for stale_file in ['rdls_index.csv', 'rdls_index.jsonl', 'rdls_metadata_bundle.zip']:
    stale_path = DIST_DIR / stale_file
    if stale_path.exists():
        stale_path.unlink()
        print(f"Removed stale: {stale_path}")

# Remove old invalid/ tree from previous runs
if INVALID_DIR.exists():
    shutil.rmtree(INVALID_DIR)
    print(f"Removed stale: {INVALID_DIR}")

# Also remove old ZIP archives from parent dir
for old_zip in (DIST_DIR.parent).glob('rdls_hdx_package_*.zip'):
    old_zip.unlink()
    print(f"Removed stale: {old_zip}")

# Create fresh directories
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
DIST_DIR.mkdir(parents=True, exist_ok=True)
TIER_HIGH_DIR.mkdir(parents=True, exist_ok=True)
TIER_MEDIUM_DIR.mkdir(parents=True, exist_ok=True)
TIER_LOW_DIR.mkdir(parents=True, exist_ok=True)
INVALID_HIGH_DIR.mkdir(parents=True, exist_ok=True)
INVALID_MEDIUM_DIR.mkdir(parents=True, exist_ok=True)
INVALID_LOW_DIR.mkdir(parents=True, exist_ok=True)

print(f"\nBase: {BASE_DIR}")
print(f"Integrated: {INTEGRATED_DIR}")
print(f"Reports: {REPORTS_DIR}")
print(f"Dist: {DIST_DIR}")
print(f"  high/            (valid, score >= {THRESHOLD_HIGH})")
print(f"  medium/          (valid, {THRESHOLD_MEDIUM} <= score < {THRESHOLD_HIGH})")
print(f"  low/             (valid, score < {THRESHOLD_MEDIUM})")
print(f"  invalid/high/    (invalid, score >= {THRESHOLD_HIGH})")
print(f"  invalid/medium/  (invalid, {THRESHOLD_MEDIUM} <= score < {THRESHOLD_HIGH})")
print(f"  invalid/low/     (invalid, score < {THRESHOLD_MEDIUM})")


CLEANUP_MODE = "replace"

Removed stale: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist/invalid
Removed stale: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/rdls_hdx_package_20260211.zip

Base: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler
Integrated: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/integrated
Reports: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports
Dist: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist
  high/            (valid, score >= 0.8)
  medium/          (valid, 0.5 <= score < 0.8)
  low/             (valid, score < 0.5)
  invalid/high/    (invalid, score >= 0.8)
  invalid/medium/  (invalid, 0.5 <= score < 0.8)
  invalid/low/     (invalid, score < 0.5)


In [3]:
"""
1.3 Clean Previous Outputs

Remove stale output files from previous runs (controlled by CLEANUP_MODE).
"""
import shutil

def clean_previous_outputs(output_dir, patterns, label, mode="replace"):
    """
    Remove previous output files matching the given glob patterns.

    Parameters
    ----------
    output_dir : Path
        Directory containing old outputs.
    patterns : list[str]
        Glob patterns to match.
    label : str
        Human-readable label for log messages.
    mode : str
        One of: "replace" (auto-delete), "prompt" (ask user),
        "skip" (keep old files), "abort" (error if stale files exist).

    Returns
    -------
    dict  with keys 'deleted' (int) and 'skipped' (bool)
    """
    result = {'deleted': 0, 'skipped': False}
    targets = {}
    for pattern in patterns:
        matches = sorted(output_dir.glob(pattern))
        if matches:
            targets[pattern] = matches
    total = sum(len(files) for files in targets.values())

    if total == 0:
        print(f'Output cleanup [{label}]: Directory is clean.')
        return result

    summary = []
    for pattern, files in targets.items():
        summary.append(f'  {pattern:40s}: {len(files):,} files')

    if mode == 'skip':
        print(f'Output cleanup [{label}]: SKIPPED ({total:,} existing files kept)')
        result['skipped'] = True
        return result

    if mode == 'abort':
        raise RuntimeError(
            f'Output cleanup [{label}]: ABORT -- {total:,} stale files found. '
            f'Delete manually or change CLEANUP_MODE.'
        )

    if mode == 'prompt':
        print(f'Output cleanup [{label}]: Found {total:,} existing output files:')
        for line in summary:
            print(line)
        choice = input('Choose [R]eplace / [S]kip / [A]bort: ').strip().lower()
        if choice in ('s', 'skip'):
            print('  Skipped.')
            result['skipped'] = True
            return result
        elif choice in ('a', 'abort'):
            raise RuntimeError('User chose to abort.')
        elif choice not in ('r', 'replace', ''):
            print(f'  Unknown choice, defaulting to Replace.')

    # Mode: replace (default)
    print(f'Output cleanup [{label}]:')
    for line in summary:
        print(line)
    for pattern, files in targets.items():
        for f in files:
            try:
                f.unlink()
                result['deleted'] += 1
            except Exception as e:
                print(f'  WARNING: Could not delete {f.name}: {e}')
    deleted_count = result['deleted']
    print(f'  Cleaned {deleted_count:,} files. Ready for fresh output.')
    print()
    return result

# ── Run cleanup ────────────────────────────────────────────────────────
# Clean tier directories (valid + invalid)
for tier_dir in [TIER_HIGH_DIR, TIER_MEDIUM_DIR, TIER_LOW_DIR,
                 INVALID_HIGH_DIR, INVALID_MEDIUM_DIR, INVALID_LOW_DIR]:
    if tier_dir.exists():
        clean_previous_outputs(
            tier_dir,
            patterns=["*.json", "manifest.csv"],
            label=f"NB 13 {tier_dir.relative_to(DIST_DIR)}",
            mode=CLEANUP_MODE,
        )

# Clean reports
clean_previous_outputs(
    REPORTS_DIR,
    patterns=["*.csv", "*.md", "*.json"],
    label="NB 13 Reports",
    mode=CLEANUP_MODE,
)

# Clean dist-level files
clean_previous_outputs(
    DIST_DIR,
    patterns=["master_manifest.csv", "README.md"],
    label="NB 13 Dist Files",
    mode=CLEANUP_MODE,
)

# Clean ZIP archives
clean_previous_outputs(
    DIST_DIR.parent,
    patterns=["rdls_hdx_package_*.zip"],
    label="NB 13 ZIP Archives",
    mode=CLEANUP_MODE,
)

# Remove stale invalid/ tree and recreate
if INVALID_DIR.exists() and CLEANUP_MODE == "replace":
    shutil.rmtree(INVALID_DIR)
    print(f"Removed stale: {INVALID_DIR}")

# Ensure all output directories exist
for d in [TIER_HIGH_DIR, TIER_MEDIUM_DIR, TIER_LOW_DIR,
          INVALID_HIGH_DIR, INVALID_MEDIUM_DIR, INVALID_LOW_DIR,
          REPORTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("Output directories ready.")


Output cleanup [NB 13 high]:
  *.json                                  : 9,797 files
  manifest.csv                            : 1 files
  Cleaned 9,798 files. Ready for fresh output.

Output cleanup [NB 13 medium]:
  manifest.csv                            : 1 files
  Cleaned 1 files. Ready for fresh output.

Output cleanup [NB 13 low]:
  manifest.csv                            : 1 files
  Cleaned 1 files. Ready for fresh output.

Output cleanup [NB 13 invalid/high]: Directory is clean.
Output cleanup [NB 13 invalid/medium]: Directory is clean.
Output cleanup [NB 13 invalid/low]: Directory is clean.
Output cleanup [NB 13 Reports]:
  *.csv                                   : 6 files
  *.md                                    : 1 files
  Cleaned 7 files. Ready for fresh output.

Output cleanup [NB 13 Dist Files]:
  master_manifest.csv                     : 1 files
  README.md                               : 1 files
  Cleaned 2 files. Ready for fresh output.

Output cleanup [NB 13 ZIP Arc

In [4]:
"""
1.3 Load RDLS Schema
"""

with open(RDLS_SCHEMA_PATH, 'r', encoding='utf-8') as f:
    RDLS_SCHEMA = json.load(f)

print(f"RDLS Schema loaded: {RDLS_SCHEMA.get('$id', 'unknown')}")
print(f"Schema draft: {RDLS_SCHEMA.get('$schema', 'unknown')}")
print(f"Required top-level fields: {RDLS_SCHEMA.get('required', [])}")

RDLS Schema loaded: https://docs.riskdatalibrary.org/en/0__3__0/rdls_schema.json
Schema draft: https://json-schema.org/draft/2020-12/schema
Required top-level fields: ['id', 'title', 'risk_data_type', 'attributions', 'spatial', 'license', 'resources']


## 2. Load RDLS Records

In [5]:
"""
2.1 Find and Load Integrated RDLS Records

Only load from integrated/ directory (NB 12 output).
Do NOT mix with extracted/ samples to avoid duplicates.
"""

def load_rdls_records(directory: Path) -> List[Dict[str, Any]]:
    """Load all RDLS JSON files from directory."""
    records = []
    
    for filepath in sorted(directory.glob('rdls_*.json')):
        entry = {
            'filepath': filepath,
            'filename': filepath.name,
        }
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                entry['data'] = json.load(f)
        except Exception as e:
            entry['load_error'] = str(e)
        
        records.append(entry)
    
    return records

rdls_records = load_rdls_records(INTEGRATED_DIR)
loaded = sum(1 for r in rdls_records if 'data' in r)
errors = sum(1 for r in rdls_records if 'load_error' in r)

print(f"Found {len(rdls_records)} RDLS files in {INTEGRATED_DIR.name}/")
print(f"  Loaded: {loaded}")
print(f"  Load errors: {errors}")

Found 12577 RDLS files in integrated/
  Loaded: 12577
  Load errors: 0


In [6]:
"""
2.2 Quick Data Preview
"""

if rdls_records and 'data' in rdls_records[0]:
    sample = rdls_records[0]['data']['datasets'][0]
    print(f"Sample record: {rdls_records[0]['filename']}")
    print(f"  id: {sample.get('id', '')}")
    print(f"  title: {(sample.get('title', '') or '')[:80]}")
    print(f"  risk_data_type: {sample.get('risk_data_type', [])}")
    print(f"  Top-level keys: {list(sample.keys())}")

Sample record: rdls_exp-hdx_3is_col_calculo_de_personas_en_necesidad_pin_del_cluster.json
  id: rdls_exp-hdx_3is_col_calculo_de_personas_en_necesidad_pin_del_cluster
  title: Colombia: Cálculo de Personas en Necesidad (PiN) del Clúster de Agua, Saneamient
  risk_data_type: ['exposure']
  Top-level keys: ['id', 'title', 'description', 'risk_data_type', 'details', 'spatial', 'license', 'attributions', 'resources', 'exposure', 'links']


## 3. Schema Validation

In [7]:
"""
3.1 Schema Validator

The RDLS v0.3 schema root validates a single Dataset object.
Our files wrap datasets in {"datasets": [...]}, so we validate datasets[0].

Schema uses draft/2020-12; we try Draft202012Validator first,
falling back to Draft7Validator if the newer one isn't available.
"""

def validate_rdls_record(dataset: Dict[str, Any], schema: Dict[str, Any]) -> Tuple[bool, List[str]]:
    """
    Validate a single RDLS dataset object against the schema.
    
    Returns (is_valid, error_messages).
    """
    if not HAS_JSONSCHEMA:
        return True, ['Schema validation skipped (jsonschema not installed)']
    
    errors = []
    try:
        validator = Draft202012Validator(schema)
        for error in validator.iter_errors(dataset):
            path = '/'.join(str(p) for p in error.absolute_path)
            msg = f"{path}: {error.message}" if path else error.message
            errors.append(msg)
    except Exception as e:
        errors.append(f"Validation exception: {e}")
    
    return len(errors) == 0, errors

print("Schema validator defined.")

Schema validator defined.


In [8]:
"""
3.2 Run Schema Validation
"""

validation_results = []

iterator = tqdm(rdls_records, desc="Validating") if HAS_TQDM else rdls_records

for record in iterator:
    result = {
        'filename': record['filename'],
        'filepath': str(record['filepath']),
    }
    
    if 'load_error' in record:
        result['status'] = 'load_error'
        result['errors'] = [record['load_error']]
        result['error_count'] = 1
    else:
        data = record['data']
        # Extract dataset object from wrapper
        if 'datasets' in data and data['datasets']:
            dataset = data['datasets'][0]
        else:
            dataset = data
        
        is_valid, errors = validate_rdls_record(dataset, RDLS_SCHEMA)
        result['status'] = 'valid' if is_valid else 'invalid'
        result['errors'] = errors
        result['error_count'] = len(errors)
        result['id'] = dataset.get('id', '')
        result['title'] = (dataset.get('title', '') or '')[:80]
        result['risk_data_type'] = '|'.join(dataset.get('risk_data_type', []))
    
    validation_results.append(result)

df_validation = pd.DataFrame(validation_results)

valid_count = (df_validation['status'] == 'valid').sum()
invalid_count = (df_validation['status'] == 'invalid').sum()
total_count = len(df_validation)

print(f"\n{'='*60}")
print("SCHEMA VALIDATION RESULTS")
print(f"{'='*60}")
print(f"Total records:  {total_count}")
print(f"Valid:          {valid_count} ({valid_count/total_count*100:.1f}%)" if total_count else "")
print(f"Invalid:        {invalid_count} ({invalid_count/total_count*100:.1f}%)" if total_count else "")

Validating:   0%|          | 0/12577 [00:00<?, ?it/s]


SCHEMA VALIDATION RESULTS
Total records:  12577
Valid:          9797 (77.9%)
Invalid:        2780 (22.1%)


In [9]:
"""
3.3 Analyze Validation Errors
"""

invalid_records = df_validation[df_validation['status'] == 'invalid']

if len(invalid_records) > 0:
    # Collect all errors
    all_errors = []
    for errors in invalid_records['errors']:
        if isinstance(errors, list):
            all_errors.extend(errors)
    
    print(f"\n{'='*60}")
    print(f"VALIDATION ERRORS ({len(invalid_records)} records, {len(all_errors)} total errors)")
    print(f"{'='*60}")
    
    # Categorize errors
    missing_required = [e for e in all_errors if 'is a required property' in e]
    invalid_enum = [e for e in all_errors if 'is not one of' in e or 'enum' in e.lower()]
    invalid_type = [e for e in all_errors if 'is not of type' in e]
    invalid_anyof = [e for e in all_errors if 'anyOf' in e or 'is not valid under any' in e]
    other_errors = [e for e in all_errors 
                    if e not in missing_required + invalid_enum + invalid_type + invalid_anyof]
    
    print(f"\nError Categories:")
    print(f"  Missing required fields:  {len(missing_required)}")
    print(f"  Invalid enum values:      {len(invalid_enum)}")
    print(f"  Type errors:              {len(invalid_type)}")
    print(f"  AnyOf constraint failures:{len(invalid_anyof)}")
    print(f"  Other errors:             {len(other_errors)}")
    
    # Top 20 unique errors
    error_counts = Counter(all_errors)
    print(f"\nTop 20 errors:")
    for err, count in error_counts.most_common(20):
        display = err[:120] + '...' if len(err) > 120 else err
        print(f"  [{count:>4}] {display}")
    
    # Extract unique missing field names
    if missing_required:
        unique_missing = set()
        for e in missing_required:
            match = re.search(r"'([^']+)' is a required property", e)
            if match:
                unique_missing.add(match.group(1))
        print(f"\n  Missing required fields: {sorted(unique_missing)}")
else:
    print(f"\n{'='*60}")
    print("ALL RECORDS PASSED SCHEMA VALIDATION!")
    print(f"{'='*60}")


VALIDATION ERRORS (2780 records, 3184 total errors)

Error Categories:
  Missing required fields:  0
  Invalid enum values:      6
  Type errors:              0
  AnyOf constraint failures:0
  Other errors:             3178

Top 20 errors:
  [2774] hazard/event_sets/0/events/0/occurrence: {} should be non-empty
  [ 366] hazard/event_sets/1/events/0/occurrence: {} should be non-empty
  [  34] hazard/event_sets/2/events/0/occurrence: {} should be non-empty
  [   4] spatial/countries/0: 'XKX' is not one of ['AFG', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATA', 'ATG', 'ARG', 'ARM', '...
  [   3] hazard/event_sets/3/events/0/occurrence: {} should be non-empty
  [   1] spatial/countries/203: 'XKX' is not one of ['AFG', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATA', 'ATG', 'ARG', 'ARM',...
  [   1] spatial/countries/137: 'XKX' is not one of ['AFG', 'ALB', 'DZA', 'ASM', 'AND', 'AGO', 'AIA', 'ATA', 'ATG', 'ARG', 'ARM',...
  [   1] hazard/event_sets/4/events/0/occurrence: {} should be non-e

## 4. HEVL Completeness Check

In [10]:
"""
4.1 Check HEVL Block Completeness
"""

def check_hevl_completeness(data: Dict[str, Any]) -> Dict[str, Any]:
    """Check presence and structure of HEVL blocks in a record."""
    if 'datasets' not in data or not data['datasets']:
        return {'error': 'No datasets array'}
    
    ds = data['datasets'][0]
    result = {
        'has_hazard': 'hazard' in ds,
        'has_exposure': 'exposure' in ds,
        'has_vulnerability': 'vulnerability' in ds,
        'has_loss': 'loss' in ds,
    }
    
    # Count how many HEVL components have actual blocks (not just risk_data_type flags)
    result['hevl_block_count'] = sum([
        result['has_hazard'],
        result['has_exposure'],
        result['has_vulnerability'],
        result['has_loss'],
    ])
    
    # Count declared risk_data_type components
    risk_types = ds.get('risk_data_type', [])
    result['declared_component_count'] = len(risk_types) if isinstance(risk_types, list) else 0
    
    # Hazard detail
    if result['has_hazard']:
        hazard = ds['hazard']
        event_sets = hazard.get('event_sets', [])
        result['hazard_event_sets'] = len(event_sets)
        if event_sets:
            es = event_sets[0]
            result['hazard_analysis_type'] = es.get('analysis_type', '')
            result['hazard_hazards_count'] = len(es.get('hazards', []))
            result['hazard_events_count'] = len(es.get('events', []))
            # Check if events have return periods (probabilistic)
            result['hazard_has_return_periods'] = any(
                ev.get('occurrence', {}).get('probabilistic', {}).get('return_period')
                for ev in es.get('events', [])
            )
        else:
            result['hazard_event_sets'] = 0
    
    # Exposure detail
    if result['has_exposure']:
        exposure = ds['exposure']
        if isinstance(exposure, list):
            result['exposure_items'] = len(exposure)
            categories = [item.get('category', '') for item in exposure]
            result['exposure_categories'] = '|'.join(categories)
            # Check if metrics are populated
            result['exposure_has_metrics'] = any(
                item.get('metrics') for item in exposure
            )
        else:
            result['exposure_items'] = 0
            result['exposure_has_metrics'] = False
    
    # Vulnerability detail
    if result['has_vulnerability']:
        vuln = ds['vulnerability']
        result['vuln_has_functions'] = bool(vuln.get('functions'))
        socio = vuln.get('socio_economic', [])
        result['vuln_socio_count'] = len(socio) if isinstance(socio, list) else 0
    
    # Loss detail
    if result['has_loss']:
        loss = ds['loss']
        losses = loss.get('losses', [])
        result['loss_items'] = len(losses)
        if losses:
            hazard_types = set(l.get('hazard_type', '') for l in losses)
            result['loss_hazard_types'] = '|'.join(sorted(hazard_types - {''}))
    
    return result

completeness_results = []
for record in rdls_records:
    if 'data' in record:
        res = check_hevl_completeness(record['data'])
        res['filename'] = record['filename']
        completeness_results.append(res)

df_completeness = pd.DataFrame(completeness_results)

print(f"{'='*60}")
print("HEVL BLOCK COMPLETENESS")
print(f"{'='*60}")
print(f"Records analyzed: {len(df_completeness)}")
print(f"\nBlock presence:")
for col in ['has_hazard', 'has_exposure', 'has_vulnerability', 'has_loss']:
    if col in df_completeness:
        count = df_completeness[col].sum()
        pct = count / len(df_completeness) * 100
        print(f"  {col:25s}: {count:>6} ({pct:>5.1f}%)")

# Block count distribution
if 'hevl_block_count' in df_completeness:
    print(f"\nHEVL blocks per record:")
    for n, count in df_completeness['hevl_block_count'].value_counts().sort_index().items():
        pct = count / len(df_completeness) * 100
        print(f"  {n} blocks: {count:>6} ({pct:>5.1f}%)")

HEVL BLOCK COMPLETENESS
Records analyzed: 12577

Block presence:
  has_hazard               :   2788 ( 22.2%)
  has_exposure             :  11517 ( 91.6%)
  has_vulnerability        :   3429 ( 27.3%)
  has_loss                 :    703 (  5.6%)

HEVL blocks per record:
  1 blocks:   7363 ( 58.5%)
  2 blocks:   4599 ( 36.6%)
  3 blocks:    584 (  4.6%)
  4 blocks:     31 (  0.2%)


In [11]:
"""
4.2 Hazard Block Quality
"""

hazard_records = df_completeness[df_completeness.get('has_hazard', pd.Series(dtype=bool)) == True]

if len(hazard_records) > 0:
    print(f"\n{'='*60}")
    print(f"HAZARD BLOCK QUALITY ({len(hazard_records)} records)")
    print(f"{'='*60}")
    
    if 'hazard_analysis_type' in hazard_records:
        print(f"\nAnalysis types:")
        print(hazard_records['hazard_analysis_type'].value_counts().to_string())
    if 'hazard_hazards_count' in hazard_records:
        print(f"\nAvg hazards per event_set: {hazard_records['hazard_hazards_count'].mean():.1f}")
    if 'hazard_events_count' in hazard_records:
        with_events = (hazard_records['hazard_events_count'] > 0).sum()
        print(f"Records with events: {with_events}/{len(hazard_records)}")
else:
    print("\nNo hazard blocks found.")


HAZARD BLOCK QUALITY (2788 records)

Analysis types:
hazard_analysis_type
empirical        2744
probabilistic      22
deterministic      22

Avg hazards per event_set: 1.0
Records with events: 2788/2788


## 5. Composite Confidence Scoring

In [12]:
"""
5.1 Compute Composite Confidence Score

The composite score combines:
1. HEVL block presence (do we have actual structured blocks, not just flags?)
2. Block richness (how detailed are the blocks?)
3. Schema validity (does the record pass validation?)
4. General metadata quality (description, spatial, attributions populated?)

Scoring weights:
  - HEVL block coverage:    40%  (most important - actual data content)
  - Block richness:         25%  (depth of HEVL detail)
  - Schema validity:        20%  (structural compliance)
  - Metadata completeness:  15%  (general metadata quality)
"""

def compute_composite_confidence(
    record_data: Dict[str, Any],
    completeness: Dict[str, Any],
    validation_status: str,
    validation_error_count: int,
) -> Tuple[float, Dict[str, float]]:
    """
    Compute composite confidence score for a single RDLS record.
    
    Returns (composite_score, component_scores_dict)
    """
    ds = record_data.get('datasets', [{}])[0]
    risk_types = ds.get('risk_data_type', [])
    declared_count = len(risk_types) if isinstance(risk_types, list) else 0
    
    # --- 1. HEVL Block Coverage (0-1) ---
    # Ratio of actual HEVL blocks present vs. declared risk_data_types
    block_count = completeness.get('hevl_block_count', 0)
    if declared_count > 0:
        hevl_coverage = min(block_count / declared_count, 1.0)
    else:
        hevl_coverage = 0.0
    
    # --- 2. Block Richness (0-1) ---
    # How detailed/populated are the HEVL blocks?
    richness_signals = []
    
    if completeness.get('has_hazard'):
        h_score = 0.5  # base: block exists
        if completeness.get('hazard_event_sets', 0) > 0:
            h_score += 0.2
        if completeness.get('hazard_events_count', 0) > 0:
            h_score += 0.15
        if completeness.get('hazard_has_return_periods', False):
            h_score += 0.15
        richness_signals.append(min(h_score, 1.0))
    
    if completeness.get('has_exposure'):
        e_score = 0.5
        if completeness.get('exposure_items', 0) > 0:
            e_score += 0.25
        if completeness.get('exposure_has_metrics', False):
            e_score += 0.25
        richness_signals.append(min(e_score, 1.0))
    
    if completeness.get('has_vulnerability'):
        v_score = 0.5
        if completeness.get('vuln_has_functions', False):
            v_score += 0.25
        if completeness.get('vuln_socio_count', 0) > 0:
            v_score += 0.25
        richness_signals.append(min(v_score, 1.0))
    
    if completeness.get('has_loss'):
        l_score = 0.5
        if completeness.get('loss_items', 0) > 0:
            l_score += 0.3
        if completeness.get('loss_hazard_types', ''):
            l_score += 0.2
        richness_signals.append(min(l_score, 1.0))
    
    block_richness = np.mean(richness_signals) if richness_signals else 0.0
    
    # --- 3. Schema Validity (0-1) ---
    if validation_status == 'valid':
        schema_score = 1.0
    elif validation_error_count <= 2:
        schema_score = 0.7  # Minor issues
    elif validation_error_count <= 5:
        schema_score = 0.4  # Moderate issues
    else:
        schema_score = 0.1  # Many issues
    
    # --- 4. Metadata Completeness (0-1) ---
    meta_signals = []
    # Description present and non-trivial
    desc = ds.get('description', '') or ''
    meta_signals.append(1.0 if len(desc) > 20 else 0.3)
    # Spatial countries populated
    countries = ds.get('spatial', {}).get('countries', [])
    meta_signals.append(1.0 if countries else 0.3)
    # Attributions have at least 3 roles
    attributions = ds.get('attributions', [])
    roles = set(a.get('role', '') for a in attributions)
    required_roles = {'publisher', 'creator', 'contact_point'}
    meta_signals.append(1.0 if required_roles.issubset(roles) else 0.5)
    # Resources present with download URLs
    resources = ds.get('resources', [])
    has_download = any(r.get('download_url') or r.get('access_url') for r in resources)
    meta_signals.append(1.0 if has_download else 0.3)
    
    metadata_score = np.mean(meta_signals)
    
    # --- Weighted Composite ---
    composite = (
        0.40 * hevl_coverage +
        0.25 * block_richness +
        0.20 * schema_score +
        0.15 * metadata_score
    )
    
    components = {
        'hevl_coverage': round(hevl_coverage, 3),
        'block_richness': round(block_richness, 3),
        'schema_score': round(schema_score, 3),
        'metadata_score': round(metadata_score, 3),
    }
    
    return round(composite, 3), components

print("Composite confidence scorer defined.")
print("Weights: HEVL coverage 40%, Block richness 25%, Schema 20%, Metadata 15%")

Composite confidence scorer defined.
Weights: HEVL coverage 40%, Block richness 25%, Schema 20%, Metadata 15%


## 6. Score All Records and Assign Tiers

In [13]:
"""
6.1 Compute Scores and Assign Confidence Tiers
"""

# Build lookup dicts for validation and completeness by filename
def _summarize_errors(errors):
    """Categorize validation errors into a concise summary string."""
    if not errors or not isinstance(errors, list) or len(errors) == 0:
        return ''
    cats = {'missing_field': [], 'invalid_enum': [], 'type_error': [],
            'anyOf': [], 'other': []}
    for e in errors:
        if 'is a required property' in e:
            # Extract field name
            import re as _re
            m = _re.search(r"'([^']+)' is a required property", e)
            field = m.group(1) if m else '?'
            cats['missing_field'].append(field)
        elif 'is not one of' in e or 'enum' in str(e).lower():
            cats['invalid_enum'].append(e.split(':')[0] if ':' in e else e[:50])
        elif 'is not of type' in e:
            cats['type_error'].append(e.split(':')[0] if ':' in e else e[:50])
        elif 'anyOf' in e or 'is not valid under any' in e:
            cats['anyOf'].append(e.split(':')[0] if ':' in e else e[:50])
        else:
            cats['other'].append(e[:50])
    parts = []
    if cats['missing_field']:
        fields = sorted(set(cats['missing_field']))
        parts.append(f"missing:{','.join(fields)}")
    if cats['invalid_enum']:
        paths = sorted(set(cats['invalid_enum']))[:3]
        parts.append(f"enum:{','.join(paths)}")
    if cats['type_error']:
        paths = sorted(set(cats['type_error']))[:3]
        parts.append(f"type:{','.join(paths)}")
    if cats['anyOf']:
        paths = sorted(set(cats['anyOf']))[:3]
        parts.append(f"anyOf:{','.join(paths)}")
    if cats['other']:
        parts.append(f"other:{len(cats['other'])}")
    return '; '.join(parts)

val_lookup = {}
for _, row in df_validation.iterrows():
    errors_list = row.get('errors', [])
    if not isinstance(errors_list, list):
        errors_list = []
    val_lookup[row['filename']] = {
        'status': row['status'],
        'error_count': row.get('error_count', 0),
        'error_summary': _summarize_errors(errors_list),
    }

comp_lookup = {}
for _, row in df_completeness.iterrows():
    comp_lookup[row['filename']] = row.to_dict()

# Score all records
scored_records = []

iterator = tqdm(rdls_records, desc="Scoring") if HAS_TQDM else rdls_records

for record in iterator:
    filename = record['filename']

    if 'data' not in record:
        scored_records.append({
            'filename': filename,
            'composite_score': 0.0,
            'tier': 'low',
            'hevl_coverage': 0.0,
            'block_richness': 0.0,
            'schema_score': 0.0,
            'metadata_score': 0.0,
        })
        continue

    val_info = val_lookup.get(filename, {'status': 'unknown', 'error_count': 0})
    comp_info = comp_lookup.get(filename, {})

    composite, components = compute_composite_confidence(
        record['data'],
        comp_info,
        val_info['status'],
        val_info['error_count'],
    )

    # Assign tier
    if composite >= THRESHOLD_HIGH:
        tier = 'high'
    elif composite >= THRESHOLD_MEDIUM:
        tier = 'medium'
    else:
        tier = 'low'

    ds = record['data'].get('datasets', [{}])[0]

    scored_records.append({
        'filename': filename,
        'id': ds.get('id', ''),
        'title': (ds.get('title', '') or '')[:100],
        'risk_data_type': '|'.join(ds.get('risk_data_type', [])),
        'composite_score': composite,
        'tier': tier,
        'hevl_coverage': components['hevl_coverage'],
        'block_richness': components['block_richness'],
        'schema_score': components['schema_score'],
        'metadata_score': components['metadata_score'],
        'has_hazard_block': comp_info.get('has_hazard', False),
        'has_exposure_block': comp_info.get('has_exposure', False),
        'has_vulnerability_block': comp_info.get('has_vulnerability', False),
        'has_loss_block': comp_info.get('has_loss', False),
        'validation_status': val_info['status'],
        'validation_errors': val_info['error_count'],
        'validation_error_summary': val_info.get('error_summary', ''),
    })

df_scored = pd.DataFrame(scored_records)

# --- Compute dist_folder: two-dimensional tier x validity ---
def compute_dist_folder(row):
    """Return the distribution folder path for a record."""
    tier = row['tier']
    if row.get('validation_status') == 'valid':
        return tier  # "high", "medium", or "low"
    else:
        return f"invalid/{tier}"  # "invalid/high", "invalid/medium", "invalid/low"

df_scored['dist_folder'] = df_scored.apply(compute_dist_folder, axis=1)

# --- Summary ---
print(f"\n{'='*60}")
print("COMPOSITE CONFIDENCE SCORING")
print(f"{'='*60}")
print(f"Total records scored: {len(df_scored):,}")
print(f"\nScore distribution:")
print(f"  Mean:   {df_scored['composite_score'].mean():.3f}")
print(f"  Median: {df_scored['composite_score'].median():.3f}")
print(f"  Min:    {df_scored['composite_score'].min():.3f}")
print(f"  Max:    {df_scored['composite_score'].max():.3f}")

print(f"\nTier distribution:")
for tier in ['high', 'medium', 'low']:
    count = (df_scored['tier'] == tier).sum()
    pct = count / len(df_scored) * 100
    print(f"  {tier:8s}: {count:>6,} ({pct:>5.1f}%)")

print(f"\nTwo-dimensional distribution (confidence x validity):")
for tier in ['high', 'medium', 'low']:
    tier_mask = df_scored['tier'] == tier
    valid_n = (tier_mask & (df_scored['validation_status'] == 'valid')).sum()
    invalid_n = (tier_mask & (df_scored['validation_status'] != 'valid')).sum()
    total_n = tier_mask.sum()
    print(f"  {tier:8s}: {valid_n:>6,} valid + {invalid_n:>6,} invalid = {total_n:>6,} total")

print(f"\nComponent score averages:")
for col in ['hevl_coverage', 'block_richness', 'schema_score', 'metadata_score']:
    print(f"  {col:20s}: {df_scored[col].mean():.3f}")


Scoring:   0%|          | 0/12577 [00:00<?, ?it/s]


COMPOSITE CONFIDENCE SCORING
Total records scored: 12,577

Score distribution:
  Mean:   0.973
  Median: 0.974
  Min:    0.816
  Max:    1.000

Tier distribution:
  high    : 12,577 (100.0%)
  medium  :      0 (  0.0%)
  low     :      0 (  0.0%)

Two-dimensional distribution (confidence x validity):
  high    :  9,797 valid +  2,780 invalid = 12,577 total
  medium  :      0 valid +      0 invalid =      0 total
  low     :      0 valid +      0 invalid =      0 total

Component score averages:
  hevl_coverage       : 1.000
  block_richness      : 0.946
  schema_score        : 0.933
  metadata_score      : 0.997


In [14]:
"""
6.2 Distribute Records into Tier Folders and Generate Manifests

Two-dimensional routing:
  - Schema-valid records   -> dist/{tier}/
  - Schema-invalid records -> dist/invalid/{tier}/
Each folder gets its own manifest.csv.
"""

def distribute_tiered_records(
    df_scored: pd.DataFrame,
    rdls_records: List[Dict],
    folder_dirs: Dict[str, Path],
) -> Dict[str, int]:
    """
    Copy records into distribution folders based on dist_folder column
    and generate a manifest.csv in each populated folder.
    """
    # Build filename -> record lookup
    record_lookup = {r['filename']: r for r in rdls_records if 'data' in r}

    stats = {folder: 0 for folder in folder_dirs}
    folder_manifests = {folder: [] for folder in folder_dirs}

    iterator = (
        tqdm(df_scored.iterrows(), total=len(df_scored), desc="Distributing")
        if HAS_TQDM else df_scored.iterrows()
    )

    for _, row in iterator:
        dist_folder = row['dist_folder']
        filename = row['filename']

        if filename not in record_lookup:
            continue

        record = record_lookup[filename]
        target_dir = folder_dirs[dist_folder]

        # Copy JSON file
        shutil.copy2(record['filepath'], target_dir / filename)
        stats[dist_folder] += 1

        # Collect manifest entry
        folder_manifests[dist_folder].append({
            'filename': filename,
            'id': row.get('id', ''),
            'title': row.get('title', ''),
            'risk_data_type': row.get('risk_data_type', ''),
            'composite_score': row.get('composite_score', 0.0),
            'has_hazard_block': row.get('has_hazard_block', False),
            'has_exposure_block': row.get('has_exposure_block', False),
            'has_vulnerability_block': row.get('has_vulnerability_block', False),
            'has_loss_block': row.get('has_loss_block', False),
            'validation_status': row.get('validation_status', ''),
            'validation_errors': row.get('validation_errors', 0),
            'validation_error_summary': row.get('validation_error_summary', ''),
        })

    # Write manifest CSV per folder
    for folder, entries in folder_manifests.items():
        manifest_path = folder_dirs[folder] / 'manifest.csv'
        if entries:
            manifest_df = pd.DataFrame(entries)
            manifest_df = manifest_df.sort_values('composite_score', ascending=False)
            manifest_df.to_csv(manifest_path, index=False)
        else:
            # Write empty manifest with headers for consistency
            pd.DataFrame(columns=[
                'filename', 'id', 'title', 'risk_data_type', 'composite_score',
                'has_hazard_block', 'has_exposure_block', 'has_vulnerability_block',
                'has_loss_block', 'validation_status', 'validation_errors',
                'validation_error_summary',
            ]).to_csv(manifest_path, index=False)

    return stats


# Map dist_folder values to directory paths
folder_dirs = {
    'high': TIER_HIGH_DIR,
    'medium': TIER_MEDIUM_DIR,
    'low': TIER_LOW_DIR,
    'invalid/high': INVALID_HIGH_DIR,
    'invalid/medium': INVALID_MEDIUM_DIR,
    'invalid/low': INVALID_LOW_DIR,
}

# Clean existing contents in ALL 6 folders
for folder_dir in folder_dirs.values():
    for f in folder_dir.glob('*.json'):
        f.unlink()
    for f in folder_dir.glob('*.csv'):
        f.unlink()

dist_stats = distribute_tiered_records(df_scored, rdls_records, folder_dirs)

print(f"\n{'='*60}")
print("TIERED DISTRIBUTION")
print(f"{'='*60}")

total_records = len(df_scored)

print("Schema-VALID records:")
for tier in ['high', 'medium', 'low']:
    count = dist_stats[tier]
    pct = count / total_records * 100 if total_records > 0 else 0
    label = {
        'high': f'Production-ready (>= {THRESHOLD_HIGH})',
        'medium': f'Needs review ({THRESHOLD_MEDIUM} - {THRESHOLD_HIGH})',
        'low': f'Needs curation (< {THRESHOLD_MEDIUM})',
    }[tier]
    print(f"  {tier + chr(47):20s} {count:>6,} records  ({pct:>5.1f}%)  -- {label}")

valid_total = sum(dist_stats[t] for t in ['high', 'medium', 'low'])
print(f"  {'':20s} {valid_total:>6,} total valid")

print("\nSchema-INVALID records:")
for tier in ['high', 'medium', 'low']:
    folder = f'invalid/{tier}'
    count = dist_stats[folder]
    pct = count / total_records * 100 if total_records > 0 else 0
    print(f"  {folder + chr(47):20s} {count:>6,} records  ({pct:>5.1f}%)")

invalid_total = sum(dist_stats[f'invalid/{t}'] for t in ['high', 'medium', 'low'])
print(f"  {'':20s} {invalid_total:>6,} total invalid")

print(f"\nManifest CSVs written to each folder.")


Distributing:   0%|          | 0/12577 [00:00<?, ?it/s]


TIERED DISTRIBUTION
Schema-VALID records:
  high/                 9,797 records  ( 77.9%)  -- Production-ready (>= 0.8)
  medium/                   0 records  (  0.0%)  -- Needs review (0.5 - 0.8)
  low/                      0 records  (  0.0%)  -- Needs curation (< 0.5)
                        9,797 total valid

Schema-INVALID records:
  invalid/high/         2,780 records  ( 22.1%)
  invalid/medium/           0 records  (  0.0%)
  invalid/low/              0 records  (  0.0%)
                        2,780 total invalid

Manifest CSVs written to each folder.


In [15]:
"""
6.3 Save Reports and Scoring Data
"""

# Schema validation report
validation_export = df_validation.copy()
validation_export['errors'] = validation_export['errors'].apply(
    lambda x: '|'.join(x) if isinstance(x, list) else str(x)
)
validation_file = REPORTS_DIR / 'schema_validation_report.csv'
validation_export.to_csv(validation_file, index=False)
print(f"Saved: {validation_file}")

# Completeness report
completeness_file = REPORTS_DIR / 'hevl_completeness_report.csv'
df_completeness.to_csv(completeness_file, index=False)
print(f"Saved: {completeness_file}")

# Full scored records (master manifest)
# Reorder columns so dist_folder appears right after tier
cols = list(df_scored.columns)
if 'dist_folder' in cols and 'tier' in cols:
    cols.remove('dist_folder')
    tier_idx = cols.index('tier')
    cols.insert(tier_idx + 1, 'dist_folder')
    df_scored = df_scored[cols]

scored_file = REPORTS_DIR / 'confidence_scored_records.csv'
df_scored.to_csv(scored_file, index=False)
print(f"Saved: {scored_file}")

# Also save master manifest to dist/
master_manifest = DIST_DIR / 'master_manifest.csv'
df_scored.to_csv(master_manifest, index=False)
print(f"Saved: {master_manifest}")

# Copy reports to dist
dist_reports = DIST_DIR / 'reports'
dist_reports.mkdir(exist_ok=True)
for report_file in REPORTS_DIR.glob('*'):
    shutil.copy2(report_file, dist_reports / report_file.name)

print(f"\nReports copied to: {dist_reports}")


Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/schema_validation_report.csv
Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/hevl_completeness_report.csv
Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/confidence_scored_records.csv
Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist/master_manifest.csv

Reports copied to: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist/reports


In [16]:
"""
6.4 Generate Summary Report and ZIP Archive
"""

# --- Summary Report ---
total = len(df_scored)
valid = (df_validation['status'] == 'valid').sum()
invalid = (df_validation['status'] == 'invalid').sum()

h_count = df_completeness.get('has_hazard', pd.Series(dtype=bool)).sum()
e_count = df_completeness.get('has_exposure', pd.Series(dtype=bool)).sum()
v_count = df_completeness.get('has_vulnerability', pd.Series(dtype=bool)).sum()
l_count = df_completeness.get('has_loss', pd.Series(dtype=bool)).sum()

high_n = (df_scored['tier'] == 'high').sum()
med_n = (df_scored['tier'] == 'medium').sum()
low_n = (df_scored['tier'] == 'low').sum()

# Two-dimensional counts
high_valid = (df_scored['dist_folder'] == 'high').sum()
high_invalid = (df_scored['dist_folder'] == 'invalid/high').sum()
med_valid = (df_scored['dist_folder'] == 'medium').sum()
med_invalid = (df_scored['dist_folder'] == 'invalid/medium').sum()
low_valid = (df_scored['dist_folder'] == 'low').sum()
low_invalid = (df_scored['dist_folder'] == 'invalid/low').sum()

report = f"""# RDLS Validation and QA Report

**Generated**: {datetime.now().isoformat()}
**Source**: HDX Dataset Metadata (Humanitarian Data Exchange)

## Summary

| Metric | Value |
|--------|-------|
| Total Records | {total:,} |
| Schema Valid | {valid:,} ({valid/total*100:.1f}%) |
| Schema Invalid | {invalid:,} ({invalid/total*100:.1f}%) |

## Confidence Tiers (two-dimensional: confidence x schema validity)

| Tier | Valid | Invalid | Total | Description |
|------|-------|---------|-------|-------------|
| **high/** | {high_valid:,} | {high_invalid:,} | {high_n:,} | Score >= {THRESHOLD_HIGH} |
| **medium/** | {med_valid:,} | {med_invalid:,} | {med_n:,} | {THRESHOLD_MEDIUM} <= score < {THRESHOLD_HIGH} |
| **low/** | {low_valid:,} | {low_invalid:,} | {low_n:,} | Score < {THRESHOLD_MEDIUM} |
| **Total** | {valid:,} | {invalid:,} | {total:,} | |

## HEVL Block Coverage

| Component | Records with Block | Percentage |
|-----------|-------------------|------------|
| Hazard | {h_count:,} | {h_count/total*100:.1f}% |
| Exposure | {e_count:,} | {e_count/total*100:.1f}% |
| Vulnerability | {v_count:,} | {v_count/total*100:.1f}% |
| Loss | {l_count:,} | {l_count/total*100:.1f}% |

## Composite Score Components (Averages)

| Component | Weight | Avg Score |
|-----------|--------|-----------|
| HEVL Coverage | 40% | {df_scored['hevl_coverage'].mean():.3f} |
| Block Richness | 25% | {df_scored['block_richness'].mean():.3f} |
| Schema Validity | 20% | {df_scored['schema_score'].mean():.3f} |
| Metadata Completeness | 15% | {df_scored['metadata_score'].mean():.3f} |

## Output Structure

```
dist/
  high/              {high_valid:,} records (schema-valid, production-ready)
    manifest.csv
    rdls_*.json
  medium/            {med_valid:,} records (schema-valid, needs review)
    manifest.csv
    rdls_*.json
  low/               {low_valid:,} records (schema-valid, needs curation)
    manifest.csv
    rdls_*.json
  invalid/           ALL schema-invalid records
    high/            {high_invalid:,} records (high confidence, schema invalid)
      manifest.csv
      rdls_*.json
    medium/          {med_invalid:,} records (medium confidence, schema invalid)
      manifest.csv
      rdls_*.json
    low/             {low_invalid:,} records (low confidence, schema invalid)
      manifest.csv
      rdls_*.json
  reports/           QA reports
  master_manifest.csv   All records with scores, tiers, and dist_folder
```

## Notes

- Metadata was automatically extracted from HDX (https://data.humdata.org)
- HEVL blocks generated by pattern-based extraction (NB 09-11)
- Confidence scores are composite: HEVL coverage (40%) + block richness (25%) + schema validity (20%) + metadata completeness (15%)
- Schema-valid records go to top-level tier folders; schema-invalid records go to invalid/ sub-tiers

---
*Report generated by HDX-RDLS Pipeline Notebook 13*
"""

summary_file = REPORTS_DIR / 'rdls_validation_summary.md'
with open(summary_file, 'w', encoding='utf-8') as f:
    f.write(report)
shutil.copy2(summary_file, DIST_DIR / 'README.md')
print(f"Saved: {summary_file}")
print(f"Copied as: {DIST_DIR / 'README.md'}")

# --- ZIP Archive ---
def create_zip_archive(dist_dir: Path, output_name: str = 'rdls_hdx_package') -> Path:
    """Create ZIP archive of distribution."""
    timestamp = datetime.now().strftime('%Y%m%d')
    zip_path = dist_dir.parent / f"{output_name}_{timestamp}.zip"

    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
        for file_path in dist_dir.rglob('*'):
            if file_path.is_file():
                arcname = file_path.relative_to(dist_dir)
                zf.write(file_path, arcname)

    return zip_path

zip_path = create_zip_archive(DIST_DIR)
size_mb = zip_path.stat().st_size / (1024 * 1024)
print(f"\nZIP archive: {zip_path}")
print(f"Size: {size_mb:.1f} MB")


Saved: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/reports/rdls_validation_summary.md
Copied as: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist/README.md

ZIP archive: /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/rdls_hdx_package_20260211.zip
Size: 76.0 MB


## 7. Final Summary

In [17]:
"""
7.1 Pipeline Summary
"""

print("=" * 70)
print("HDX -> RDLS PIPELINE COMPLETE")
print("=" * 70)

total = len(df_scored)
valid = (df_validation['status'] == 'valid').sum()
invalid = (df_validation['status'] == 'invalid').sum()
high_n = (df_scored['tier'] == 'high').sum()
med_n = (df_scored['tier'] == 'medium').sum()
low_n = (df_scored['tier'] == 'low').sum()

# Two-dimensional counts
high_valid = (df_scored['dist_folder'] == 'high').sum()
high_invalid = (df_scored['dist_folder'] == 'invalid/high').sum()
med_valid = (df_scored['dist_folder'] == 'medium').sum()
med_invalid = (df_scored['dist_folder'] == 'invalid/medium').sum()
low_valid = (df_scored['dist_folder'] == 'low').sum()
low_invalid = (df_scored['dist_folder'] == 'invalid/low').sum()

print(f"""
VALIDATION
----------
Total RDLS records:    {total:,}
Schema valid:          {valid:,} ({valid/total*100:.1f}%)
Schema invalid:        {invalid:,} ({invalid/total*100:.1f}%)

CONFIDENCE TIERS (confidence x validity)
-----------------------------------------
                         Valid   Invalid    Total
High (>= {THRESHOLD_HIGH}):       {high_valid:>6,}   {high_invalid:>6,}   {high_n:>6,}   Production-ready
Medium ({THRESHOLD_MEDIUM}-{THRESHOLD_HIGH}):    {med_valid:>6,}   {med_invalid:>6,}   {med_n:>6,}   Needs review
Low (< {THRESHOLD_MEDIUM}):        {low_valid:>6,}   {low_invalid:>6,}   {low_n:>6,}   Needs curation
                       ------   ------   ------
Total:                 {valid:>6,}   {invalid:>6,}   {total:>6,}

HEVL COVERAGE
-------------
With Hazard block:        {df_completeness.get('has_hazard', pd.Series(dtype=bool)).sum():>6,}
With Exposure block:      {df_completeness.get('has_exposure', pd.Series(dtype=bool)).sum():>6,}
With Vulnerability block: {df_completeness.get('has_vulnerability', pd.Series(dtype=bool)).sum():>6,}
With Loss block:          {df_completeness.get('has_loss', pd.Series(dtype=bool)).sum():>6,}

OUTPUT
------
Distribution:  {DIST_DIR}
  high/              {high_valid:,} records  (valid, production-ready)
  medium/            {med_valid:,} records  (valid, needs review)
  low/               {low_valid:,} records  (valid, needs curation)
  invalid/high/      {high_invalid:,} records  (invalid, high confidence)
  invalid/medium/    {med_invalid:,} records  (invalid, medium confidence)
  invalid/low/       {low_invalid:,} records  (invalid, low confidence)
Reports:       {REPORTS_DIR}
ZIP Archive:   {zip_path}
""")

print(f"Notebook completed: {datetime.now().isoformat()}")


HDX -> RDLS PIPELINE COMPLETE

VALIDATION
----------
Total RDLS records:    12,577
Schema valid:          9,797 (77.9%)
Schema invalid:        2,780 (22.1%)

CONFIDENCE TIERS (confidence x validity)
-----------------------------------------
                         Valid   Invalid    Total
High (>= 0.8):        9,797    2,780   12,577   Production-ready
Medium (0.5-0.8):         0        0        0   Needs review
Low (< 0.5):             0        0        0   Needs curation
                       ------   ------   ------
Total:                  9,797    2,780   12,577

HEVL COVERAGE
-------------
With Hazard block:         2,788
With Exposure block:      11,517
With Vulnerability block:  3,429
With Loss block:             703

OUTPUT
------
Distribution:  /mnt/c/Users/benny/OneDrive/Documents/Github/hdx-metadata-crawler/hdx_dataset_metadata_dump/rdls/dist
  high/              9,797 records  (valid, production-ready)
  medium/            0 records  (valid, needs review)
  low/          

## End of Code