# 1. Data Preparation for GRI Calculation with Global Dialogues Data

This notebook demonstrates how to prepare data for Global Representativeness Index (GRI) calculations using the **new GRI module structure**.

## What's New

This updated notebook showcases the improved data loading workflow:
- **`gri.data_loader`** module for unified data loading
- **`gri.validation`** module for comprehensive data validation
- **`load_benchmark_suite()`** to load all benchmarks at once
- **`load_gd_survey()`** for automated Global Dialogues processing

## Overview

The GRI system requires two types of data:
1. **Benchmark data**: Global population demographics from UN and Pew Research
2. **Survey data**: Participant demographics from surveys like Global Dialogues

## Prerequisites

First, ensure you have processed the benchmark data:
```bash
make process-data
```

This creates benchmark files for all 13 configured dimensions.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys

# Add the gri module to the path
sys.path.append('..')

# Import the new GRI module functions
from gri.data_loader import load_benchmark_suite, load_gd_survey
from gri.validation import validate_benchmark_data, validate_survey_data
from gri.config import GRIConfig

# Set pandas display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Initialize configuration
config = GRIConfig()
print("✅ GRI module loaded successfully")
print(f"📋 Configured dimensions: {len(config.get_all_dimensions())}")
print(f"📁 Data directory: {Path('../data/processed').absolute()}")

## 1. Load All Benchmark Data with `load_benchmark_suite()`

The new `load_benchmark_suite()` function loads all benchmark data files at once, returning a dictionary mapped by dimension name. This is much more convenient than loading files individually.

In [None]:
# Load all benchmark data at once
print("Loading benchmark suite...")
benchmarks = load_benchmark_suite(data_dir='../data/processed')

print(f"\n✅ Loaded {len(benchmarks)} benchmark dimensions:")
for i, (dimension, df) in enumerate(benchmarks.items(), 1):
    print(f"{i:2d}. {dimension}: {len(df):,} strata")

# Verify all proportions sum to 1.0
print("\n🔍 Verifying benchmark data quality:")
for dimension, df in benchmarks.items():
    prop_sum = df['population_proportion'].sum()
    status = "✅" if abs(prop_sum - 1.0) < 0.001 else "⚠️"
    print(f"{status} {dimension}: sum = {prop_sum:.6f}")

In [None]:
# Examine structure of core benchmark dimensions
core_dimensions = ['Country × Gender × Age', 'Country × Religion', 'Country × Environment']

for dimension in core_dimensions:
    if dimension in benchmarks:
        df = benchmarks[dimension]
        print(f"\n📊 {dimension} Structure:")
        print(f"   Columns: {list(df.columns)}")
        print(f"   Sample data:")
        print(df.head(3).to_string())
        
        # Show unique values for each categorical column
        categorical_cols = [col for col in df.columns if col != 'population_proportion']
        for col in categorical_cols[:3]:  # Show first 3 categorical columns
            unique_vals = df[col].nunique()
            print(f"   Unique {col}: {unique_vals}")

In [None]:
## 2. Validate Benchmark Data

The new `validate_benchmark_data()` function checks for common data quality issues:

In [None]:
# Validate each benchmark dimension
print("🔍 Running benchmark validation...\n")

validation_results = {}
for dimension, df in benchmarks.items():
    is_valid, issues = validate_benchmark_data(df)
    validation_results[dimension] = (is_valid, issues)
    
    if is_valid:
        print(f"✅ {dimension}: Valid")
    else:
        print(f"❌ {dimension}: {len(issues)} issues found")
        for issue in issues:
            print(f"   - {issue}")

# Summary
valid_count = sum(1 for is_valid, _ in validation_results.values() if is_valid)
print(f"\n📊 Validation Summary: {valid_count}/{len(benchmarks)} dimensions passed validation")

## 3. Load Global Dialogues Survey Data with `load_gd_survey()`

The new `load_gd_survey()` function handles all the complexity of loading Global Dialogues data:
- Detects and handles format quirks (malformed headers, etc.)
- Applies segment mappings automatically
- Adds geographic hierarchies (region, continent)
- Standardizes column names

In [None]:
# Load Global Dialogues survey data using the new function
gd_path = Path("../data/raw/survey_data/global-dialogues/Data/GD3/GD3_participants.csv")

if gd_path.exists():
    print(f"📂 Loading GD3 data from: {gd_path.name}")
    print("🔄 Processing with load_gd_survey()...")
    
    # The new function handles everything automatically!
    survey_data = load_gd_survey(gd_path, gd_version=3, config=config)
    
    print(f"\n✅ Successfully loaded and processed GD3 data:")
    print(f"   Participants: {len(survey_data):,}")
    print(f"   Columns: {list(survey_data.columns)}")
    print(f"\n📊 Data sample:")
    print(survey_data.head())
else:
    print("❌ GD3 data file not found. Please ensure the global-dialogues submodule is initialized:")
    print("   git submodule update --init --recursive")
    survey_data = None

## 4. Explore the Processed Survey Data

Let's examine what `load_gd_survey()` did for us automatically:

In [None]:
if survey_data is not None:
    print("📊 Survey Data Analysis:\n")
    
    # Demographic distributions
    for col in ['country', 'gender', 'age_group', 'religion', 'environment', 'region', 'continent']:
        if col in survey_data.columns:
            print(f"\n{col.replace('_', ' ').title()} Distribution:")
            value_counts = survey_data[col].value_counts()
            
            # Show top values
            show_n = 10 if col == 'country' else len(value_counts)
            for value, count in value_counts.head(show_n).items():
                print(f"  {value}: {count} ({count/len(survey_data)*100:.1f}%)")
            
            if col == 'country' and len(value_counts) > 10:
                print(f"  ... and {len(value_counts) - 10} more countries")
    
    # Check which dimensions we can analyze
    print("\n📋 Analyzable GRI Dimensions:")
    for dimension in config.get_all_dimensions():
        required_cols = dimension['columns']
        if all(col in survey_data.columns for col in required_cols):
            print(f"  ✅ {dimension['name']}")
        else:
            missing = [col for col in required_cols if col not in survey_data.columns]
            print(f"  ❌ {dimension['name']} (missing: {missing})")

In [None]:
## 5. Validate Survey Data

The `validate_survey_data()` function checks for data quality issues:

if survey_data is not None:
    print("🔍 Validating survey data...\n")
    
    # Validate with core demographic columns
    required_cols = ['country', 'gender', 'age_group']
    is_valid, issues = validate_survey_data(survey_data, required_columns=required_cols)
    
    if is_valid:
        print("✅ Survey data passed validation!")
    else:
        print(f"⚠️  Survey data has {len(issues)} issues:")
        for issue in issues:
            print(f"   - {issue}")
    
    # Additional quality checks
    print("\n📊 Data Quality Metrics:")
    print(f"   Total participants: {len(survey_data):,}")
    print(f"   Complete records: {survey_data.notna().all(axis=1).sum():,}")
    print(f"   Countries represented: {survey_data['country'].nunique()}")
    print(f"   Gender categories: {survey_data['gender'].nunique()}")
    print(f"   Age groups: {survey_data['age_group'].nunique()}")
    
    # Check for non-binary gender representation
    if 'gender' in survey_data.columns:
        gender_counts = survey_data['gender'].value_counts()
        non_binary_count = sum(count for gender, count in gender_counts.items() 
                              if gender not in ['Male', 'Female'])
        if non_binary_count > 0:
            print(f"\n📝 Note: {non_binary_count} participants with non-binary gender")
            print("   These will be excluded from gender-stratified GRI calculations")
            print("   (UN benchmark data only available for Male/Female)")

In [None]:
## 6. Check Alignment Between Survey and Benchmark Data

Let's verify that the survey categories align with benchmark categories:

In [None]:
def check_category_alignment(survey_df, benchmark_df, columns):
    """Check how well survey categories align with benchmark categories."""
    results = {}
    
    for col in columns:
        if col in survey_df.columns and col in benchmark_df.columns:
            survey_cats = set(survey_df[col].dropna().unique())
            benchmark_cats = set(benchmark_df[col].dropna().unique())
            
            matched = survey_cats.intersection(benchmark_cats)
            unmatched = survey_cats - benchmark_cats
            
            results[col] = {
                'survey_count': len(survey_cats),
                'benchmark_count': len(benchmark_cats),
                'matched': len(matched),
                'unmatched': list(unmatched),
                'coverage': len(matched) / len(survey_cats) * 100 if survey_cats else 0
            }
    
    return results

In [None]:
if survey_data is not None:
    print("🔍 Checking alignment between survey and benchmark data...\n")
    
    # Check alignment for key dimensions
    dimensions_to_check = [
        ('Country × Gender × Age', ['country', 'gender', 'age_group']),
        ('Country × Religion', ['country', 'religion']),
        ('Country × Environment', ['country', 'environment'])
    ]
    
    for dimension_name, columns in dimensions_to_check:
        if dimension_name in benchmarks:
            print(f"\n📊 {dimension_name} Alignment:")
            alignment = check_category_alignment(survey_data, benchmarks[dimension_name], columns)
            
            for col, stats in alignment.items():
                print(f"\n   {col.replace('_', ' ').title()}:")
                print(f"      Survey categories: {stats['survey_count']}")
                print(f"      Matched with benchmark: {stats['matched']} ({stats['coverage']:.1f}%)")
                
                if stats['unmatched']:
                    print(f"      Unmatched: {stats['unmatched'][:5]}")
                    if len(stats['unmatched']) > 5:
                        print(f"                 ... and {len(stats['unmatched']) - 5} more")
    
    # Overall alignment summary
    print("\n" + "="*60)
    print("📊 ALIGNMENT SUMMARY")
    print("="*60)
    
    # Key insights
    if 'country' in survey_data.columns:
        country_coverage = len(set(survey_data['country']) & set(benchmarks['Country']['country']))
        print(f"\n✅ Countries with full demographic data: {country_coverage}")
        
    print("\n📝 Notes:")
    print("- Some countries may lack religious composition data (Pew Research gaps)")
    print("- Some countries may lack urban/rural data (UN data gaps)")
    print("- Non-binary gender categories excluded (no UN benchmark available)")
    print("\n✅ Data is ready for GRI calculation across all supported dimensions!")

## 7. Save Processed Survey Data

Let's save the processed survey data for use in GRI calculations:

In [None]:
if survey_data is not None:
    # Save the processed data
    output_path = Path('../data/processed/gd3_survey_data_processed.csv')
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    survey_data.to_csv(output_path, index=False)
    
    print("💾 Saved processed survey data")
    print(f"📁 Location: {output_path}")
    print(f"📊 Size: {len(survey_data):,} participants × {len(survey_data.columns)} columns")
    
    # Create a summary of what was processed
    print("\n📋 Processing Summary:")
    print("✅ Applied segment mappings from config/segments.yaml")
    print("✅ Added geographic hierarchies (region, continent)")
    print("✅ Standardized column names")
    print("✅ Validated data quality")
    print("✅ Ready for GRI calculation!")
else:
    print("❌ No survey data to save")

## Summary

This notebook demonstrated the **new GRI module capabilities** for data preparation:

### 🎯 Key Features Used

1. **`load_benchmark_suite()`** - Loaded all 13 benchmark dimensions at once
2. **`load_gd_survey()`** - Automatically processed Global Dialogues data with:
   - Format detection and handling
   - Segment mapping application
   - Geographic hierarchy addition
   - Column standardization
3. **`validate_benchmark_data()`** - Validated benchmark data quality
4. **`validate_survey_data()`** - Checked survey data for issues

### 📊 Results

- ✅ Loaded complete benchmark suite covering all configured dimensions
- ✅ Processed Global Dialogues survey with {len(survey_data) if survey_data is not None else 0} participants
- ✅ Validated all data for quality and completeness
- ✅ Verified alignment between survey and benchmark categories
- ✅ Saved processed data ready for GRI calculation

### 🚀 Next Steps

1. **Calculate GRI scores** - Use `2-gri-calculation-example.ipynb`
2. **Perform analysis** - Use `3-advanced-analysis.ipynb`
3. **Command line** - Run `make calculate-gri GD=3`

### 💡 Benefits of the New Module Structure

- **Less code** - Complex operations handled by module functions
- **More reliable** - Consistent processing and validation
- **Educational** - Clear separation of concerns
- **Reusable** - Same functions work for any survey data