# 1. Data Preparation for GRI Calculation

This notebook demonstrates how to prepare data for Global Representativeness Index (GRI) calculations.

## Overview

The GRI requires two types of data:
1. **Benchmark data**: Global population demographics from UN and Pew Research
2. **Survey data**: Participant demographics from your survey

## Prerequisites

First, ensure you have processed the benchmark data by running:
```bash
make process-data
```

This uses the configuration-driven approach to create all benchmark files defined in `config/dimensions.yaml`, including:
- All 13 configured dimensions
- Regional and continental aggregations
- Single-dimension benchmarks

The configuration system ensures consistent processing across all dimensions.

In [ ]:
import pandas as pd
import numpy as np
import sys
import os

# Add the gri module to the path
sys.path.append('..')
from gri.utils import load_data
from gri.config import GRIConfig

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Load configuration to understand available dimensions
config = GRIConfig()
print("Configuration-driven GRI system initialized")
print(f"Available dimensions: {len(config.get_all_dimensions())}")

# Show the dimensions we'll be working with
dimensions = config.get_all_dimensions()
print("\nConfigured dimensions:")
for i, dim in enumerate(dimensions[:5], 1):  # Show first 5
    print(f"{i}. {dim['name']}: {dim['columns']}")
if len(dimensions) > 5:
    print(f"... and {len(dimensions) - 5} more dimensions")

## 1. Load Processed Benchmark Data

The benchmark data is processed using the configuration system, which creates files for all dimensions defined in `config/dimensions.yaml`. Let's load the core dimensions and explore what's available:

In [ ]:
# Load processed benchmark data - configuration-driven approach creates all dimensions
benchmark_age_gender = load_data('../data/processed/benchmark_country_gender_age.csv')
benchmark_religion = load_data('../data/processed/benchmark_country_religion.csv')
benchmark_environment = load_data('../data/processed/benchmark_country_environment.csv')

# Also show some of the additional dimensions created by configuration system
benchmark_gender = load_data('../data/processed/benchmark_gender.csv')
benchmark_age_group = load_data('../data/processed/benchmark_age_group.csv')

print("Core Benchmark Data Summary:")
print(f"Country × Gender × Age: {len(benchmark_age_gender):,} strata")
print(f"Country × Religion: {len(benchmark_religion):,} strata")
print(f"Country × Environment: {len(benchmark_environment):,} strata")

print("\nSingle-Dimension Benchmarks:")
print(f"Gender: {len(benchmark_gender):,} strata")
print(f"Age Group: {len(benchmark_age_group):,} strata")

# Verify proportions sum to 1.0
print("\nProportion sums (should be 1.0):")
print(f"Age/Gender: {benchmark_age_gender['population_proportion'].sum():.6f}")
print(f"Religion: {benchmark_religion['population_proportion'].sum():.6f}")
print(f"Environment: {benchmark_environment['population_proportion'].sum():.6f}")
print(f"Gender: {benchmark_gender['population_proportion'].sum():.6f}")
print(f"Age Group: {benchmark_age_group['population_proportion'].sum():.6f}")

print(f"\nConfiguration system created benchmark files for all {len(config.get_all_dimensions())} configured dimensions!")

In [3]:
# Preview benchmark data structures
print("Country x Gender x Age Benchmark:")
print(benchmark_age_gender.head())
print("\nUnique age groups:", sorted(benchmark_age_gender['age_group'].unique()))
print("Unique genders:", sorted(benchmark_age_gender['gender'].unique()))

Country x Gender x Age Benchmark:
   country  gender age_group  population_proportion
0  Burundi    Male     18-25               0.000219
1  Burundi  Female     18-25               0.000219
2  Burundi    Male     26-35               0.000147
3  Burundi  Female     26-35               0.000149
4  Burundi    Male     36-45               0.000120

Unique age groups: ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
Unique genders: ['Female', 'Male']


In [4]:
print("Country x Religion Benchmark:")
print(benchmark_religion.head())
print("\nUnique religions:", sorted(benchmark_religion['religion'].unique()))

Country x Religion Benchmark:
       country      religion  population_proportion
0  Afghanistan  Christianity           1.543859e-06
1  Afghanistan         Islam           1.539228e-03
2  Afghanistan      Hinduism           7.719297e-07
3  Afghanistan      Buddhism           7.719297e-07
4  Afghanistan       Judaism           7.719297e-07

Unique religions: ['Buddhism', 'Christianity', 'Hinduism', 'I do not identify with any religious group or faith', 'Islam', 'Judaism', 'Other religious group']


In [5]:
print("Country x Environment Benchmark:")
print(benchmark_environment.head())
print("\nUnique environments:", sorted(benchmark_environment['environment'].unique()))

Country x Environment Benchmark:
              country environment  population_proportion
0  Sub-Saharan Africa       Urban               0.016595
1  Sub-Saharan Africa       Rural               0.024510
2              AFRICA       Urban               0.021434
3              AFRICA       Rural               0.028977
4      Eastern Africa       Urban               0.004749

Unique environments: ['Rural', 'Urban']


## 2. Load and Examine Survey Data

Now let's load sample survey data from the Global Dialogues project:

In [6]:
# Load Global Dialogues participant data (GD4 as example)
survey_path = '../data/raw/survey_data/global-dialogues/Data/GD4/GD4_participants.csv'

if os.path.exists(survey_path):
    # Load survey data - the file has some formatting issues, so we'll handle them
    survey_raw = pd.read_csv(survey_path, skiprows=1)  # Skip the empty first row
    
    print(f"Raw survey data shape: {survey_raw.shape}")
    print("\nColumn names:")
    for i, col in enumerate(survey_raw.columns[:10]):  # Show first 10 columns
        print(f"{i}: {col}")
else:
    print(f"Survey file not found at {survey_path}")
    print("Note: Global Dialogues data is in a Git submodule.")
    print("You may need to initialize the submodule or use your own survey data.")

Raw survey data shape: (600, 234)

Column names:
0: Unnamed: 0
1: Unnamed: 1
2: Participant Id
3: Sample Provider Id
4: Please select your preferred language:
5: How old are you?
6: What is your gender?
7: What best describes where you live?
8: Overall, would you say the increased use of artificial intelligence (AI) in daily life makes you feel…
9: What religious group or faith do you most identify with?


## 3. Clean and Standardize Survey Data

For GRI calculation, we need to map survey demographics to match the benchmark categories:

In [7]:
# Create sample survey data if the real data isn't available
if 'survey_raw' not in locals() or survey_raw.empty:
    print("Creating sample survey data for demonstration...")
    
    # Create realistic sample data
    np.random.seed(42)
    n_participants = 500
    
    sample_countries = ['United States', 'India', 'Brazil', 'Germany', 'Nigeria', 'Japan']
    sample_ages = ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
    sample_genders = ['Male', 'Female']
    sample_religions = ['Christianity', 'Islam', 'Hinduism', 'Buddhism', 'Judaism', 
                       'I do not identify with any religious group or faith', 'Other religious group']
    sample_environments = ['Urban', 'Rural']
    
    survey_data = pd.DataFrame({
        'country': np.random.choice(sample_countries, n_participants),
        'age_group': np.random.choice(sample_ages, n_participants),
        'gender': np.random.choice(sample_genders, n_participants),
        'religion': np.random.choice(sample_religions, n_participants),
        'environment': np.random.choice(sample_environments, n_participants)
    })
    
    print(f"Sample survey data created with {len(survey_data)} participants")
    
else:
    print("Processing real Global Dialogues data...")
    
    # Extract relevant demographic columns from Global Dialogues data
    # Note: Column names may vary - adjust as needed based on actual data structure
    survey_data = pd.DataFrame()
    
    # Map country column (column index varies)
    country_col = None
    for col in survey_raw.columns:
        if 'country' in str(col).lower() or 'region' in str(col).lower():
            country_col = col
            break
    
    if country_col:
        survey_data['country'] = survey_raw[country_col]
    
    # Add other demographic mappings as needed...
    print("Real data processing would require specific column mapping")
    print("Using sample data for this demonstration...")
    
    # Fall back to sample data for demo
    np.random.seed(42)
    n_participants = 500
    
    sample_countries = ['United States', 'India', 'Brazil', 'Germany', 'Nigeria', 'Japan']
    sample_ages = ['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
    sample_genders = ['Male', 'Female']
    sample_religions = ['Christianity', 'Islam', 'Hinduism', 'Buddhism', 'Judaism', 
                       'I do not identify with any religious group or faith', 'Other religious group']
    sample_environments = ['Urban', 'Rural']
    
    survey_data = pd.DataFrame({
        'country': np.random.choice(sample_countries, n_participants),
        'age_group': np.random.choice(sample_ages, n_participants),
        'gender': np.random.choice(sample_genders, n_participants),
        'religion': np.random.choice(sample_religions, n_participants),
        'environment': np.random.choice(sample_environments, n_participants)
    })

Processing real Global Dialogues data...
Real data processing would require specific column mapping
Using sample data for this demonstration...


In [8]:
# Display survey data summary
print("Survey Data Summary:")
print(f"Total participants: {len(survey_data)}")
print("\nCountry distribution:")
print(survey_data['country'].value_counts())

print("\nAge group distribution:")
print(survey_data['age_group'].value_counts())

print("\nGender distribution:")
print(survey_data['gender'].value_counts())

Survey Data Summary:
Total participants: 500

Country distribution:
country
Germany          97
United States    90
Japan            86
Brazil           77
Nigeria          75
India            75
Name: count, dtype: int64

Age group distribution:
age_group
56-65    97
18-25    91
26-35    89
46-55    77
36-45    77
65+      69
Name: count, dtype: int64

Gender distribution:
gender
Male      259
Female    241
Name: count, dtype: int64


## 4. Data Validation and Alignment

Before calculating GRI, we need to ensure the survey categories align with benchmark categories:

In [9]:
# Check alignment between survey and benchmark categories
def check_category_alignment(survey_df, benchmark_df, columns):
    """Check which survey categories are present in benchmark data."""
    results = {}
    
    for col in columns:
        if col in survey_df.columns and col in benchmark_df.columns:
            survey_cats = set(survey_df[col].unique())
            benchmark_cats = set(benchmark_df[col].unique())
            
            matched = survey_cats.intersection(benchmark_cats)
            unmatched = survey_cats - benchmark_cats
            
            results[col] = {
                'total_survey': len(survey_cats),
                'matched': len(matched),
                'unmatched': unmatched
            }
    
    return results

# Check age/gender alignment
age_gender_alignment = check_category_alignment(
    survey_data, benchmark_age_gender, ['country', 'gender', 'age_group']
)

print("Age/Gender Category Alignment:")
for col, stats in age_gender_alignment.items():
    print(f"\n{col}:")
    print(f"  Survey categories: {stats['total_survey']}")
    print(f"  Matched with benchmark: {stats['matched']}")
    if stats['unmatched']:
        print(f"  Unmatched: {stats['unmatched']}")

Age/Gender Category Alignment:

country:
  Survey categories: 6
  Matched with benchmark: 5
  Unmatched: {'United States'}

gender:
  Survey categories: 2
  Matched with benchmark: 2

age_group:
  Survey categories: 6
  Matched with benchmark: 6


In [10]:
# Check religion alignment
religion_alignment = check_category_alignment(
    survey_data, benchmark_religion, ['country', 'religion']
)

print("Religion Category Alignment:")
for col, stats in religion_alignment.items():
    print(f"\n{col}:")
    print(f"  Survey categories: {stats['total_survey']}")
    print(f"  Matched with benchmark: {stats['matched']}")
    if stats['unmatched']:
        print(f"  Unmatched: {stats['unmatched']}")

Religion Category Alignment:

country:
  Survey categories: 6
  Matched with benchmark: 6

religion:
  Survey categories: 7
  Matched with benchmark: 7


In [11]:
# Check environment alignment
environment_alignment = check_category_alignment(
    survey_data, benchmark_environment, ['country', 'environment']
)

print("Environment Category Alignment:")
for col, stats in environment_alignment.items():
    print(f"\n{col}:")
    print(f"  Survey categories: {stats['total_survey']}")
    print(f"  Matched with benchmark: {stats['matched']}")
    if stats['unmatched']:
        print(f"  Unmatched: {stats['unmatched']}")

Environment Category Alignment:

country:
  Survey categories: 6
  Matched with benchmark: 5
  Unmatched: {'United States'}

environment:
  Survey categories: 2
  Matched with benchmark: 2


## 5. Country Name Standardization

Country names often need to be standardized between survey and benchmark data:

In [12]:
# Show sample of countries in benchmark vs survey
print("Sample benchmark countries:")
print(sorted(benchmark_age_gender['country'].unique())[:10])

print("\nSurvey countries:")
print(sorted(survey_data['country'].unique()))

# Create country mapping if needed
country_mapping = {
    'United States': 'United States of America',
    'Germany': 'Germany',
    'Brazil': 'Brazil',
    'India': 'India',
    'Nigeria': 'Nigeria',
    'Japan': 'Japan'
}

# Apply country mapping
survey_data_clean = survey_data.copy()
survey_data_clean['country'] = survey_data_clean['country'].map(country_mapping).fillna(survey_data_clean['country'])

print("\nAfter mapping:")
print(survey_data_clean['country'].value_counts())

Sample benchmark countries:
['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia']

Survey countries:
['Brazil', 'Germany', 'India', 'Japan', 'Nigeria', 'United States']

After mapping:
country
Germany                     97
United States of America    90
Japan                       86
Brazil                      77
Nigeria                     75
India                       75
Name: count, dtype: int64


## 6. Save Cleaned Survey Data

Save the cleaned survey data for use in GRI calculations:

In [13]:
# Save cleaned survey data
os.makedirs('../data/processed', exist_ok=True)
survey_data_clean.to_csv('../data/processed/sample_survey_data.csv', index=False)

print("Cleaned survey data saved to: data/processed/sample_survey_data.csv")
print(f"Final dataset shape: {survey_data_clean.shape}")
print("\nData is now ready for GRI calculation!")

Cleaned survey data saved to: data/processed/sample_survey_data.csv
Final dataset shape: (500, 5)

Data is now ready for GRI calculation!


## Summary

In this notebook, we have:

1. ✅ Loaded processed benchmark data for all three GRI dimensions
2. ✅ Loaded and examined survey data structure
3. ✅ Cleaned and standardized survey demographics
4. ✅ Validated category alignment between survey and benchmark
5. ✅ Applied country name standardization
6. ✅ Saved cleaned data for GRI calculation

**Next Steps:**
- Proceed to `2-gri-calculation-example.ipynb` to calculate GRI scores
- Use `3-advanced-analysis.ipynb` for detailed representativeness analysis

**Key Files Created:**
- `data/processed/sample_survey_data.csv` - Cleaned survey data ready for GRI calculation