# 4. Coarse Dimensions Demonstration

This notebook demonstrates the expanded GRI scorecard with coarser-grained dimensions for broader representativeness analysis.

## Overview

The **expanded GRI scorecard** now includes both fine-grained and coarse-grained dimensions:

### Fine-grained (Original)
- Country × Gender × Age
- Country × Religion  
- Country × Environment

### Coarse-grained (New)
- **Country**: Country-level representativeness
- **Region**: Regional representativeness (UN M49 regions)
- **Continent**: Continental representativeness
- **Gender**: Global gender representativeness
- **Age Group**: Global age representativeness
- **Religion**: Global religious representativeness
- **Environment**: Global urban/rural representativeness

**Why coarser dimensions matter:**
- Provide different perspectives on representativeness
- Help identify geographic vs. demographic imbalances
- Useful for different analysis goals and reporting needs

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import the new GRI module
import sys
sys.path.append('..')
from gri import GRIAnalysis

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')
pd.set_option('display.max_columns', None)

## 1. Create Sample Data with Geographic Diversity

In [ ]:
# Create comprehensive sample data with geographic diversity
np.random.seed(42)
n_participants = 1000

# Include diverse countries from different regions and continents
countries = [
    'United States', 'Canada',                # North America
    'Brazil', 'Argentina',                    # South America  
    'Germany', 'France', 'Spain', 'Poland',   # Europe
    'Nigeria', 'South Africa', 'Egypt',       # Africa
    'India', 'China', 'Japan', 'Indonesia',   # Asia
    'Australia'                               # Oceania
]

# Create sample survey data
sample_survey = pd.DataFrame({
    'country': np.random.choice(countries, n_participants, 
                               p=[0.15, 0.05, 0.10, 0.05, 0.08, 0.07, 0.06, 0.04,
                                  0.08, 0.07, 0.05, 0.10, 0.05, 0.03, 0.01, 0.01]),
    'gender': np.random.choice(['Male', 'Female'], n_participants, p=[0.52, 0.48]), 
    'age_group': np.random.choice([
        '18-25', '26-35', '36-45', '46-55', '56-65', '65+'
    ], n_participants, p=[0.20, 0.25, 0.20, 0.15, 0.12, 0.08]),
    'religion': np.random.choice([
        'Christianity', 'Islam', 'Hinduism', 'Buddhism', 
        'Judaism', 'I do not identify with any religious group or faith',
        'Other religious group'
    ], n_participants, p=[0.30, 0.20, 0.15, 0.10, 0.02, 0.18, 0.05]),
    'environment': np.random.choice(['Urban', 'Rural'], n_participants, p=[0.55, 0.45])
})

print(f"Sample survey: {len(sample_survey):,} participants from {sample_survey['country'].nunique()} countries")
print("\nGeographic distribution:")
print(sample_survey['country'].value_counts().head(10))

## 2. Initialize GRI Analysis and Calculate Multiple Dimension Granularities

In [ ]:
# Initialize GRI analysis
analysis = GRIAnalysis(
    sample_survey,
    survey_source='global_dialogues',
    benchmark_dir='../data/processed'
)

# Calculate scorecard with different dimension granularities
print("Calculating GRI scores for different dimension granularities...")
print("=" * 60)

# Define dimensions from fine to coarse
dimension_sets = {
    'Fine-grained': ['Country × Gender × Age', 'Country × Religion', 'Country × Environment'],
    'Regional': ['Region', 'Region × Gender × Age', 'Region × Religion', 'Region × Environment'],
    'Continental': ['Continent'],
    'Global Demographics': ['Gender', 'Age Group', 'Religion', 'Environment'],
    'Geographic Only': ['Country', 'Region', 'Continent']
}

# Calculate and display each set
all_results = []
for granularity, dimensions in dimension_sets.items():
    print(f"\n{granularity}:")
    scorecard = analysis.calculate_scorecard(dimensions=dimensions)
    
    for _, row in scorecard.iterrows():
        if row['Dimension'] != 'AVERAGE':
            print(f"  • {row['Dimension']:.<35} GRI: {row['GRI Score']:.3f}, Diversity: {row['Diversity Score']:.3f}")
            all_results.append({
                'Granularity': granularity,
                'Dimension': row['Dimension'],
                'GRI Score': row['GRI Score'],
                'Diversity Score': row['Diversity Score']
            })

results_df = pd.DataFrame(all_results)

## 3. Demonstrate Geographic Mapping Capabilities

In [ ]:
# Show how the module handles geographic mappings
print("GEOGRAPHIC MAPPING DEMONSTRATION")
print("=" * 60)

# Display sample with geographic mappings
sample_with_geo = analysis.data.copy()
print("\nSample data with geographic mappings (first 10 rows):")
print(sample_with_geo[['country', 'region', 'continent']].head(10))

# Show geographic aggregation
print("\n\nGeographic distribution at different levels:")
print("\nBy Continent:")
print(sample_with_geo['continent'].value_counts())
print("\nBy Region (top 5):")
print(sample_with_geo['region'].value_counts().head())

# Compare scores at different geographic levels
geo_comparison = analysis.calculate_scorecard(dimensions=['Country', 'Region', 'Continent'])
print("\n\nGeographic Representativeness Comparison:")
print(geo_comparison[geo_comparison['Dimension'] != 'AVERAGE'][['Dimension', 'GRI Score', 'Diversity Score']])

## 4. Visualize Dimension Comparisons Using Module Capabilities

In [ ]:
# Use the module's visualization capabilities
# First, let's use the built-in comparison visualization
dimensions_to_compare = [
    'Country × Gender × Age',
    'Region × Gender × Age', 
    'Country',
    'Region',
    'Continent',
    'Gender',
    'Age Group'
]

# Calculate scores for visualization
viz_scorecard = analysis.calculate_scorecard(dimensions=dimensions_to_compare)

# Create custom visualization showing granularity levels
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

# Define colors by granularity level
granularity_colors = {
    'Country × Gender × Age': '#1f77b4',  # Fine-grained
    'Region × Gender × Age': '#ff7f0e',   # Medium-grained
    'Country': '#2ca02c',                 # Geographic
    'Region': '#d62728',                  # Geographic
    'Continent': '#9467bd',               # Geographic
    'Gender': '#8c564b',                  # Demographic
    'Age Group': '#e377c2'                # Demographic
}

# Filter out AVERAGE row
viz_data = viz_scorecard[viz_scorecard['Dimension'] != 'AVERAGE'].copy()

# Plot GRI scores
colors = [granularity_colors[dim] for dim in viz_data['Dimension']]
bars1 = ax1.bar(range(len(viz_data)), viz_data['GRI Score'], color=colors, alpha=0.8)
ax1.set_xlabel('Dimension', fontsize=12)
ax1.set_ylabel('GRI Score', fontsize=12)
ax1.set_title('GRI Scores: Fine to Coarse Dimensions', fontsize=14)
ax1.set_xticks(range(len(viz_data)))
ax1.set_xticklabels(viz_data['Dimension'], rotation=45, ha='right')
ax1.set_ylim(0, 1)
ax1.grid(axis='y', alpha=0.3)

# Add reference line
ax1.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Midpoint (0.5)')

# Plot Diversity scores
bars2 = ax2.bar(range(len(viz_data)), viz_data['Diversity Score'], color=colors, alpha=0.8)
ax2.set_xlabel('Dimension', fontsize=12)
ax2.set_ylabel('Diversity Score', fontsize=12)
ax2.set_title('Diversity Scores: Fine to Coarse Dimensions', fontsize=14)
ax2.set_xticks(range(len(viz_data)))
ax2.set_xticklabels(viz_data['Dimension'], rotation=45, ha='right')
ax2.set_ylim(0, 1)
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics by granularity
print("\nSUMMARY BY GRANULARITY:")
print("=" * 50)
for dim in viz_data['Dimension']:
    gri = viz_data[viz_data['Dimension'] == dim]['GRI Score'].values[0]
    div = viz_data[viz_data['Dimension'] == dim]['Diversity Score'].values[0]
    
    if '×' in dim:
        gran_type = "Multi-factor"
    elif dim in ['Country', 'Region', 'Continent']:
        gran_type = "Geographic"
    else:
        gran_type = "Demographic"
    
    print(f"{dim:<25} ({gran_type:<12}): GRI={gri:.3f}, Diversity={div:.3f}")

## 5. Educational Insights: Why Dimension Choices Matter

In [ ]:
# Demonstrate the impact of dimension choices
print("UNDERSTANDING DIMENSION CHOICES")
print("=" * 60)

# Compare fine vs coarse geographic dimensions
geo_fine = analysis.calculate_dimension('Country')
geo_medium = analysis.calculate_dimension('Region') 
geo_coarse = analysis.calculate_dimension('Continent')

print("\n📍 GEOGRAPHIC GRANULARITY COMPARISON:")
print(f"Country-level GRI:   {geo_fine['gri_score']:.3f} (captures {geo_fine['coverage']:.1%} of countries)")
print(f"Region-level GRI:    {geo_medium['gri_score']:.3f} (captures {geo_medium['coverage']:.1%} of regions)")
print(f"Continent-level GRI: {geo_coarse['gri_score']:.3f} (captures {geo_coarse['coverage']:.1%} of continents)")

# Show how aggregation works
print("\n\n🔄 AGGREGATION EXAMPLE:")
print("When we aggregate from Country to Region, the module:")
print("1. Maps each country to its UN region")
print("2. Sums proportions within each region")
print("3. Compares to regional benchmarks")

# Example: Show countries in a specific region
example_region = 'Northern America'
countries_in_region = analysis.data[analysis.data['region'] == example_region]['country'].unique()
print(f"\nExample - Countries in '{example_region}':")
for country in sorted(countries_in_region):
    count = len(analysis.data[analysis.data['country'] == country])
    print(f"  • {country}: {count} participants")

# Educational comparison of dimension types
print("\n\n📊 DIMENSION TYPE COMPARISON:")

# Multi-factor vs single-factor
multi_factor = analysis.calculate_dimension('Country × Gender × Age')
single_country = analysis.calculate_dimension('Country')
single_gender = analysis.calculate_dimension('Gender')

print(f"\nMulti-factor (Country × Gender × Age):")
print(f"  • GRI: {multi_factor['gri_score']:.3f}")
print(f"  • Captures interactions between demographics")
print(f"  • Most detailed but hardest to achieve high scores")

print(f"\nSingle-factor Geographic (Country):")
print(f"  • GRI: {single_country['gri_score']:.3f}")
print(f"  • Shows pure geographic representation")
print(f"  • Easier to interpret and target improvements")

print(f"\nSingle-factor Demographic (Gender):")
print(f"  • GRI: {single_gender['gri_score']:.3f}")
print(f"  • Shows global demographic balance")
print(f"  • Often easier to achieve high scores")

## 6. Practical Applications and Recommendations

In [ ]:
# Generate actionable insights using different dimension granularities
print("PRACTICAL APPLICATIONS OF DIMENSION CHOICES")
print("=" * 60)

# 1. Identify recruitment priorities
print("\n1️⃣ RECRUITMENT PRIORITIES:")

# Get all dimension scores
all_dims = analysis.calculate_scorecard()
low_scoring = all_dims[(all_dims['GRI Score'] < 0.5) & (all_dims['Dimension'] != 'AVERAGE')].sort_values('GRI Score')

print("\nLowest scoring dimensions (< 0.5 GRI):")
for _, row in low_scoring.head(5).iterrows():
    print(f"  • {row['Dimension']:.<40} GRI: {row['GRI Score']:.3f}")

# 2. Geographic vs Demographic analysis
print("\n\n2️⃣ GEOGRAPHIC VS DEMOGRAPHIC PATTERNS:")

geo_dims = ['Country', 'Region', 'Continent']
demo_dims = ['Gender', 'Age Group', 'Religion', 'Environment']

geo_scores = all_dims[all_dims['Dimension'].isin(geo_dims)]['GRI Score'].mean()
demo_scores = all_dims[all_dims['Dimension'].isin(demo_dims)]['GRI Score'].mean()

print(f"\nAverage Geographic GRI: {geo_scores:.3f}")
print(f"Average Demographic GRI: {demo_scores:.3f}")

if geo_scores < demo_scores:
    print("\n→ Geographic representation needs more attention than demographics")
    print("  Recommendation: Focus on recruiting from underrepresented regions/countries")
else:
    print("\n→ Demographic representation needs more attention than geography")
    print("  Recommendation: Target specific age groups, genders, or religious groups")

# 3. Use module's built-in analysis methods
print("\n\n3️⃣ TARGETED IMPROVEMENT STRATEGIES:")

# Get detailed dimension analysis
country_analysis = analysis.calculate_dimension('Country', return_details=True)

# Find most underrepresented countries
if 'details' in country_analysis:
    details = country_analysis['details']
    # Sort by difference between population and sample proportion
    details['gap'] = details['population_proportion'] - details['sample_proportion']
    underrep = details[details['gap'] > 0].sort_values('gap', ascending=False).head(5)
    
    print("\nMost underrepresented countries:")
    for _, row in underrep.iterrows():
        print(f"  • {row['country']:.<30} Gap: {row['gap']:.3%}")

# 4. Reporting recommendations
print("\n\n4️⃣ REPORTING RECOMMENDATIONS:")
print("\nFor different audiences, use different dimension sets:")
print("  • Executive Summary: Use 'Continent' and single demographics")
print("  • Regional Teams: Use 'Region' and 'Region × Demographics'")
print("  • Detailed Analysis: Use full 'Country × Gender × Age' breakdown")
print("  • Public Reports: Balance geographic and demographic dimensions")

In [ ]:
# Create a comprehensive comparison visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

# 1. Geographic hierarchy comparison
geo_dims = ['Country', 'Region', 'Continent']
geo_scores = all_dims[all_dims['Dimension'].isin(geo_dims)]

ax1.bar(geo_scores['Dimension'], geo_scores['GRI Score'], color=['#2ca02c', '#d62728', '#9467bd'])
ax1.set_title('Geographic Hierarchy: Fine to Coarse', fontsize=14)
ax1.set_ylabel('GRI Score')
ax1.set_ylim(0, 1)
ax1.grid(axis='y', alpha=0.3)

# 2. Dimension complexity comparison
complexity_order = ['Gender', 'Age Group', 'Country', 'Country × Gender × Age']
complexity_scores = all_dims[all_dims['Dimension'].isin(complexity_order)]

ax2.bar(range(len(complexity_order)), 
        [complexity_scores[complexity_scores['Dimension'] == d]['GRI Score'].values[0] for d in complexity_order],
        color=['#8c564b', '#e377c2', '#2ca02c', '#1f77b4'])
ax2.set_title('Dimension Complexity: Simple to Complex', fontsize=14)
ax2.set_xticks(range(len(complexity_order)))
ax2.set_xticklabels(complexity_order, rotation=45, ha='right')
ax2.set_ylabel('GRI Score')
ax2.set_ylim(0, 1)
ax2.grid(axis='y', alpha=0.3)

# 3. Scatter plot: GRI vs Diversity
ax3.scatter(all_dims[all_dims['Dimension'] != 'AVERAGE']['GRI Score'], 
           all_dims[all_dims['Dimension'] != 'AVERAGE']['Diversity Score'],
           s=100, alpha=0.6)
ax3.set_xlabel('GRI Score')
ax3.set_ylabel('Diversity Score')
ax3.set_title('GRI vs Diversity Scores Across All Dimensions', fontsize=14)
ax3.grid(True, alpha=0.3)

# Add annotations for interesting points
for _, row in all_dims[all_dims['Dimension'] != 'AVERAGE'].iterrows():
    if row['GRI Score'] > 0.8 or row['Diversity Score'] > 0.8:
        ax3.annotate(row['Dimension'], (row['GRI Score'], row['Diversity Score']), 
                    xytext=(5, 5), textcoords='offset points', fontsize=8)

# 4. Category comparison
categories = {
    'Fine-grained': ['Country × Gender × Age', 'Country × Religion', 'Country × Environment'],
    'Geographic': ['Country', 'Region', 'Continent'],
    'Demographic': ['Gender', 'Age Group', 'Religion', 'Environment']
}

cat_means = []
cat_names = []
for cat_name, dims in categories.items():
    mean_score = all_dims[all_dims['Dimension'].isin(dims)]['GRI Score'].mean()
    cat_means.append(mean_score)
    cat_names.append(cat_name)

ax4.bar(cat_names, cat_means, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
ax4.set_title('Average GRI by Dimension Category', fontsize=14)
ax4.set_ylabel('Average GRI Score')
ax4.set_ylim(0, 1)
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated how the **GRI module** handles different dimension granularities:

### ✅ Key Module Features Demonstrated

1. **Built-in Geographic Mapping**
   - Automatic country → region → continent mapping
   - Uses UN M49 standard regional classifications
   - No manual mapping required

2. **Flexible Dimension Calculation**
   - `calculate_scorecard(dimensions=[...])` for custom dimension sets
   - `calculate_dimension()` for individual dimension analysis
   - Support for fine-grained to coarse dimensions

3. **Aggregation Handling**
   - Module automatically aggregates from fine to coarse dimensions
   - Maintains proper proportion calculations at each level
   - Handles missing data gracefully

4. **Analysis Capabilities**
   - Compare scores across dimension granularities
   - Identify geographic vs demographic gaps
   - Generate actionable recruitment insights

### ✅ Educational Value Preserved

- **Why dimension choices matter**: Different granularities reveal different patterns
- **Trade-offs**: Fine dimensions are more precise but harder to achieve high scores
- **Practical applications**: Choose dimensions based on audience and goals
- **Strategic insights**: Use coarse dimensions to identify broad patterns, fine dimensions for specific gaps

### ✅ Code Reduction Benefits

The new module structure significantly reduces code complexity:
- No manual benchmark loading
- No manual geographic mapping
- Built-in visualization support
- Standardized dimension configurations
- Consistent API across all operations

The module makes it easy to experiment with different dimension granularities while maintaining robust analysis capabilities.