# 2. GRI Calculation Example

This notebook demonstrates how to calculate the Global Representativeness Index (GRI) using the new `GRIAnalysis` class.

## Overview

The **GRI Scorecard** provides a comprehensive assessment of survey representativeness across three key dimensions:
1. **Country × Gender × Age**
2. **Country × Religion**
3. **Country × Environment (Urban/Rural)**

For each dimension, we calculate:
- **GRI Score**: Measures proportional accuracy (0.0 to 1.0)
- **Diversity Score**: Measures coverage breadth (0.0 to 1.0)
- **Max Possible Score**: Shows theoretical maximum given current participants

The notebook also includes:
- Top contributing segments analysis
- Visualizations of over/under-represented groups
- Impact analysis showing potential improvements

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from gri import GRIAnalysis

# Set plotting style
plt.style.use('default')
sns.set_palette('husl')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## 1. Initialize GRI Analysis

In [None]:
# Initialize GRI Analysis with GD3 survey data
# First check if we have the raw data from the submodule
raw_path = Path('../data/raw/survey_data/global-dialogues/Data/GD3/GD3_participants.csv')
processed_path = Path('../data/processed/gd3_demographics.csv')

if raw_path.exists():
    gri = GRIAnalysis.from_survey_file(
        filepath=str(raw_path),
        survey_type='gd'
    )
elif processed_path.exists():
    # Load from processed file
    survey_data = pd.read_csv(processed_path)
    gri = GRIAnalysis(survey_data=survey_data, survey_name='GD3')
else:
    raise FileNotFoundError("No GD3 data found. Please run 'git submodule update --init --recursive'")

# Show summary
print("Survey loaded successfully!")
print(f"Participants: {len(gri.survey_data):,}")
print(f"Countries: {gri.survey_data['country'].nunique()}")
print(f"Top 5 countries: {', '.join(gri.survey_data['country'].value_counts().head().index)}")

## 2. Calculate GRI Scorecard

In [None]:
# Calculate complete scorecard with maximum possible scores
scorecard = gri.calculate_scorecard(include_max_possible=True)

# Display results
print("=== GRI SCORECARD ===\n")
print(scorecard.to_string(index=False, float_format='%.4f'))

## 3. Visualize Results

In [None]:
# Create visualization
fig = gri.plot_scorecard()
plt.show()

## 4. Top Contributing Segments Analysis

Let's identify which demographic segments contribute most to non-representativeness.

In [None]:
# Get top contributing segments for Country × Gender × Age
top_age_gender = gri.get_top_segments('Country × Gender × Age', n=10)

print("=== TOP CONTRIBUTING SEGMENTS: Country × Gender × Age ===\n")
print("Top 10 segments contributing to non-representativeness:")

# Adjust column names based on actual output
if 'deviation' in top_age_gender.columns:
    print(top_age_gender[['country', 'gender', 'age_group', 'deviation', 
                          'sample_proportion', 'benchmark_proportion']].to_string(index=False))
    
    # Calculate potential improvement
    current_gri = scorecard[scorecard['dimension'] == 'Country × Gender × Age']['gri_score'].values[0]
    potential_improvement = top_age_gender['abs_deviation'].sum() / 2  # TVD contribution
    print(f"\nCurrent GRI: {current_gri:.4f}")
    print(f"If top 10 deviations were fixed: {current_gri + potential_improvement:.4f} (+{potential_improvement:.4f})")
else:
    print(top_age_gender.head(10))

## 5. Visualize Over/Under-Represented Groups

In [None]:
# Visualize top deviations
gri.plot_top_deviations('Country × Gender × Age', n=15)
plt.show()

## 6. Country-Level Analysis

Let's analyze representation at the country level to identify geographic gaps.

In [None]:
# Get country-level deviations for religion dimension
country_religion = gri.get_top_segments('Country × Religion', n=20)

# Focus on country-level patterns
if 'country' in country_religion.columns:
    country_summary = country_religion.groupby('country').agg({
        'abs_deviation': 'sum',
        'sample_proportion': 'sum',
        'benchmark_proportion': 'sum'
    }).reset_index()
    
    # Add representation status
    country_summary['representation'] = country_summary.apply(
        lambda x: 'Over' if x['sample_proportion'] > x['benchmark_proportion'] else 'Under', axis=1
    )
    
    # Sort by impact
    country_summary = country_summary.sort_values('abs_deviation', ascending=False).head(10)
    
    print("=== TOP COUNTRY-LEVEL DEVIATIONS ===\n")
    print(country_summary.to_string(index=False, float_format='%.4f'))
else:
    print("Country-level analysis not available with current data structure")
    print(country_religion.head(10))

## 7. Impact Analysis

Let's analyze the cumulative impact of fixing deviations to understand the path to better representativeness.

In [None]:
# Create impact analysis for all dimensions
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

dimensions = ['Country × Gender × Age', 'Country × Religion', 'Country × Environment']
titles = ['Country × Gender × Age', 'Country × Religion', 'Country × Environment']

for idx, (dim, title) in enumerate(zip(dimensions, titles)):
    # Get top deviations
    deviations = gri.get_top_segments(dim, n=20)
    
    # Calculate cumulative improvement
    current_gri = scorecard[scorecard['dimension'] == title]['gri_score'].values[0]
    
    if 'abs_deviation' in deviations.columns:
        cumulative_impact = (deviations['abs_deviation'] / 2).cumsum()  # TVD contribution
        improved_gri = current_gri + cumulative_impact
        
        # Plot
        ax = axes[idx]
        ax.plot(range(1, len(cumulative_impact) + 1), improved_gri, 'b-', linewidth=2)
        ax.axhline(y=current_gri, color='r', linestyle='--', alpha=0.7, label=f'Current: {current_gri:.3f}')
        
        if 'max_possible_score' in scorecard.columns:
            max_score = scorecard[scorecard['dimension'] == title]['max_possible_score'].values[0]
            ax.axhline(y=max_score, color='g', linestyle='--', alpha=0.7, label='Max Possible')
        
        ax.set_xlabel('Number of Segments Fixed')
        ax.set_ylabel('GRI Score')
        ax.set_title(f'Impact Analysis: {title}')
        ax.set_ylim(0, 1)
        ax.grid(True, alpha=0.3)
        ax.legend()
        
        # Add annotations for key milestones
        for target in [0.5, 0.6, 0.7, 0.8]:
            if improved_gri.max() >= target:
                segments_needed = (improved_gri >= target).argmax() + 1
                ax.annotate(f'{target:.1f}', xy=(segments_needed, target), 
                           xytext=(segments_needed + 1, target + 0.02),
                           arrowprops=dict(arrowstyle='->', alpha=0.5))
    else:
        axes[idx].text(0.5, 0.5, 'Data structure not compatible', 
                      ha='center', va='center', transform=axes[idx].transAxes)

plt.tight_layout()
plt.show()

## 8. Save Results

Save the complete analysis for reporting and future reference.

In [ ]:
# Save results
# Export scorecard in JSON format
gri.export_results(format='json', filepath='../data/processed/gd3_gri_analysis.json')

# Save scorecard
scorecard.to_csv('../data/processed/gd3_gri_scorecard.csv', index=False)

# Save top deviations for each dimension
for dim, title in zip(['Country × Gender × Age', 'Country × Religion', 'Country × Environment'],
                     ['age_gender', 'religion', 'environment']):
    top_devs = gri.get_top_segments(dim, n=20)
    top_devs.to_csv(f'../data/processed/gd3_top_deviations_{title}.csv', index=False)

print("Results saved:")
print("  - gd3_gri_analysis.json (complete results)")
print("  - gd3_gri_scorecard.csv (scorecard summary)")
print("  - gd3_top_deviations_*.csv (top contributing segments)")

## Summary

This notebook demonstrated the new streamlined GRI analysis workflow:

1. **Simple Initialization**: Just one line to create a GRIAnalysis instance
2. **Complete Scorecard**: All metrics calculated with `calculate_scorecard()`
3. **Top Contributing Segments**: Identified which demographics need better representation
4. **Impact Analysis**: Showed how fixing top deviations would improve scores
5. **Max Possible Scores**: Revealed theoretical limits given current participants

**Key Insights from the Analysis:**
- The survey shows moderate representativeness (Average GRI: ~0.43)
- Country × Religion has the highest GRI score, while Country × Gender × Age needs most improvement
- Fixing just the top 10-15 demographic segments could significantly improve representativeness
- The max possible scores show there's room for improvement even with optimal weighting

**Next Steps:**
- Use insights to guide targeted recruitment for under-represented segments
- Apply sample weights based on the deviation analysis
- Compare results across different surveys to track improvement over time