# 3. Advanced GRI Analysis

This notebook demonstrates advanced analysis techniques using the GRI module's comprehensive features:
- Monte Carlo simulations to understand achievable GRI scores
- Comprehensive reporting and visualization capabilities
- Alignment checking and segment-level analysis
- Survey comparison tools
- Sample size analysis and optimization

## Overview

The GRI module provides powerful tools to:
1. **Simulate maximum achievable GRI scores** given real-world constraints
2. **Identify specific over/under-represented groups** with detailed segment analysis
3. **Compare multiple surveys** to track progress over time
4. **Generate comprehensive reports** with actionable insights
5. **Optimize sample sizes** for different representativeness targets

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List
import json

# Import the GRI module components
import sys
sys.path.append('..')
from gri import GRIAnalysis
from gri.simulation import monte_carlo_max_scores, generate_sample_size_curve
from gri.reports import generate_comparison_report
from gri.visualization import create_comparison_plot

# Set plotting style
plt.style.use('default')
sns.set_palette('RdYlBu_r')

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 20)

## 1. Initialize GRI Analysis

The GRIAnalysis class provides a comprehensive interface for all GRI calculations and analysis.

In [None]:
# Initialize GRI analysis for multiple surveys
surveys = {}
for survey_name in ['GD1', 'GD2', 'GD3']:
    try:
        # Create GRIAnalysis instance
        analysis = GRIAnalysis(
            survey_file=f'../data/processed/{survey_name.lower()}_demographics.csv',
            benchmark_dir='../data/processed'
        )
        
        # Calculate GRI scores
        results = analysis.calculate_gri()
        
        surveys[survey_name] = {
            'analysis': analysis,
            'results': results,
            'n_participants': len(analysis.survey_data)
        }
        
        print(f"{survey_name}:")
        print(f"  Participants: {surveys[survey_name]['n_participants']}")
        print(f"  Overall GRI: {results['average_gri']:.4f}")
        print(f"    - Country×Gender×Age: {results['gri_country_gender_age']:.4f}")
        print(f"    - Country×Religion: {results['gri_country_religion']:.4f}")
        print(f"    - Country×Environment: {results['gri_country_environment']:.4f}")
        print()
        
    except FileNotFoundError:
        print(f"{survey_name}: Data file not found")
        print()

# Select primary survey for detailed analysis
primary_survey = 'GD3'
if primary_survey in surveys:
    analysis = surveys[primary_survey]['analysis']
    results = surveys[primary_survey]['results']
    print(f"Selected {primary_survey} for detailed analysis")

## 2. Monte Carlo Simulation: Understanding Achievable GRI Scores

One key question is: "What's the maximum GRI score we could achieve given real-world constraints?"

The monte_carlo_max_scores function simulates thousands of possible samples to find the theoretical maximum.

In [None]:
# Run Monte Carlo simulation to find maximum achievable GRI scores
print("Running Monte Carlo simulation (this may take a moment)...")

# Simulate with current sample size
simulation_results = monte_carlo_max_scores(
    analysis.survey_data,
    analysis.benchmarks,
    n_simulations=1000,
    sample_size=len(analysis.survey_data)
)

print("\n🎯 MONTE CARLO SIMULATION RESULTS")
print("=" * 60)
print(f"Sample size: {len(analysis.survey_data)} participants")
print(f"Simulations run: {simulation_results['n_simulations']}")
print(f"\nActual {primary_survey} GRI: {results['average_gri']:.4f}")
print(f"Theoretical maximum GRI: {simulation_results['max_average_gri']:.4f}")
print(f"Gap to maximum: {simulation_results['max_average_gri'] - results['average_gri']:.4f}")
print(f"\nPercentile of actual score: {simulation_results['actual_percentile']:.1f}%")

# Show dimension-specific results
print("\nDimension-specific maximum scores:")
for dim, score in simulation_results['max_scores_by_dimension'].items():
    actual = results[f'gri_{dim}']
    print(f"  {dim}: max={score:.4f}, actual={actual:.4f}, gap={score-actual:.4f}")

# Visualize the distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of simulated scores
ax1.hist(simulation_results['simulated_scores'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
ax1.axvline(results['average_gri'], color='red', linestyle='--', linewidth=2, label=f'Actual GRI ({results["average_gri"]:.3f})')
ax1.axvline(simulation_results['max_average_gri'], color='green', linestyle='--', linewidth=2, label=f'Max GRI ({simulation_results["max_average_gri"]:.3f})')
ax1.set_xlabel('Average GRI Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Simulated GRI Scores')
ax1.legend()

# Compare actual vs maximum by dimension
dimensions = list(simulation_results['max_scores_by_dimension'].keys())
actual_scores = [results[f'gri_{dim}'] for dim in dimensions]
max_scores = [simulation_results['max_scores_by_dimension'][dim] for dim in dimensions]

x = np.arange(len(dimensions))
width = 0.35

ax2.bar(x - width/2, actual_scores, width, label='Actual', alpha=0.8)
ax2.bar(x + width/2, max_scores, width, label='Maximum', alpha=0.8)
ax2.set_xlabel('Dimension')
ax2.set_ylabel('GRI Score')
ax2.set_title('Actual vs Maximum GRI by Dimension')
ax2.set_xticks(x)
ax2.set_xticklabels([d.replace('_', '\n') for d in dimensions], rotation=0)
ax2.legend()
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

print(f"\n💡 Insight: The {primary_survey} survey achieved a GRI in the {simulation_results['actual_percentile']:.0f}th percentile")
print(f"of all possible samples of the same size, suggesting {'good' if simulation_results['actual_percentile'] > 50 else 'poor'} representativeness.")

## 3. Segment Analysis: Identifying Key Representativeness Gaps

The GRIAnalysis class provides powerful segment analysis to identify which specific demographic groups drive representativeness gaps.

In [None]:
# Get top contributing segments using the analysis module
top_segments = analysis.get_top_segments(n=20)

print("TOP 10 SEGMENTS CONTRIBUTING TO GRI DEVIATION")
print("=" * 80)

for i, segment in enumerate(top_segments.head(10).itertuples(), 1):
    # Format representation
    if segment.over_under_representation_pct == float('inf'):
        rep_str = "∞% (new segment)"
    elif segment.over_under_representation_pct > 1000:
        rep_str = f"+{segment.over_under_representation_pct:.0f}%"
    else:
        rep_str = f"{segment.over_under_representation_pct:+.1f}%"
    
    print(f"{i:2d}. {segment.segment_id}")
    print(f"    Dimension: {segment.dimension}")
    print(f"    GRI Impact: {segment.gri_contribution:.3f} percentage points")
    print(f"    Deviation: {segment.deviation_pp:+.2f} pp from expected")
    print(f"    Representation: {rep_str}")
    print(f"    Category: {segment.category}")
    print()

# Visualize segment analysis
analysis.plot_segment_analysis(top_n=15)

# Get actionable recommendations
print("\n🎯 ACTIONABLE RECOMMENDATIONS")
print("=" * 80)

recommendations = analysis.get_recommendations()
for rec_type, segments in recommendations.items():
    if segments:
        print(f"\n{rec_type.upper()}:")
        for seg in segments[:5]:  # Show top 5 per category
            print(f"  • {seg['segment']}: {seg['action']}")

## 4. Sample Size Analysis: How Many Participants Do We Need?

Understanding the relationship between sample size and achievable GRI scores helps in planning future surveys.

In [None]:
# Generate sample size curve
print("Analyzing sample size vs achievable GRI...")
sample_size_analysis = generate_sample_size_curve(
    analysis.survey_data,
    analysis.benchmarks,
    sample_sizes=[100, 250, 500, 750, 1000, 1500, 2000, 3000, 5000],
    n_simulations_per_size=100
)

# Visualize the results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot average and max GRI vs sample size
ax1.plot(sample_size_analysis['sample_sizes'], 
         sample_size_analysis['average_gri_scores'], 
         'o-', label='Average GRI', linewidth=2, markersize=8)
ax1.plot(sample_size_analysis['sample_sizes'], 
         sample_size_analysis['max_gri_scores'], 
         's-', label='Maximum GRI', linewidth=2, markersize=8)
ax1.axhline(y=results['average_gri'], color='red', linestyle='--', 
            label=f'Current {primary_survey} GRI')
ax1.set_xlabel('Sample Size')
ax1.set_ylabel('GRI Score')
ax1.set_title('GRI Score vs Sample Size')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')

# Plot confidence intervals
ax2.fill_between(sample_size_analysis['sample_sizes'],
                 sample_size_analysis['percentile_5'],
                 sample_size_analysis['percentile_95'],
                 alpha=0.3, label='90% CI')
ax2.plot(sample_size_analysis['sample_sizes'],
         sample_size_analysis['percentile_50'],
         'o-', label='Median GRI', linewidth=2)
ax2.axhline(y=results['average_gri'], color='red', linestyle='--',
            label=f'Current {primary_survey} GRI')
ax2.set_xlabel('Sample Size')
ax2.set_ylabel('GRI Score')
ax2.set_title('GRI Score Distribution by Sample Size')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')

plt.tight_layout()
plt.show()

# Find optimal sample size for target GRI
target_gri = 0.7
optimal_size = None
for i, size in enumerate(sample_size_analysis['sample_sizes']):
    if sample_size_analysis['max_gri_scores'][i] >= target_gri:
        optimal_size = size
        break

print(f"\n💡 SAMPLE SIZE INSIGHTS:")
print(f"  • Current sample size: {len(analysis.survey_data)}")
print(f"  • Current GRI: {results['average_gri']:.4f}")
if optimal_size:
    print(f"  • Minimum sample size for GRI ≥ {target_gri}: ~{optimal_size:,} participants")
else:
    print(f"  • Even with {max(sample_size_analysis['sample_sizes']):,} participants, GRI {target_gri} may be challenging")
print(f"  • Diminishing returns appear after ~{sample_size_analysis['sample_sizes'][3]:,} participants")

## 5. Survey Comparison: Tracking Progress Over Time

Compare multiple surveys to understand trends and improvements in representativeness.

In [None]:
# Compare all available surveys
if len(surveys) > 1:
    # Create comparison visualization
    comparison_fig = create_comparison_plot(
        [s['results'] for s in surveys.values()],
        survey_names=list(surveys.keys()),
        plot_type='both'  # Shows both radar and bar charts
    )
    
    # Generate detailed comparison report
    comparison_data = []
    for name, data in surveys.items():
        comparison_data.append({
            'survey_name': name,
            'results': data['results'],
            'n_participants': data['n_participants'],
            'survey_data': data['analysis'].survey_data
        })
    
    report = generate_comparison_report(comparison_data)
    
    print("\n📊 SURVEY COMPARISON REPORT")
    print("=" * 80)
    print(report)
    
    # Identify best practices from highest-scoring survey
    best_survey = max(surveys.items(), key=lambda x: x[1]['results']['average_gri'])
    print(f"\n🏆 Best performing survey: {best_survey[0]} (GRI: {best_survey[1]['results']['average_gri']:.4f})")
    
    # Show improvements over time
    if 'GD1' in surveys and 'GD3' in surveys:
        gd1_gri = surveys['GD1']['results']['average_gri']
        gd3_gri = surveys['GD3']['results']['average_gri']
        improvement = gd3_gri - gd1_gri
        print(f"\n📈 Progress from GD1 to GD3: {improvement:+.4f} GRI points ({improvement/gd1_gri*100:+.1f}%)")
else:
    print("Only one survey available. Load multiple surveys to see comparisons.")

## 6. Alignment Analysis: Understanding Representation Patterns

Check how well the survey aligns with global population across different demographic cuts.

In [None]:
# Check alignment across different demographic cuts
alignment_results = analysis.check_alignment(threshold=0.02)  # 2 percentage point threshold

print("ALIGNMENT ANALYSIS")
print("=" * 80)
print(f"Checking segments that deviate by more than 2 percentage points...\n")

# Summarize alignment by dimension
for dimension, segments in alignment_results.items():
    aligned = [s for s in segments if s['aligned']]
    misaligned = [s for s in segments if not s['aligned']]
    
    print(f"{dimension}:")
    print(f"  Aligned segments: {len(aligned)}/{len(segments)} ({len(aligned)/len(segments)*100:.1f}%)")
    print(f"  Misaligned segments: {len(misaligned)}")
    
    if misaligned:
        print("  Top misalignments:")
        # Sort by deviation magnitude
        sorted_misaligned = sorted(misaligned, key=lambda x: abs(x['deviation']), reverse=True)
        for seg in sorted_misaligned[:3]:
            print(f"    • {seg['segment']}: {seg['deviation']:+.2f} pp "
                  f"({'over' if seg['deviation'] > 0 else 'under'}-represented)")
    print()

# Visualize alignment patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, (dimension, ax) in enumerate(zip(alignment_results.keys(), axes)):
    segments = alignment_results[dimension]
    
    # Extract data for plotting
    segment_names = [s['segment'] for s in segments]
    deviations = [s['deviation'] for s in segments]
    colors = ['green' if s['aligned'] else 'red' for s in segments]
    
    # Sort by deviation for better visualization
    sorted_indices = sorted(range(len(deviations)), key=lambda i: deviations[i])
    
    # Plot horizontal bars
    y_pos = range(len(segments))
    bars = ax.barh(y_pos, [deviations[i] for i in sorted_indices])
    
    # Color bars based on alignment
    for bar, i in zip(bars, sorted_indices):
        bar.set_color(colors[i])
    
    # Customize plot
    ax.set_yticks(y_pos)
    ax.set_yticklabels([segment_names[i][:20] + '...' if len(segment_names[i]) > 20 
                        else segment_names[i] for i in sorted_indices], fontsize=8)
    ax.axvline(x=0, color='black', linewidth=0.8)
    ax.axvline(x=2, color='gray', linestyle='--', alpha=0.5)
    ax.axvline(x=-2, color='gray', linestyle='--', alpha=0.5)
    ax.set_xlabel('Deviation (pp)')
    ax.set_title(f'{dimension.replace("_", " ").title()}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Overall alignment score
total_segments = sum(len(segments) for segments in alignment_results.values())
total_aligned = sum(len([s for s in segments if s['aligned']]) 
                   for segments in alignment_results.values())
overall_alignment = total_aligned / total_segments * 100

print(f"\n📊 OVERALL ALIGNMENT SCORE: {overall_alignment:.1f}%")
print(f"   {total_aligned} out of {total_segments} segments are within ±2pp of expected values")

## 7. Generate Comprehensive Report

The GRI module can generate detailed reports in multiple formats for sharing with stakeholders.

In [None]:
# Generate and save comprehensive report
report_path = analysis.generate_report(
    output_dir='../analysis_output',
    format='html',  # Can be 'html', 'pdf', or 'markdown'
    include_visualizations=True
)

print(f"✅ Comprehensive report generated: {report_path}")

# Also save key results for future use
results_summary = {
    'survey': primary_survey,
    'n_participants': len(analysis.survey_data),
    'gri_scores': results,
    'top_segments': top_segments.head(20).to_dict('records'),
    'monte_carlo': simulation_results,
    'recommendations': recommendations,
    'alignment_score': overall_alignment
}

output_file = f'../analysis_output/{primary_survey.lower()}_advanced_analysis.json'
with open(output_file, 'w') as f:
    json.dump(results_summary, f, indent=2, default=str)

print(f"✅ Analysis results saved: {output_file}")

# Display summary dashboard
print("\n" + "="*80)
print(f"ADVANCED ANALYSIS SUMMARY - {primary_survey}")
print("="*80)
print(f"Overall GRI Score: {results['average_gri']:.4f}")
print(f"Percentile Rank: {simulation_results['actual_percentile']:.0f}th")
print(f"Alignment Score: {overall_alignment:.1f}%")
print(f"Top Impact Segment: {top_segments.iloc[0]['segment_id']}")
print(f"Maximum Achievable GRI: {simulation_results['max_average_gri']:.4f}")
print(f"Gap to Maximum: {simulation_results['max_average_gri'] - results['average_gri']:.4f}")
print("\n🎯 Key Actions:")
for i, rec in enumerate(recommendations.get('high_priority', [])[:3], 1):
    print(f"  {i}. {rec['action']}")

## Summary

This notebook demonstrated the advanced capabilities of the GRI module:

1. **Monte Carlo Simulations** revealed the theoretical maximum GRI achievable and how the current survey compares
2. **Segment Analysis** identified specific demographic groups driving representativeness gaps
3. **Sample Size Analysis** showed the relationship between participant count and achievable GRI scores
4. **Survey Comparisons** tracked progress across multiple survey iterations
5. **Alignment Analysis** provided a detailed view of representation patterns
6. **Comprehensive Reporting** generated shareable reports with actionable insights

### Key Takeaways

- The GRI module significantly reduces code complexity while providing more powerful analysis
- Monte Carlo simulations help set realistic expectations for achievable representativeness
- Segment-level analysis provides specific, actionable recruitment targets
- Built-in visualization and reporting features make it easy to share findings

### Next Steps

1. Use the generated reports to plan recruitment strategies
2. Focus on the high-priority segments identified in the analysis
3. Consider the sample size recommendations for future surveys
4. Track progress by comparing GRI scores across survey iterations

For more details, see the comprehensive report generated in the `analysis_output` directory.