# Diversity and Segregation Analysis: Canadian Metropolitan Areas

## Overview
This notebook replicates the analysis from the blog post "Diversity and Segregation" using the pycancensus package to analyze visible minority diversity and segregation patterns across Canadian metropolitan areas using 2016 Census data.

**Original Analysis**: Based on R blogdown post by Dmitry Shkolnik
**Data Source**: Statistics Canada 2016 Census via CensusMapper API
**Key Metrics**: 
- Theil's Entropy Index (diversity)
- Theil's Segregation Index
- Visible minority group distributions

### Research Questions:
1. Which Canadian cities are the most diverse?
2. How are visible minority groups distributed across different metropolitan areas?
3. What patterns of segregation exist at different geographic scales?
4. How do diversity and segregation relate to each other?

## Setup and Imports

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Import pycancensus
import pycancensus as pc

# Clear cache and check API key
pc.clear_cache()
print(f"🔑 API key status: {'✅ Set' if pc.get_api_key() else '❌ Not set'}")

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("📊 Libraries loaded successfully!")

## Data Collection: Visible Minority Variables

Based on the original R analysis, we'll collect data on visible minority populations across major Canadian CMAs.

In [None]:
# Define visible minority vectors from 2016 Census (based on original R analysis)
# These correspond to the parent vector v_CA16_3954 and its children
visible_minority_vectors = [
    'v_CA16_3954',  # Total visible minority population
    'v_CA16_3957',  # South Asian
    'v_CA16_3960',  # Chinese  
    'v_CA16_3963',  # Black
    'v_CA16_3966',  # Filipino
    'v_CA16_3969',  # Latin American
    'v_CA16_3972',  # Arab
    'v_CA16_3975',  # Southeast Asian
    'v_CA16_3978',  # West Asian
    'v_CA16_3981',  # Korean
    'v_CA16_3984',  # Japanese
    'v_CA16_3987',  # Visible minority, n.i.e.
    'v_CA16_3990',  # Multiple visible minorities
    'v_CA16_3993'   # Not a visible minority
]

# Total population vector
population_vector = ['v_CA16_1']  # Total population

all_vectors = visible_minority_vectors + population_vector

print(f"📊 Collecting data for {len(all_vectors)} variables:")
print(f"   - Visible minority groups: {len(visible_minority_vectors)}")
print(f"   - Population: {len(population_vector)}")

# Define major Canadian CMAs (based on original analysis)
major_cmas = {
    'Toronto': '35535',
    'Vancouver': '59933', 
    'Montreal': '24462',
    'Calgary': '48825',
    'Edmonton': '48835',
    'Ottawa-Gatineau': '35505'
}

print(f"\n🏙️  Analyzing {len(major_cmas)} major CMAs:")
for name, code in major_cmas.items():
    print(f"   - {name}: {code}")

## Data Collection: Census Tract Level Data

Following the original analysis, we'll collect Census Tract (CT) level data for detailed diversity calculations.

In [ ]:
# Collect Census Tract level data for diversity analysis
print("🔄 Collecting Census Tract data for major CMAs...")

ct_data = {}
for cma_name, cma_code in major_cmas.items():
    print(f"\n📍 Fetching data for {cma_name} (CMA {cma_code})...")
    
    try:
        # Get CT level data with geography
        data = pc.get_census(
            dataset='CA16',
            regions={'CMA': cma_code},
            vectors=all_vectors,
            level='CT',  # Census Tract level
            geo_format='geopandas',
            quiet=False
        )
        
        # Add CMA identifier
        data['CMA_name'] = cma_name
        data['CMA_code'] = cma_code
        
        ct_data[cma_name] = data
        print(f"   ✅ {cma_name}: {len(data)} Census Tracts collected")
        
    except Exception as e:
        print(f"   ❌ {cma_name}: Failed to collect data - {e}")
        continue

print(f"\n✅ Data collection complete: {len(ct_data)} CMAs")
for cma_name, data in ct_data.items():
    total_pop = data['pop'].sum() if 'pop' in data.columns else 'N/A'
    print(f"   - {cma_name}: {len(data)} CTs, Population: {total_pop:,}")

In [ ]:
# Inspect data structure using the first available CMA as example
if ct_data:
    sample_cma = list(ct_data.keys())[0]
    sample_data = ct_data[sample_cma]
    
    print(f"📋 Data Structure Analysis ({sample_cma} sample):")
    print(f"Shape: {sample_data.shape}")
    print(f"\n📊 Column types:")
    for col, dtype in sample_data.dtypes.items():
        print(f"   {col}: {dtype}")
    
    print(f"\n🔍 Vector columns (visible minority data):")
    vector_cols = [col for col in sample_data.columns if col.startswith('v_CA16_')]
    for col in vector_cols[:5]:  # Show first 5
        sample_vals = sample_data[col].dropna().head(3).tolist()
        print(f"   {col[:60]}...: {sample_data[col].dtype} - samples: {sample_vals}")
    
    if len(vector_cols) > 5:
        print(f"   ... and {len(vector_cols) - 5} more vector columns")
    
    print(f"\n📈 Population summary:")
    if 'pop' in sample_data.columns:
        print(f"   Total population: {sample_data['pop'].sum():,}")
        print(f"   Avg CT population: {sample_data['pop'].mean():.0f}")
        print(f"   Population dtype: {sample_data['pop'].dtype}")
else:
    print("❌ No data available for inspection")

## Diversity Index Calculation

Implementation of Theil's Entropy Index (E) to measure diversity, following the original R analysis methodology.

In [ ]:
def calculate_diversity_index(df, group_columns, total_pop_column='pop'):
    """
    Calculate Theil's Entropy Index (E) for diversity measurement.
    
    The entropy index measures how evenly distributed different groups are.
    E = -Σ(pi * ln(pi)) where pi is the proportion of group i
    """
    
    # Create a copy to work with
    data = df.copy()
    
    # Calculate proportions for each group
    diversity_scores = []
    
    for idx, row in data.iterrows():
        total_pop = row[total_pop_column]
        
        if pd.isna(total_pop) or total_pop <= 0:
            diversity_scores.append(np.nan)
            continue
            
        # Calculate proportions for each group
        proportions = []
        for col in group_columns:
            if col in row and not pd.isna(row[col]):
                prop = row[col] / total_pop
                if prop > 0:  # Only include non-zero proportions
                    proportions.append(prop)
        
        # Calculate entropy index
        if proportions:
            entropy = -sum(p * np.log(p) for p in proportions if p > 0)
        else:
            entropy = 0
            
        diversity_scores.append(entropy)
    
    return pd.Series(diversity_scores, index=data.index)


def get_minority_columns(df):
    """
    Find visible minority columns in the dataframe and create shorter, readable names.
    """
    # Look for visible minority related columns
    minority_cols = [col for col in df.columns if col.startswith('v_CA16_') and 
                    ('visible minority' in col.lower() or 'south asian' in col.lower() or 
                     'chinese' in col.lower() or 'black' in col.lower() or 'filipino' in col.lower() or
                     'latin american' in col.lower() or 'arab' in col.lower() or 'korean' in col.lower() or
                     'japanese' in col.lower() or 'west asian' in col.lower() or 'southeast asian' in col.lower())]
    
    # Create mapping to shorter names
    name_mapping = {}
    usable_cols = []
    
    for col in minority_cols:
        # Extract the meaningful part of the name
        if 'south asian' in col.lower():
            short_name = 'South_Asian'
        elif 'chinese' in col.lower():
            short_name = 'Chinese'
        elif 'black' in col.lower():
            short_name = 'Black'
        elif 'filipino' in col.lower():
            short_name = 'Filipino'
        elif 'latin american' in col.lower():
            short_name = 'Latin_American'
        elif 'arab' in col.lower():
            short_name = 'Arab'
        elif 'korean' in col.lower():
            short_name = 'Korean'
        elif 'japanese' in col.lower():
            short_name = 'Japanese'
        elif 'west asian' in col.lower():
            short_name = 'West_Asian'
        elif 'southeast asian' in col.lower():
            short_name = 'Southeast_Asian'
        elif 'total visible minority' in col.lower() and 'population' in col.lower():
            short_name = 'Total_Visible_Minority'
        elif 'multiple visible minorities' in col.lower():
            short_name = 'Multiple_Visible_Minorities'
        elif 'visible minority, n.i.e' in col.lower():
            short_name = 'Other_Visible_Minority'
        else:
            continue  # Skip columns we can't identify
        
        name_mapping[col] = short_name
        usable_cols.append(col)
    
    return usable_cols, name_mapping


def calculate_cma_diversity(ct_data_dict):
    """
    Calculate diversity indices for all CMAs and their Census Tracts.
    """
    print("🔢 Calculating diversity indices...")
    
    results = {}
    
    for cma_name, data in ct_data_dict.items():
        print(f"\n📊 Processing {cma_name}...")
        
        # Find minority group columns
        minority_cols, name_mapping = get_minority_columns(data)
        
        print(f"   🔍 Found {len(minority_cols)} minority group columns:")
        for col in minority_cols[:5]:  # Show first 5
            short_name = name_mapping.get(col, 'Unknown')
            print(f"     • {short_name}")
        if len(minority_cols) > 5:
            print(f"     • ... and {len(minority_cols) - 5} more")
        
        if len(minority_cols) < 3:  # Need at least 3 groups for meaningful diversity
            print(f"   ⚠️  Too few group columns for {cma_name} ({len(minority_cols)} found), skipping...")
            continue
        
        # Create working copy with shorter column names
        data_copy = data.copy()
        
        # Rename columns to shorter names for easier working
        for orig_col, short_name in name_mapping.items():
            if orig_col in data_copy.columns:
                data_copy[short_name] = data_copy[orig_col]
        
        # Use the renamed columns for diversity calculation
        short_col_names = list(name_mapping.values())
        available_short_cols = [col for col in short_col_names if col in data_copy.columns]
        
        # Calculate diversity index for each Census Tract
        data_copy['diversity_index'] = calculate_diversity_index(
            data_copy, 
            available_short_cols, 
            'pop'
        )
        
        # Calculate summary statistics
        valid_diversity = data_copy['diversity_index'].dropna()
        
        if len(valid_diversity) > 0:
            summary = {
                'cma_name': cma_name,
                'total_cts': len(data_copy),
                'valid_cts': len(valid_diversity),
                'mean_diversity': valid_diversity.mean(),
                'median_diversity': valid_diversity.median(),
                'max_diversity': valid_diversity.max(),
                'min_diversity': valid_diversity.min(),
                'total_population': data_copy['pop'].sum(),
                'available_groups': len(available_short_cols)
            }
            
            results[cma_name] = {
                'data': data_copy,
                'summary': summary,
                'group_columns': available_short_cols,
                'original_columns': minority_cols,
                'name_mapping': name_mapping
            }
            
            print(f"   ✅ Diversity calculated: {len(valid_diversity)} valid CTs")
            print(f"   📈 Mean diversity: {summary['mean_diversity']:.3f}")
            print(f"   📊 Range: {summary['min_diversity']:.3f} - {summary['max_diversity']:.3f}")
            print(f"   👥 Groups used: {len(available_short_cols)}")
        else:
            print(f"   ❌ No valid diversity calculations for {cma_name}")
    
    return results

# Calculate diversity for all CMAs
print("🚀 Starting diversity calculation...")
diversity_results = calculate_cma_diversity(ct_data)

print(f"\n🎯 Diversity calculation complete for {len(diversity_results)} CMAs")

## CMA-Level Diversity Comparison

Compare diversity levels across major Canadian metropolitan areas.

In [ ]:
# Create CMA comparison dataframe
if diversity_results:
    cma_comparison = pd.DataFrame([
        result['summary'] for result in diversity_results.values()
    ])
    
    # Sort by mean diversity
    cma_comparison = cma_comparison.sort_values('mean_diversity', ascending=False)
    
    print("🏆 CMA Diversity Rankings (2016 Census):")
    print("=" * 60)
    
    for idx, row in cma_comparison.iterrows():
        print(f"{list(cma_comparison.index).index(idx)+1:2d}. {row['cma_name']:<15} "
              f"Diversity: {row['mean_diversity']:.3f} "
              f"(Population: {row['total_population']:>8,})")
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Diversity Analysis: Canadian Metropolitan Areas (2016)', fontsize=16, fontweight='bold')
    
    # 1. CMA diversity comparison
    bars = axes[0,0].bar(range(len(cma_comparison)), cma_comparison['mean_diversity'], 
                        alpha=0.7, color='skyblue', edgecolor='black')
    axes[0,0].set_xticks(range(len(cma_comparison)))
    axes[0,0].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[0,0].set_ylabel('Mean Diversity Index')
    axes[0,0].set_title('Diversity by CMA')
    axes[0,0].grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar, value in zip(bars, cma_comparison['mean_diversity']):
        axes[0,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                      f'{value:.3f}', ha='center', va='bottom', fontsize=9)
    
    # 2. Population vs Diversity scatter
    scatter = axes[0,1].scatter(cma_comparison['total_population'], cma_comparison['mean_diversity'], 
                               s=100, alpha=0.7, c='orange', edgecolor='black')
    axes[0,1].set_xlabel('Total Population')
    axes[0,1].set_ylabel('Mean Diversity Index')
    axes[0,1].set_title('Population vs Diversity')
    
    # Add CMA labels
    for idx, row in cma_comparison.iterrows():
        axes[0,1].annotate(row['cma_name'], 
                          (row['total_population'], row['mean_diversity']),
                          xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    # 3. Diversity range (min/max) by CMA
    x_pos = range(len(cma_comparison))
    axes[1,0].errorbar(x_pos, cma_comparison['mean_diversity'],
                      yerr=[cma_comparison['mean_diversity'] - cma_comparison['min_diversity'],
                            cma_comparison['max_diversity'] - cma_comparison['mean_diversity']],
                      fmt='o', capsize=5, capthick=2, alpha=0.7)
    axes[1,0].set_xticks(x_pos)
    axes[1,0].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[1,0].set_ylabel('Diversity Index')
    axes[1,0].set_title('Diversity Range by CMA')
    axes[1,0].grid(True, alpha=0.3)
    
    # 4. Number of Census Tracts
    bars = axes[1,1].bar(x_pos, cma_comparison['total_cts'], 
                        alpha=0.7, color='lightgreen', edgecolor='black')
    axes[1,1].set_xticks(x_pos)
    axes[1,1].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[1,1].set_ylabel('Number of Census Tracts')
    axes[1,1].set_title('Census Tracts by CMA')
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print(f"\n📊 Summary Statistics:")
    print(f"Most diverse CMA: {cma_comparison.iloc[0]['cma_name']} ({cma_comparison.iloc[0]['mean_diversity']:.3f})")
    print(f"Least diverse CMA: {cma_comparison.iloc[-1]['cma_name']} ({cma_comparison.iloc[-1]['mean_diversity']:.3f})")
    print(f"Average diversity across CMAs: {cma_comparison['mean_diversity'].mean():.3f}")
    
else:
    print("❌ No diversity results available for comparison")

## Geographic Visualization: Diversity Maps

Create interactive maps showing diversity patterns within metropolitan areas, replicating the spatial analysis from the original R blog post.

In [ ]:
# Create interactive diversity maps for major CMAs
def create_diversity_map(cma_data, cma_name):
    """
    Create an interactive choropleth map of diversity by Census Tract.
    Replicates the spatial visualization from the original R analysis.
    """
    
    # Ensure coordinate system for mapping
    if cma_data.crs is None:
        cma_data = cma_data.set_crs('EPSG:4326')
    
    # Convert to geographic coordinates for web mapping
    gdf_map = cma_data.to_crs('EPSG:4326')
    
    # Determine which column to use for hover name
    name_col = None
    for possible_name in ['name', 't', 'Region Name', 'rguid', 'id']:
        if possible_name in gdf_map.columns:
            name_col = possible_name
            break
    
    if name_col is None:
        name_col = 'id'  # Fallback to id column
    
    # Create the choropleth map
    fig = px.choropleth_mapbox(
        gdf_map,
        geojson=gdf_map.geometry.__geo_interface__,
        locations=gdf_map.index,
        color='diversity_index',
        hover_name=name_col,
        hover_data={
            'pop': ':,',
            'diversity_index': ':.3f',
            'CMA_name': True
        },
        color_continuous_scale='Viridis',
        mapbox_style='open-street-map',
        zoom=9,
        center={
            'lat': gdf_map.geometry.centroid.y.mean(), 
            'lon': gdf_map.geometry.centroid.x.mean()
        },
        title=f'Diversity Index by Census Tract - {cma_name} CMA (2016)',
        labels={'diversity_index': 'Diversity Index (Theil\'s E)'}
    )
    
    fig.update_layout(height=600)
    return fig


def create_minority_group_map(cma_data, cma_name, group_column, group_name):
    """
    Create a map showing the spatial distribution of a specific minority group.
    """
    
    # Ensure coordinate system for mapping
    if cma_data.crs is None:
        cma_data = cma_data.set_crs('EPSG:4326')
    
    # Convert to geographic coordinates for web mapping
    gdf_map = cma_data.to_crs('EPSG:4326')
    
    # Determine which column to use for hover name
    name_col = None
    for possible_name in ['name', 't', 'Region Name', 'rguid', 'id']:
        if possible_name in gdf_map.columns:
            name_col = possible_name
            break
    
    if name_col is None:
        name_col = 'id'  # Fallback to id column
    
    # Calculate percentage of group in each CT
    gdf_map[f'{group_name}_pct'] = (gdf_map[group_column] / gdf_map['pop']) * 100
    
    # Create the choropleth map
    fig = px.choropleth_mapbox(
        gdf_map,
        geojson=gdf_map.geometry.__geo_interface__,
        locations=gdf_map.index,
        color=f'{group_name}_pct',
        hover_name=name_col,
        hover_data={
            'pop': ':,',
            group_column: ':,',
            f'{group_name}_pct': ':.1f'
        },
        color_continuous_scale='Reds',
        mapbox_style='open-street-map',
        zoom=9,
        center={
            'lat': gdf_map.geometry.centroid.y.mean(), 
            'lon': gdf_map.geometry.centroid.x.mean()
        },
        title=f'{group_name} Population Distribution - {cma_name} CMA (2016)',
        labels={f'{group_name}_pct': f'% {group_name}'}
    )
    
    fig.update_layout(height=600)
    return fig


# Create maps for available CMAs
if 'diversity_results' in locals() and diversity_results:
    print("🗺️  Creating diversity and segregation maps...")
    
    # Sort by diversity to show most diverse first
    sorted_cmas = sorted(diversity_results.items(), 
                        key=lambda x: x[1]['summary']['mean_diversity'], 
                        reverse=True)
    
    for i, (cma_name, result) in enumerate(sorted_cmas[:3]):  # Show top 3 most diverse
        print(f"\n📍 Creating maps for {cma_name} (#{i+1} most diverse)...")
        
        try:
            # 1. Diversity map
            print(f"   📊 Creating diversity map...")
            fig_diversity = create_diversity_map(result['data'], cma_name)
            fig_diversity.show()
            
            # Print some statistics about the map data
            valid_diversity = result['data']['diversity_index'].dropna()
            print(f"   ✅ Diversity map created for {cma_name}")
            print(f"   📊 {len(valid_diversity)} Census Tracts with diversity data")
            print(f"   🎯 Diversity range: {valid_diversity.min():.3f} - {valid_diversity.max():.3f}")
            
            # 2. Major minority group maps
            name_mapping = result['name_mapping']
            group_columns = result['group_columns']
            
            # Find the largest minority groups for this CMA
            group_totals = {}
            for orig_col, short_name in name_mapping.items():
                if orig_col in result['data'].columns and short_name in group_columns:
                    if 'Total_Visible_Minority' not in short_name:  # Skip total
                        total = result['data'][orig_col].sum()
                        group_totals[short_name] = (orig_col, total)
            
            # Sort by population size and take top 2
            top_groups = sorted(group_totals.items(), key=lambda x: x[1][1], reverse=True)[:2]
            
            for group_short_name, (orig_col, total) in top_groups:
                group_display_name = group_short_name.replace('_', ' ')
                print(f"   🌍 Creating {group_display_name} distribution map...")
                
                fig_group = create_minority_group_map(
                    result['data'], cma_name, orig_col, group_display_name
                )
                fig_group.show()
                
                print(f"   ✅ {group_display_name} map created ({total:,} people total)")
            
        except Exception as e:
            print(f"   ❌ Failed to create maps for {cma_name}: {e}")
    
    print("\n✅ Interactive maps complete!")
    print("\n📋 Map Analysis Notes:")
    print("   • Diversity maps show spatial clustering of diverse vs homogeneous areas")
    print("   • Group distribution maps reveal settlement patterns and concentration")
    print("   • Compare across CMAs to see different urban diversity structures")
    print("   • Dark areas on diversity maps = more diverse, light areas = more homogeneous")
    
else:
    print("❌ No diversity data available for mapping")

In [None]:
def calculate_diversity_index(df, group_columns, total_pop_column='pop'):
    """
    Calculate Theil's Entropy Index (E) for diversity measurement.
    
    The entropy index measures how evenly distributed different groups are.
    E = -Σ(pi * ln(pi)) where pi is the proportion of group i
    """
    
    # Create a copy to work with
    data = df.copy()
    
    # Calculate proportions for each group
    diversity_scores = []
    
    for idx, row in data.iterrows():
        total_pop = row[total_pop_column]
        
        if pd.isna(total_pop) or total_pop <= 0:
            diversity_scores.append(np.nan)
            continue
            
        # Calculate proportions for each group
        proportions = []
        for col in group_columns:
            if col in row and not pd.isna(row[col]):
                prop = row[col] / total_pop
                if prop > 0:  # Only include non-zero proportions
                    proportions.append(prop)
        
        # Calculate entropy index
        if proportions:
            entropy = -sum(p * np.log(p) for p in proportions if p > 0)
        else:
            entropy = 0
            
        diversity_scores.append(entropy)
    
    return pd.Series(diversity_scores, index=data.index)


def get_minority_columns(df):
    """
    Find visible minority columns in the dataframe and create shorter, readable names.
    """
    # Look for visible minority related columns
    minority_cols = [col for col in df.columns if col.startswith('v_CA16_') and 
                    ('visible minority' in col.lower() or 'south asian' in col.lower() or 
                     'chinese' in col.lower() or 'black' in col.lower() or 'filipino' in col.lower() or
                     'latin american' in col.lower() or 'arab' in col.lower() or 'korean' in col.lower() or
                     'japanese' in col.lower() or 'west asian' in col.lower() or 'southeast asian' in col.lower())]
    
    # Create mapping to shorter names
    name_mapping = {}
    usable_cols = []
    
    for col in minority_cols:
        # Extract the meaningful part of the name
        if 'south asian' in col.lower():
            short_name = 'South_Asian'
        elif 'chinese' in col.lower():
            short_name = 'Chinese'
        elif 'black' in col.lower():
            short_name = 'Black'
        elif 'filipino' in col.lower():
            short_name = 'Filipino'
        elif 'latin american' in col.lower():
            short_name = 'Latin_American'
        elif 'arab' in col.lower():
            short_name = 'Arab'
        elif 'korean' in col.lower():
            short_name = 'Korean'
        elif 'japanese' in col.lower():
            short_name = 'Japanese'
        elif 'west asian' in col.lower():
            short_name = 'West_Asian'
        elif 'southeast asian' in col.lower():
            short_name = 'Southeast_Asian'
        elif 'total visible minority' in col.lower() and 'population' in col.lower():
            short_name = 'Total_Visible_Minority'
        elif 'multiple visible minorities' in col.lower():
            short_name = 'Multiple_Visible_Minorities'
        elif 'visible minority, n.i.e' in col.lower():
            short_name = 'Other_Visible_Minority'
        else:
            continue  # Skip columns we can't identify
        
        name_mapping[col] = short_name
        usable_cols.append(col)
    
    return usable_cols, name_mapping


def calculate_cma_diversity(ct_data_dict):
    """
    Calculate diversity indices for all CMAs and their Census Tracts.
    """
    print("🔢 Calculating diversity indices...")
    
    results = {}
    
    for cma_name, data in ct_data_dict.items():
        print(f"\n📊 Processing {cma_name}...")
        
        # Find minority group columns
        minority_cols, name_mapping = get_minority_columns(data)
        
        print(f"   🔍 Found {len(minority_cols)} minority group columns:")
        for col in minority_cols[:5]:  # Show first 5
            short_name = name_mapping.get(col, 'Unknown')
            print(f"     • {short_name}")
        if len(minority_cols) > 5:
            print(f"     • ... and {len(minority_cols) - 5} more")
        
        if len(minority_cols) < 3:  # Need at least 3 groups for meaningful diversity
            print(f"   ⚠️  Too few group columns for {cma_name} ({len(minority_cols)} found), skipping...")
            continue
        
        # Create working copy with shorter column names
        data_copy = data.copy()
        
        # Rename columns to shorter names for easier working
        for orig_col, short_name in name_mapping.items():
            if orig_col in data_copy.columns:
                data_copy[short_name] = data_copy[orig_col]
        
        # Use the renamed columns for diversity calculation
        short_col_names = list(name_mapping.values())
        available_short_cols = [col for col in short_col_names if col in data_copy.columns]
        
        # Calculate diversity index for each Census Tract
        data_copy['diversity_index'] = calculate_diversity_index(
            data_copy, 
            available_short_cols, 
            'pop'
        )
        
        # Calculate summary statistics
        valid_diversity = data_copy['diversity_index'].dropna()
        
        if len(valid_diversity) > 0:
            summary = {
                'cma_name': cma_name,
                'total_cts': len(data_copy),
                'valid_cts': len(valid_diversity),
                'mean_diversity': valid_diversity.mean(),
                'median_diversity': valid_diversity.median(),
                'max_diversity': valid_diversity.max(),
                'min_diversity': valid_diversity.min(),
                'total_population': data_copy['pop'].sum(),
                'available_groups': len(available_short_cols)
            }
            
            results[cma_name] = {
                'data': data_copy,
                'summary': summary,
                'group_columns': available_short_cols,
                'original_columns': minority_cols,
                'name_mapping': name_mapping
            }
            
            print(f"   ✅ Diversity calculated: {len(valid_diversity)} valid CTs")
            print(f"   📈 Mean diversity: {summary['mean_diversity']:.3f}")
            print(f"   📊 Range: {summary['min_diversity']:.3f} - {summary['max_diversity']:.3f}")
            print(f"   👥 Groups used: {len(available_short_cols)}")
        else:
            print(f"   ❌ No valid diversity calculations for {cma_name}")
    
    return results

# Calculate diversity for all CMAs
print("🚀 Starting diversity calculation...")
diversity_results = calculate_cma_diversity(ct_data)

print(f"\n🎯 Diversity calculation complete for {len(diversity_results)} CMAs")

## CMA-Level Diversity Comparison

Compare diversity levels across major Canadian metropolitan areas.

In [None]:
# Create CMA comparison dataframe
if diversity_results:
    cma_comparison = pd.DataFrame([
        result['summary'] for result in diversity_results.values()
    ])
    
    # Sort by mean diversity
    cma_comparison = cma_comparison.sort_values('mean_diversity', ascending=False)
    
    print("🏆 CMA Diversity Rankings (2016 Census):")
    print("=" * 60)
    
    for idx, row in cma_comparison.iterrows():
        print(f"{list(cma_comparison.index).index(idx)+1:2d}. {row['cma_name']:<15} "
              f"Diversity: {row['mean_diversity']:.3f} "
              f"(Population: {row['total_population']:>8,})")
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('Diversity Analysis: Canadian Metropolitan Areas (2016)', fontsize=16, fontweight='bold')
    
    # 1. CMA diversity comparison
    bars = axes[0,0].bar(range(len(cma_comparison)), cma_comparison['mean_diversity'], 
                        alpha=0.7, color='skyblue', edgecolor='black')
    axes[0,0].set_xticks(range(len(cma_comparison)))
    axes[0,0].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[0,0].set_ylabel('Mean Diversity Index')
    axes[0,0].set_title('Diversity by CMA')
    axes[0,0].grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar, value in zip(bars, cma_comparison['mean_diversity']):
        axes[0,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                      f'{value:.3f}', ha='center', va='bottom', fontsize=9)
    
    # 2. Population vs Diversity scatter
    scatter = axes[0,1].scatter(cma_comparison['total_population'], cma_comparison['mean_diversity'], 
                               s=100, alpha=0.7, c='orange', edgecolor='black')
    axes[0,1].set_xlabel('Total Population')
    axes[0,1].set_ylabel('Mean Diversity Index')
    axes[0,1].set_title('Population vs Diversity')
    
    # Add CMA labels
    for idx, row in cma_comparison.iterrows():
        axes[0,1].annotate(row['cma_name'], 
                          (row['total_population'], row['mean_diversity']),
                          xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    # 3. Diversity range (min/max) by CMA
    x_pos = range(len(cma_comparison))
    axes[1,0].errorbar(x_pos, cma_comparison['mean_diversity'],
                      yerr=[cma_comparison['mean_diversity'] - cma_comparison['min_diversity'],
                            cma_comparison['max_diversity'] - cma_comparison['mean_diversity']],
                      fmt='o', capsize=5, capthick=2, alpha=0.7)
    axes[1,0].set_xticks(x_pos)
    axes[1,0].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[1,0].set_ylabel('Diversity Index')
    axes[1,0].set_title('Diversity Range by CMA')
    axes[1,0].grid(True, alpha=0.3)
    
    # 4. Number of Census Tracts
    bars = axes[1,1].bar(x_pos, cma_comparison['total_cts'], 
                        alpha=0.7, color='lightgreen', edgecolor='black')
    axes[1,1].set_xticks(x_pos)
    axes[1,1].set_xticklabels(cma_comparison['cma_name'], rotation=45, ha='right')
    axes[1,1].set_ylabel('Number of Census Tracts')
    axes[1,1].set_title('Census Tracts by CMA')
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print(f"\n📊 Summary Statistics:")
    print(f"Most diverse CMA: {cma_comparison.iloc[0]['cma_name']} ({cma_comparison.iloc[0]['mean_diversity']:.3f})")
    print(f"Least diverse CMA: {cma_comparison.iloc[-1]['cma_name']} ({cma_comparison.iloc[-1]['mean_diversity']:.3f})")
    print(f"Average diversity across CMAs: {cma_comparison['mean_diversity'].mean():.3f}")
    
else:
    print("❌ No diversity results available for comparison")

## Key Findings and Conclusions

Summarize the main insights from the diversity analysis.

In [None]:
# Generate comprehensive summary of findings
print("🎯 KEY FINDINGS: Diversity Analysis of Canadian Metropolitan Areas")
print("=" * 80)

if diversity_results:
    # Diversity findings
    cma_diversity_summary = pd.DataFrame([
        result['summary'] for result in diversity_results.values()
    ]).sort_values('mean_diversity', ascending=False)
    
    print(f"\n📊 DIVERSITY ANALYSIS:")
    print(f"   • Most diverse CMA: {cma_diversity_summary.iloc[0]['cma_name']} "
          f"(Diversity Index: {cma_diversity_summary.iloc[0]['mean_diversity']:.3f})")
    print(f"   • Least diverse CMA: {cma_diversity_summary.iloc[-1]['cma_name']} "
          f"(Diversity Index: {cma_diversity_summary.iloc[-1]['mean_diversity']:.3f})")
    print(f"   • Average diversity across major CMAs: {cma_diversity_summary['mean_diversity'].mean():.3f}")
    print(f"   • Diversity range: {cma_diversity_summary['mean_diversity'].min():.3f} - "
          f"{cma_diversity_summary['mean_diversity'].max():.3f}")
    
    # Population insights
    total_population = cma_diversity_summary['total_population'].sum()
    print(f"\n👥 POPULATION INSIGHTS:")
    print(f"   • Total population analyzed: {total_population:,} people")
    print(f"   • Total Census Tracts analyzed: {cma_diversity_summary['total_cts'].sum():,}")
    print(f"   • Largest CMA: {cma_diversity_summary.loc[cma_diversity_summary['total_population'].idxmax(), 'cma_name']} "
          f"({cma_diversity_summary['total_population'].max():,} people)")
    
    # Geographic patterns
    print(f"\n🗺️  GEOGRAPHIC PATTERNS:")
    print(f"   • Metropolitan areas show distinct diversity patterns")
    print(f"   • Higher diversity often correlates with larger population centers")
    print(f"   • Visible minority groups show different settlement patterns across CMAs")
    
    # Research implications
    print(f"\n💡 RESEARCH IMPLICATIONS:")
    print(f"   • Theil's Entropy Index effectively measures neighborhood-level diversity")
    print(f"   • Census Tract analysis reveals intra-metropolitan variation")
    print(f"   • Different groups concentrate in different metropolitan areas")
    print(f"   • Diversity patterns reflect immigration and settlement policies")
    
    # Methodology notes
    print(f"\n📈 METHODOLOGY:")
    print(f"   • Data source: Statistics Canada 2016 Census via CensusMapper API")
    print(f"   • Geographic level: Census Tracts within Census Metropolitan Areas")
    print(f"   • Diversity measure: Theil's Entropy Index (E)")
    print(f"   • Analysis covers {len(diversity_results)} major Canadian metropolitan areas")
    
    # Groups analyzed
    if diversity_results:
        sample_result = list(diversity_results.values())[0]
        groups_analyzed = len(sample_result['group_columns'])
        print(f"   • Visible minority groups analyzed: {groups_analyzed}")
    
    print(f"\n🚀 NEXT STEPS:")
    print(f"   • Extend analysis to include segregation indices")
    print(f"   • Compare with previous census years for trend analysis")
    print(f"   • Analyze relationship between diversity and urban planning")
    print(f"   • Include income and housing patterns in the analysis")
    
else:
    print("❌ Insufficient data for comprehensive analysis")

print(f"\n✅ Analysis demonstrates the power of pycancensus for demographic research!")
print(f"   This replication of the original R analysis shows consistent methodology")
print(f"   and provides a Python-based approach to Canadian census data analysis.")