# Instagram Popularity Data Collection

This notebook guides you through collecting Instagram follower counts for DWTS celebrities to use as a social popularity metric.

## Why This Matters
- Judge scores explain ~73% of placement variance
- ~27% is unexplained (likely due to celebrity/social factors)
- Instagram followers = direct measure of celebrity popularity
- This should explain some of that missing 27%

## Step 1: Setup and Get Celebrity List

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# Load DWTS data
DATA_PATH = Path('../2026_MCM_Problem_C_Data.csv')
df = pd.read_csv(DATA_PATH)

# Get unique celebrities
celebrities = df['celebrity_name'].unique()
print(f"Total celebrities in dataset: {len(celebrities)}")
print(f"\nCelebrities: {sorted(celebrities)}")

## Step 2: Manual Collection Template

**RECOMMENDED APPROACH:** Manual collection from Instagram is most reliable.

Here's how to collect the data:
1. Run the cell below to create a CSV template
2. Open the CSV in Excel/Google Sheets
3. For each celebrity, visit their Instagram profile
4. Copy their follower count
5. Paste it into the CSV

In [None]:
# Create template for manual data entry
template_df = pd.DataFrame({
    'celebrity_name': sorted(celebrities),
    'instagram_handle': ['@' for _ in celebrities],  # User fills this in
    'follower_count': [np.nan for _ in celebrities],  # User fills this in
    'collection_date': [pd.Timestamp.now().strftime('%Y-%m-%d') for _ in celebrities],
    'notes': ['' for _ in celebrities]
})

# Save template
template_path = '../instagram_followers_template.csv'
template_df.to_csv(template_path, index=False)

print(f"✓ Template created: {template_path}")
print("\nTemplate preview:")
print(template_df.head(10))

print("\n" + "="*80)
print("MANUAL COLLECTION INSTRUCTIONS:")
print("="*80)
print("""
1. Open the file: instagram_followers_template.csv

2. For EACH celebrity:
   a. Google their name + "instagram"
   b. Click their Instagram profile
   c. Look at their follower count (e.g., "2.5M followers")
   d. Enter their IG handle in the 'instagram_handle' column (e.g., @zendaya)
   e. Enter the follower count as a number in 'follower_count'
      - If it says "2.5M" enter: 2500000
      - If it says "150K" enter: 150000
      - If it says "23,456" enter: 23456

3. Save the file

4. Run the next cell to load it back into Python

Expected time: ~10-15 minutes for ~30 celebrities
""")

## Step 3: Load Collected Data

In [None]:
# After you've filled in the CSV, load it here
ig_data = pd.read_csv('../instagram_followers_template.csv')

print(f"Loaded {len(ig_data)} celebrities")
print(f"\nCelebrities with data: {ig_data['follower_count'].notna().sum()}")
print(f"Missing data: {ig_data['follower_count'].isna().sum()}")

print("\nSample:")
print(ig_data.head(10))

# Basic statistics
print(f"\nFollower Statistics:")
print(ig_data['follower_count'].describe())

## Step 4: Merge with Placement Data

In [None]:
# Merge Instagram data with DWTS data
df_with_ig = df.merge(
    ig_data[['celebrity_name', 'follower_count', 'instagram_handle']], 
    on='celebrity_name', 
    how='left'
)

print(f"Dataset shape: {df_with_ig.shape}")
print(f"Rows with Instagram data: {df_with_ig['follower_count'].notna().sum()}")
print(f"Rows missing Instagram data: {df_with_ig['follower_count'].isna().sum()}")

# Show sample
print("\nSample:")
print(df_with_ig[['celebrity_name', 'placement', 'follower_count']].head(10))

## Step 5: Create Popularity Metric

Options for creating a popularity variable:
1. **Raw followers** - Just use the follower count directly
2. **Log followers** - Use log scale (better for ML models)
3. **Normalized (0-1)** - Scale followers between 0 and 1
4. **Popularity tier** - Categorize into A/B/C list

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Option 1: Log transformation (good for correlation analysis)
df_with_ig['log_followers'] = np.log10(df_with_ig['follower_count'] + 1)

# Option 2: Normalized (0-1) for use in ML models
scaler = MinMaxScaler()
df_with_ig['normalized_followers'] = scaler.fit_transform(
    df_with_ig[['follower_count']]
)

# Option 3: Popularity tiers (A/B/C list categorization)
follower_quantiles = df_with_ig['follower_count'].quantile([0.33, 0.67])
df_with_ig['popularity_tier'] = pd.cut(
    df_with_ig['follower_count'],
    bins=[0, follower_quantiles.iloc[0], follower_quantiles.iloc[1], np.inf],
    labels=['C-List', 'B-List', 'A-List']
)

print("Popularity metrics created:")
print(df_with_ig[[
    'celebrity_name', 'follower_count', 'log_followers', 
    'normalized_followers', 'popularity_tier'
]].head(10))

# Show distribution
print("\nPopularity Tier Distribution:")
print(df_with_ig['popularity_tier'].value_counts())

## Step 6: Test Correlation with Placement

In [None]:
from scipy.stats import pearsonr, spearmanr
import matplotlib.pyplot as plt
import seaborn as sns

# Remove NaN values
valid_data = df_with_ig.dropna(subset=['placement', 'log_followers'])

# Calculate correlations
pearson_r, pearson_p = pearsonr(valid_data['log_followers'], valid_data['placement'])
spearman_r, spearman_p = spearmanr(valid_data['log_followers'], valid_data['placement'])

print("="*80)
print("INSTAGRAM FOLLOWERS vs PLACEMENT")
print("="*80)
print(f"\nPearson Correlation: r = {pearson_r:.4f}, p-value = {pearson_p:.6f}")
print(f"Spearman Correlation: rho = {spearman_r:.4f}, p-value = {spearman_p:.6f}")
print(f"\nR² = {pearson_r**2:.4f} ({pearson_r**2*100:.2f}% variance explained)")

if pearson_p < 0.05:
    print(f"\n✓ SIGNIFICANT: Instagram followers correlate with placement (p < 0.05)")
else:
    print(f"\n✗ NOT SIGNIFICANT: Instagram followers don't significantly predict placement (p >= 0.05)")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot
ax1 = axes[0]
ax1.scatter(valid_data['log_followers'], valid_data['placement'], alpha=0.5, edgecolors='black', s=50)
z = np.polyfit(valid_data['log_followers'], valid_data['placement'], 1)
p = np.poly1d(z)
x_line = np.linspace(valid_data['log_followers'].min(), valid_data['log_followers'].max(), 100)
ax1.plot(x_line, p(x_line), "r--", linewidth=2, label=f'r = {pearson_r:.4f}')
ax1.set_xlabel('Log Instagram Followers', fontsize=12, fontweight='bold')
ax1.set_ylabel('Placement (Lower is Better)', fontsize=12, fontweight='bold')
ax1.set_title('Instagram Followers vs Placement', fontsize=13, fontweight='bold')
ax1.invert_yaxis()
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=11)

# Distribution by tier
ax2 = axes[1]
tier_order = ['A-List', 'B-List', 'C-List']
tier_colors = {'A-List': 'gold', 'B-List': 'silver', 'C-List': '#CD7F32'}
valid_data_copy = valid_data.copy()
valid_data_copy = valid_data_copy.dropna(subset=['popularity_tier'])
sns.boxplot(
    data=valid_data_copy, 
    x='popularity_tier', 
    y='placement',
    order=tier_order,
    palette=tier_colors,
    ax=ax2
)
ax2.set_xlabel('Celebrity Tier', fontsize=12, fontweight='bold')
ax2.set_ylabel('Placement', fontsize=12, fontweight='bold')
ax2.set_title('Placement by Celebrity Tier (A/B/C)', fontsize=13, fontweight='bold')
ax2.invert_yaxis()
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Tier analysis
print("\n" + "-"*80)
print("PLACEMENT BY CELEBRITY TIER:")
print("-"*80)
tier_stats = valid_data_copy.groupby('popularity_tier')['placement'].agg(['mean', 'median', 'count'])
print(tier_stats)

## Step 7: Save Enhanced Dataset

In [None]:
# Save the enhanced dataset with Instagram data
output_path = '../2026_MCM_with_instagram.csv'
df_with_ig.to_csv(output_path, index=False)

print(f"✓ Enhanced dataset saved: {output_path}")
print(f"\nColumns in new dataset: {df_with_ig.columns.tolist()}")