# Intermediate Level: Statistical Analysis of DEI in Music Industry

Welcome to the intermediate workshop on DEI in the music industry! This notebook builds on the beginner level with more advanced statistical analysis and visualization techniques.

## Learning Objectives:
- Perform statistical tests to identify significant differences
- Create advanced visualizations with multiple variables
- Analyze correlation patterns in the data
- Apply grouping and aggregation techniques
- Understand confidence intervals and statistical significance

## Prerequisites:
You should have completed the beginner level or be familiar with basic pandas and matplotlib operations.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set up plotting style
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

In [None]:
# Load the dataset
df = pd.read_csv('../data/music_industry_dei_data.csv')

print(f"Dataset loaded: {df.shape[0]} artists, {df.shape[1]} features")
df.head()

## Advanced Demographic Analysis

Let's create more sophisticated visualizations that show multiple dimensions of our data simultaneously.

In [None]:
# Create a comprehensive demographic overview
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Gender distribution
df['gender'].value_counts().plot(kind='pie', ax=axes[0,0], autopct='%1.1f%%')
axes[0,0].set_title('Gender Distribution')
axes[0,0].set_ylabel('')

# Ethnicity distribution
df['ethnicity'].value_counts().plot(kind='bar', ax=axes[0,1])
axes[0,1].set_title('Ethnicity Distribution')
axes[0,1].tick_params(axis='x', rotation=45)

# Label type by gender
pd.crosstab(df['gender'], df['label_type']).plot(kind='bar', ax=axes[1,0], stacked=True)
axes[1,0].set_title('Label Type by Gender')
axes[1,0].tick_params(axis='x', rotation=0)

# Genre diversity
top_genres = df['genre'].value_counts().head(8)
top_genres.plot(kind='bar', ax=axes[1,1])
axes[1,1].set_title('Top 8 Genres')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## Statistical Testing for Gender Differences

Let's perform statistical tests to determine if there are significant differences in success metrics between genders.

In [None]:
# Separate data by gender
male_artists = df[df['gender'] == 'Male']
female_artists = df[df['gender'] == 'Female']

print(f"Male artists: {len(male_artists)}")
print(f"Female artists: {len(female_artists)}")

# Statistical tests for different metrics
metrics = ['monthly_listeners', 'total_streams', 'spotify_followers', 'soundcloud_followers', 'award_wins']

results = []
for metric in metrics:
    male_values = male_artists[metric]
    female_values = female_artists[metric]
    
    # Perform t-test
    statistic, p_value = stats.ttest_ind(male_values, female_values)
    
    # Calculate means
    male_mean = male_values.mean()
    female_mean = female_values.mean()
    
    results.append({
        'Metric': metric,
        'Male_Mean': male_mean,
        'Female_Mean': female_mean,
        'T_Statistic': statistic,
        'P_Value': p_value,
        'Significant': 'Yes' if p_value < 0.05 else 'No'
    })

results_df = pd.DataFrame(results)
print("Statistical Test Results (Male vs Female):")
print(results_df.round(4))

In [None]:
# Visualize the differences with box plots
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for i, metric in enumerate(metrics):
    sns.boxplot(data=df, x='gender', y=metric, ax=axes[i])
    axes[i].set_title(f'{metric.replace("_", " ").title()} by Gender')
    axes[i].tick_params(axis='y', which='major', labelsize=8)
    
    # Add statistical significance annotation
    p_val = results_df[results_df['Metric'] == metric]['P_Value'].iloc[0]
    if p_val < 0.001:
        sig_text = 'p < 0.001***'
    elif p_val < 0.01:
        sig_text = 'p < 0.01**'
    elif p_val < 0.05:
        sig_text = 'p < 0.05*'
    else:
        sig_text = f'p = {p_val:.3f} (ns)'
    
    axes[i].text(0.5, 0.95, sig_text, transform=axes[i].transAxes, 
                ha='center', va='top', fontsize=10, 
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# Remove the empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

## Correlation Analysis

Let's examine correlations between different success metrics and demographic factors.

In [None]:
# Select numeric columns for correlation analysis
numeric_cols = ['monthly_listeners', 'total_streams', 'album_count', 'years_active', 
                'award_wins', 'spotify_followers', 'soundcloud_followers']

# Calculate correlation matrix
correlation_matrix = df[numeric_cols].corr()

# Create heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Correlation Matrix of Success Metrics')
plt.tight_layout()
plt.show()

# Find strongest correlations
corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_pairs.append({
            'Variable 1': correlation_matrix.columns[i],
            'Variable 2': correlation_matrix.columns[j],
            'Correlation': correlation_matrix.iloc[i, j]
        })

corr_df = pd.DataFrame(corr_pairs).sort_values('Correlation', key=abs, ascending=False)
print("\nStrongest Correlations:")
print(corr_df.head(10))

## Interactive Visualizations with Plotly

Let's create interactive visualizations to explore multiple dimensions of the data.

In [None]:
# Interactive scatter plot: Monthly Listeners vs Total Streams
fig = px.scatter(df, x='monthly_listeners', y='total_streams',
                 color='gender', symbol='ethnicity', size='award_wins',
                 hover_data=['artist_name', 'genre', 'country', 'label_type'],
                 title='Monthly Listeners vs Total Streams (Interactive)',
                 labels={'monthly_listeners': 'Monthly Listeners',
                        'total_streams': 'Total Streams'})

fig.update_layout(height=600, width=900)
fig.show()

print("Hover over points to see artist details!")

In [None]:
# Interactive bar chart: Success metrics by ethnicity
ethnicity_stats = df.groupby('ethnicity').agg({
    'monthly_listeners': 'mean',
    'total_streams': 'mean',
    'award_wins': 'mean',
    'spotify_followers': 'mean'
}).reset_index()

fig = make_subplots(rows=2, cols=2, 
                    subplot_titles=('Average Monthly Listeners', 'Average Total Streams',
                                  'Average Award Wins', 'Average Spotify Followers'),
                    vertical_spacing=0.12)

# Add traces for each metric
fig.add_trace(go.Bar(x=ethnicity_stats['ethnicity'], y=ethnicity_stats['monthly_listeners'],
                     name='Monthly Listeners'), row=1, col=1)

fig.add_trace(go.Bar(x=ethnicity_stats['ethnicity'], y=ethnicity_stats['total_streams'],
                     name='Total Streams'), row=1, col=2)

fig.add_trace(go.Bar(x=ethnicity_stats['ethnicity'], y=ethnicity_stats['award_wins'],
                     name='Award Wins'), row=2, col=1)

fig.add_trace(go.Bar(x=ethnicity_stats['ethnicity'], y=ethnicity_stats['spotify_followers'],
                     name='Spotify Followers'), row=2, col=2)

fig.update_layout(height=800, title_text="Success Metrics by Ethnicity (Interactive)", 
                 showlegend=False)
fig.show()

## Genre Analysis by Demographics

Let's analyze how different demographic groups are represented across music genres.

In [None]:
# Create genre-demographic analysis
genre_demo = df.groupby(['genre', 'gender', 'ethnicity']).size().reset_index(name='count')
genre_totals = df.groupby('genre').size().reset_index(name='total')
genre_demo = genre_demo.merge(genre_totals, on='genre')
genre_demo['percentage'] = (genre_demo['count'] / genre_demo['total']) * 100

# Focus on top genres
top_genres = df['genre'].value_counts().head(6).index
genre_demo_filtered = genre_demo[genre_demo['genre'].isin(top_genres)]

# Create heatmap for gender representation across genres
gender_genre = df[df['genre'].isin(top_genres)].groupby(['genre', 'gender']).size().unstack(fill_value=0)
gender_genre_pct = gender_genre.div(gender_genre.sum(axis=1), axis=0) * 100

plt.figure(figsize=(10, 6))
sns.heatmap(gender_genre_pct, annot=True, fmt='.1f', cmap='RdYlBu_r', 
            cbar_kws={'label': 'Percentage'})
plt.title('Gender Representation Across Top Genres (%)')
plt.xlabel('Gender')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

print("\nGender Distribution by Genre (Raw Numbers):")
print(gender_genre)

In [None]:
# Ethnicity representation across genres
ethnicity_genre = df[df['genre'].isin(top_genres)].groupby(['genre', 'ethnicity']).size().unstack(fill_value=0)
ethnicity_genre_pct = ethnicity_genre.div(ethnicity_genre.sum(axis=1), axis=0) * 100

plt.figure(figsize=(12, 8))
sns.heatmap(ethnicity_genre_pct, annot=True, fmt='.1f', cmap='viridis', 
            cbar_kws={'label': 'Percentage'})
plt.title('Ethnic Representation Across Top Genres (%)')
plt.xlabel('Ethnicity')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()

print("\nEthnicity Distribution by Genre (Raw Numbers):")
print(ethnicity_genre)

## Success Gap Analysis

Let's quantify success gaps between different demographic groups.

In [None]:
# Calculate success gaps
def calculate_gaps(df, group_col, metric_cols):
    """Calculate success gaps between groups for given metrics"""
    gaps = {}
    group_means = df.groupby(group_col)[metric_cols].mean()
    
    for metric in metric_cols:
        max_val = group_means[metric].max()
        min_val = group_means[metric].min()
        gaps[f'{metric}_gap_ratio'] = max_val / min_val if min_val > 0 else np.inf
        gaps[f'{metric}_gap_absolute'] = max_val - min_val
    
    return gaps, group_means

success_metrics = ['monthly_listeners', 'total_streams', 'spotify_followers', 'award_wins']

# Gender gaps
gender_gaps, gender_means = calculate_gaps(df, 'gender', success_metrics)
print("SUCCESS GAPS BY GENDER:")
print("Group Averages:")
print(gender_means[success_metrics].round(0))
print("\nGap Ratios (Max/Min):")
for metric in success_metrics:
    print(f"{metric}: {gender_gaps[f'{metric}_gap_ratio']:.2f}x")

print("\n" + "="*50 + "\n")

# Ethnicity gaps
ethnicity_gaps, ethnicity_means = calculate_gaps(df, 'ethnicity', success_metrics)
print("SUCCESS GAPS BY ETHNICITY:")
print("Group Averages:")
print(ethnicity_means[success_metrics].round(0))
print("\nGap Ratios (Max/Min):")
for metric in success_metrics:
    print(f"{metric}: {ethnicity_gaps[f'{metric}_gap_ratio']:.2f}x")

## Advanced Statistical Analysis: ANOVA

Let's perform Analysis of Variance (ANOVA) to test if there are significant differences across multiple groups.

In [None]:
# ANOVA test for ethnicity differences
ethnicity_groups = [group['monthly_listeners'].values for name, group in df.groupby('ethnicity')]
f_stat, p_value = stats.f_oneway(*ethnicity_groups)

print(f"ANOVA Test - Monthly Listeners by Ethnicity:")
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

# Post-hoc analysis using confidence intervals
print("\nConfidence Intervals for Monthly Listeners by Ethnicity:")
for ethnicity in df['ethnicity'].unique():
    subset = df[df['ethnicity'] == ethnicity]['monthly_listeners']
    mean = subset.mean()
    sem = stats.sem(subset)
    ci = stats.t.interval(0.95, len(subset)-1, loc=mean, scale=sem)
    print(f"{ethnicity}: {mean:,.0f} [{ci[0]:,.0f}, {ci[1]:,.0f}]")

## Key Statistical Insights

Based on your statistical analysis, document your findings:

### Your Statistical Insights:

1. **Gender Differences**: [Interpret the t-test results and what they mean for gender equity]

2. **Strongest Correlations**: [Discuss the most significant correlations you found]

3. **Genre Representation**: [Analyze patterns in demographic representation across genres]

4. **Success Gaps**: [Interpret the gap ratios and what they suggest about equity]

5. **Statistical Significance**: [Discuss which differences are statistically significant and their practical implications]

6. **Limitations**: [Consider limitations of this analysis and what additional data might help]

## Next Steps

Excellent work! You've completed intermediate-level statistical analysis including:

- Statistical hypothesis testing
- Correlation analysis
- Interactive visualizations
- Success gap quantification
- ANOVA testing
- Confidence intervals

### Ready for the advanced level?
Move on to the **Advanced Level** notebook for machine learning approaches, predictive modeling, and advanced statistical techniques!