# [SOLUTION] Beginner Level: Exploring DEI in the Music Industry

Welcome to the beginner workshop on Diversity, Equity, and Inclusion (DEI) in the music industry! This notebook will guide you through basic data exploration and visualization techniques to understand how music listening differs across genders, countries, and genres.

## Learning Objectives:

- Load and explore a dataset on listening behaviour

- Investigate creator and listener demographics

- Create simple visualizations to identify patterns

- Calculate basic statistics about representation

## 🚀 Getting Started
Let's start by importing the libraries we'll need and loading our data.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## Helpers in this Document

To guide you through this workshop there are some pointers in the individual sections to show you where you can experiment with the data. 

#### 💡 shows you general information or tips 
#### 📖 points you to documentation that helps you with specific syntax that can be used 
#### 🚀 shows you where you can explore more on your own 

## 📂 Load the Dataset

We have 4 datasets, each representing a different (anonymized) genre, and a legend describing the columns.

In [None]:
# Explicitly list the 4 dataset files
data_files = [
    "../../data/1 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/2 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/3 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/5 Creators and their Listeners with gender and 6 locations.csv"
]

# Load all datasets into a dictionary
datasets = {}
for i, file in enumerate(data_files):
    name = f"Genre {i+1}"
    datasets[name] = pd.read_csv(file)

# Load legend
legend = pd.read_csv("../../data/legend.csv")

## 🔍 Explore the Data

Let's check the shape and columns of each dataset to get a sense of what we're working with.

In [None]:
genre = "Genre 1"   # You can change this!
df = datasets[genre]
print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
df.head(2)

We can always check column definitions in our legend.

In [None]:
# Look up column description in our legend
column = 'creator_gender'   # You can change this!
row = legend[legend['Column'] == column]
description = row['Description']
description.values[0]

Or feel free to open it in your favorite spreadsheet program (Excel, Numbers, Google Sheets) to explore the descriptions!

## 🌍 (Example) Step 1: Listening Across Countries

Each dataset includes plays from 6 countries. Let's compare total plays by country for one genre.

In [None]:
# Choose the genre to explore
genre = 'Genre 1'   # You can change this!
df = datasets[genre]

# Define the countries
countries = ['BR', 'DE', 'EG', 'GB', 'US', 'VN']

# Select columns for total plays in each country (if they exist)
country_cols = [f'total_play_count_{c}' for c in countries if f'total_play_count_{c}' in df.columns]

# Calculate total plays per country
country_totals = df[country_cols].sum()

# Display basic statistics
print(f"Total Plays by Country ({genre}):")
print(country_totals)

In [None]:
# Plot the results as a bar chart
country_totals.plot(kind='bar')
plt.title(f"Total Plays by Country ({genre})")
plt.ylabel("Total Plays")
plt.show()

## 🎨 Step 2: Gender Representation Among Listeners (per Country)

Let's zoom into one country (e.g., the US) and see how plays are distributed among listener genders.

In [None]:
# Choose the country and genre to explore
country = 'US'   # Try changing to BR, DE, EG, GB, or VN
genre = 'Genre 1'
df = datasets[genre]

# Calculate total plays by listener gender for the selected country
listener_gender_totals = {
    'Male': df[f'plays_by_males_{country}'].sum(),
    'Female': df[f'plays_by_females_{country}'].sum(),
    'Custom': df[f'plays_by_custom_gender_{country}'].sum(),
    'Null': df[f'plays_by_null_gender_{country}'].sum(),
}

# Display counts
print("Listener gender counts:")
print(pd.Series(listener_gender_totals))

In [None]:
# Create a pie chart
plt.figure()
plt.pie(
    listener_gender_totals.values(),
    labels=listener_gender_totals.keys(),
    autopct='%1.1f%%',
)
plt.title(f"Listener Gender Distribution in {country}: {genre}")
plt.show()

## 🧭 Step 3: Understanding Creator Demographics

Let's explore how many creators identify as male, female, custom, or null in each genre.

In [None]:
# Choose the genre to explore
genre = "Genre 1"   # You can change this!
df = datasets[genre]

# Column to analyze
column = 'creator_gender'

# Counts per gender
gender_counts = df[column].value_counts(dropna=False)

# Display statistics
print("Genre:", genre)
print("\nCounts by gender:")
print(gender_counts)

In [None]:
# Calculate the percentage for every genre
creator_gender_summary = {}
for name, df in datasets.items():
    gender_pcts = df[column].value_counts(normalize=True, dropna=False) * 100
    creator_gender_summary[name] = gender_pcts

# Transpose the DataFrame so that genres are rows and genders are columns
creator_gender_df = pd.DataFrame(creator_gender_summary).T

# Plot a stacked bar chart of creator gender distribution
creator_gender_df.plot(kind='bar', stacked=True)
plt.title("Creator Gender Distribution by Genre")
plt.ylabel("Percentage of Creators")
plt.legend(title="Gender", bbox_to_anchor=(1.05, 1))
plt.show()

## 📊 (Optional) Step 4: Calculating Representation Ratios

Let's measure the ratio of female creators to female listeners for each genre — a simple proxy for representation.

In [None]:
# Choose the genre to explore
genre = 'Genre 1'   # You can change this!
df = datasets[genre]

# Percentage of female creators and listeners
pct_creators_female = (df['creator_gender'].eq('female').mean()) * 100
pct_listeners_female = (df['total_play_count_female'].sum() / df['total_play_count'].sum()) * 100

print(f"Percentage of female creators: {pct_creators_female:.1f}%")
print(f"Percentage of plays by female listeners: {pct_listeners_female:.1f}%")
print(f"Difference (listeners - creators): {pct_listeners_female - pct_creators_female:.1f}%")

In [None]:
# Initialize an empty list to store representation data
rep_data = []

# Loop through each genre dataset
for name, df in datasets.items():
    
    # Calculate the percentage of creators who are female
    pct_creators_female = (df['creator_gender'].eq('female').mean()) * 100
    
    # Calculate the percentage of listeners who are female
    pct_listeners_female = (df['total_play_count_female'].sum() / df['total_play_count'].sum()) * 100
    
     # Add a list of genre name, female creators %, and female listeners % to the list
    rep_data.append([name, pct_creators_female, pct_listeners_female])

# Convert the list of lists into a DataFrame
rep_df = pd.DataFrame(rep_data, columns=['Genre', 'Female Creators (%)', 'Female Listeners (%)'])

# Plot a bar chart comparing female creators vs female listeners for each genre
rep_df.plot(x='Genre', kind='bar', figsize=(8,5))
plt.title("Representation of Female Creators vs Female Listeners")
plt.ylabel("Percentage")
plt.show()

## 💡 Key Insights

Based on your analysis above, write down 3-5 key insights you've discovered about DEI in the music industry:

### Your Insights:

1. **Gender Representation**: [Write your observation about gender distribution among creators and listeners]

2. **Listener Diversity by Country**: [Write your observation about how listener demographics differ across countries]

3. **Genre-Specific Trends**: [Write your observation about differences across genres]

4. **Data Limitations & Biases**: [Write your observation about missing or anonymized data and its implications]

5. **Additional Insight**: [Any other pattern you noticed]

## Next Steps

Congratulations! You've completed the beginner level analysis. You've learned how to:

- Load and explore a dataset

- Calculate basic statistics

- Create visualizations to understand data patterns

- Analyze representation across different demographics

### Ready for more?
Move on to the **Intermediate Level** notebook to dive deeper into statistical analysis and more advanced visualizations!