# Beginner Level: Exploring DEI in the Music Industry

Welcome to the beginner workshop on Diversity, Equity, and Inclusion (DEI) in the music industry! This notebook will guide you through basic data exploration and visualization techniques to understand how music listening differs across genders, countries, and genres.

## Learning Objectives:

- Load and explore a dataset on listening behaviour

- Investigate creator and listener demographics

- Create simple visualizations to identify patterns

- Calculate basic statistics about representation

## 🚀 Getting Started
Let's start by importing the libraries we'll need and loading our data.

In [3]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
# 📖 Matplotlib styles: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html
# 📖 Seaborn color palettes: https://seaborn.pydata.org/tutorial/color_palettes.html
plt.style.use('default') 
sns.set_palette("husl") 

print("Libraries imported successfully!")

Libraries imported successfully!


## 📂 Load the Dataset

We have 4 datasets, each representing a different (anonymized) genre, and a legend describing the columns.

In [None]:
# Explicitly list the 4 dataset files
# 💡 Store the file paths in a list called data_files.
# 📖 Docs: Working with file paths in Python: https://docs.python.org/3/library/os.path.html
data_files = [
    "../../data/1 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/2 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/3 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/5 Creators and their Listeners with gender and 6 locations.csv"
]

# Load all datasets into a dictionary
# 💡 Reading the csv file for each data set above
# 📖 Docs: Reading CSV files with pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
datasets = {}
for i, file in enumerate(data_files):
    name = f"Genre {i+1}"
    datasets[name] = pd.read_csv(file)

# Load legend
legend = pd.read_csv("../../data/legend.csv")

## 🔍 Explore the Data

Let's check the shape and columns of each dataset to get a sense of what we're working with.

In [7]:
genre = "Genre 1"   # You can change this!

df = datasets[genre]
print(f"{name}: {df.shape[0]} rows, {df.shape[1]} columns")
df.head(2)

# 🚀 Try accessing other genres


NameError: name 'datasets' is not defined

We can always check column definitions in our legend.

In [None]:
# Look up column description in our legend
column = 'creator_gender'   # You can change this!
row = legend[legend['Column'] == column]
description = row['Description']
description.values[0]

Or feel free to open it in your favorite spreadsheet program (Excel, Numbers, Google Sheets) to explore the descriptions!

## 🌍 (Example) Step 1: Listening Across Countries

Each dataset includes plays from 6 countries. Let's compare total plays by country for one genre.

In [None]:
# Choose the genre to explore
genre = 'Genre 1'   # You can change this!
df = datasets[genre]

# Define the countries
countries = ['BR', 'DE', 'EG', 'GB', 'US', 'VN']

# Select columns for total plays in each country (if they exist)
country_cols = [f'total_play_count_{c}' for c in countries if f'total_play_count_{c}' in df.columns]

# Calculate total plays per country
country_totals = df[country_cols].sum()

# Display basic statistics
print(f"Total Plays by Country ({genre}):")
print(country_totals)

In [None]:
# Plot the results as a bar chart
country_totals.plot(kind='bar')
plt.title(f"Total Plays by Country ({genre})")
plt.ylabel("Total Plays")
plt.show()

## 🎨 Step 2: Gender Representation Among Listeners (per Country)

Let's zoom into one country (e.g., the US) and see how plays are distributed among listener genders.

In [1]:
# Choose the country and genre to explore
# 🚀  Try changing to BR, DE, EG, GB, or VN
# 💡 df stands for data frame and is a table like structure

country = 'US'   
genre = 'Genre 1' 
df = datasets[genre] 

# Calculate total plays by listener gender for the selected country
# 💡 Have a look at the legend. Which column describes both gender and country?
# 💡 You can have a look in 🔍 Explore the Data or open the legend from the repo directly

# 📖 docs about accessing a column: https://pandas.pydata.org/docs/user_guide/indexing.html#basics
# 📖 docs about adding variables to a sting: https://www.w3schools.com/python/python_string_formatting.asp
# 📖 docs about adding all values of one df together : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html

listener_gender_totals = {
    'Male': # Your code goes here!
    'Female': # Your code goes here!
    'Custom': # Your code goes here!
    'Null': # Your code goes here!
}

# Display counts
print("Listener gender counts:")
print(pd.Series(listener_gender_totals))

# 🚀 Try changing the genre and see how other genres look like. 

SyntaxError: invalid syntax (1703425563.py, line 12)

In [5]:
# Create a pie chart
# 💡 Pie charts are useful for visualizing the proportion of categories in a dataset.
# 💡 plt.pie() creates a pie chart. You can pass values (e.g., listener_gender_totals.values()) and set labels.

# 📖 Pie chart docs: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.pie.html
# 📖 More on plotting: https://www.w3schools.com/python/matplotlib_pie_charts.asp
# 📖 Docs on using .values() and .keys() for dictionaries: https://www.w3schools.com/python/python_dictionaries.asp

# ⚠️ Don't forget to call plt.show() to display your chart!


# Your code goes here!


# 🚀 Try changing the genre or country in earlier cells and see how the chart updates.

## 🧭 Step 3: Understanding Creator Demographics

Let's explore how many creators identify as male, female, custom, or null in each genre.

In [6]:
# Choose the genre to explore
genre = "Genre 1"   # You can change this!
df = datasets[genre]

# Column to analyze
# 💡 What is the name of the column representing the creator's gender? Check the data or the legend if needed.
# 📖 Docs: How to select a column: https://pandas.pydata.org/docs/user_guide/indexing.html#basics

column = # Your code goes here!

# Counts per gender
# 💡 value_counts() counts unique values in a Series (e.g., a DataFrame column).
# 💡 Setting dropna=False means missing (NA) values are also counted.
# 📖 value_counts() reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

gender_counts = # Your code goes here!

# Display statistics
print("Genre:", genre)
print("\nCounts by gender:")
print(gender_counts)

# 🚀 Try changing the genre and see how the counts change.

SyntaxError: invalid syntax (4158321990.py, line 7)

In [None]:
# Calculate the percentage for every genre

# 💡 We loop through all genres in the datasets dictionary using .items()
# 📖 Docs on dictionary iteration: https://www.w3schools.com/python/python_dictionaries_loop.asp

# 💡 Remember: normalize=True converts counts to proportions (decimals between 0 and 1)
# 💡 Multiply by 100 to convert decimals to percentages
# 📖 value_counts() with normalize: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
# 📖 Getting percentages instead of counts: https://www.statology.org/pandas-value_counts-percentage/

creator_gender_summary = {}
for name, df in datasets.items():
    gender_pcts = # Your code goes here! Keep in mind we need percentages rather than counts.
    creator_gender_summary[name] = gender_pcts

# Transpose the DataFrame so that genres are rows and genders are columns

creator_gender_df = pd.DataFrame(creator_gender_summary).T

# Plot a stacked bar chart of creator gender distribution
# 💡 Stacked bar charts show composition—each bar's segments represent different categories
# 💡 Use .plot() on the DataFrame with kind='bar' and stacked=True
# 💡 Don't forget to add a title, ylabel, legend, and call plt.show()!
# 📖 Plotting bar charts with pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
# 📖 More on bar charts: https://www.w3schools.com/python/matplotlib_bars.asp

# Your code goes here!

## 📊 (Optional) Step 4: Calculating Representation Ratios

Let's measure the ratio of female creators to female listeners for each genre — a simple proxy for representation.

In [None]:
# Choose the genre to explore
genre = 'Genre 1'  # You can change this!
df = datasets[genre]

# Percentage of female creators and listeners
# 💡 To get the percentage of female creators: count rows where creator_gender is 'female', then divide by total rows. Multiply by 100 for percentage.
# 📖 Docs on filtering and mean: https://pandas.pydata.org/docs/user_guide/groupby.html
# 💡 Use df['creator_gender'].eq('female').mean() * 100 to calculate the percentage of female creators.
pct_creators_female = # Your code goes here!

# 💡 To get the percentage of plays by female listeners: sum the total_play_count_female column, divide by the sum of total_play_count, then multiply by 100.
# 📖 Docs on sum: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html
# 📖 General percentage calculation: https://www.geeksforgeeks.org/python/how-to-calculate-the-percentage-of-a-column-in-pandas/
pct_listeners_female = # Your code goes here!

print(f"Percentage of female creators: {pct_creators_female:.1f}%")
print(f"Percentage of plays by female listeners: {pct_listeners_female:.1f}%")
print(f"Difference (listeners - creators): {pct_listeners_female - pct_creators_female:.1f}%")

# 🚀 Try changing the genre and see how these percentages change!

In [None]:
# Initialize an empty list to store representation data
rep_data = []

# Loop through each genre dataset
for name, df in datasets.items():
    # Calculate the percentage of creators who are female
    # 💡 Count the creators whose gender is 'female', divide by total creators, multiply by 100.
    # 📖 Docs: Filtering and mean: https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html
    pct_creators_female = # Your code goes here!

    # Calculate the percentage of listeners who are female
    # 💡 Sum total_play_count_female, divide by sum of total_play_count (all listeners), and multiply by 100.
    # 📖 Docs: Summing DataFrame columns: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html
    # 📖 Percentage calculation: https://www.geeksforgeeks.org/python/how-to-calculate-the-percentage-of-a-column-in-pandas/
    pct_listeners_female = # Your code goes here!

    # Add a list of genre name, female creators %, and female listeners % to the list
    rep_data.append([name, pct_creators_female, pct_listeners_female])


# Convert the list of lists into a DataFrame
# 💡 pd.DataFrame(...) can build a DataFrame from a list of lists. Don't forget to specify column names!
# 📖 Example with DataFrame from lists: https://www.geeksforgeeks.org/python/creating-pandas-dataframe-using-list-of-lists/ [1][3][4]
rep_df = pd.DataFrame(rep_data, columns=['Genre', 'Female Creators (%)', 'Female Listeners (%)'])

# Plot a bar chart comparing female creators vs female listeners for each genre
# 💡 Use rep_df.plot with kind='bar' and x='Genre' to visualize the comparison.
# 💡 Add a title and ylabel, and call plt.show() to display your chart.
# 📖 Plotting with pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

# Your code goes here!

## 💡 Key Insights

Based on your analysis above, write down 3-5 key insights you've discovered about DEI in the music industry:

### Your Insights:

1. **Gender Representation**: [Write your observation about gender distribution among creators and listeners]

2. **Listener Diversity by Country**: [Write your observation about how listener demographics differ across countries]

3. **Genre-Specific Trends**: [Write your observation about differences across genres]

4. **Data Limitations & Biases**: [Write your observation about missing or anonymized data and its implications]

5. **Additional Insight**: [Any other pattern you noticed]

## Next Steps

Congratulations! You've completed the beginner level analysis. You've learned how to:

- Load and explore a dataset

- Calculate basic statistics

- Create visualizations to understand data patterns

- Analyze representation across different demographics

### Ready for more?
Move on to the **Intermediate Level** notebook to dive deeper into statistical analysis and more advanced visualizations!