# Advanced Level: Statistical Analysis of DEI in Music Industry

Welcome to the advanced workshop on DEI in the music industry! This notebook is an extension of the intermediate workshop with more advanced analysis and visualization techniques.

## Learning Objectives:
- Understand the importance of DEI and how it affects the music industry
- Create advanced visualizations with multiple variables
- Analyze correlation patterns in the data
- Apply grouping and aggregation techniques
- Combine multiple datasets

## 🚀 Getting Started
Let's start by importing the libraries we'll need and loading our data.

In [None]:
# Enable inline plots
%matplotlib inline
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set up plotting style
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 📂 Load the Dataset

We have 5 datasets to work with:
* 4 datasets which contains listening stats for a different (anonymized) genre (a.k.a the "listeners datasets")
* 1 datasets which contains engagement stats for all genres (a.k.a the "comments dataset")

In addition, we have a legend describing the columns.

In [2]:
# Explicitly list the 4 dataset files
# 💡 Store the file paths in a list called data_files.
# 📖 Docs: Working with file paths in Python: https://docs.python.org/3/library/os.path.html
data_files = [
    "../../data/1 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/2 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/3 Creators and their Listeners with gender and 6 locations.csv",
    "../../data/5 Creators and their Listeners with gender and 6 locations.csv"
]

# Load all datasets into a dictionary
# 💡 Reading the csv file for each data set above
# 📖 Docs: Reading CSV files with pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
datasets = {}
for i, file in enumerate(data_files):
    name = f"Genre {i+1}"
    datasets[name] = pd.read_csv(file)


# Load comments dataset
comments_data = pd.read_csv("../../data/creator and listener comments by gender and anonymised genre.csv")

# Load legend
legend = pd.read_csv("../../data/legend.csv")

In [3]:
# Combine the listeners datasets per genre into one large dataset
listeners_data = pd.concat(datasets.values()).reset_index(drop=True)

## 🔍 Explore the Data
Get familiarized with the listeners & comments datasets before jumping to the next step!
- How does the data look like?
- What are the columns? What do they represent?
- What are the data types?
- What are the missing values?
- Are there any duplicates?

Is there anything you would like to transform in the data?  (e.g. change data types, remove certain columns, rows, replace values, deal with null values etc.)

In [None]:
# Explore how the Genre datasets look like
## 🧐 Understand what the data represents
## 🔍 Look up column descriptions in the legend
## 📖 Docs: Helpful pandas methods can be: DataFrame.shape(), DataFrame.head(), DataFrame.tail(), DataFrame.describe(), DataFrame.columns
## 📖 Docs: Check out this tutorial with a good overview of an initial data exploration & cleaning process: https://miamioh.edu/centers-institutes/center-for-analytics-data-science/students/coding-tutorials/python/data-cleaning.html


In [None]:
# Check for missing values.
# 🧐 Which columns have missing values? What does the missing data represent?
# 🔍 What do we want to do with it? we can drop the columns, fill the missing values or keep them as is. 
# 📖 Docs: You can find more information on how to work with missing data (e.g. drop it or fill it) here: https://pandas.pydata.org/docs/user_guide/missing_data.html



In [None]:
# Explore the categorial columns "creator_gender" and "creator_signup_country"
# Hint: Series.value_counts() can be used to count values in one series


In [None]:
# Explore how the comments dataset look like
## 🧐Understand what the data represents
## 🔍 Look up column descriptions in the legend


In [16]:
# 📊 Do any data transformations you consider necessary before jumping to the next step


<details>
<summary>Click here to get a brief overview of the different datasets</summary>

**Listeners dataset:**
- one row per creator
- contains listening data from 4 different genres
- creator gender (`creator_gender`)
- genders represented: female, male, custom, unknown (NA)
- countries represented: US, GB, DE, EG, BR, VN
- `total_play_count` as an indicator of the popularity of the creator
- `total_play_count_<gender>` as an indicator of the popularity of the creator among listener's gender
- `total_play_count_<country>` as an indicator of the popularity of the creator in different countries
- `plays_by_<gender>_<country>` and `pct_plays_by_<gender>_<countr>` as an indicator of the popularity of the creator among listener's genders and countries


**Comments dataset:**
- one row per genre
- `new_comments_by_<gender>_creator` and `pct_new_comments_by_<gender>_creator` as an indicator for the commenting behavior of creators based on creator's gender
- `total_new_creator_comment_count` displaying the total amount of new comments by creators in a genre
- `listener_new_comment_count` as an indicator of the listeners commenting behavior
- `responses_by_<gender>_creator` as an indicator for the responding behavior of creators of different based on creator's gender
- `total_reponse_count` as an indicator of the response behavior in different genres
- `total_responses_by_listener` as an indicator of the response behavior by listeners in different genres
- `total_responses_by_creator` as an indicator of the response behavior by creators in different genres
- `responses_by_<gender>_creator` as an indicator of the response behavior by creator's gender

</details>



Before moving on to the next step, we’d like to emphasize that this workshop is all about fostering creativity and having fun! Feel free to create additional visualizations whenever you feel they enhance your storytelling. Some of our favorite visualization types include:

- Bar plots
- Pie charts
- Heatmaps
- Scatter plots
- Box plots
- Sankey diagrams

That said, you’re not limited to these options. Experiment and explore other visualization methods that align with your data and the story you want to convey!

Check out the seaborn gallery if you need inspiration: https://seaborn.pydata.org/examples/index.html

## 🎧 Step 1: Creator-Listener Gender Dynamics
*Let's investigate whether creator gender identity influences listener demographics and consumption patterns, establishing baseline understanding of potential gender-based preferences, biases, or barriers in music discovery and consumption across genres and global markets.*

### a) 📊 Overall Creator-Listener Gender Influence
Let's examine the question of whether creator gender creates systematic patterns in listener demographics, identifying potential gender-based consumption biases that could impact creator visibility and success.

- Do listeners show preferential consumption patterns based on creator gender?
- How strong is the overall creator-listener gender correlation and what does this mean for equity?

In [None]:
# Before we dive into the detailed gender dynamics in this dataset, let's first check out the overall distribution of creator genders and plays in this dataset
# 💡 Boxplots can be nice to analyze categorical data and get an initial understanding
sns.boxplot(data=listeners_data, x="total_play_count", y="creator_gender", showfliers=False, showmeans=True)

In [None]:
# Calculate total listeners gender proportions by creator gender



In [35]:
# Visualize the results (use the plots you find most appropriate)
# 💡: DataFrame.melt() can help to change the format of the dataframe: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html#pandas-dataframe-melt


### b) 📚 Creator-Listener Gender Influence by Genre
We'll analyze how musical genres may amplify or mitigate gender-based listening patterns, identifying genres that promote cross-gender consumption versus those that may reinforce gender disparities.

- Which genres demonstrate the highest cross-gender listening patterns?
- Which genre show less visibility of different genders across creators and listeners?

Idea: Look at the ratio of female creators to female listeners for each genre - a simple proxy for representation. 

In [None]:
sns.countplot(data=listeners_data, x="most_uploaded_genre", hue="creator_gender")

In [46]:
# For the representation ratios, we first calculate the percentage of plays by gender and creators by gender
# 📖 Docs: Pandas grouping follows the "split-apply-combine" approach, which can be useful here to calculate the representation ratios
#          You first start grouping by a set of columns, then apply multiple aggregations. 
#          Don't forget to call reset_index() at the end to combine it to a new dataset. 
#          https://pandas.pydata.org/docs/user_guide/groupby.html

# Group the data and aggregate total play counts per gender
representation_ratios = ... # Your code goes here

# Calculate the representation ratios  for the creator and all different genders
representation_ratios['creator_ratio'] = ... 


In [None]:
# Visualize the results
# 💡 It can be useful to reshape the data for visualization using DataFrame.melted()
# 💡 sns.catplot allows you to create categorical plots, see documentation for more info: https://seaborn.pydata.org/generated/seaborn.catplot.html 


### c) 🌍 Creator-Listener Gender Influence by Country
This analysis will reveal how cultural contexts and regional attitudes influence creator-listener gender dynamics.

- Do certain countries show more equitable cross-gender listening patterns?
- How does gender diversity of creators change in the different markets?

Idea: Here we can again have a look at the ratio of creators in a country to listeners in one country for each genre. 

In [None]:
# Before we dive into the detailed country representations in this dataset, let's first check out the overall distribution of creator's country and plays in this dataset
# 💡 Boxplots can be nice to analyze categorical data and get an initial understanding


In [None]:
# Calculate Creator gender distribution (how many male/female/custom/null creators per country) \
# Optional: you can include the genre too


# Calculate Listener gender % by creator gender per country of the listener (optional: and genre)


# Calculate Listener gender % by creator gender per country of the creator (optional: and genre)


In [92]:
# For the representation ratios, we first calculate the percentage of plays by country and creators by country
# 📖 Docs: Pandas grouping follows the "split-apply-combine" approach, which can be useful here to calculate the representation ratios
#          You first start grouping by a set of columns, then apply multiple aggregations. 
#          Don't forget to call reset_index() at the end to combine it to a new dataset. 
#          https://pandas.pydata.org/docs/user_guide/groupby.html

COUNTRY_VALUES = ["BR", "DE", "EG", "GB", "US", "VN"]


In [None]:
# Visualization time!


## 💬 Step 2 (Optional): Creator Engagement Equity Analysis
*Let's analyze gender representation and inclusive participation patterns in creator-driven community engagement, identifying barriers to equitable voice and participation opportunities across gender identities and musical genres.*

### a) 👥 Gender Equity in Creators' Engagement
Let's analyze whether there are equitable participation opportunities across creator genders and identify any barriers to engagement that may disproportionately affect certain gender groups.
- Do all creator genders have equal representation in platform engagement?
- How do you think gender identity impacts creator engagement opportunities?

💡 If you need a refresher of the comments dataset, have an additional look at your dataset and start with simple visualizations of the distributions represented in the dataset.

💡 For the genres 1,2,3,5 you have additional listening statistics in the listeners dataset. These genres are also in the top 5 of the genres with most listener comments. 

In [None]:
# Find out which creators' genders are more active commenters (as new commenters) across all genres


In [None]:
# Visualization time!


In [None]:
# Find out which creators' genders are more active responders across genres


### b) 🎧 Engagements vs Creator Equity Gap Analysis
Let's have a detailed look at the 4 genres for which we also have listeners data and examine which genders are represented in which gender and how this ratio compares to the engagement ratio per gender.

For this section, we combine both datasets: listeners data and comments data. For this, we first reduce the comments data to the 4 genres for which we also have listeners data.

In [None]:
# Select subset of comments dataset for genres 1,2,3,5


In [None]:
# To put the engagement data into perspective, we first calculate the listening statistics per genre (or reuse them from an earlier section)


In [53]:
# let's merge both datasets
# Docs: https://pandas.pydata.org/docs/user_guide/merging.html#merge-join-concatenate-and-compare 


In [None]:
# Visualization time!


## 💡 Key Insights

Based on your analysis above, write down 3-5 key insights you've discovered about DEI in the music industry:

### Your Insights:

1. **Gender Representation**: [Write your observation about gender distribution among creators and listeners]

2. **Listener Diversity by Country**: [Write your observation about how listener demographics differ across countries]

3. **Genre-Specific Trends**: [Write your observation about differences across genres]

4. **Data Limitations & Biases**: [Write your observation about missing or anonymized data and its implications]

5. **Additional Insight**: [Any other pattern you noticed]

## Next Steps

Excellent work! You've completed advanced-level statistical analysis including:

- Extracting insights from data related to creators and listeners in the music industry.
- Visualizing relationships and patterns within the data to uncover the underlying patterns.