In [None]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
import plotly
import csv

In [None]:
# Load the dataset
df = pd.read_csv('mxmh_survey_results.csv')


Exploratory Data Analysis (EDA)
This project explores the relationship between music listening habits and self-reported mental health. Using a survey dataset, we will clean and process the data, perform exploratory data analysis (EDA), and create visualizations to uncover potential patterns and insights. The primary goal is to understand how factors like favorite genre, listening duration, and streaming service might correlate with anxiety, depression, insomnia, and OCD levels. All colors in visualizations for all data displayed are consistent to keep up with the workflow of each plot.

Data Cleaning and Wrangling
We will clean the column names, handle missing values, and correct any erroneous data to prepare the dataset for analysis.


Standardize Column Names

Converted all column names to lowercase and replaced spaces with underscores for easier access.


In [None]:
#Standardize Column Names
df.columns = df.columns.str.lower().str.replace(' ', '_')
print("Cleaned column names:")
print(df.columns)


Handle Missing Values

music_effects: This is a key categorical variable. [cite_start] 
The missing values are filled with "No effect" as a neutral default.
Other Categorical Columns: For other categorical columns like primary_streaming_service, while_working, etc., will fill missing values with the mode (the most frequent value) as a reasonable assumption.
Numerical Columns (age, bpm): For numerical columns, will fill missing values with the median, which is less sensitive to outliers than the mean.


In [None]:
# Fill missing 'music_effects' with 'No effect'
df['music_effects'].fillna('No effect', inplace=True)

# Fill other categorical NaNs with the mode
for col in ['primary_streaming_service', 'while_working', 'instrumentalist', 'composer']: mode_value = df[col].mode()[0]
df[col].fillna(mode_value, inplace=True)

# Fill numerical NaNs with the median
df['age'].fillna(df['age'].median(), inplace=True)

# Verify that most missing values are handled
print(df.isnull().sum())

Correct Erroneous Data
During inspection of numerical data, an extremely high BPM value was noticed. This is likely a data entry error.
A normal human heart rate, even during intense exercise, rarely exceeds 220 BPM, so we'll cap all BPM values at a reasonable upper limit (e.g., 250).

Display descriptive statistics to spot outliers
The max BPM is 999999999.0. This will replace any BPM over 250 with the median BPM.

In [None]:
# Display descriptive statistics to spot outliers
print(df['bpm'].describe())


In [None]:
# Find the median BPM to use for replacement
median_bpm = df['bpm'].median()



In [None]:
# Replace impossibly high BPM values
df['bpm'] = df['bpm'].apply(lambda x: median_bpm if x > 250 else x)

In [None]:
# Verify the correction
print("\nBPM statistics after correction:")
print(df['bpm'].describe())

In [None]:
# Final Data Check
print("Data cleaning complete. Here's the info for the cleaned DataFrame:")
df.info()

Correlation Analysis
The following analysis explains how all numerical columns connect with one another. A correlation matrix is a great way to get a quick overview of the relationships between variables like age, listening hours, and mental health scores.


In [None]:
# Mental Illness List of Disorders
mental_health_cols = ['anxiety', 'depression', 'insomnia', 'ocd']

# Group by favorite genre and calculate mean scores
genre_mental_health = df.groupby('fav_genre')[mental_health_cols].mean().sort_values(by='anxiety', ascending=False)

# Select numerical columns for correlation matrix
numerical_cols = ['age', 'hours_per_day', 'bpm'] + mental_health_cols

# Calculate the correlation matrix
corr_matrix = df[numerical_cols].corr()

# Plot a heatmap to visualize the correlations
plt.figure(figsize=(10, 7))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()





Key Finding: The correlation matrix shows fairly weak relationships between most variables. Anxiety, depression, insomnia, and OCD scores are moderately correlated with each other, which is expected as these conditions often coexist. There appears to be a very weak positive correlation between hours_per_day and the mental health metrics, but it's not strong enough to be conclusive from this chart alone.


Visualizations

Visualization 1: Average Anxiety by Favorite Music Genre

Story: I want to see if people who prefer certain music genres report higher levels of anxiety on average. This helps explore the stereotype that certain genres, like Metal or Rock, are associated with negative emotions, while others, like Classical, are associated with calmness.

Chart Justification: A bar chart is the perfect choice here because it's excellent for comparing a numerical value (average anxiety) across different categories (music genres). I sorted the bars in descending order to make it immediately clear which genres have the highest and lowest associated anxiety scores. I used the color (skyblue) to keep the focus on the data itself rather than distracting with a complex color scheme. For colors, I used Blue for Depression, Red for Anxiety, Violet for Insomnia, and Orange for OCD. This palette represents all mental disorders giving them a distinct meaning of symbolism. Defining their mood in their own unique way.

In [None]:
# Prepare data for plotting
genre_anxiety = df.groupby('fav_genre')['anxiety'].mean().sort_values(ascending=False)

In [None]:
# Bar Plot: Emotional Scores Across Favorite Genres
grouped.plot(kind='bar', stacked=True, color=['#1f77b4', '#d62728', '#9467bd', '#ff7f0e'], figsize=(10,6))
plt.title('Emotional Scores Across Favorite Genres')
plt.ylabel('Average Score')
plt.xlabel('Streaming Service')
plt.legend(title='Disorder')
plt.tight_layout()
plt.show()

Key Finding: The chart shows that fans of Rock and Metal music reported the highest average anxiety levels. Conversely, those who favor Lofi, Latin, and Country music reported the lowest levels. This doesn't imply causation but is an interesting pattern worth noting.

Visualization 2: Relationship Between Listening Hours and Depression

Story: Does listening to more music correlate with higher or lower self-reported depression? I want to investigate the pattern between the number of hours someone listens to music per day and their depression score.

Chart Justification: A scatter plot with a regression line (regplot) is ideal for visualizing the relationship between two continuous variables. Each point represents a survey respondent, and the line of best fit helps show the overall trend. A small amount of "jitter" was added to the data points to prevent overplotting where many respondents have the same values (e.g., 2 hours, depression score of 7). For colors, I used Blue for Depression, Red for Anxiety, Violet for Insomnia, and Orange for OCD. This palette represents all mental disorders giving them a distinct meaning of symbolism. Defining their mood in their own unique way.

In [None]:
# Set up Regression Plot
plt.figure(figsize=(10,6))
colors = {'Depression': '#1f77b4', 'Anxiety': '#d62728', 'Insomnia': '#9467bd', 'OCD': '#ff7f0e'}

# Plot each emotional score
for emotion, color in colors.items():
    sns.regplot(x='Hours per day', y=emotion, data=df, scatter=False, label=emotion, color=color)

plt.title('Regression: Hours Listening vs. Emotional Scores')
plt.xlabel('Hours per Day Listening to Music')
plt.ylabel('Score')
plt.legend()
plt.tight_layout()
plt.show()


Key Finding: The regression line has a slight positive slope, suggesting a weak positive correlation: as listening hours increase, the depression score tends to increase slightly. However, the points are widely dispersed, indicating that this relationship is not strong. There are many people who listen for many hours with low depression scores, etc. This tells us that listening time alone is not a strong predictor of depression. 

Visualization 3: Distribution of Mental Health Scores

Story: What is the overall landscape of mental health in this survey's population? (Anxiety, Depression, Insomnia) to see their central tendencies and spread.

Chart Justification: Box plots are an excellent tool for comparing the distributions of several numerical variables at once. For each condition, the box plot clearly shows the median (the line inside the box), the interquartile range (the box itself, representing the middle 50% of data), and the spread of the data (the whiskers). This makes it easy to compare, whether the median anxiety score is higher than the median insomnia score and how their spreads differ. For colors, I used Blue for Depression, Red for Anxiety, Violet for Insomnia, and Orange for OCD. This palette represents all mental disorders giving them a distinct meaning of symbolism. Defining their mood in their own unique way.


In [None]:
# Reshape the data for easier plotting with seaborn
melted_df = df[mental_health_cols].melt(var_name='Condition', value_name='Score')

In [None]:
# Melt the dataframe for boxplot
melted = df.melt(id_vars=['Fav genre'], value_vars=['Depression', 'Anxiety', 'Insomnia', 'OCD'],
var_name='Disorder', value_name='Score')

plt.figure(figsize=(12,6))
sns.boxplot(x='Fav genre', y='Score', hue='Disorder', data=melted,
palette={'Depression': '#1f77b4', 'Anxiety': '#d62728', 'Insomnia': '#9467bd', 'OCD': '#ff7f0e'})
plt.title('Emotional Scores by Favorite Genre')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


Key Finding: The box plots reveal that the median scores for all conditions are relatively high (hovering between 4 and 6). Anxiety and Depression have very similar distributions, with their median scores being the highest among the four conditions. Insomnia scores are slightly lower and more spread out. OCD scores are the lowest on average, with a median around 2, and are more skewed towards the lower end of the scale.


Key Findings:
Data Quality: The initial dataset required significant cleaning, including standardizing column names, handling numerous missing values, and correcting outlier data points.

Genre & Anxiety: A clear pattern emerged where favorite genres like Rock and Metal were associated with higher average anxiety scores, while Lofi and Latin were associated with lower scores.

Listening Hours: There is a very weak positive correlation between the number of hours spent listening to music and self-reported scores for depression, anxiety, and insomnia. The weakness of this correlation suggests that listening time is not a major predictor of mental health status.