<a href="https://colab.research.google.com/github/bderdz/music_mental_health/blob/main/music_mental_health.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analiza zbioru danych Music & Mental Health Survey

Dataset: [Music & Mental Health Survey Results](https://www.kaggle.com/datasets/catherinerasgaitis/mxmh-survey-results)


In [68]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from google.colab import drive

%matplotlib inline

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [69]:
df = pd.read_csv('/content/drive/MyDrive/big_data/mxmh_survey_results.csv')

df.head()

Unnamed: 0,Timestamp,Age,Primary streaming service,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects,Permissions
0,8/27/2022 19:29:02,18.0,Spotify,3.0,Yes,Yes,Yes,Latin,Yes,Yes,...,Sometimes,Very frequently,Never,Sometimes,3.0,0.0,1.0,0.0,,I understand.
1,8/27/2022 19:57:31,63.0,Pandora,1.5,Yes,No,No,Rock,Yes,No,...,Sometimes,Rarely,Very frequently,Rarely,7.0,2.0,2.0,1.0,,I understand.
2,8/27/2022 21:28:18,18.0,Spotify,4.0,No,No,No,Video game music,No,Yes,...,Never,Rarely,Rarely,Very frequently,7.0,7.0,10.0,2.0,No effect,I understand.
3,8/27/2022 21:40:40,61.0,YouTube Music,2.5,Yes,No,Yes,Jazz,Yes,Yes,...,Sometimes,Never,Never,Never,9.0,7.0,3.0,3.0,Improve,I understand.
4,8/27/2022 21:54:47,18.0,Spotify,4.0,Yes,No,No,R&B,Yes,No,...,Very frequently,Very frequently,Never,Rarely,7.0,2.0,5.0,9.0,Improve,I understand.


# Czyszczenie zbioru danych

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736 entries, 0 to 735
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Timestamp                     736 non-null    object 
 1   Age                           735 non-null    float64
 2   Primary streaming service     735 non-null    object 
 3   Hours per day                 736 non-null    float64
 4   While working                 733 non-null    object 
 5   Instrumentalist               732 non-null    object 
 6   Composer                      735 non-null    object 
 7   Fav genre                     736 non-null    object 
 8   Exploratory                   736 non-null    object 
 9   Foreign languages             732 non-null    object 
 10  BPM                           629 non-null    float64
 11  Frequency [Classical]         736 non-null    object 
 12  Frequency [Country]           736 non-null    object 
 13  Frequ

Kolumny **Timestamp** i **Permissions** nie wpływają na analizę ponieważ to zwykła formalność więc można je usunąć

In [71]:
df.drop(columns=['Timestamp', 'Permissions'], inplace=True)

In [72]:
df.dtypes

Unnamed: 0,0
Age,float64
Primary streaming service,object
Hours per day,float64
While working,object
Instrumentalist,object
Composer,object
Fav genre,object
Exploratory,object
Foreign languages,object
BPM,float64


* Takie kolumny jak **Age** i **BPM** mają typ **float64** ale powinny być raczej **int64** ( *ale najpierw musiałbym pozbyć się braków danych* )
* Jak też można zauważyć większość kolumn ma typ **String** który zamieniam na **Categorical**


---


*Aby nie powtarzać kodu wybiore indeksy kolumn mających typ* **object** *i zamienię je na kolumny z zastosowaniem na nich funkcji zmieniającej typ na* **category**

In [73]:
df[df.select_dtypes(include='object').columns] = df.select_dtypes(include='object').apply(lambda col : col.astype("category"))

## Brakujące dane

In [74]:
df.isna().mean() * 100

Unnamed: 0,0
Age,0.13587
Primary streaming service,0.13587
Hours per day,0.0
While working,0.407609
Instrumentalist,0.543478
Composer,0.13587
Fav genre,0.0
Exploratory,0.0
Foreign languages,0.543478
BPM,14.538043


Jak widać mamy:
* Age < 1%
* Primary streaming service	< 1%
* While working	< 1%
* Instrumentalist	< 1%
* Composer	< 1%
* Foreign languages	< 1%
* BPM ~ 14%
* Music effects ~ 1%

In [75]:
df_backup = df.copy()

In [76]:
df['Age'].isna().sum()

1

**Age** - Usuwam rekordy z brakującym wiekiem ponieważ wartości brakuje tylko w 1 rekordzie a jest ona dla nas ważna w analizie

In [77]:
df.dropna(subset=['Age'], inplace=True)

In [78]:
df.groupby('Primary streaming service')['Primary streaming service'].count()





Unnamed: 0_level_0,Primary streaming service
Primary streaming service,Unnamed: 1_level_1
Apple Music,51
I do not use a streaming service.,71
Other streaming service,50
Pandora,11
Spotify,457
YouTube Music,94


**Primary streaming service** - wypełnię braki wartością Spotify ponieważ jest on najczęściej wybieraną platformą

In [79]:
df['Primary streaming service'] = df['Primary streaming service'].fillna('Spotify')

**While working** - wypełniam wartością **Yes** ponieważ występuje najczęściej i jest bardziej realistyczna dla wielu osób

In [80]:
df['While working'].mode()[0]

'Yes'

In [81]:
df['While working'] = df['While working'].fillna('Yes')

In [82]:
df.groupby(['Instrumentalist', 'Composer'])['Composer'].count()





Unnamed: 0_level_0,Unnamed: 1_level_0,Composer
Instrumentalist,Composer,Unnamed: 2_level_1
No,No,463
No,Yes,33
Yes,No,143
Yes,Yes,92


**Instrumentalist, Composer** - wypełnię wartością **No** ze względu na to że częsciej człowiek nie robi własnej muzyki i nie gra na instrumentach

In [83]:
df['Composer'] = df['Composer'].fillna('No')
df['Instrumentalist'] = df['Instrumentalist'].fillna('No')

**Foreign languages** - zostanie wypełnione metodą **Backward fill**

In [84]:
df['Foreign languages'] = df['Foreign languages'].bfill()
df['Foreign languages'].isna().sum()

0

**BPM** - ze względu na duży procent spróbuje wypełnić metodą interpolacji najbliższych wartośći

In [85]:
df['BPM'] = df['BPM'].interpolate(method='nearest')
df['BPM'].isna().sum()

0

**Music effects** - aby nie tracić na ilości rekordów możemy uznać że brakujące wartości to **Brak efektu (No effect)**

In [86]:
df['Music effects'] = df['Music effects'].fillna('No effect')

In [87]:
df.isna().mean() * 100

Unnamed: 0,0
Age,0.0
Primary streaming service,0.0
Hours per day,0.0
While working,0.0
Instrumentalist,0.0
Composer,0.0
Fav genre,0.0
Exploratory,0.0
Foreign languages,0.0
BPM,0.0


Teraz nie mając brakujących wartości można zamienić typ danych w kolumnach **Age** i **BPM**

In [88]:
df['Age'] = df['Age'].astype('int64')
df['BPM'] = df['BPM'].astype('int64')

## Wartości odstające
Wartości odstające mamy w kolumnie **BPM** więc użyję metody **IQR-based Outlier Detection**

In [89]:
bpm_box = px.box(
    df,
    y="BPM",
    width=500,
    height=400)

bpm_box.show()

In [90]:
df['BPM'].mean()

1360668.0353741497

In [91]:
def remove_outlier(column):
  Q1 = column.quantile(0.25)
  Q3 = column.quantile(0.75)

  IQR = Q3 - Q1
  max_value = Q3 + 1.5 * IQR
  min_value = Q1 - 1.5 * IQR

  return column.apply(lambda v: min_value if v < min_value else max_value if v > max_value else v)

In [92]:
df['BPM'] = remove_outlier(df['BPM'])

df['BPM'] = df['BPM'].astype('int64') #kastuje na int ponieważ nie potrzebujemy zmiennoprzycinowej liczby do BPM

In [93]:
df['BPM'].mean()

123.71156462585034

In [94]:
bpm_box = px.box(
    df,
    y="BPM",
    width=500,
    height=400)

bpm_box.show()

Patrząc na **maksymalną** wartość z kolumny **Hours per day** można zwątpić w to że ktokolwiek mógłby tyle słuchać muzykę więc uważam że tak samo jak z **BPM** powinniśmy **zastosować metodę opartą na IQR i wyrównać wartości**

In [95]:
hours_box = px.box(
    df,
    y="Hours per day",
    width=500,
    height=400)

hours_box.show()

In [96]:
df['Hours per day'] = remove_outlier(df['Hours per day'])

df['Hours per day'].describe()

Unnamed: 0,Hours per day
count,735.0
mean,3.407551
std,2.40948
min,0.0
25%,2.0
50%,3.0
75%,5.0
max,9.5


## Koniec czyszczenia
Końcowy efekt:

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 735 entries, 0 to 735
Data columns (total 31 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Age                           735 non-null    int64   
 1   Primary streaming service     735 non-null    category
 2   Hours per day                 735 non-null    float64 
 3   While working                 735 non-null    category
 4   Instrumentalist               735 non-null    category
 5   Composer                      735 non-null    category
 6   Fav genre                     735 non-null    category
 7   Exploratory                   735 non-null    category
 8   Foreign languages             735 non-null    category
 9   BPM                           735 non-null    int64   
 10  Frequency [Classical]         735 non-null    category
 11  Frequency [Country]           735 non-null    category
 12  Frequency [EDM]               735 non-null    category


# EDA

## Ogólna charakterystyka osób


### Wiek

In [98]:
age_hist = px.histogram(df, x='Age', nbins=20)

age_charts = make_subplots(rows=2, cols=2, subplot_titles=['Age', 'Hours per day', 'Hours per day by Age'])

age_charts.add_trace(go.Histogram(x=df['Age'], name='Age', nbinsx=20),row=1,col=1)
age_charts.add_trace(go.Histogram(x=df['Hours per day'], name='Hours per day', nbinsx=10, ),row=1,col=2)

age_charts.add_trace(go.Histogram(x=df['Age'], y=df['Hours per day'], name='age vs hours', histfunc='avg', nbinsx=20),row=2,col=1)
age_charts.update_xaxes(title_text="Age", row=2, col=1)
age_charts.update_yaxes(title_text="Mean listen time", row=2, col=1)

age_charts.update_layout(
    showlegend=False,
    height = 650,
    width = 800
)

age_charts.show()

### Cechy

In [99]:
person_info = df[['Instrumentalist','Composer','While working', 'Exploratory', 'Foreign languages']]

figure = make_subplots(rows=2, cols=3, subplot_titles=person_info.columns)

for i, col_name in enumerate(person_info.columns):
  row = i // 3 + 1
  column = i % 3 + 1
  figure.add_trace(
    go.Histogram(x=person_info[col_name],name=col_name),
    row=row, col=column)

figure.update_layout(
    title='Histogram cech',
    showlegend=False,
    height=600,
    width=800
)

## Muzyka


### Serwisy streamingowe

In [100]:
streaming_services = df['Primary streaming service'].value_counts().reset_index()
streaming_services.columns = ['service', 'count']

services_pie = px.pie(
    streaming_services,
    values='count',
    names='service',
    title='Popularność serwisów streamingowych',
    hole=0.2,
    height=400,
    width=600)

age_services_hist = px.histogram(
    df,
    x='Primary streaming service',
    y='Age',
    histfunc='avg',
    title='Statystyka używania serwisów streamingowych oparta o wiek',
    height=500,
    width=600)

age_services_hist.update_layout(
    yaxis=dict(
        tickformat='d'
    )
)

services_pie.show()
age_services_hist.show()

### Gatunki muzyczne

In [101]:
fav_genres = df.groupby('Fav genre').size().reset_index(name='count')

genres_pie = px.pie(
    fav_genres,
    values='count',
    names='Fav genre',
    hole=0.2,
    title='Najpopularniejszy gatunek muzyki',
    height=500,
    width=800)

genres_pie.show()





In [102]:
genre_age_chart = px.scatter(
    df,
    x='Fav genre',
    y='Age',
    color='Fav genre',
    title='Ulubione gatunki muzyki według wieku',
    labels={'Fav genre': '', 'Age': 'Wiek'},
    height=550,
    width=800)

genre_age_chart.update_traces(marker_size=8)
genre_age_chart.update_layout(showlegend=False)

genre_age_chart.show()

In [103]:
avg_bpm = df.groupby('Fav genre')['BPM'].mean().sort_values().reset_index()

genre_bpm_chart = px.scatter(
    avg_bpm,
    x='Fav genre',
    y='BPM',
    color='Fav genre',
    title='Średni BPM podczas słuchania poszczególnych gatunków muzyki',
    labels={'Fav genre': '', 'BPM': 'BPM'},
    height=550,
    width=800)

genre_bpm_chart.update_traces(marker_size=15)

genre_bpm_chart.update_layout(
    yaxis=dict(
        tickformat='d'
    )
)

genre_bpm_chart.show()





### Częstotliwość słuchania poszczególnych gatunków

In [104]:
freq_df = df.filter(like='Frequency', axis=1)

freq_df = freq_df.melt(var_name='genre', value_name='frequency')
freq_df['genre'] = freq_df['genre'].apply(lambda g: re.search(r'Frequency \[(.*?)\]', str(g)).group(1))

freq_hist = px.histogram(
    freq_df,
    x='genre',
    color='frequency',
    title='Częstotliwość słuchania poszczególnych gatunków',
)

freq_hist.update_xaxes(title='')
freq_hist.update_yaxes(title='Ilość osób')
freq_hist.update_layout(barmode='group', bargap=0.35, legend_title='Częstotliwość')

freq_hist.show()

## Choroby psychiczne



In [105]:
mental_illness = df[['Anxiety','Depression','Insomnia', 'OCD']]

illness_hists = make_subplots(rows=2, cols=2, subplot_titles=mental_illness.columns)

for i, col_name in enumerate(mental_illness.columns):
  row = i // 2 + 1
  column = i % 2 + 1
  illness_hists.add_trace(
    go.Histogram(x=mental_illness[col_name],name=col_name,nbinsx=20),
    row=row, col=column)

illness_hists.update_layout(
    title='Histogram chorób psychicznych',
    showlegend=False,
    width=800,
    height=600
)

#### Choroby psychiczne a wiek

In [106]:
illness_age = make_subplots(rows=2, cols=2, vertical_spacing=0.2)

for i, col_name in enumerate(mental_illness.columns):
  row = i // 2 + 1
  column = i % 2 + 1
  illness_age.add_trace(
    go.Scatter(x=df['Age'], y=df[col_name], mode='markers'),
    row=row, col=column)
  illness_age.update_yaxes(row=row, col=column, title=col_name)


illness_age.update_xaxes(title='Age')
illness_age.update_traces(marker_size=5)
illness_age.update_layout(
    title='Poziom poszczególnych chorób według wieku',
    showlegend=False,
    height=600,
    width=900
)
illness_age.show()

### Korelacja między chorobami

In [107]:
illness_correlation = mental_illness.corr()

illness_heatmap = px.imshow(illness_correlation,
                text_auto=True,
                color_continuous_scale='Burgyl',
                title='Heatmapa korelacji między chorobami')

illness_heatmap.show()

In [108]:
mental_illness = df[['Anxiety','Depression', 'OCD']]

insomnia_hists = make_subplots(rows=2, cols=2, horizontal_spacing=0.15, vertical_spacing=0.2)

for i, column_name in enumerate(mental_illness.columns):
    row = i // 2 + 1
    column = i % 2 + 1

    insomnia_hists.add_trace(go.Histogram(
        x=df['Insomnia'],
        y=mental_illness[column_name],
        histfunc='avg',
        nbinsx=10,
        name=column_name), row=row, col=column)

    insomnia_hists.update_xaxes(title_text="Poziom bezsenności", row=row, col=column)
    insomnia_hists.update_yaxes(title_text=f"Poziom {column_name}", row=row, col=column)

insomnia_hists.update_layout(
    title='Wpływ chorób psychicznych na sen',
    showlegend=False,
    width=800,
    height=600
)

insomnia_hists.show()

### Cechy a zdrowie psychiczne

In [109]:
df['mean_mental_health'] = df['Anxiety'] + df['Depression'] + df['Insomnia'] + df['OCD']

In [110]:
health_map = px.density_heatmap(
    df,
    x='Age',
    y='mean_mental_health',
    color_continuous_scale="Sunsetdark",
    labels={'Age': 'Wiek', 'mean_mental_health': 'Poziom'})

health_map.update_layout(
    title='Poziom problemów ze zdrowiem psychicznym u osób według wieku',
)
health_map.update_xaxes(title='Wiek')
health_map.update_yaxes(title='Problemy ze zdrowiem psychicznym')

health_map.show()

In [111]:
musician_health = make_subplots(
    rows=1,
    cols=2,
    horizontal_spacing=0.3,
    subplot_titles=['Instrumentalist', 'Composer']
  )

musician_health.add_trace(
    go.Histogram(x=df['Instrumentalist'], y=df['mean_mental_health'], histfunc='avg'), row=1, col=1)
musician_health.add_trace(
    go.Histogram(x=df['Composer'], y=df['mean_mental_health'], histfunc='avg'), row=1, col=2)

musician_health.update_yaxes(title='Problemy ze zdrowiem psychicznym')
musician_health.update_layout(
    showlegend=False,
    width=700,
    height=400
)

In [112]:
freq_means = pd.DataFrame()

frequency_mapping = {"Never": 0, "Rarely": 1, "Sometimes": 2, "Very frequently": 3}
frequency_name = {str(v): k for k, v in frequency_mapping.items()}

for i, col_name in enumerate(df.filter(like='Frequency', axis=1)):
  freq_mean = df.groupby(col_name, observed=True)['mean_mental_health'].mean().reset_index()
  freq_mean['genre'] = re.search(r'Frequency \[(.*?)\]', col_name).group(1)
  freq_mean['frequency'] = freq_mean[col_name].map(frequency_mapping)

  freq_means= pd.concat([freq_means, freq_mean])

freq_mean_health = px.scatter(freq_means, x='genre', y='mean_mental_health', color='frequency')
freq_mean_health.for_each_trace(lambda t: t.update(name = frequency_name[t.name]))

freq_mean_health.update_traces(mode='lines+markers', marker_size=10)
freq_mean_health.update_layout(showlegend=True)


## Wpływ na poprawę stanu zdrowia

In [113]:
fig = px.histogram(df, x='Music effects', y='mean_mental_health', histfunc='avg')
fig.show()

#### Granie na instrumentach i bycie kompozytorem

In [132]:
musician_cols = ['Instrumentalist', 'Composer']
musician_effects = make_subplots(rows=1, cols=2, subplot_titles=musician_cols)

for i,col_name in enumerate(musician_cols):
  grouped = df.groupby([col_name, 'Music effects'], observed=True).size().reset_index(name='count')
  musician_hist = px.histogram(grouped, x=col_name, y='count', color='Music effects', barmode='group')

  for j, trace in enumerate(musician_hist.data):
      trace.showlegend = (i == 0)
      musician_effects.add_trace(trace, row=1, col=i + 1)


musician_effects.update_xaxes(categoryorder="array", categoryarray=['No', 'Yes'], row=1, col=1)
musician_effects.update_xaxes(categoryorder="array", categoryarray=['No', 'Yes'], row=1, col=2)
musician_effects.update_layout(
    width=800,
    height=400
)
musician_effects.show()

#### Odkrywanie nowych gatunków/artystów i obcojęzyczne utwory