### ✔ Data Processing

#### The column indicating the level of insomnia was replaced with a column indicating whether the person was experiencing insomnia or not. After analyzing the data and examining the relationships between variables, I identified certain factors that had a significantly stronger correlation to insomnia. Other factors were excluded from the study.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

data1 = pd.read_csv('./music.csv')

data1 = data1.dropna()
data1 = data1.drop(columns = ['Timestamp', 'Permissions'])
data1.drop(data1[data1['Age'] >= 70].index, inplace = True)
data1.drop(data1[(data1['BPM'] > 300) | (data1['BPM'] == 170)].index, inplace = True)
data1.drop(data1[data1['Primary streaming service'] == 'Pandora'].index, inplace = True)
data1.drop(data1[data1['Anxiety'] == 7.5].index, inplace = True)
data1.drop(data1[data1['Depression'] == 3.5].index, inplace = True)
conditions = [9, 0, 24, 15, 0.25, 2.5, 16, 18, 0.7, 11, 14]
data1 = data1[~data1['Hours per day'].isin(conditions)]
data1 = data1[~data1['Fav genre'].isin(['Gospel', 'Latin'])]
bpm_counts = data1['BPM'].value_counts()
bpm_to_drop = bpm_counts[bpm_counts <= 4].index
data1 = data1[~data1['BPM'].isin(bpm_to_drop)]
data1 = data1.sort_values('Age')

data1.index = [i+1 for i in range(len(data1))]
for column1 in ['While working', 'Instrumentalist','Composer', 'Exploratory', 'Foreign languages']:
    data1[column1] = data1[column1].apply(lambda x: 1 if x == 'Yes' else 0)

old_column_names = ['Frequency [Classical]', 'Frequency [Country]', 'Frequency [EDM]', 'Frequency [Folk]', 'Frequency [Gospel]', 'Frequency [Hip hop]', 'Frequency [Jazz]', 'Frequency [K pop]', 'Frequency [Latin]', 'Frequency [Lofi]', 'Frequency [Metal]', 'Frequency [Pop]', 'Frequency [R&B]', 'Frequency [Rap]', 'Frequency [Rock]', 'Frequency [Video game music]']
new_column_names = ['Classical', 'Country', 'EDM', 'Folk', 'Gospel', 'Hip hop', 'Jazz', 'K pop', 'Latin', 'Lofi', 'Metal', 'Pop', 'R&B', 'Rap', 'Rock', 'Video game music']
for old_col, new_col in zip(old_column_names, new_column_names):
    data1.rename(columns = {old_col: new_col}, inplace=True)

def map_values1(x):
    if x == 'Never':
        return 0
    elif x == 'Rarely':
        return 1
    elif x== 'Sometimes':
        return 2
    elif x == 'Very frequently':
        return 3
for column2 in new_column_names:
    data1[column2] = data1[column2].apply(map_values1)

for column in new_column_names:
    data1[column +'copy'] = data1[column].copy()
    data1.rename(columns={column + 'copy': column +'1'}, inplace=True)

data1.rename(columns={'Classical1': 'classical', 'Country1': 'country', 'EDM1': 'edm', 'Folk1': 'folk',
                     'Gospel1': 'gospel', 'Hip hop1': 'hip hop', 'Jazz1': 'jazz',
                     'K pop1': 'k pop', 'Latin1': 'latin', 'Lofi1': 'lofi', 'Metal1': 'metal', 'Pop1': 'pop',
                     'R&B1': 'r&b', 'Rap1': 'rap', 'Rock1': 'rock', 'Video game music1': 'video game music'}, inplace=True)
copy_column_names = ['classical', 'country', 'edm', 'folk', 'gospel', 'hip hop', 'jazz', 'k pop',
                     'latin', 'lofi', 'metal', 'pop', 'r&b', 'rap', 'rock', 'video game music']

for column3 in copy_column_names:
    data1[column3] = data1[column3].apply(lambda x: 1 if x > 0 else 0)
data1['Frequency count'] = data1[copy_column_names].sum(axis = 1)
data1['Average frequency'] = data1[new_column_names].sum(axis = 1)/data1['Frequency count']
data1['Average frequency'] = data1['Average frequency'].round(1)
data1 = data1[~data1['Frequency count'].isin([1, 2, 3])]

def map_values2(x):
    if x == 'Worsen':
        return 0
    elif x == 'No effect':
        return 1
    elif x == 'Improve':
        return 2
data1['Music effects'] = data1['Music effects'].apply(map_values2)

Exp = data1.pop('Exploratory')
data1.insert(data1.columns.get_loc('Average frequency') + 1, 'Exploratory', Exp)

def map_values3(x):
    if x == 0:
        return 0
    elif x != 0:
        return 1
data1['Insomnia'] = data1['Insomnia'].apply(map_values3)

data1 = data1.drop(columns = ['Age', 'Primary streaming service', 'Fav genre', 'BPM',
                              'Classical', 'Country', 'EDM', 'Folk', 'Gospel', 'Hip hop', 'Jazz', 'K pop',
                              'Latin', 'Lofi', 'Metal', 'Pop', 'R&B', 'Rap', 'Rock', 'Video game music',
                              'classical', 'country', 'edm', 'folk', 'gospel', 'hip hop', 'jazz', 'k pop',
                              'latin', 'lofi', 'metal', 'pop', 'r&b', 'rap', 'rock', 'video game music'])

### ✔ Modelling

#### I used five factors to predict insomnia: composing music, playing an instrument, several music genres listened to, and levels of depression and OCD. To tackle the binary classification problem, I utilized a classification algorithm called Logistic Regression. While other algorithms such as Random Forest, SVM (Support Vector Machine), k-Nearest Neighbors (k-NN), and Decision Tree were also considered, Logistic Regression was found to be the most suitable for this task. Compared to the other algorithms, Logistic Regression produced more accurate results.

In [None]:
X = data1[['Composer', 'Instrumentalist', 'Frequency count', 'Depression', 'OCD']]
y = data1['Insomnia']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 0.7763157894736842
Precision: 0.7733333333333333
Recall: 1.0
F1-score: 0.8721804511278195


### ✔ Results
#### Accuracy: 0.78
#### Precision: 0.77
#### Recall: 1.0
#### F1-score: 0.87

<br>

#### To sum up, the model has a high recall rate and successfully identified all true positives. However, accuracy and precision are lower than recall, and there may be cases where the model incorrectly predicts a negative class as a positive class. Therefore, it would be a good idea to fine-tune the model using more datasets in the future to increase accuracy and precision. If the dataset is large enough, it may be possible to develop a more sophisticated model using deep learning's RNN algorithm.