# Machine Learning Preprocessing

Although the data has been cleaned up, it is not in a format usable by machine learning algorithms. The following notebook will include conversions to a more algorithm-friendly format.

## Columns to be Converted

### While working, Instrumentalist, Composer, Exploratory, and Foreign Languages
These columns contains Yes/No values, which will be converted into 0s and 1s.

### Frequency [Genre] 
These columns contains string values in the order: Never, Rarely, Sometimes, Very Frequently. These will be converted to integers ranging from 0-3.

### Music effects
This column contains string values in the order: Worsen, No effect, and Improve. These will be converted to integers ranging from 0-2.

### Fav genre
Using pandas get_dummies method, this will be converted to dummy variables.

## Column to Add

### Improve, Worsen, No effect
In order to be able to use logistic regression, the column Music effects will also be converted into dummy variables. This way we will be able to create models to predict whether patients improved, worsened, or stayed the same after their treatments.

In [1]:
import pandas as pd
import numpy as np

# Stats library

from sklearn.preprocessing import LabelEncoder

In [2]:
# Import survey data

survey_df = pd.read_csv('../data/interim/clean_survey_results.csv')
survey_df.head()

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,Frequency [Classical],...,Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects
0,18,4.0,No,No,No,Video game music,No,Yes,132.0,Never,...,Rarely,Never,Rarely,Rarely,Very frequently,7,7,10,2,No effect
1,61,2.5,Yes,No,Yes,Jazz,Yes,Yes,84.0,Sometimes,...,Sometimes,Sometimes,Never,Never,Never,9,7,3,3,Improve
2,18,4.0,Yes,No,No,R&B,Yes,No,107.0,Never,...,Sometimes,Very frequently,Very frequently,Never,Rarely,7,2,5,9,Improve
3,18,5.0,Yes,Yes,Yes,Jazz,Yes,Yes,86.0,Rarely,...,Very frequently,Very frequently,Very frequently,Very frequently,Never,8,8,7,7,Improve
4,18,3.0,Yes,Yes,No,Video game music,Yes,Yes,66.0,Sometimes,...,Rarely,Rarely,Never,Never,Sometimes,4,8,6,0,Improve


In [3]:
survey_df.dtypes

Age                               int64
Hours per day                   float64
While working                    object
Instrumentalist                  object
Composer                         object
Fav genre                        object
Exploratory                      object
Foreign languages                object
BPM                             float64
Frequency [Classical]            object
Frequency [Country]              object
Frequency [EDM]                  object
Frequency [Folk]                 object
Frequency [Gospel]               object
Frequency [Hip hop]              object
Frequency [Jazz]                 object
Frequency [K pop]                object
Frequency [Latin]                object
Frequency [Lofi]                 object
Frequency [Metal]                object
Frequency [Pop]                  object
Frequency [R&B]                  object
Frequency [Rap]                  object
Frequency [Rock]                 object
Frequency [Video game music]     object


# Column Additions

## Music effects

In [4]:
# Convert music effects into dummy columns

effects_dummies = pd.get_dummies(survey_df['Music effects'], drop_first=True, dtype=int)
survey_df = pd.concat([survey_df, effects_dummies], axis=1)

# Drop Music effects column
survey_df = survey_df.drop('Music effects', axis=1)

# Verify conversion

survey_df.head()

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,Frequency [Classical],...,Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,No effect,Worsen
0,18,4.0,No,No,No,Video game music,No,Yes,132.0,Never,...,Never,Rarely,Rarely,Very frequently,7,7,10,2,1,0
1,61,2.5,Yes,No,Yes,Jazz,Yes,Yes,84.0,Sometimes,...,Sometimes,Never,Never,Never,9,7,3,3,0,0
2,18,4.0,Yes,No,No,R&B,Yes,No,107.0,Never,...,Very frequently,Very frequently,Never,Rarely,7,2,5,9,0,0
3,18,5.0,Yes,Yes,Yes,Jazz,Yes,Yes,86.0,Rarely,...,Very frequently,Very frequently,Very frequently,Never,8,8,7,7,0,0
4,18,3.0,Yes,Yes,No,Video game music,Yes,Yes,66.0,Sometimes,...,Rarely,Never,Never,Sometimes,4,8,6,0,0,0


## While working, Instrumentalist, Composer, Exploratory, and Foreign Languages

In [5]:
# Convert Yes/No values to 1s and 0s respectively

column_names = ['While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Foreign languages']

survey_df[column_names] = survey_df[column_names].replace({'Yes':1, 'No':0})

In [6]:
# Confirm conversion

survey_df[column_names].dtypes

While working        int64
Instrumentalist      int64
Composer             int64
Exploratory          int64
Foreign languages    int64
dtype: object

In [7]:
# Check values

survey_df[column_names].head()

Unnamed: 0,While working,Instrumentalist,Composer,Exploratory,Foreign languages
0,0,0,0,0,1
1,1,0,1,1,1
2,1,0,0,1,0
3,1,1,1,1,1
4,1,1,0,1,1


## Frequency Columns

In [8]:
# Convert frequency columns from strings to integer values corresponding to their oder

# Initialize label encoder
le = LabelEncoder()

# Create list of frequency columns
frequency_columns = list(survey_df.columns[survey_df.columns.str.startswith('Fr')])

survey_df[frequency_columns]

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
0,Never,Never,Very frequently,Never,Never,Rarely,Rarely,Very frequently,Never,Sometimes,Sometimes,Rarely,Never,Rarely,Rarely,Very frequently
1,Sometimes,Never,Never,Rarely,Sometimes,Never,Very frequently,Sometimes,Very frequently,Sometimes,Never,Sometimes,Sometimes,Never,Never,Never
2,Never,Never,Rarely,Never,Rarely,Very frequently,Never,Very frequently,Sometimes,Sometimes,Never,Sometimes,Very frequently,Very frequently,Never,Rarely
3,Rarely,Sometimes,Never,Never,Never,Sometimes,Very frequently,Very frequently,Rarely,Very frequently,Rarely,Very frequently,Very frequently,Very frequently,Very frequently,Never
4,Sometimes,Never,Rarely,Sometimes,Rarely,Rarely,Sometimes,Never,Rarely,Rarely,Rarely,Rarely,Rarely,Never,Never,Sometimes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,Very frequently,Rarely,Never,Sometimes,Never,Sometimes,Rarely,Never,Sometimes,Rarely,Rarely,Very frequently,Never,Rarely,Very frequently,Never
716,Rarely,Rarely,Never,Never,Never,Never,Rarely,Never,Never,Rarely,Never,Very frequently,Never,Never,Sometimes,Sometimes
717,Rarely,Sometimes,Sometimes,Rarely,Rarely,Very frequently,Rarely,Rarely,Rarely,Sometimes,Rarely,Sometimes,Sometimes,Sometimes,Rarely,Rarely
718,Very frequently,Never,Never,Never,Never,Never,Rarely,Never,Never,Never,Never,Never,Never,Never,Never,Sometimes


In [9]:
# Loop through frequency columns and fit LabelEncoder
for column in frequency_columns:
    survey_df[column] = le.fit_transform(survey_df[column])

# Confirm conversion

survey_df[frequency_columns]

Unnamed: 0,Frequency [Classical],Frequency [Country],Frequency [EDM],Frequency [Folk],Frequency [Gospel],Frequency [Hip hop],Frequency [Jazz],Frequency [K pop],Frequency [Latin],Frequency [Lofi],Frequency [Metal],Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music]
0,0,0,3,0,0,1,1,3,0,2,2,1,0,1,1,3
1,2,0,0,1,2,0,3,2,3,2,0,2,2,0,0,0
2,0,0,1,0,1,3,0,3,2,2,0,2,3,3,0,1
3,1,2,0,0,0,2,3,3,1,3,1,3,3,3,3,0
4,2,0,1,2,1,1,2,0,1,1,1,1,1,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,3,1,0,2,0,2,1,0,2,1,1,3,0,1,3,0
716,1,1,0,0,0,0,1,0,0,1,0,3,0,0,2,2
717,1,2,2,1,1,3,1,1,1,2,1,2,2,2,1,1
718,3,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2


In [10]:
# Check type conversion

survey_df[frequency_columns].dtypes

Frequency [Classical]           int64
Frequency [Country]             int64
Frequency [EDM]                 int64
Frequency [Folk]                int64
Frequency [Gospel]              int64
Frequency [Hip hop]             int64
Frequency [Jazz]                int64
Frequency [K pop]               int64
Frequency [Latin]               int64
Frequency [Lofi]                int64
Frequency [Metal]               int64
Frequency [Pop]                 int64
Frequency [R&B]                 int64
Frequency [Rap]                 int64
Frequency [Rock]                int64
Frequency [Video game music]    int64
dtype: object

## Fav genre

In [11]:
# Convert fav genre to dummies

genre_dummies = pd.get_dummies(survey_df['Fav genre'], drop_first=True, dtype=int, prefix='fav')
survey_df = pd.concat([survey_df, genre_dummies], axis=1)

# Drop fav genre column

survey_df = survey_df.drop('Fav genre', axis=1)

# Confirm transformation
survey_df.head()

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Exploratory,Foreign languages,BPM,Frequency [Classical],Frequency [Country],...,fav_Jazz,fav_K pop,fav_Latin,fav_Lofi,fav_Metal,fav_Pop,fav_R&B,fav_Rap,fav_Rock,fav_Video game music
0,18,4.0,0,0,0,0,1,132.0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,61,2.5,1,0,1,1,1,84.0,2,0,...,1,0,0,0,0,0,0,0,0,0
2,18,4.0,1,0,0,1,0,107.0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,18,5.0,1,1,1,1,1,86.0,1,2,...,1,0,0,0,0,0,0,0,0,0
4,18,3.0,1,1,0,1,1,66.0,2,0,...,0,0,0,0,0,0,0,0,0,1


## Confirm Pre-processing transformation

In [12]:
survey_df.head()

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Exploratory,Foreign languages,BPM,Frequency [Classical],Frequency [Country],...,fav_Jazz,fav_K pop,fav_Latin,fav_Lofi,fav_Metal,fav_Pop,fav_R&B,fav_Rap,fav_Rock,fav_Video game music
0,18,4.0,0,0,0,0,1,132.0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,61,2.5,1,0,1,1,1,84.0,2,0,...,1,0,0,0,0,0,0,0,0,0
2,18,4.0,1,0,0,1,0,107.0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,18,5.0,1,1,1,1,1,86.0,1,2,...,1,0,0,0,0,0,0,0,0,0
4,18,3.0,1,1,0,1,1,66.0,2,0,...,0,0,0,0,0,0,0,0,0,1


## Export preprocessed data

In [13]:
survey_df.to_csv('../data/processed/pre_processed_survey_results.csv', index=False)