# Machine Learning Preprocessing

Although the data has been cleaned up, it is not in a format usable by machine learning algorithms. The following notebook will include conversions to a more algorithm-friendly format.

## Columns to be Converted

### While working, Instrumentalist, Composer, Exploratory, and Foreign Languages
These columns contains Yes/No values, which will be converted into 0s and 1s.

### Frequency [Genre] 
These columns contains string values in the order: Never, Rarely, Sometimes, Very Frequently. These will be converted to integers ranging from 0-3.

### Music effects
This column contains string values in the order: Worsen, No effect, and Improve. These will be converted to integers ranging from 0-2.

In [3]:
import pandas as pd
import numpy as np

# Stats library

from sklearn.preprocessing import LabelEncoder

In [5]:
# Import survey data

survey_df = pd.read_csv('../data/interim/clean_survey_results.csv')
survey_df.head()

Unnamed: 0,Age,Hours per day,While working,Instrumentalist,Composer,Fav genre,Exploratory,Foreign languages,BPM,Frequency [Classical],...,Frequency [Pop],Frequency [R&B],Frequency [Rap],Frequency [Rock],Frequency [Video game music],Anxiety,Depression,Insomnia,OCD,Music effects
0,18,4.0,No,No,No,Video game music,No,Yes,132.0,Never,...,Rarely,Never,Rarely,Rarely,Very frequently,7,7,10,2,No effect
1,61,2.5,Yes,No,Yes,Jazz,Yes,Yes,84.0,Sometimes,...,Sometimes,Sometimes,Never,Never,Never,9,7,3,3,Improve
2,18,4.0,Yes,No,No,R&B,Yes,No,107.0,Never,...,Sometimes,Very frequently,Very frequently,Never,Rarely,7,2,5,9,Improve
3,18,5.0,Yes,Yes,Yes,Jazz,Yes,Yes,86.0,Rarely,...,Very frequently,Very frequently,Very frequently,Very frequently,Never,8,8,7,7,Improve
4,18,3.0,Yes,Yes,No,Video game music,Yes,Yes,66.0,Sometimes,...,Rarely,Rarely,Never,Never,Sometimes,4,8,6,0,Improve


In [6]:
survey_df.dtypes

Age                               int64
Hours per day                   float64
While working                    object
Instrumentalist                  object
Composer                         object
Fav genre                        object
Exploratory                      object
Foreign languages                object
BPM                             float64
Frequency [Classical]            object
Frequency [Country]              object
Frequency [EDM]                  object
Frequency [Folk]                 object
Frequency [Gospel]               object
Frequency [Hip hop]              object
Frequency [Jazz]                 object
Frequency [K pop]                object
Frequency [Latin]                object
Frequency [Lofi]                 object
Frequency [Metal]                object
Frequency [Pop]                  object
Frequency [R&B]                  object
Frequency [Rap]                  object
Frequency [Rock]                 object
Frequency [Video game music]     object


## While working, Instrumentalist, Composer, Exploratory, and Foreign Languages

In [None]:
# Convert Yes/No values to 1s and 0s respectively

column_names = ['While working', 'Instrumentalist', 'Composer', 'Exploratory', 'Foreign languages']

survey_df[column_names] = survey_df[column_names].replace({'Yes':1, 'No':0})

## Frequency Columns

In [None]:
# Convert frequency columns from strings to integer values corresponding to their oder

# Initialize label encoder
le = LabelEncoder()

# Create list of frequency columns
frequency_columns = list(survey_df.columns[survey_df.columns.str.startswith('Fr')])

survey_df[frequency_columns]

In [None]:
# Loop through frequency columns and fit LabelEncoder
for column in frequency_columns:
    survey_df[column] = le.fit_transform(survey_df[column])

# Confirm conversion

survey_df[frequency_columns]

## Music effects

In [None]:
survey_df['Music effects']

In [None]:
# Convert to int using LabelEncoder

# Initialize label encoder
le = LabelEncoder()

# Fit label encoder
survey_df['Music effects'] = le.fit_transform(survey_df['Music effects'])

# Confirm conversion
survey_df['Music effects']