# Pre-processing and Training for Capstone Two: Music & Happiness

### Table of Contents

* [Introduction](#start)
    * [Import relevant libraries](#import)
* [Pre-processing](#preprocess)
    * [Encode dummy variables for countries](#dummies)
    * [Scale data using StandardScale](#scaling)
* [Training](#train)
    * [Split the data](#split)
    * [Check and save data](#check)

## 1 - Introduction <a name="start"></a>

In this notebook, we will pick up where we left off in the exploratory data analysis phase by pre-processing and training our data for a machine learning model.

### 1.1 - Import relevant libraries <a name="import"></a>

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Retrieve dataframes stored in the EDA phase
%store -r wh_songs_country

## 2 - Pre-processing <a name="import"></a>

Let's pre-process our data to prepare it for our machine learning model. Let's look at the variables we have and decide whether we need to encode them or scale them.

In [3]:
wh_songs_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   country              71 non-null     object  
 1   region               71 non-null     object  
 2   ladder_score         71 non-null     float64 
 3   gdp_per_capita       71 non-null     float64 
 4   social_support       71 non-null     float64 
 5   life_expectancy      71 non-null     float64 
 6   life_choice_freedom  71 non-null     float64 
 7   generosity           71 non-null     float64 
 8   corruption           71 non-null     float64 
 9   popularity           71 non-null     float64 
 10  is_explicit          71 non-null     float64 
 11  duration_ms          71 non-null     float64 
 12  danceability         71 non-null     float64 
 13  energy               71 non-null     float64 
 14  key                  71 non-null     object  
 15  loudness             71 n

### 2.1 - Encode dummy variables for countries <a name="dummies"></a>

Our dataset contains two categorical variables of interest: `country` and `region`. We will therefore create a new dataframe where we have encoded these as Boolean dummy variables.

In order to more easily distinguish these two dummy variables, we will add prefixes for them: `country` and `region`.

In [4]:
# Create dummy variables for country and region
df_encoded = pd.get_dummies(wh_songs_country.drop(['key', 'time_signature'], axis=1), columns=['country', 'region'], prefix=['country', 'region'])

In [5]:
# Check dataframe
df_encoded.head()

Unnamed: 0,ladder_score,gdp_per_capita,social_support,life_expectancy,life_choice_freedom,generosity,corruption,popularity,is_explicit,duration_ms,...,country_Venezuela,country_Vietnam,region_East Asia,region_Eastern Europe,region_Latin America,region_Middle East and North Africa,region_South and South East Asia,region_Sub Saharan Africa,region_Western Europe,region_Western Offshoots
0,0.7804,0.858471,0.969,0.726205,0.961,0.286641,0.182,0.542417,21.896163,0.194135,...,False,False,False,False,False,False,False,False,True,False
1,0.7586,0.886189,0.954,0.730671,0.934,0.485084,0.196,0.579685,42.307692,0.201592,...,False,False,False,False,False,False,False,False,True,False
2,0.753,0.875428,0.983,0.766403,0.936,0.584955,0.668,0.46119,34.222222,0.209837,...,False,False,False,False,False,False,False,False,True,False
3,0.7473,0.833524,0.943,0.795301,0.809,0.281453,0.708,0.503162,14.979757,0.222603,...,False,False,False,False,False,True,False,False,False,False
4,0.7403,0.882928,0.93,0.744071,0.887,0.587549,0.379,0.650881,29.573935,0.199223,...,False,False,False,False,False,False,False,False,True,False


In [6]:
# Check that the dummy variables have encoded properly
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 99 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ladder_score                         71 non-null     float64
 1   gdp_per_capita                       71 non-null     float64
 2   social_support                       71 non-null     float64
 3   life_expectancy                      71 non-null     float64
 4   life_choice_freedom                  71 non-null     float64
 5   generosity                           71 non-null     float64
 6   corruption                           71 non-null     float64
 7   popularity                           71 non-null     float64
 8   is_explicit                          71 non-null     float64
 9   duration_ms                          71 non-null     float64
 10  danceability                         71 non-null     float64
 11  energy                            

### 2.2 - Scale data using StandardScaler <a name="scaling"></a>

We have a number of variables of type float that we will need to scale before fitting the data to a model. Let's do that here.

In [7]:
# Since we already encoded the dummy variables, let's exclude Booleans from our DataFrame so we can apply the 
# StandardScaler without losing our country data.
df_no_bools = df_encoded.select_dtypes(exclude='bool')
df_bools = df_encoded.select_dtypes(include='bool')

# Make scaler object
scaler = StandardScaler()

# Fit the model to the data and transform it
scaled_df = scaler.fit_transform(df_no_bools)
scaled_df = pd.DataFrame(scaled_df, columns=df_no_bools.columns)

# Combine the scaled data with df_bools to complete the DataFrame
df = pd.concat([scaled_df, df_bools], axis=1)

In [8]:
df.head()

Unnamed: 0,ladder_score,gdp_per_capita,social_support,life_expectancy,life_choice_freedom,generosity,corruption,popularity,is_explicit,duration_ms,...,country_Venezuela,country_Vietnam,region_East Asia,region_Eastern Europe,region_Latin America,region_Middle East and North Africa,region_South and South East Asia,region_Sub Saharan Africa,region_Western Europe,region_Western Offshoots
0,1.945856,0.70535,1.254684,0.745662,1.490476,-0.178546,-2.274951,-1.325841,-0.882964,-1.046849,...,False,False,False,False,False,False,False,False,True,False
1,1.678419,0.883101,1.074404,0.769691,1.178856,0.941283,-2.209425,-0.82614,0.544376,-0.578623,...,False,False,False,False,False,False,False,False,True,False
2,1.60972,0.814092,1.422946,0.961922,1.201939,1.504857,-0.000264,-2.414943,-0.021026,-0.06091,...,False,False,False,False,False,False,False,False,True,False
3,1.539793,0.545374,0.942198,1.117389,-0.263828,-0.207822,0.186953,-1.852182,-1.366615,0.740681,...,False,False,False,False,False,True,False,False,False,False
4,1.453919,0.862189,0.785955,0.841778,0.636407,1.519495,-1.352907,0.128469,-0.346072,-0.727367,...,False,False,False,False,False,False,False,False,True,False


## 3 - Training <a name="train"></a>

### 3.1 - Split data <a name="split"></a>

Recall that our variable `ladder_score` corresponds to the ladder score on the World Happiness Report. This score is also known as a "Cantril Ladder." From the [World Happiness Report website's FAQ page](https://worldhappiness.report/faq/):
>The rankings in ... \[the\] World Happiness Report 2024 use data from the Gallup World Poll surveys from 2021 to 2023. They are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10 and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale.

The `ladder_score` is the best approximation we have from our World Happiness Report to a comprehensive score on World Happiness. For this reason, we will use `ladder_score` as our target variable, and we will split and train our data accordingly.

Note that our dataset, `df`, contains other variables from the World Happiness Report, including `'gdp_per_capita'`, `'social_support'`, `'life_expectancy'`, `'life_choice_freedom'`, `'generosity'`, and `'corruption'`. While these may be interesting target variables to explore in a separate project, we are only interested in how a country's music listening habits might predict their overall scores of happiness. Therefore, we will be excluding these variables from our analysis.

In [9]:
# Define our X and y variables
X = df.drop(['ladder_score', 'gdp_per_capita', 'social_support', 'life_expectancy',
       'life_choice_freedom', 'generosity', 'corruption'], axis=1)
y = df['ladder_score']

In [10]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 3.2 - Check data and save for retrieval <a name="check"></a>

In [11]:
# Check the shapes of the training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (56, 92)
X_test shape: (15, 92)
y_train shape: (56,)
y_test shape: (15,)


We have completed the pre-processing and training phase of our data. Now, let's save our work so we can retrieve it in the next notebook, where we will create our machine learning model:

In [12]:
%store X_train
%store X_test
%store y_train
%store y_test

Stored 'X_train' (DataFrame)
Stored 'X_test' (DataFrame)
Stored 'y_train' (Series)
Stored 'y_test' (Series)
