# Pre-processing and Training for Capstone Two: Music & Happiness

### Table of Contents

* [Introduction](#start)
    * [Import relevant libraries](#import)
* [Pre-processing](#preprocess)
    * [Encode dummy variables for countries](#dummies)
    * [Split the data](#split)
    * [Scale data using StandardScale](#scaling)
* [Finalizing](#final)
    * [Check and save data](#check)

## 1 - Introduction <a name="start"></a>

In this notebook, we will pick up where we left off in the exploratory data analysis phase by pre-processing and training our data for a machine learning model.

### 1.1 - Import relevant libraries <a name="import"></a>

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
# Retrieve dataframes stored in the EDA phase
%store -r wh_songs_country

## 2 - Pre-processing <a name="import"></a>

Let's pre-process our data to prepare it for our machine learning model. Let's look at the variables we have and decide whether we need to encode them or scale them.

In [3]:
wh_songs_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   country              71 non-null     object  
 1   ladder_score         71 non-null     float64 
 2   gdp_per_capita       71 non-null     float64 
 3   social_support       71 non-null     float64 
 4   life_expectancy      71 non-null     float64 
 5   life_choice_freedom  71 non-null     float64 
 6   generosity           71 non-null     float64 
 7   corruption           71 non-null     float64 
 8   popularity           71 non-null     float64 
 9   is_explicit          71 non-null     float64 
 10  duration_ms          71 non-null     float64 
 11  danceability         71 non-null     float64 
 12  energy               71 non-null     float64 
 13  key                  71 non-null     object  
 14  loudness             71 non-null     float64 
 15  mode                 71 n

### 2.1 - Encode dummy variables for countries <a name="dummies"></a>

We will encode `country` as a dummy variable.

In [4]:
# Create dummy variables for country
df = pd.get_dummies(wh_songs_country.drop(['key', 'time_signature'], axis=1), columns=['country'], 
                            prefix=['country'], drop_first=True)

In [5]:
# Check dataframe
df.head()

Unnamed: 0,ladder_score,gdp_per_capita,social_support,life_expectancy,life_choice_freedom,generosity,corruption,popularity,is_explicit,duration_ms,...,country_Taiwan,country_Thailand,country_Turkey,country_Ukraine,country_United Arab Emirates,country_United Kingdom,country_United States,country_Uruguay,country_Venezuela,country_Vietnam
0,7.804,10.792,0.969,71.15,0.961,-0.019,0.182,54.241709,21.896163,182422.435666,...,False,False,False,False,False,False,False,False,False,False
1,7.586,10.962,0.954,71.25,0.934,0.134,0.196,57.968539,42.307692,189429.330769,...,False,False,False,False,False,False,False,False,False,False
2,7.53,10.896,0.983,72.05,0.936,0.211,0.668,46.119044,34.222222,197176.773333,...,False,False,False,False,False,False,False,False,False,False
3,7.473,10.639,0.943,72.697,0.809,-0.023,0.708,50.316184,14.979757,209172.384615,...,False,False,False,False,False,False,False,False,False,False
4,7.403,10.942,0.93,71.55,0.887,0.213,0.379,65.08814,29.573935,187203.403509,...,False,False,False,False,False,False,False,False,False,False


In [6]:
# Check that the dummy variables have encoded properly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 90 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ladder_score                  71 non-null     float64
 1   gdp_per_capita                71 non-null     float64
 2   social_support                71 non-null     float64
 3   life_expectancy               71 non-null     float64
 4   life_choice_freedom           71 non-null     float64
 5   generosity                    71 non-null     float64
 6   corruption                    71 non-null     float64
 7   popularity                    71 non-null     float64
 8   is_explicit                   71 non-null     float64
 9   duration_ms                   71 non-null     float64
 10  danceability                  71 non-null     float64
 11  energy                        71 non-null     float64
 12  loudness                      71 non-null     float64
 13  mode   

### 2.2 - Split data <a name="split"></a>

Recall that our variable `ladder_score` corresponds to the ladder score on the World Happiness Report. This score is also known as a "Cantril Ladder." From the [World Happiness Report website's FAQ page](https://worldhappiness.report/faq/):
>The rankings in ... \[the\] World Happiness Report 2024 use data from the Gallup World Poll surveys from 2021 to 2023. They are based on answers to the main life evaluation question asked in the poll. This is called the Cantril ladder: it asks respondents to think of a ladder, with the best possible life for them being a 10 and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale.

The `ladder_score` is the best approximation we have from our World Happiness Report to a comprehensive score on World Happiness. For this reason, we will use `ladder_score` as our target variable, and we will split and train our data accordingly.

Note that our dataset, `df`, contains other variables from the World Happiness Report, including `'gdp_per_capita'`, `'social_support'`, `'life_expectancy'`, `'life_choice_freedom'`, `'generosity'`, and `'corruption'`. While these may be interesting target variables to explore in a separate project, we are only interested in how a country's music listening habits might predict their overall scores of happiness. Therefore, we will be excluding these variables from our analysis.

In [7]:
# Define our X and y variables
X = df.drop(['ladder_score', 'gdp_per_capita', 'social_support', 'life_expectancy',
       'life_choice_freedom', 'generosity', 'corruption'], axis=1)
y = df['ladder_score']

In [8]:
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2.3 Scale data using StandardScaler <a name="scaling"></a>

In [9]:
scaler = StandardScaler()

In [10]:
# Fit the scaler on the training data
X_train_scaled = scaler.fit_transform(X_train)

In [11]:
# Use the scaler to transform the test data
X_test_scaled = scaler.transform(X_test)

## 3 - Finalizing <a name="final"></a>

### 3.1 - Check data and save for retrieval <a name="check"></a>

In [12]:
# Check the shapes of the training and testing sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (56, 83)
X_test shape: (15, 83)
y_train shape: (56,)
y_test shape: (15,)


We have completed the pre-processing and training phase of our data. Now, let's save our work so we can retrieve it in the next notebook, where we will create our machine learning model:

In [13]:
%store X_train
%store X_test
%store y_train
%store y_test
%store df

Stored 'X_train' (DataFrame)
Stored 'X_test' (DataFrame)
Stored 'y_train' (Series)
Stored 'y_test' (Series)
Stored 'df' (DataFrame)
