# 4 Training and Modelling the Data<a id='4_Training_and_Modeling_the_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Training and Modelling the Data](#4_Training_and_Modelling_the_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Imports](#4.2_Imports)
  * [4.3 Load Data](#4.3_Load_Data)
  * [4.4 Create Dummy Variables](#4.4_Create_Dummy_Variables)
  * [4.6 Train/Test Split](#4.6_Train/Test_Split)
  * [4.5 Scale the Data](#4.5_Scale_the_Data)
      * [4.5.1 Make Scaler Object](#4.5.1_Make_Scaler_Object)
      * [4.5.2 Fitting Data to Scalar](#4.5.2_Fitting_Data_to_Scalar)
  * [4.7 Initial Models](#4.7_Initial_Models)
    * [4.8.1 Imputing missing feature (predictor) values](#4.8.1_Imputing_missing_feature_(predictor)_values)
      * [4.8.1.1 Impute missing values with median](#4.8.1.1_Impute_missing_values_with_median)
        * [4.8.1.1.1 Learn the values to impute from the train set](#4.8.1.1.1_Learn_the_values_to_impute_from_the_train_set)
        * [4.8.1.1.2 Apply the imputation to both train and test splits](#4.8.1.1.2_Apply_the_imputation_to_both_train_and_test_splits)
        * [4.8.1.1.3 Scale the data](#4.8.1.1.3_Scale_the_data)
        * [4.8.1.1.4 Train the model on the train split](#4.8.1.1.4_Train_the_model_on_the_train_split)
        * [4.8.1.1.5 Make predictions using the model on both train and test splits](#4.8.1.1.5_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [4.8.1.1.6 Assess model performance](#4.8.1.1.6_Assess_model_performance)
      * [4.8.1.2 Impute missing values with the mean](#4.8.1.2_Impute_missing_values_with_the_mean)
        * [4.8.1.2.1 Learn the values to impute from the train set](#4.8.1.2.1_Learn_the_values_to_impute_from_the_train_set)
        * [4.8.1.2.2 Apply the imputation to both train and test splits](#4.8.1.2.2_Apply_the_imputation_to_both_train_and_test_splits)
        * [4.8.1.2.3 Scale the data](#4.8.1.2.3_Scale_the_data)
        * [4.8.1.2.4 Train the model on the train split](#4.8.1.2.4_Train_the_model_on_the_train_split)
        * [4.8.1.2.5 Make predictions using the model on both train and test splits](#4.8.1.2.5_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [4.8.1.2.6 Assess model performance](#4.8.1.2.6_Assess_model_performance)
    * [4.8.2 Pipelines](#4.8.2_Pipelines)
      * [4.8.2.1 Define the pipeline](#4.8.2.1_Define_the_pipeline)
      * [4.8.2.2 Fit the pipeline](#4.8.2.2_Fit_the_pipeline)
      * [4.8.2.3 Make predictions on the train and test sets](#4.8.2.3_Make_predictions_on_the_train_and_test_sets)
      * [4.8.2.4 Assess performance](#4.8.2.4_Assess_performance)
  * [4.9 Refining The Linear Model](#4.9_Refining_The_Linear_Model)
    * [4.9.1 Define the pipeline](#4.9.1_Define_the_pipeline)
    * [4.9.2 Fit the pipeline](#4.9.2_Fit_the_pipeline)
    * [4.9.3 Assess performance on the train and test set](#4.9.3_Assess_performance_on_the_train_and_test_set)
    * [4.9.4 Define a new pipeline to select a different number of features](#4.9.4_Define_a_new_pipeline_to_select_a_different_number_of_features)
    * [4.9.5 Fit the pipeline](#4.9.5_Fit_the_pipeline)
    * [4.9.6 Assess performance on train and test data](#4.9.6_Assess_performance_on_train_and_test_data)
    * [4.9.7 Assessing performance using cross-validation](#4.9.7_Assessing_performance_using_cross-validation)
    * [4.9.8 Hyperparameter search using GridSearchCV](#4.9.8_Hyperparameter_search_using_GridSearchCV)
  * [4.10 Random Forest Model](#4.10_Random_Forest_Model)
    * [4.10.1 Define the pipeline](#4.10.1_Define_the_pipeline)
    * [4.10.2 Fit and assess performance using cross-validation](#4.10.2_Fit_and_assess_performance_using_cross-validation)
    * [4.10.3 Hyperparameter search using GridSearchCV](#4.10.3_Hyperparameter_search_using_GridSearchCV)
  * [4.11 Final Model Selection](#4.11_Final_Model_Selection)
    * [4.11.1 Linear regression model performance](#4.11.1_Linear_regression_model_performance)
    * [4.11.2 Random forest regression model performance](#4.11.2_Random_forest_regression_model_performance)
    * [4.11.3 Conclusion](#4.11.3_Conclusion)
  * [4.12 Data quantity assessment](#4.12_Data_quantity_assessment)
  * [4.13 Save best model object from pipeline](#4.13_Save_best_model_object_from_pipeline)
  * [4.14 Summary](#4.14_Summary)


## 4.2 Imports<a id='4.2_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

## 4.3 Load Data<a id='4.3_Load_Data'></a>

In [2]:
spotify_data = pd.read_csv('../data/genre_music.csv')
spotify_data.head()

Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_s,time_signature,chorus_hit,sections,popularity,decade,genre
0,Jealous Kind Of Fella,Garland Green,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,185.655,173.533,3,32.94975,9,1,60s,edm
1,Initials B.B.,Serge Gainsbourg,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,101.801,213.613,4,48.8251,10,0,60s,pop
2,Melody Twist,Lord Melody,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,115.94,223.96,4,37.22663,12,0,60s,pop
3,Mi Bomba Sonó,Celia Cruz,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,105.592,157.907,4,24.75484,8,0,60s,pop
4,Uravu Solla,P. Susheela,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,114.617,245.6,4,21.79874,14,0,60s,r&b


In [3]:
spotify_data.shape

(41099, 20)

## 4.4 Create Dummy Variables<a id='4.4_Create_Dummy_Variables'></a>

In [4]:
genre_dummy=pd.get_dummies(spotify_data["genre"])
genre_dummy.head()

Unnamed: 0,edm,latin,pop,r&b,rap,rock
0,1,0,0,0,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,0,1,0,0,0
4,0,0,0,1,0,0


In [5]:
decade_dummy=pd.get_dummies(spotify_data["decade"])
decade_dummy.head()

Unnamed: 0,00s,10s,60s,70s,80s,90s
0,0,0,1,0,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,0,1,0,0,0
4,0,0,1,0,0,0


In [6]:
# Drop column as it is now encoded
spotify_data = spotify_data.drop(['genre','decade'],axis = 1)
# Join the encoded df
spotify_data_dummies = pd.concat([spotify_data,decade_dummy,genre_dummy],axis=1)

In [12]:
spotify_data_dummies.head()

Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,...,60s,70s,80s,90s,edm,latin,pop,r&b,rap,rock
0,Jealous Kind Of Fella,Garland Green,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,...,1,0,0,0,1,0,0,0,0,0
1,Initials B.B.,Serge Gainsbourg,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,...,1,0,0,0,0,0,1,0,0,0
2,Melody Twist,Lord Melody,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,...,1,0,0,0,0,0,1,0,0,0
3,Mi Bomba Sonó,Celia Cruz,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,...,1,0,0,0,0,0,1,0,0,0
4,Uravu Solla,P. Susheela,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,...,1,0,0,0,0,0,0,1,0,0


In [13]:
spotify_data_dummies = spotify_data_dummies.drop(['track','artist'],axis = 1)

In [14]:
spotify_data_dummies.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,...,60s,70s,80s,90s,edm,latin,pop,r&b,rap,rock
0,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,...,1,0,0,0,1,0,0,0,0,0
1,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,...,1,0,0,0,0,0,1,0,0,0
2,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,...,1,0,0,0,0,0,1,0,0,0
3,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,...,1,0,0,0,0,0,1,0,0,0
4,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,...,1,0,0,0,0,0,0,1,0,0


## 4.6 Train/Test Split<a id='4.6_Train/Test_Split'></a>

In [10]:
len(spotify_data_dummies) * .7, len(spotify_data_dummies) * .3

(28769.3, 12329.699999999999)

In [15]:
spotify_data_dummies.columns

Index(['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_s', 'time_signature', 'chorus_hit', 'sections', 'popularity',
       '00s', '10s', '60s', '70s', '80s', '90s', 'edm', 'latin', 'pop', 'r&b',
       'rap', 'rock'],
      dtype='object')

In [19]:
features = ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_s', 'time_signature', 'chorus_hit', 'sections', 'popularity',
       '00s', '10s', '60s', '70s', '80s', '90s']

genres = ['edm','latin','pop','r&b','rap','rock']

X = spotify_data_dummies[features]

y = spotify_data_dummies[genres]

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [21]:
X_train.shape, X_test.shape

((28769, 22), (12330, 22))

In [22]:
y_train.shape, y_test.shape

((28769, 6), (12330, 6))

## 4.5 Scale the Data <a id='4.5_Scale_the_Data'></a>

### 4.5.1 Make Scaler Object <a id='4.5.1_Make_Scaler_Object'></a>

In [24]:
scaler = StandardScaler()

### 4.5.2 Fitting Data to Scalar <a id='4.5.2_Fitting_Data_to_Scalar'></a>

In [26]:
scaled_spotify_data = scaler.fit_transform(spotify_data_dummies)
scaled_spotify_data = pd.DataFrame(scaled_spotify_data)

## 4.7 Initial Model<a id='4.7_Initial_Model'></a>