# 3 Training and Modelling the Data<a id='4_Training_and_Modeling_the_Data'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Training and Modelling the Data](#3_Training_and_Modelling_the_Data)
  * [3.1 Contents](#3.1_Contents)
  * [3.2 Imports](#3.2_Imports)
  * [3.3 Load Data](#3.3_Load_Data)
  * [3.4 Train/Test Split](#3.4_Train_Test_Split)
  * [3.5 Modelling](#3.5_Modelling)
    * [3.5.1 Model 1](#3.5.1_----)
        * [3.5.1.1 Train the model on the train split](#3.5.1.1_Train_the_model_on_the_train_split)
        * [3.5.1.2 Make predictions using the model on both train and test splits](#3.5.1.2_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [3.5.1.3 Assess model performance](#3.5.1.3_Assess_model_performance)
    * [3.5.2 -----](#3.5.2_Decision_Tree_without_entropy)
        * [3.5.2.1 Train the model on the train split](#3.5.2.1_Train_the_model_on_the_train_split)
        * [3.5.2.2 Make predictions using the model on both train and test splits](#3.5.2.2_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [3.5.2.3 Assess model performance](#3.5.1.3_Assess_model_performance)
  * [3.6 Final Model Selection](#3.6_Final_Model_Selection)
    * [3.6.1 --- model performance](#3.6.1_Logistic_regression_model_performance)
    * [3.6.2 --- model performance](#3.6.2_Decision_Tree_model_performance)
    * [3.6.3 Conclusion](#3.6.3_Conclusion)

## 3.2 Imports<a id='3.2_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
import datetime
from sklearn import metrics
from sklearn.tree import export_text
import matplotlib.image as pltimg
from IPython.display import Image  
import pydotplus

## 3.3 Load Data<a id='3.3_Load_Data'></a>

In [2]:
spotify_data = pd.read_csv('../data/top_200_tracks_features.csv')
spotify_data.head()

Unnamed: 0.1,Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source,Week Ending,id,name,...,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,0,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0sf12qNH5qcw8qpgymFOqD,Blinding Lights,...,30,0.513,0.00147,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
1,1,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0nbXyq5TXYPCO7pr3N8S4I,The Box,...,88,0.896,0.104,0.586,0.0,0.79,-6.687,0.0559,116.971,4
2,2,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,1rgnBhdG2JDFTbYkYRZAku,Dance Monkey,...,69,0.825,0.688,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
3,3,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6WrI0LAC5M1Rw2MnX2ZvEg,Don't Start Now,...,85,0.794,0.0125,0.793,0.0,0.0952,-4.521,0.0842,123.941,4
4,4,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6NfrH0ANGmgBXyxgV2PeXt,La Difícil,...,81,0.685,0.0861,0.848,7e-06,0.0783,-4.561,0.0858,179.87,4


In [3]:
spotify_data.shape

(10600, 24)

In [4]:
print(list(spotify_data.columns))

['Unnamed: 0', 'Position', 'Track Name', 'Artist', 'Stream Count/Week', 'URL', 'source', 'Week Ending', 'id', 'name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature']


In [5]:
spotify_data = spotify_data.drop(['Unnamed: 0','Track Name', 'Artist','URL', 'source','id','name', 'album', 'artist'],axis = 1)
spotify_data.head()

Unnamed: 0,Position,Stream Count/Week,Week Ending,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,1,41066317,2020-03-13,2019-11-29,201573,30,0.513,0.00147,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
1,2,37470185,2020-03-13,2019-12-06,196652,88,0.896,0.104,0.586,0.0,0.79,-6.687,0.0559,116.971,4
2,3,36071262,2020-03-13,2019-05-10,209754,69,0.825,0.688,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
3,4,32169572,2020-03-13,2019-10-31,183290,85,0.794,0.0125,0.793,0.0,0.0952,-4.521,0.0842,123.941,4
4,5,29598307,2020-03-13,2020-02-28,163084,81,0.685,0.0861,0.848,7e-06,0.0783,-4.561,0.0858,179.87,4


In [6]:
spotify_data.shape

(10600, 15)

In [7]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10600 entries, 0 to 10599
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Position           10600 non-null  int64  
 1   Stream Count/Week  10600 non-null  int64  
 2   Week Ending        10600 non-null  object 
 3   release_date       10600 non-null  object 
 4   length             10600 non-null  int64  
 5   popularity         10600 non-null  int64  
 6   danceability       10600 non-null  float64
 7   acousticness       10600 non-null  float64
 8   energy             10600 non-null  float64
 9   instrumentalness   10600 non-null  float64
 10  liveness           10600 non-null  float64
 11  loudness           10600 non-null  float64
 12  speechiness        10600 non-null  float64
 13  tempo              10600 non-null  float64
 14  time_signature     10600 non-null  int64  
dtypes: float64(8), int64(5), object(2)
memory usage: 1.2+ MB


In [8]:
spotify_data["release_date"] = pd.to_datetime(spotify_data["release_date"])
spotify_data["Week Ending"]  = pd.to_datetime(spotify_data["Week Ending"])

In [9]:
spotify_data["days elapsed"] = (spotify_data["release_date"] - spotify_data["Week Ending"]).dt.days

In [10]:
spotify_data = spotify_data.set_index('Week Ending')
spotify_data.index

DatetimeIndex(['2020-03-13', '2020-03-13', '2020-03-13', '2020-03-13',
               '2020-03-13', '2020-03-13', '2020-03-13', '2020-03-13',
               '2020-03-13', '2020-03-13',
               ...
               '2021-03-19', '2021-03-19', '2021-03-19', '2021-03-19',
               '2021-03-19', '2021-03-19', '2021-03-19', '2021-03-19',
               '2021-03-19', '2021-03-19'],
              dtype='datetime64[ns]', name='Week Ending', length=10600, freq=None)

In [11]:
spotify_data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10600 entries, 2020-03-13 to 2021-03-19
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Position           10600 non-null  int64         
 1   Stream Count/Week  10600 non-null  int64         
 2   release_date       10600 non-null  datetime64[ns]
 3   length             10600 non-null  int64         
 4   popularity         10600 non-null  int64         
 5   danceability       10600 non-null  float64       
 6   acousticness       10600 non-null  float64       
 7   energy             10600 non-null  float64       
 8   instrumentalness   10600 non-null  float64       
 9   liveness           10600 non-null  float64       
 10  loudness           10600 non-null  float64       
 11  speechiness        10600 non-null  float64       
 12  tempo              10600 non-null  float64       
 13  time_signature     10600 non-null  int64    

In [19]:
feature_names = list(spotify_data.columns[3:])
print("features:", feature_names, sep="\n")

features:
['length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature', 'days elapsed']


## 3.4 Train/Test Split<a id='3.4_Train_Test_Split'></a>

In [20]:
len(spotify_data) * .7, len(spotify_data) * .3

(7419.999999999999, 3180.0)

In [21]:
X = spotify_data[feature_names]

y = spotify_data['Stream Count/Week']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [23]:
X_train.shape, X_test.shape

((7420, 12), (3180, 12))

In [24]:
y_train.shape, y_test.shape

((7420,), (3180,))

## 3.6 Models<a id='3.6_Models'></a>

### 3.6.1 Linear Regression<a id='3.6.1_Linear_Regression'></a>

#### 3.6.1.1 Train the model on the train split<a id='3.6.1.1_Train_the_model_on_the_train_split'></a>

In [25]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X, y)
reg.score(X, y)
reg.coef_

array([-8.25449183e+00,  6.86426635e+04,  1.80431763e+06, -1.52692365e+05,
        3.05234708e+06, -4.70391080e+06,  1.48524317e+06, -2.78762713e+05,
       -3.33316455e+06,  1.35673206e+04, -1.08763731e+05,  3.61417459e+01])

In [26]:
reg = reg.fit(X_train, y_train)

#### 3.6.1.2 Make predictions using the model on both train and test splits<a id='3.6.1.2_Make_predictions_using_the_model_on_both_train_and_test_splits'></a>

In [27]:
reg_train_pred = reg.predict(X_train)

reg_test_pred = reg.predict(X_test)

In [28]:
reg_train_pred = pd.Series(reg_train_pred)

reg

LinearRegression()

#### 3.6.1.3 Assess model performance<a id='3.6.1.3_Assess_model_performance'></a>

##### Training Set

In [29]:
# Regression metrics
explained_variance=metrics.explained_variance_score(y_train, reg_train_pred)
mean_absolute_error=metrics.mean_absolute_error(y_train, reg_train_pred) 
mse=metrics.mean_squared_error(y_train, reg_train_pred) 
mean_squared_log_error=metrics.mean_squared_log_error(y_train, reg_train_pred)
median_absolute_error=metrics.median_absolute_error(y_train, reg_train_pred)
r2=metrics.r2_score(y_train, reg_train_pred)

print('explained_variance: ', round(explained_variance,4))    
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))

explained_variance:  0.0361
mean_squared_log_error:  0.2353
r2:  0.0361
MAE:  3823156.6654
MSE:  32053182916372.875
RMSE:  5661553.0481


##### Test Set

In [31]:
# Regression metrics
explained_variance=metrics.explained_variance_score(y_test, reg_test_pred)
mean_absolute_error=metrics.mean_absolute_error(y_test, reg_test_pred) 
mse=metrics.mean_squared_error(y_test, reg_test_pred) 
mean_squared_log_error=metrics.mean_squared_log_error(y_test, reg_test_pred)
median_absolute_error=metrics.median_absolute_error(y_test, reg_test_pred)
r2=metrics.r2_score(y_test, reg_test_pred)

print('explained_variance: ', round(explained_variance,4))    
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))

explained_variance:  0.0436
mean_squared_log_error:  0.2331
r2:  0.0426
MAE:  3839079.8179
MSE:  31413852182776.152
RMSE:  5604806.1682


### 3.6.2 Decision Tree <a id='3.6.2_Decision_Tree'></a>

In [None]:
from statsmodels.tsa.arima.model import ARIMA
from random import random

# fit model
model = ARIMA(data, order=(1, 1, 1))
model_fit = model.fit()
# make prediction
yhat = model_fit.predict(len(data), len(data), typ='levels')
print(yhat)

