# 3 Training and Modelling the Data<a id='4_Training_and_Modeling_the_Data'></a>

## 3.1 Contents<a id='3.1_Contents'></a>
* [3 Training and Modelling the Data](#3_Training_and_Modelling_the_Data)
  * [3.1 Contents](#3.1_Contents)
  * [3.2 Imports](#3.2_Imports)
  * [3.3 Load Data](#3.3_Load_Data)
  * [3.4 Create Dummy Variables](#3.4_Create_Dummy_Variables)
  * [3.5 Train/Test Split](#3.5_Train_Test_Split)
  * [3.6 Modelling](#3.6_Modelling)
    * [3.6.1 Model 1](#3.6.1_----)
        * [3.6.1.1 Train the model on the train split](#3.6.1.1_Train_the_model_on_the_train_split)
        * [3.6.1.2 Make predictions using the model on both train and test splits](#3.6.1.2_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [3.6.1.3 Assess model performance](#3.6.1.3_Assess_model_performance)
    * [3.6.2 -----](#3.6.2_Decision_Tree_without_entropy)
        * [3.6.2.1 Train the model on the train split](#3.6.2.1_Train_the_model_on_the_train_split)
        * [3.6.2.2 Make predictions using the model on both train and test splits](#3.6.2.2_Make_predictions_using_the_model_on_both_train_and_test_splits)
        * [3.6.2.3 Assess model performance](#3.6.1.3_Assess_model_performance)
  * [3.7 Final Model Selection](#3.7_Final_Model_Selection)
    * [3.7.1 --- model performance](#3.7.1_Logistic_regression_model_performance)
    * [3.7.2 --- model performance](#3.7.2_Decision_Tree_model_performance)
    * [3.7.3 Conclusion](#3.7.3_Conclusion)

## 3.2 Imports<a id='3.2_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
import datetime
from sklearn import metrics
from sklearn.tree import export_text
import matplotlib.image as pltimg
from IPython.display import Image  
import pydotplus

## 3.3 Load Data<a id='3.3_Load_Data'></a>

In [2]:
spotify_data = pd.read_csv('../data/top_200_tracks_features.csv')
spotify_data.head()

Unnamed: 0.1,Unnamed: 0,Position,Track Name,Artist,Stream Count/Week,URL,source,Week Ending,id,name,...,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,0,1,Blinding Lights,The Weeknd,41066317,https://open.spotify.com/track/0sf12qNH5qcw8qp...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0sf12qNH5qcw8qpgymFOqD,Blinding Lights,...,30,0.513,0.00147,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
1,1,2,The Box,Roddy Ricch,37470185,https://open.spotify.com/track/0nbXyq5TXYPCO7p...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,0nbXyq5TXYPCO7pr3N8S4I,The Box,...,88,0.896,0.104,0.586,0.0,0.79,-6.687,0.0559,116.971,4
2,2,3,Dance Monkey,Tones And I,36071262,https://open.spotify.com/track/1rgnBhdG2JDFTbY...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,1rgnBhdG2JDFTbYkYRZAku,Dance Monkey,...,69,0.825,0.688,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
3,3,4,Don't Start Now,Dua Lipa,32169572,https://open.spotify.com/track/6WrI0LAC5M1Rw2M...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6WrI0LAC5M1Rw2MnX2ZvEg,Don't Start Now,...,85,0.794,0.0125,0.793,0.0,0.0952,-4.521,0.0842,123.941,4
4,4,5,La Difícil,Bad Bunny,29598307,https://open.spotify.com/track/6NfrH0ANGmgBXyx...,regional-global-weekly-2020-03-06--2020-03-13.csv,2020-03-13,6NfrH0ANGmgBXyxgV2PeXt,La Difícil,...,81,0.685,0.0861,0.848,7e-06,0.0783,-4.561,0.0858,179.87,4


In [3]:
spotify_data.shape

(10600, 24)

In [4]:
print(list(spotify_data.columns))

['Unnamed: 0', 'Position', 'Track Name', 'Artist', 'Stream Count/Week', 'URL', 'source', 'Week Ending', 'id', 'name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature']


In [5]:
spotify_data = spotify_data.drop(['Unnamed: 0','Track Name', 'Artist','URL', 'source','id','name', 'album', 'artist'],axis = 1)
spotify_data.head()

Unnamed: 0,Position,Stream Count/Week,Week Ending,release_date,length,popularity,danceability,acousticness,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,1,41066317,2020-03-13,2019-11-29,201573,30,0.513,0.00147,0.796,0.000209,0.0938,-4.075,0.0629,171.017,4
1,2,37470185,2020-03-13,2019-12-06,196652,88,0.896,0.104,0.586,0.0,0.79,-6.687,0.0559,116.971,4
2,3,36071262,2020-03-13,2019-05-10,209754,69,0.825,0.688,0.593,0.000161,0.17,-6.401,0.0988,98.078,4
3,4,32169572,2020-03-13,2019-10-31,183290,85,0.794,0.0125,0.793,0.0,0.0952,-4.521,0.0842,123.941,4
4,5,29598307,2020-03-13,2020-02-28,163084,81,0.685,0.0861,0.848,7e-06,0.0783,-4.561,0.0858,179.87,4


In [6]:
spotify_data.shape

(10600, 15)

In [7]:
feature_names = list(spotify_data.columns[2:])
print("features:", feature_names, sep="\n")

features:
['Week Ending', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature']


## 3.4 Create Dummy Variables<a id='3.4_Create_Dummy_Variables'></a>

Is this required?

## 3.5 Train/Test Split<a id='3.5_Train_Test_Split'></a>

In [8]:
len(spotify_data) * .7, len(spotify_data) * .3

(7419.999999999999, 3180.0)

In [10]:
X = spotify_data[feature_names]

y = spotify_data['Stream Count/Week']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [12]:
X_train.shape, X_test.shape

((7420, 13), (3180, 13))

In [13]:
y_train.shape, y_test.shape

((7420,), (3180,))

## 3.6 Models<a id='3.6_Models'></a>

### 3.6.1 Model 1<a id='3.6.1_Model_1'></a>

#### 3.6.1.1 Train the model on the train split<a id='3.6.1.1_Train_the_model_on_the_train_split'></a>

#### 3.6.1.2 Make predictions using the model on both train and test splits<a id='3.6.1.2_Make_predictions_using_the_model_on_both_train_and_test_splits'></a>

In [None]:
---_train_pred = ---.predict(X_train)

---_test_pred = ---.predict(X_test)

In [None]:
---_train_pred = pd.Series(---_train_pred)

---

#### 3.6.1.3 Assess model performance<a id='3.6.1.3_Assess_model_performance'></a>

##### Training Set

##### Test Set