# Baseline Modeling 

Before we do any further preprocessing or EDA let's run some baseline models on our spotify_cleaned dataset. This will give us a good jumping off point and help us see what classifiers we might want to gravitate towards and/or optimize in the future. 

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
# Reading in our cleaned dataset
spotify= pd.read_csv('~/Desktop/CapstoneProject/spotify_cleaned_final.csv')

In [3]:
# shape of our cleaned dataset 
spotify.shape

(81343, 21)

In [4]:
# All of our features 
spotify.columns

Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
       'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
       'track_genre'],
      dtype='object')

In [5]:
# top of the data
spotify.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,...,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,...,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,...,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,...,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,...,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


We have a second index. Let's read in our data one more time setting the 1st column as the index. 

In [6]:
# Renaming index 
spotify = pd.read_csv('~/Desktop/CapstoneProject/spotify_cleaned_final.csv', index_col=0)

In [7]:
# Sanity check 
spotify.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [8]:
# making a copy of the cleaned data 
spotify_original = spotify.copy()

In [9]:
# Grabbing our independent variables 
spotify = spotify.iloc[:,4:]

In [10]:
# Sanity check
spotify.head()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [11]:
spotify.shape

(81343, 16)

# Baseline Models

Without scaling our data or optimizing any hyperparameters, we are going to run a few baseline classifier models. These include: Logistic Regression, Decision Tree and Random Forest.

## Baseline Logistic Regression Model 

In [12]:
#Baseline model 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Train test split 
X = spotify.select_dtypes(exclude='object')
y = spotify['track_genre']

In [13]:
X.shape

(81343, 15)

In [14]:
y.shape

(81343,)

In [15]:
# train test split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [16]:
# shape of our splits
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(65074, 15) (16269, 15) (65074,) (16269,)


In [17]:
# Instatiate model 
baseline_log = LogisticRegression(max_iter=10000)

# Fit the model 
baseline_log.fit(X_train, y_train)

# Score on training data 
print(baseline_log.score(X_train, y_train))

# Score on testing data 
print(baseline_log.score(X_test, y_test))

0.050834434643636475
0.0481283422459893


For our baseline logistic regression model, both scores are extremely low which is a clear sign of underfitting. The baseline model hasn't picked up on any relationship between our features and genre. 

This was expected given the large number of overlapping genres. We definitely have some hyperparameters we can optimize, but given that the scores are both so low, consolidating genres would be a good first step. 

| Baseline Model      | Train Score | Test Score |
|---------------------|-------------|------------|
| Logistic Regression | 5.1%        | 4.8%       |
|                     |             |            |            

# Basline Decision Tree 

In [18]:
from sklearn.tree import DecisionTreeClassifier

# instantiate model 
baseline_DT = DecisionTreeClassifier()

# fit model 
baseline_DT.fit(X_train, y_train)

print(baseline_DT.score(X_train, y_train))
print(baseline_DT.score(X_test, y_test))

0.999984632879491
0.24893970127235848


Although this baseline decision tree scored better on the test data, there is an extremely large distance between the scores indicating overfitting due to the fact max_depth defaulted to 'None'. This parameter and min_samples_leaf will need to be optimized going forward.

| Baseline Model      | Train Score | Test Score |
|---------------------|-------------|------------|
| Logistic Regression | 5.1%        | 4.8%       |
| Decision Tree       | 99.9%       | 25.0%      |     

# Random Forest

A random forest model should improve upon our decision tree score since it is averaging out 100 decision tree models. 

In [23]:
from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier(n_estimators=100) # setting max features is not needed here as our dataset only has 2 features.
random_forest_model.fit(X_train, y_train)
 
#plot_decision_regions(X_train, y_train, clf=random_forest_model);

print(random_forest_model.score(X_train, y_train))
print(random_forest_model.score(X_test, y_test))

0.999984632879491
0.40555657999877065


| Baseline Model      | Train Score | Test Score |
|---------------------|-------------|------------|
| Logistic Regression | 5.1%        | 4.8%       |
| Decision Tree       | 99.9%       | 25.0%      | 
| Random Forest       | 99.9%       | 40.6%      | 

As expected, we have an increased train score of 99% and an increased test score of 40.6%. This is our highest score yet, but we also see a lot of overfittin - a little less than our DT model. There is a lot of improvement to be made here with some genre consolidation and hyperparamter optimization. 

We expected any and all classifiers to overfit and have trouble following any trend in the data due to the large amount of overlapping genres in the dataset. 

# K-Nearest Neighbors

In [24]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

print("Number of neighbors: ", knn_model.n_neighbors)
print("Train Accuracy: ", knn_model.score(X_train, y_train))
print("Test Accuracy: ", knn_model.score(X_test, y_test))


Number of neighbors:  5


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Train Accuracy:  0.256938254909795


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Test Accuracy:  0.034605691806503164
