# Spotify Song Like Prediction Model 
 

---

## Overview - Problem Statement

We will collect data relating to Spotify songs played and if a user likes the song. We will use this to build a classifer prediction model to predict when a user will like a song using the song's features.



## Import Tools & Data

---

In [1]:
# First we Import Libraries. WE will import them all to be safe

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegressionCV, LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from sklearn.metrics import r2_score
from sklearn import metrics
import requests 
from time import sleep
from bs4 import BeautifulSoup

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error






%matplotlib inline

## Data Collection

---

### Objectives
- Determine where and how to access the full list of sub-reddits titles
- Import data to my server for anlysis

TBD Update

In [2]:
df = pd.read_csv('data/data.csv')

## Exploratory Data Analysis

---

Since we have pulled this data from an internet source and not directly through Spotify's API, we need to  thoroughly examine the data to ensure we ahve a complete and useable set. 

### Objectives
- Evaluate missing data and devise a plan to handle accordingly
- Ensure useable data-types, transform data as needed

In [3]:
df.shape

(2017, 17)

In [4]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4.0,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4.0,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4.0,0.173,1,Xanny Family,Future


In [5]:
df.isnull().sum()

Unnamed: 0          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
target              0
song_title          0
artist              0
dtype: int64

In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,2017.0,1008.0,582.402066,0.0,504.0,1008.0,1512.0,2016.0
acousticness,2017.0,0.18759,0.259989,3e-06,0.00963,0.0633,0.265,0.995
danceability,2017.0,0.618422,0.161029,0.122,0.514,0.631,0.738,0.984
duration_ms,2017.0,246306.197323,81981.814219,16042.0,200015.0,229261.0,270333.0,1004627.0
energy,2017.0,0.681577,0.210273,0.0148,0.563,0.715,0.846,0.998
instrumentalness,2017.0,0.133286,0.273162,0.0,0.0,7.6e-05,0.054,0.976
key,2017.0,5.342588,3.64824,0.0,2.0,6.0,9.0,11.0
liveness,2017.0,0.190844,0.155453,0.0188,0.0923,0.127,0.247,0.969
loudness,2017.0,-7.085624,3.761684,-33.097,-8.394,-6.248,-4.746,-0.307
mode,2017.0,0.612295,0.487347,0.0,0.0,1.0,1.0,1.0


<br>
Examining our data we can see that the data is full as we do not have any nulls in any columns. Also furthermore looking at the data types of each column, we see that each datatype is appropriate and useable.

We can drop columns 'Unnamed:0' and 'time_signature' as on first glance these will be less helpful to the model and will simply cause noise.

Before any transformation, we have 17 features and 2,017 datapoints to use. We will need to dummify the artist column in order to use this as a feature input into any of our models.
<br><br>

In [7]:
X = df.drop(columns=['Unnamed: 0' , 'song_title', 'target'])

In [8]:
X = pd.get_dummies(X, columns =['artist'], drop_first=True)

In [9]:
X.head(3)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,...,artist_alt-J,artist_deadmau5,artist_for KING & COUNTRY,artist_one sonic society,artist_tUnE-yArDs,artist_tobyMac,artist_권나무 Kwon Tree,artist_도시총각 Dosichonggak,artist_카우칩스 The CowChips,artist_플랫핏 Flat Feet
0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,...,0,0,0,0,0,0,0,0,0,0
1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,...,0,0,0,0,0,0,0,0,0,0
2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,...,0,0,0,0,0,0,0,0,0,0


In [10]:
X.shape

(2017, 1355)

<br>
We have created our X variable using only the fature data from each song along with the dummified data for song artist. We can see that our original X data frame shape was 2,017 rows with 17 columns and not we have 2,017 rows with 1,355 columns.

Since our X shape looks correct, we continue to set our 'y' target variable. Here we have labeled the column 'target' for simplicity with values of 1 if the song is liked and 0 if its unliked.
<br><br>

In [11]:
y = df['target']

<br>
We have successfully imported our Spotify data. We have explored the data thoroughly to ensure a useable set without missing data or improper values. We have also setup our X and y Datasets to feed our models with appropriate features and target. 

Our Exploratory Data Analysis is complete. We should move on to building our models
<br><br>

<br>

## Construct Logistic Regression Model

Lets build our first model which will be a Logistic Regression Model. We will test it with 'C' value of 100 along with 5 cross validations. We will setup a train-test split validation to allow us to test our model's fitness as we build.

<br>

In [12]:
# Instantiate
lr = LogisticRegressionCV(
    Cs=[100],
    cv=5,
    n_jobs=-1
    
)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    random_state=42,
                                                    stratify=y)

In [14]:
lr.fit(X_train, y_train)

LogisticRegressionCV(Cs=[100], class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='warn', n_jobs=-1, penalty='l2',
                     random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [15]:
lr.score(X_train, y_train)

0.5383597883597884

In [16]:
lr.score(X_test, y_test)

0.5485148514851486

<br>
Looking at our first model we can see our baseline score is simply 54% against training data. This is not a very impressive score, and judging by our standards for success, this would not be considered a successful model.

Reviewing our train-test split scores, we see that the model is performing with 55% score against testing data. This is a good result that our model is not over-fitting to the training data.

We continue further to try other models, pipelines, and features.
<br>

<br> 
## Construct K Nearest Neighbors Classifier Model

Next lets continue our optimmal model constructions by testing a K Nearest Neighbors model  We will initially build this to test with '3' nearest neighbors. 

We will fit our model to training data established above.
<br><br>

In [17]:
knn = KNeighborsClassifier();
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [18]:
print(f' Knn Training Score is {knn.score(X_train, y_train)}');
print(f' Knn Training Score is {knn.score(X_test, y_test)}')

 Knn Training Score is 0.7248677248677249
 Knn Training Score is 0.5584158415841585


<br>
Here we evaluate using a K Nearest Neighbors classification using the 3 nearest neighbors. We can achieve a better score here when lookign at the training data we achieve 73% score.  

When we evaluate our knn model against testing data, our score drops to 56%. This means we are over fitting to our training data with our calculations. 

To strengthen this KNN model, lets evaluate our score when passing different parameters.
<br><br>

In [19]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=9, p=2,
                     weights='uniform')

In [20]:
print(f'Training score is {knn.score(X_train, y_train)}');
print(f'Testing score is {knn.score(X_test, y_test)}')

Training score is 0.6812169312169312
Testing score is 0.5564356435643565


<br>
Examining our different neighbor sizes of KNN, we observe that this reduces our model's over-fitness, however it does not improve our performance scores. This is good information to note, though not helpful in improving our model. 

Continuing our efforts to improve the model, I will try grid-search capabilities next to find the optimal parameters. To grid search effectively we will scale our data first.
<br><br>

In [21]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train) #fit transfrom takes mean/standard dev.., then tranformed to z score 
X_test_scaled = ss.transform(X_test)


In [22]:
lr.fit(X_train_scaled, y_train)



LogisticRegressionCV(Cs=[100], class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='warn', n_jobs=-1, penalty='l2',
                     random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [23]:
print(f'Training-Scaled score is {knn.score(X_train_scaled, y_train)}');
print(f'Testing-Scaled score is {knn.score(X_test_scaled, y_test)}')

Training-Scaled score is 0.49404761904761907
Testing-Scaled score is 0.49504950495049505


<br>
We have tried scaling the data to investigate if this can help improve our model performance. Unformatately scaling the data has had the inverse impact and has reduced our improvements. 

We will re-fit our lr model back to original state and continue with other model evaluateions and pipelines methods. We will re-scale data as needed to operate effectively with the pipeline, however in the event we use this stand alone model, it is better to leave the data un transformed.
<br><br>

In [24]:
lr.fit(X_train, y_train)

LogisticRegressionCV(Cs=[100], class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='warn', n_jobs=-1, penalty='l2',
                     random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [25]:
#Grid Search lr parameter
lr_params = {
    'C':[100,10,1,0.1,0.01]
}

In [26]:
#Instantiate a Grid search
gs = GridSearchCV(LogisticRegression(), 
                  lr_params, 
                  cv=5, 
                  n_jobs=-1,
                 verbose=1
)

In [27]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    5.4s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=-1, param_grid={'C': [100, 10, 1, 0.1, 0.01]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=1)

In [28]:
gs.best_params_

{'C': 100}

In [29]:
gs.best_score_

0.5357142857142857

In [30]:
gs.score(X_test, y_test)

0.5485148514851486

<br>
Completing our Grid Search we see that our lr model performs best when passed '100' for 'C' parameter which is actually our inital best-guess seeting.

Using these parameters our lr best score against training data is 53.6% which is our original result. When compared to testign data our best score is 54.9
<br><br>


<br> 
## Evaluation - Confusion Matrix

Continuing with our best KNN model, next evaluate the output of our model by taking a look into the values of our confusion matrix.  

To produce the confusion metric we wil ned to generate and store our predictions.
<br><br>

In [31]:
preds = knn.predict(X_test)

In [32]:
from sklearn.metrics import confusion_matrix

In [33]:
confusion_matrix(y_test, preds) #Receive raw confusion matrix

array([[130, 120],
       [104, 151]])

In [34]:
#Setting up confusion matrix, using the ravel() function
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

In [35]:
tn, fp, fn, tp

(130, 120, 104, 151)

In [36]:
conf_matrix = pd.DataFrame(data=[[tp, fp],[fn, tn]], 
                           columns=['Actual +', 'Actual -'], 
                           index=['Predicted +', 'Predicted -'])
conf_matrix

Unnamed: 0,Actual +,Actual -
Predicted +,151,120
Predicted -,104,130


In [37]:
sensitivity = conf_matrix.iloc[0,0] / conf_matrix.iloc[:, 0].sum()
specificity = conf_matrix.iloc[1,1] / conf_matrix.iloc[:, 1].sum()
accuracy = (conf_matrix.iloc[0, 0] + conf_matrix.iloc[1, 1]) / conf_matrix.iloc[:, :].sum().sum()
precision = conf_matrix.iloc[0, 0] / conf_matrix.iloc[0, :].sum()
neg_pred_val = conf_matrix.iloc[1, 1] / conf_matrix.iloc[1, :].sum()

In [40]:
#Nifty printing format code borrowed from General Assembly ATL Local-lesson
label_just = 18
spacer = ' '
sep = ':'
sep_just = 2

print('Sensitivity '.ljust(label_just, spacer)     + sep.ljust(sep_just), '{:.2f}%'.format(sensitivity*100))
print('Specificity '.ljust(label_just, spacer)     + sep.ljust(sep_just), '{:.2f}%'.format(specificity*100))
print('Accuracy '.ljust(label_just, spacer)        + sep.ljust(sep_just), '{:.2f}%'.format(accuracy*100))
print('Precision (PPV) '.ljust(label_just, spacer) + sep.ljust(sep_just), '{:.2f}%'.format(precision*100))
print('Neg PV '.ljust(label_just, spacer)          + sep.ljust(sep_just), '{:.2f}%'.format(neg_pred_val*100))

Sensitivity       :  59.22%
Specificity       :  52.00%
Accuracy          :  55.64%
Precision (PPV)   :  55.72%
Neg PV            :  55.56%


<br>


Examing our confusion matrix our model is optimized for Sensetivity which is good as we are predicting true positives at the highest rate. Our model is trending toward our proper data set but is not yet the best it can be. We can try using a BoosStrap method to continue to iterate toward perfection
<br><br>

<br>

### Create a Bootstrap Pipeline
At this point our best score is still our KNN model testing with a 'C' value of 100.  In an effort to improve our model, we will next setup a bootstrap pipeline method using several models simultaneously. Lets evaluate this   combination to understand how this impacts our performance
<br><br>

In [41]:
ss = StandardScaler()
ss.fit(X_train)
X_train_scaled = ss.transform(X_train)
X_test_scaled = ss.transform(X_test)

In [42]:
lcv = LassoCV()
knr = KNeighborsRegressor()
dtr = DecisionTreeRegressor()
br = BaggingRegressor(base_estimator=DecisionTreeRegressor())
rfr = RandomForestRegressor()
adr = AdaBoostRegressor()
svr = SVR()

In [43]:
lcv.fit(X_train_scaled, y_train)
knr.fit(X_train_scaled, y_train)
dtr.fit(X_train_scaled, y_train)
br.fit(X_train_scaled, y_train)
rfr.fit(X_train_scaled, y_train)
adr.fit(X_train_scaled, y_train)
svr.fit(X_train_scaled, y_train)



SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

In [44]:
lcv_preds = lcv.predict(X_test_scaled)
knr_preds = knr.predict(X_test_scaled)
dtr_preds = dtr.predict(X_test_scaled)
br_preds = br.predict(X_test_scaled)
rfr_preds = rfr.predict(X_test_scaled)
adr_preds = adr.predict(X_test_scaled)
svr_preds = svr.predict(X_test_scaled)

In [45]:
#Method to split-out the 'model-name' from the model information string
#Method borrowed from General Assembly lessons
str(lcv).split('(')[0]

'LassoCV'

In [46]:
#Method to get the RMSE for each model type
#Method sourced from General Assembly lessons
for model in [lcv, knr, dtr, br, rfr, adr, svr]:
    print(str(model).split('(')[0])
    
    train_preds = model.predict(X_train_scaled)
    train_mse = mean_squared_error(y_train, train_preds)
    print(f'Train RMSE: {train_mse**0.5}')
    
    test_preds = model.predict(X_test_scaled)
    test_mse = mean_squared_error(y_test, test_preds)
    print(f'Test RMSE: {test_mse**0.5}')
    print()
          

LassoCV
Train RMSE: 0.18586838704431535
Test RMSE: 0.4019529683040417

KNeighborsRegressor
Train RMSE: 0.4372835821488514
Test RMSE: 0.542600996667126

DecisionTreeRegressor
Train RMSE: 0.018184824186332698
Test RMSE: 0.5431845956491173

BaggingRegressor
Train RMSE: 0.17100208897170424
Test RMSE: 0.44038686862735354

RandomForestRegressor
Train RMSE: 0.17226328662165794
Test RMSE: 0.43410568387718385

AdaBoostRegressor
Train RMSE: 0.4258825883287354
Test RMSE: 0.4421913405492583

SVR
Train RMSE: 0.13735581603834654
Test RMSE: 0.4271688712206993



<br>
Above we use a bootstrap method to evaluate several models in a concise process. Here we are using Lasso, K Neighbors Regressor, Decision Tree Regressor, Bagging Regressor, Random Forest Regressor, AdaBoost Regressor, SVR.


Initial analysis when using our bootstrap method shows poorer scores and greater overfitting to our training data. In general the models show a trend of overfitting and performing poorly this way.

Though we are using many models here, we can see that our best performance continues to lie with our K Neighbors Classifer model.  For now, lets take a moment to backup our transferred data-sets for safety and reproduceability.
<br><br>

In [47]:
#Take the Scores and Titles Dataframes to csg
df.to_csv('DF_original.csv')
X_train.to_csv('X_Train.csv')
y_train.to_csv('y_train.csv')
X_test.to_csv('X_test.csv')
y_test.to_csv('y_test.csv')



  after removing the cwd from sys.path.
  


### Conclusions and Reccomendations

When Building a model to predict if a user will like a song based on the user's like history alone, It will be difficult to achieve really high performance score. This is a form of classifying human behaviour whch can be come difficult, yet a successful model is possible. 

For the features that we have available to construct our model, K nearest neighbors has the best performance. 

Its recommended to: 
- continue enhancements of this model by engineering features 
- collecting more data from this single user as well as other users
- Collect lyrics and perform sentiment analysis and use results as predictive features

Additionally -In the case of expanding the model to be a general prediction model for any users, we will need to average and classify each of the "liked/disliked" from all of the users for a particular song. For our purposes here with a single user, this method is fully functional. For a new user, we would simply need to load their like history to produce new results. 