# Recommendation System

The next step in the analysis is to build a recommendation system. I will test out KNNBaseline, KNNBasic, KNNWithMeans and SVD models and move forward with whichever model has the lowest RMSE score.

In [1]:
#import necessary packages 
import pandas as pd
import numpy as np
from surprise import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV

In [2]:
#import dataframe 
df = pd.read_csv('modelingdata')
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [3]:
#Explore first two rows of dataframe 
df.head(2)

Unnamed: 0,artist,artist_id,album,track_name,is_explicit,track_id,danceability,energy,key,loudness,...,genre_electronic,genre_hip hop,genre_house,genre_indie,genre_pop,genre_punk,genre_r&b,genre_rap,genre_rock,genre_soul
0,Katy Perry,6jJ0s89eD6GaHleKKya26X,Katy Perry - Teenage Dream: The Complete Confe...,Firework,0,4lCv7b86sLynZbXhfScfm2,0.638,0.826,8,-4.968,...,0,0,0,0,0,0,0,0,0,0
1,Katy Perry,6jJ0s89eD6GaHleKKya26X,Katy Perry - Teenage Dream: The Complete Confe...,California Gurls,0,6tS3XVuOyu10897O3ae7bi,0.791,0.754,0,-3.729,...,0,0,0,0,0,0,0,0,0,0


In [4]:
#Look at dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7028 entries, 0 to 7027
Data columns (total 39 columns):
artist              7028 non-null object
artist_id           7028 non-null object
album               7028 non-null object
track_name          7028 non-null object
is_explicit         7028 non-null int64
track_id            7028 non-null object
danceability        7028 non-null float64
energy              7028 non-null float64
key                 7028 non-null int64
loudness            7028 non-null float64
mode                7028 non-null int64
speechiness         7028 non-null float64
acousticness        7028 non-null float64
instrumentalness    7028 non-null float64
liveness            7028 non-null float64
valence             7028 non-null float64
tempo               7028 non-null float64
duration_ms         7028 non-null int64
ratings             7028 non-null int64
decade_1960         7028 non-null int64
decade_1970         7028 non-null int64
decade_1980         7028 non-n

Note that the current dataframe we are looking at consists of the integer ratings, the same data that was used in our original EDA. The distribution of these ratings can be found below.

In [5]:
#Print the value counts of the ratings column
df['ratings'].value_counts()

3    2668
2    1905
4    1314
1     888
5     253
Name: ratings, dtype: int64

In [6]:
#See which artists have been rated the most 
df['artist'].value_counts().head()

Various Artists        464
Panic! At The Disco     63
Passion Pit             56
Bastille                51
Twenty One Pilots       43
Name: artist, dtype: int64

In [7]:
#See which tracks have been rated the most 
df['track_name'].value_counts().head()

Closer       8
Hurricane    8
Smile        8
Gold         8
Heaven       7
Name: track_name, dtype: int64

In [8]:
#Create a dataframe that contains artist ID, track ID, and ratings to then be put into surprise
dataset = df[['artist_id','track_id', 'ratings']]

In [9]:
#Check sparsity of matrix
numratings = len(dataset['ratings'])
numusers = len(dataset['artist_id'].unique())
numitems = len(dataset['track_id'].unique())

sparse = 1 - (numratings / (numusers*numitems))
sparse

0.9995661887459188

In [10]:
#print number of unique artists 
dataset['artist_id'].nunique()

2322

Our dataset looks very sparse, at a value of 99%. Ideally, I would like for the sparsity of the matrix to be below 95%. In an attempt to achieve this, I will remove any artists who only appear once in the playlist.

In [11]:
#Filter out artists who only appear once in dataset 
dataset = dataset.groupby('artist_id').filter(lambda x: len(x)>1)

In [12]:
#Check sparsity of matrix again 
numratings = len(dataset['ratings'])
numusers = len(dataset['artist_id'].unique())
numitems = len(dataset['track_id'].unique())

sparse = 1 - (numratings / (numusers*numitems))
sparse

0.9990326457248797

In [13]:
#print number of unique artists 
dataset['artist_id'].nunique()

1043

While the sparsity of our matrix is still quite high, I have reduced the number of artists that appear in the playlist from 2322 to 1043. This will help with providing meaningful recommendations once our recommendation system is built. The reason the matrix still remains very sparse is that in this particular playlist, there is only one artist listed per track. Unfortunately, the Spotify API does not have other proxies that could be used for artist_id (i.e. songwriters, producers, etc.) that would help with the sparsity of the matrix. In the meantime, I will run the model with these smaller list of artists, and will potentially re-evaluate the sparsity value later on in the analysis.

In [14]:
#Check for NA values 
dataset.isna().sum()

artist_id    0
track_id     0
ratings      0
dtype: int64

### Baseline Model

In [15]:
#Instantiate reader and data 
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(dataset, reader)

In [16]:
#Train test split with test size of 25% 
trainset, testset = train_test_split(data, test_size=.25)

In [17]:
# Print number of artists and tracks for the trainset 
print('Number of artists in train set : ', trainset.n_users, '\n')
print('Number of tracks in train set : ', trainset.n_items, '\n')

Number of artists in train set :  1017 

Number of tracks in train set :  4274 



In [18]:
#Instantiate a baseline model using SVD baseline 
baseline = SVD(random_state=42)

In [19]:
#Fit model on the trainset 
baseline.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdb4bb9d8d0>

In [20]:
#Predict on the test set 
baselinepreds = baseline.test(testset)

In [21]:
#Check RMSE and MAE results 
accuracy.rmse(baselinepreds)
accuracy.mae(baselinepreds)

RMSE: 0.9094
MAE:  0.7187


0.7187116770014621

In [22]:
#Run 3-fold cross validation on the data and print results 
cv_baseline = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8981  0.8927  0.9273  0.9060  0.0152  
MAE (testset)     0.7053  0.7065  0.7274  0.7131  0.0101  
Fit time          0.27    0.22    0.20    0.23    0.03    
Test time         0.07    0.01    0.01    0.03    0.03    


In [23]:
# Print out the RMSE score for each fold 
for i in cv_baseline.items():
    print(i)

('test_rmse', array([0.89812095, 0.892668  , 0.92725151]))
('test_mae', array([0.70528602, 0.70650744, 0.72739868]))
('fit_time', (0.2715928554534912, 0.22281599044799805, 0.2016618251800537))
('test_time', (0.06548309326171875, 0.009864091873168945, 0.009476900100708008))


In [24]:
#Find the average test RMSE from the 3-Fold cross-validation
np.mean(cv_baseline['test_rmse'])

0.9060134845722022

The 3-fold cross validation of the baseline model is 0.91; this will be the metric that our future models will be compared against once we instantiate the new models and include different hyperparameters.

### Model 1

The first model I will explore is a tuned SVD model. I will include several hyperparameters, through GridSearch, to see if our results improve.

In [25]:
#Set parameters for GridSearch on SVD model 
parameters = {'n_factors': [25, 50, 75, 100],
             'reg_all': [0.01, 0.02, 0.03, 0.04, 0.05],
             'n_epochs': [20, 30, 40, 50, 60],
             'lr_all': [.005, .01, .05, .1]}
gridsvd = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [26]:
#Fit SVD model on the data
gridsvd.fit(data)

In [27]:
#Print best score and best parameters from the GridSearch 
print(gridsvd.best_score)
print(gridsvd.best_params)

{'rmse': 0.8560472745942198, 'mae': 0.6690316838689716}
{'rmse': {'n_factors': 25, 'reg_all': 0.01, 'n_epochs': 30, 'lr_all': 0.05}, 'mae': {'n_factors': 25, 'reg_all': 0.01, 'n_epochs': 40, 'lr_all': 0.05}}


In [28]:
#Reinstantiate the model with the best parameters from GridSearch 
svdtuned = SVD(n_factors=25,
               reg_all=0.01,
               n_epochs=60,
               lr_all=0.05)

In [29]:
#Fit and predict the model 
svdtuned.fit(trainset)
svdpreds = svdtuned.test(testset)

In [30]:
#Print RMSE and MAE results 
accuracy.rmse(svdpreds)
accuracy.mae(svdpreds)

RMSE: 0.8791
MAE:  0.6858


0.6857739768059207

In [31]:
#Perform 3-Fold cross validation for SVD tuned model
cv_svd_tuned = cross_validate(svdtuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8755  0.8766  0.8613  0.8711  0.0069  
MAE (testset)     0.6826  0.6806  0.6744  0.6792  0.0035  
Fit time          0.31    0.32    0.30    0.31    0.01    
Test time         0.01    0.01    0.01    0.01    0.00    


In [32]:
#Display the results for all 3-folds 
for i in cv_svd_tuned.items():
    print(i)

('test_rmse', array([0.87548424, 0.87659582, 0.86133352]))
('test_mae', array([0.68259161, 0.68055132, 0.67440392]))
('fit_time', (0.3076200485229492, 0.31559205055236816, 0.29521894454956055))
('test_time', (0.010340213775634766, 0.013015031814575195, 0.011862993240356445))


In [33]:
# Print out the average RMSE score for the test set
np.mean(cv_svd_tuned['test_rmse'])

0.8711378616900826

The RMSE from the tuned SVD model decreased from the baseline, as well as the MAE score and the cross-validation RMSE score.

## Model 2 

The next model I will run is the KNN Basic Model. First, I will instantiate several KNN parameters which will then be used in all of the remaining KNN models that I will run.

In [34]:
#Set parameters to be used in KNN models 
knn_params = {'name': ['cosine', 'pearson'],
              'user_based':[True, False], 
              'min_support':[True, False],
            'min_k' : [1, 2]}

In [35]:
#Apply GridSearch to the KNN Basic model to identify the best parameters
gsknnbasic = GridSearchCV(KNNBasic, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbasic.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [36]:
#Display the best scores and parameters from GridSearch
print(gsknnbasic.best_score)
print(gsknnbasic.best_params)

{'rmse': 0.9940855317512179, 'mae': 0.7881871004959159}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [37]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbasic_tuned = KNNBasic(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':1, })

In [38]:
#Fit on the train set and predict on the test set 
knnbasic_tuned.fit(trainset)
knnbpreds = knnbasic_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [39]:
#Print RMSE and MAE results 
accuracy.rmse(knnbpreds)
accuracy.mae(knnbpreds)

RMSE: 0.9949
MAE:  0.7915


0.7914645040713202

In [40]:
#Conduct cross validation for the KNNBasic tuned model 
cv_knn_basic = cross_validate(knnbasic_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9827  0.9986  0.9997  0.9937  0.0078  
MAE (testset)     0.7715  0.7955  0.7975  0.7882  0.0118  
Fit time          0.03    0.02    0.02    0.02    0.01    
Test time         0.01    0.01    0.01    0.01    0.00    


In [41]:
# Print out results from the cross-validation 
for i in cv_knn_basic.items():
    print(i)

('test_rmse', array([0.98269587, 0.99860852, 0.99972444]))
('test_mae', array([0.77149622, 0.79552529, 0.79752569]))
('fit_time', (0.03170418739318848, 0.02056884765625, 0.018851041793823242))
('test_time', (0.014935970306396484, 0.012057065963745117, 0.014061927795410156))


In [42]:
# Print out the average RMSE score for the test set
np.mean(cv_knn_basic['test_rmse'])

0.9936762742205855

The KNNBasic model has a higher RMSE, MAE and cross-validation RMSE score in comparison to the baseline model. 

### Model 3

The following model I will run is the KNNBaseline model. I will apply the same GridSearch parameters that I have instantiated above.

In [43]:
#Apply KNN GridSearch parameters on the KNNBaseline model 
gsknnbaseline = GridSearchCV(KNNBaseline, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbaseline.fit(data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

In [44]:
#Display the best score and the best parameters 
print(gsknnbaseline.best_score)
print(gsknnbaseline.best_params)

{'rmse': 0.9288126434655078, 'mae': 0.7296010988599243}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [45]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbaseline_tuned = KNNBaseline(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':1, })

In [46]:
#Fit the trainset and predict on the test set 
knnbaseline_tuned.fit(trainset)
knnbaselinepreds = knnbaseline_tuned.test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [47]:
#Print the RMSE and MAE scores 
accuracy.rmse(knnbaselinepreds)
accuracy.mae(knnbaselinepreds)

RMSE: 0.9267
MAE:  0.7299


0.729935955417388

In [48]:
#Perform 3 fold cross validation 
cv_knn_baseline = cross_validate(knnbaseline_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9276  0.9141  0.9355  0.9257  0.0088  
MAE (testset)     0.7300  0.7201  0.7326  0.7275  0.0054  
Fit time          0.04    0.03    0.04    0.04    0.00    
Test time         0.01    0.02    0.01    0.02    0.00    


In [49]:
#Show the mean RMSE score for the test set 
np.mean(cv_knn_baseline['test_rmse'])

0.925736296606003

Again, the cross-validation score performs worse than the baseline model.

### Model 4

Our final model will look at the KNNWithMeans algorithm, and apply a GridSearch similar to the KNN models above to tune our hyperparameters further.

In [50]:
#Apply GridSearch to the KNNWithMeans model 
gsknnWM = GridSearchCV(KNNWithMeans, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnWM.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [51]:
#Display the best score and best parameters from GridSearch 
print(gsknnWM.best_score)
print(gsknnWM.best_params)

{'rmse': 0.9935392639495347, 'mae': 0.7877601133795539}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [52]:
#Reinstantiate the model with the best parameters 
knnwm_tuned = KNNWithMeans(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':1, })

In [53]:
#Fit on the trainset, predict on the testset 
knnwm_tuned.fit(trainset)
knnwmpreds = knnwm_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [54]:
#Print RMSE and MAE results
accuracy.rmse(knnwmpreds)
accuracy.mae(knnwmpreds)

RMSE: 0.9949
MAE:  0.7915


0.7914645040713202

In [55]:
#Perform 3-Fold cross validation on KNNWithMeans model 
cv_knn_wm = cross_validate(knnwm_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9935  0.9861  1.0009  0.9935  0.0060  
MAE (testset)     0.7844  0.7836  0.7950  0.7877  0.0052  
Fit time          0.04    0.04    0.03    0.04    0.00    
Test time         0.01    0.02    0.01    0.01    0.00    


In [56]:
#Print the average RMSE score for the test set 
np.mean(cv_knn_wm['test_rmse'])

0.9934938042623603

Again, I see that the cross-validation model performs worse than the baseline, with a higher RMSE score.

## All Results

In [57]:
#Create a dictionary for each model's results 
baselineresult = {'model': 'baseline','RMSE': accuracy.rmse(baselinepreds), 'MAE': accuracy.mae(baselinepreds), 'CV': np.mean(cv_baseline['test_rmse'])}
svdresult = {'model':'svd', 'RMSE': accuracy.rmse(svdpreds), 'MAE': accuracy.mae(svdpreds), 'CV': np.mean(cv_svd_tuned['test_rmse'])}
knnbasicresult = {'model':'knnbasic','RMSE': accuracy.rmse(knnbpreds), 'MAE': accuracy.mae(knnbpreds), 'CV': np.mean(cv_knn_basic['test_rmse'])}
knnbaselineresult = {'model':'knnbaseline','RMSE': accuracy.rmse(knnbaselinepreds), 'MAE': accuracy.mae(knnbaselinepreds), 'CV': np.mean(cv_knn_baseline['test_rmse'])}
knnwmresult = {'model':'knnwm','RMSE': accuracy.rmse(knnwmpreds), 'MAE': accuracy.mae(knnwmpreds), 'CV': np.mean(cv_knn_wm['test_rmse'])}

RMSE: 0.9094
MAE:  0.7187
RMSE: 0.8791
MAE:  0.6858
RMSE: 0.9949
MAE:  0.7915
RMSE: 0.9267
MAE:  0.7299
RMSE: 0.9949
MAE:  0.7915


In [58]:
#Combine all the results into a list 
result_list = [baselineresult, svdresult, knnbasicresult, knnbaselineresult, knnwmresult]

In [59]:
#Transform the results lists into a DataFrame 
df_results_updated = pd.DataFrame.from_dict(result_list, orient='columns')
df_results_updated = df_results_updated.set_index('model')

In [60]:
#Display the results for all of the models 
df_results_updated

Unnamed: 0_level_0,RMSE,MAE,CV
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baseline,0.909391,0.718712,0.906013
svd,0.879144,0.685774,0.871138
knnbasic,0.994897,0.791465,0.993676
knnbaseline,0.926693,0.729936,0.925736
knnwm,0.994897,0.791465,0.993494


Our results show that only the SVD model performs better than the baseline, in terms of RMSE, MAE, and CV. Going forward, this is the model we will use for the recommendation system, as the KNN models in this circumstance all produce higher error results.

### Generating New Ratings

In [61]:
#Define function that can generate new user song ratings 
def song_rater(song_df,num, genre=None):
    #Create new artist with artist_id = 1000
    artistid = 1000
    
    #Create an empty list of ratings 
    rating_list = []
    
    #For all number of ratings, provide a random song sample within the specified genre for the user to rate 
    while num > 0:
        if genre:
            song = song_df[song_df['genrecategory'].str.contains(genre)].sample(1)
        else:
            song = song_df.sample(1)
        print(song)
    
    #Provide user with a prompt to rate the song, then print the artist_id, track_id, rating, track name
    #artist, and genre category, and then append results to the rating_list 
        rating = input('How do you rate this song on a scale of 1-5, press n if you have not listened to it :\n')
        if rating == 'n':
            continue
        else:
            rating_one_song = {'artist_id':artistid, 'track_id':song['track_id'].values[0], 
                               'ratings':rating,'track_name':song['track_name'],
                                'artist':song['artist'].values[0], 
                               'genrecategory':song['genrecategory'].values[0]}
            rating_list.append(rating_one_song) 
            num -= 1
    return rating_list

In [62]:
#Select relevant columns for new dataframe 
dfnew = df[['artist_id', 'track_id', 'ratings', 'track_name', 'artist','genrecategory']]
dfnew.head()

Unnamed: 0,artist_id,track_id,ratings,track_name,artist,genrecategory
0,6jJ0s89eD6GaHleKKya26X,4lCv7b86sLynZbXhfScfm2,4,Firework,Katy Perry,dance
1,6jJ0s89eD6GaHleKKya26X,6tS3XVuOyu10897O3ae7bi,4,California Gurls,Katy Perry,dance
2,6jJ0s89eD6GaHleKKya26X,455AfCsOhhLPRc68sE01D8,4,Last Friday Night (T.G.I.F.),Katy Perry,dance
3,6jJ0s89eD6GaHleKKya26X,14iN3o8ptQ8cFVZTEmyQRV,4,I Kissed A Girl,Katy Perry,dance
4,6jJ0s89eD6GaHleKKya26X,1nZzRJbFvCEct3uzu04ZoL,4,Part Of Me,Katy Perry,dance


In [63]:
#Apply the song rater function to our new dataframe to generate new ratings 
artistrating = song_rater(dfnew, 3, 'dance')

                   artist_id                track_id  ratings track_name  \
5153  0N0d3kjwdY2h7UVuTdJGfp  0lcamYchjAoGYH7Gee8kfK        2    Bad Boy   

       artist genrecategory  
5153  Cascada         dance  
How do you rate this song on a scale of 1-5, press n if you have not listened to it :
3.2
                  artist_id                track_id  ratings     track_name  \
845  5MouCg6ta7zAxsfMEbc1uh  455sO7hiKp56nQSqsPPK73        2  Promised Land   

    artist genrecategory  
845    OMI         dance  
How do you rate this song on a scale of 1-5, press n if you have not listened to it :
1.2
                   artist_id                track_id  ratings track_name  \
3273  4SqTiwOEdYrNayaGMkc7ia  6Greo7ncQmcs7XcjiEK29x        3   Treasure   

     artist genrecategory  
3273   LÉON         dance  
How do you rate this song on a scale of 1-5, press n if you have not listened to it :
4.5


In [64]:
#Investigate the response of the new ratings 
artistrating

[{'artist_id': 1000,
  'track_id': '0lcamYchjAoGYH7Gee8kfK',
  'ratings': '3.2',
  'track_name': 5153    Bad Boy
  Name: track_name, dtype: object,
  'artist': 'Cascada',
  'genrecategory': 'dance'},
 {'artist_id': 1000,
  'track_id': '455sO7hiKp56nQSqsPPK73',
  'ratings': '1.2',
  'track_name': 845    Promised Land
  Name: track_name, dtype: object,
  'artist': 'OMI',
  'genrecategory': 'dance'},
 {'artist_id': 1000,
  'track_id': '6Greo7ncQmcs7XcjiEK29x',
  'ratings': '4.5',
  'track_name': 3273    Treasure
  Name: track_name, dtype: object,
  'artist': 'LÉON',
  'genrecategory': 'dance'}]

In [65]:
#Add new ratings to our existing DataFrame
new_ratings_df = dataset.append(artistrating,ignore_index=True, sort=False)

In [66]:
#Drop certain columns so our dataset is ready to be put into surprise 
new_ratings_df = new_ratings_df.drop(['track_name', 'artist', 'genrecategory'], axis=1)
new_ratings_df.head()

Unnamed: 0,artist_id,track_id,ratings
0,6jJ0s89eD6GaHleKKya26X,4lCv7b86sLynZbXhfScfm2,4
1,6jJ0s89eD6GaHleKKya26X,6tS3XVuOyu10897O3ae7bi,4
2,6jJ0s89eD6GaHleKKya26X,455AfCsOhhLPRc68sE01D8,4
3,6jJ0s89eD6GaHleKKya26X,14iN3o8ptQ8cFVZTEmyQRV,4
4,6jJ0s89eD6GaHleKKya26X,1nZzRJbFvCEct3uzu04ZoL,4


### Make Predictions with New Artist Ratings 

First, we will redo the same modeling process as above in order to find predictions for the above songs.

In [67]:
#Reinstantiate the dataset object with our new ratings dataframe 
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [68]:
#Rerun the SVD model with the same hyperparameters as before
svd_ = SVD(n_factors= 25, reg_all=0.01, n_epochs=60, lr_all=0.05)
#Fit the new model
svd_.fit(new_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fdb4fcf32e8>

In [69]:
#Make predictions for the user based on the artist id that was generated 
list_of_tracks = []
for t_id in new_ratings_df['track_id'].unique():
    list_of_tracks.append( (t_id,svd_.predict(1000,t_id)[3]))

In [70]:
# order the predictions from highest to lowest rated
ranked_tracks = sorted(list_of_tracks, key=lambda x:x[1], reverse=True)

In [71]:
# Create a function to return the top n recommended tracks for the user 
def recommended_tracks(artistrating,track_title_df,n):
        for idx, rec in enumerate(artistrating):
            track_name = track_title_df.loc[track_title_df['track_id'] == (rec[0])][['track_name', 'artist']].values
            print('Recommendation # ', idx+1, ': ', track_name, '\n')
            n-= 1
            if n == 0:
                break
            
recommended_tracks(ranked_tracks, dfnew, 5)

Recommendation #  1 :  [["i hate u, i love u (feat. olivia o'brien)" 'gnash']] 

Recommendation #  2 :  [['Call Me Maybe' 'Carly Rae Jepsen']] 

Recommendation #  3 :  [['Mr. Blue Sky' 'Electric Light Orchestra']] 

Recommendation #  4 :  [['Treasure' 'LÉON']] 

Recommendation #  5 :  [['8TEEN' 'Khalid']] 



# Models Using Predicted Ratings

Through our classification model, by tuning XGBoost, I was able to build a model with predicted ratings as the proxy for ratings. I will now repeat the steps above that were conducted for the SVD model to see how this model performs and how/if the recommendation results differ.

In [72]:
#Import the predictions dataframe 
dfpreds = pd.read_csv('predictionsdf')

In [73]:
#Extract relevant columns to use in surprise 
preds = dfpreds[['artist_id', 'track_id', 'predicted ratings']]

In [74]:
#Re-instantiate reader and data 
reader = Reader(rating_scale=(1, 5))
preds_data = Dataset.load_from_df(preds, reader)

In [75]:
#Train test split 
trainset, testset = train_test_split(preds_data, test_size=.2)

In [76]:
#Set parameters for GridSearch on SVD model 
parameters = {'n_factors': [25, 50, 75, 100],
             'reg_all': [0.01, 0.02, 0.03, 0.04, 0.05],
             'n_epochs': [20, 30, 40, 50, 60],
             'lr_all': [.005, .01, .05, .1]}
gridsvd2 = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [77]:
#Refit the model 
gridsvd2.fit(preds_data)

In [78]:
#Print best score and best parameters from the GridSearch 
print(gridsvd2.best_score)
print(gridsvd2.best_params)

In [79]:
#Reinstantiate the model with the best parameters from GridSearch 
svdtuned2 = SVD(n_factors=25,
               reg_all=0.01,
               n_epochs=30,
               lr_all=0.05)

In [80]:
#Fit and predict the model 
svdtuned2.fit(trainset)
svdpreds2 = svdtuned2.test(testset)

In [81]:
#Print RMSE and MAE results 
accuracy.rmse(svdpreds2)
accuracy.mae(svdpreds2)

RMSE: 0.2428
MAE:  0.1829


0.18286658843524248

The RMSE is approx. 0.24 and MAE is 0.18; significantly stronger results to the SVD model with the integer ratings. The next step here would be to see what tracks our recommendation system provides using these new ratings.

### Generating New Ratings for New Model 

In [82]:
#Look at the predictions dataframe again 
dfpreds.head(2)

Unnamed: 0.1,Unnamed: 0,is_explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,...,genre_rap,genre_rock,genre_soul,predicted ratings,artist,artist_id,album,track_name,track_id,genrecategory
0,0,0,0.638,0.826,8,-4.968,1,0.0479,0.139,0.0,...,0,0,0,2.879912,Katy Perry,6jJ0s89eD6GaHleKKya26X,Katy Perry - Teenage Dream: The Complete Confe...,Firework,4lCv7b86sLynZbXhfScfm2,dance
1,1,0,0.791,0.754,0,-3.729,1,0.0569,0.00446,0.0,...,0,0,0,2.969021,Katy Perry,6jJ0s89eD6GaHleKKya26X,Katy Perry - Teenage Dream: The Complete Confe...,California Gurls,6tS3XVuOyu10897O3ae7bi,dance


In [83]:
#Create a df that contains only certain columns from our original dataframe 
new_preds_rating = dfpreds[['artist_id', 'track_id', 'predicted ratings', 'track_name', 'artist','genrecategory']]

In [84]:
#Apply the song rater function to our new dataframe to generate new ratings 
artistrating2 = song_rater(new_preds_rating, 3, 'country')

                   artist_id                track_id  predicted ratings  \
1917  4xFUf1FHVy696Q1JQZMTRj  05BgC2247XGi8ySwBzOO0o           2.606889   

     track_name            artist genrecategory  
1917  Heartbeat  Carrie Underwood       country  
How do you rate this song on a scale of 1-5, press n if you have not listened to it :
3.4
                   artist_id                track_id  predicted ratings  \
4193  7z5WFjZAIYejWy0NI5lv4T  4W38RXuQNuoTSwVsQA1OGC            2.63332   

            track_name      artist genrecategory  
4193  Nothin' Like You  Dan + Shay       country  
How do you rate this song on a scale of 1-5, press n if you have not listened to it :
5.2
                   artist_id                track_id  predicted ratings  \
1915  4xFUf1FHVy696Q1JQZMTRj  10RQKVSr4rS0coExTmi4dF           2.641503   

                  track_name            artist genrecategory  
1915  Something in the Water  Carrie Underwood       country  
How do you rate this song on a scale of

In [85]:
# Investigate new ratings 
artistrating2

[{'artist_id': 1000,
  'track_id': '05BgC2247XGi8ySwBzOO0o',
  'ratings': '3.4',
  'track_name': 1917    Heartbeat
  Name: track_name, dtype: object,
  'artist': 'Carrie Underwood',
  'genrecategory': 'country'},
 {'artist_id': 1000,
  'track_id': '4W38RXuQNuoTSwVsQA1OGC',
  'ratings': '5.2',
  'track_name': 4193    Nothin' Like You
  Name: track_name, dtype: object,
  'artist': 'Dan + Shay',
  'genrecategory': 'country'},
 {'artist_id': 1000,
  'track_id': '10RQKVSr4rS0coExTmi4dF',
  'ratings': '1.1',
  'track_name': 1915    Something in the Water
  Name: track_name, dtype: object,
  'artist': 'Carrie Underwood',
  'genrecategory': 'country'}]

In [86]:
#Add new ratings to our existing DataFrame
new_preds_df = dfpreds.append(artistrating2,ignore_index=True, sort=False)

In [87]:
#Drop certain columns so our dataset is ready to be put into surprise 
new_preds_df = new_preds_df[['artist_id', 'track_id', 'predicted ratings']]
new_preds_df.head()

Unnamed: 0,artist_id,track_id,predicted ratings
0,6jJ0s89eD6GaHleKKya26X,4lCv7b86sLynZbXhfScfm2,2.879912
1,6jJ0s89eD6GaHleKKya26X,6tS3XVuOyu10897O3ae7bi,2.969021
2,6jJ0s89eD6GaHleKKya26X,455AfCsOhhLPRc68sE01D8,2.768275
3,6jJ0s89eD6GaHleKKya26X,14iN3o8ptQ8cFVZTEmyQRV,3.096658
4,6jJ0s89eD6GaHleKKya26X,1nZzRJbFvCEct3uzu04ZoL,2.780582


### Making Recommendations on New Model

Now that these new ratings are generated, we can use our model to provide recommendations to a new user, using the ratings we have just collected.

In [88]:
#Reinstantiate the dataset object with our new ratings dataframe 
new_preds = Dataset.load_from_df(new_preds_df,reader)

In [89]:
#Make predictions for the user based on the artist id that was generated 
list_of_tracks = []
for t_id in new_preds_df['track_id'].unique():
    list_of_tracks.append( (t_id,svd_.predict(1000,t_id)[3]))

In [90]:
# order the predictions from highest to lowest rated
ranked_tracks = sorted(list_of_tracks, key=lambda x:x[1], reverse=True)

In [91]:
#Print recommended songs 
recommended_tracks(ranked_tracks, dfpreds, 5)

Recommendation #  1 :  [["i hate u, i love u (feat. olivia o'brien)" 'gnash']] 

Recommendation #  2 :  [['Call Me Maybe' 'Carly Rae Jepsen']] 

Recommendation #  3 :  [['Mr. Blue Sky' 'Electric Light Orchestra']] 

Recommendation #  4 :  [['Treasure' 'LÉON']] 

Recommendation #  5 :  [['8TEEN' 'Khalid']] 



# Extracting Predictions for all Artists and Tracks

Our final step in the process will be to create a Dataframe that includes all the estimated ratings for every combination of artist_id and track_id. Once this information is clearly presented in a dataframe, I will be able to conduct some post-modeling EDA to determine how or if trends from the estimators differ from the original ratings.

In [92]:
#Create list of unique aritst_ids and track_ids 
artistids = new_preds_df['artist_id'].unique()
trackids = new_preds_df['track_id'].unique()

In [93]:
#Create a list and append the artist_id, track_ids, and estimated ratings 
estimations = [] 
for u in artistids:
    for m in trackids:
        predicted = svdtuned2.predict(u, m)
        estimations.append([u, m, predicted[3]])

In [94]:
#Convert the list to a dataframe
df_estimated = pd.DataFrame(estimations)

In [95]:
#Rename columns of dataframe 
df_estimated = df_estimated.rename(columns={0: 'artist_id', 1: 'track_id', 2:'estimatedrating'})

In [96]:
#Investigate the first 5 rows of the new dataframe 
df_estimated.head()

Unnamed: 0,artist_id,track_id,estimatedrating
0,6jJ0s89eD6GaHleKKya26X,4lCv7b86sLynZbXhfScfm2,2.88082
1,6jJ0s89eD6GaHleKKya26X,6tS3XVuOyu10897O3ae7bi,2.884619
2,6jJ0s89eD6GaHleKKya26X,455AfCsOhhLPRc68sE01D8,2.884619
3,6jJ0s89eD6GaHleKKya26X,14iN3o8ptQ8cFVZTEmyQRV,3.07513
4,6jJ0s89eD6GaHleKKya26X,1nZzRJbFvCEct3uzu04ZoL,2.789683


In [98]:
#Investigate the shape of the new dataframe 
df_estimated.shape

(16265730, 3)

The cell below is coded out due to the large data size of the df_estimated dataframe (as seen above, there are 16,265,730 rows). This file was added to my gitignore so that the information could be accessed later on.

In [97]:
# Export estimated ratings to a csv file
# df_estimated.to_csv('df_estimated')