# **Music Recommendation System**

# **Milestone 2**

Now that we have explored the data, let's apply different algorithms to build recommendation systems

**Note:** Use the shorter version of the data i.e. the data after the cutoffs as used in Milestone 1.

### **Popularity-Based Recommendation Systems**

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems on the basis of the sum of play counts.

In [139]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [140]:
import warnings #Used to ignore the warning given as output of the code.
warnings.filterwarnings('ignore')

import numpy as np # Basic libraries of python for numeric and dataframe computations.
import pandas as pd

import matplotlib.pyplot as plt #Basic library for data visualization.
import seaborn as sns #Slightly advanced library for data visualization

from sklearn.metrics.pairwise import cosine_similarity #To compute the cosine similarity between two vectors.
from collections import defaultdict #A dictionary output that does not raise a key error

from sklearn.metrics import mean_squared_error # A performance metrics in sklearn.

In [141]:
#importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/capstone/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/capstone/song_data.csv')


In [142]:
#loading the file with the filters and data treatment applied in milestone 1
df_final = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/capstone/recSystems_df_final2.csv')

In [143]:
df_final.shape

(154377, 8)

In [144]:
df_final.nunique()

Unnamed: 0     154377
user_id          3476
song_id           695
play_count          5
title             704
release           484
artist_name       258
year               38
dtype: int64

In [145]:
df_final.drop(columns=['Unnamed: 0'], inplace=True)
df_final.drop_duplicates(inplace=True)
df_final.head()

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
0,39814,736,1,Stronger,Graduation,Kanye West,2007
2,57932,736,1,Stronger,Graduation,Kanye West,2007
4,19193,736,1,Stronger,Graduation,Kanye West,2007
6,3919,736,2,Stronger,Graduation,Kanye West,2007
8,51414,736,2,Stronger,Graduation,Kanye West,2007


In [146]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148338 entries, 0 to 154375
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   user_id      148338 non-null  int64 
 1   song_id      148338 non-null  int64 
 2   play_count   148338 non-null  int64 
 3   title        148338 non-null  object
 4   release      148338 non-null  object
 5   artist_name  148338 non-null  object
 6   year         148338 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 9.1+ MB


In [147]:
#Calculating average play_count
average_count = df_final.groupby('song_id').mean()['play_count'] #Hint: Use groupby function on the song_id column. 

#Calculating the frequency a song is played.
play_freq = df_final.groupby('song_id').count()['play_count']#Hint: Use groupby function on the song_id column

In [148]:
#Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count':average_count, 'play_freq':play_freq})
final_play.tail()

Unnamed: 0_level_0,avg_count,play_freq
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1
9939,1.662551,243
9942,2.47205,161
9960,1.622378,143
9981,1.88024,167
9989,1.341463,123


Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [149]:
#Build the function for finding top n songs
def top_n_songs(data, n, min_interaction=50):
    
    #Finding movies with minimum number of interactions
    recommendations = data[data['play_freq'] > min_interaction]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by='avg_count', ascending=False)
    
    return recommendations.index[:n]

In [150]:
#Recommend top 10 songs using the function defined above
top10=list(top_n_songs(final_play, 10, 50))

In [151]:
top10songs = df_final.loc[df_final['song_id'].isin(top10)].groupby(['title','artist_name']).count()['play_count'].to_frame().sort_values(by='play_count', ascending=False)
top10songs

Unnamed: 0_level_0,Unnamed: 1_level_0,play_count
title,artist_name,Unnamed: 2_level_1
Secrets,OneRepublic,684
You're The One,Dwight Yoakam,411
Luvstruck,Southside Spinners,162
Greece 2000,Three Drives,161
Video Killed The Radio Star,The Buggles,132
Brave The Elements,Colossal,115
Transparency,White Denim,112
Victoria (LP Version),Old 97's,111
The Big Gundown,The Prodigy,103
Heaven Must Be Missing An Angel,Tavares,99


### **User User Similarity-Based Collaborative Filtering**

To build the user-user-similarity based and subsequent models we will use the "surprise" library.

In [152]:
#Install the surprise package using pip. Uncomment and run the below code to do the same. 
!pip install surprise 



In [153]:
# Import necessary libraries
# To compute the accuracy of models
from surprise import accuracy

# class is used to parse a file containing play_counts, data should be in structure - user; item ; play_count
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# for splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# for implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing KFold cross-validation
from surprise.model_selection import KFold

#For implementing clustering-based recommendation system
from surprise import CoClustering

### Some useful functions

The below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

**Think About It:** Which metric should be used for this problem to compare different models?

In [154]:
#The function to calulate the RMSE, precision@k, recall@k and F_1 score. 
def precision_recall_at_k(model, k=30, threshold=1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    #Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)
    
    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.

**Think About It:** In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value 1.5? 

Setting the threshold at 1.5 assumes that if a person plays a song more 
than once it's because they like it

---



In [155]:
# Instantiating Reader scale with expected rating scale 
reader = Reader(rating_scale=(0,5)) #use rating scale (0,5)

# loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) #Take only "user_id","song_id", and "play_count"

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.4, random_state=42) # Take test_size=0.4

In [156]:
trainset.all_users()[-1]

3470

**Think About It:** How changing the test size would change the results and outputs?

Test sizes of .1-.4 would be common. If it were a very large data set we could probably do .1 but since it's a smaller one, .6 makes sense to avoid overfitting to our validation set.

In [157]:
0#Build the default user-user-similarity model
sim_options = {'name': 'cosine',
               'user_based':True}

#KNN algorithm is used to find desired similar items.
sim_user_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=1) #use random_state=1 

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =30.
precision_recall_at_k(sim_user_user) #Use sim_user_user model

RMSE: 1.0707
Precision:  0.403
Recall:  0.711
F_1 score:  0.514


**Observations and Insights:**


* We calculated the RMSE to check how far the overall predicted ratings are from the actual ratings.
* Intuition of Recall - We are getting a recall of 0.714, which means out of all the relevant songs 71% are recommended.
* Intuition of Precision - We are getting a precision of ~ 0.4, which means out of all the recommended songs, 40% are relevant (which is not too great!)
* The F_1 score of the baseline model is ~0.518. It indicates that mostly recommended songs were relevant and relevant movies were recommended but we can definitely seek to improve this.



In [158]:
data.df.loc[data.df['user_id']==6958].loc[data.df['song_id']==1671]

Unnamed: 0,user_id,song_id,play_count
24649,6958,1671,2


In [159]:
#predicting play_count for a sample user with a listened song.
p = sim_user_user.predict(6958, 1671, r_ui=2,verbose=True) #use user id 6958 and song_id 1671

user: 6958       item: 1671       r_ui = 2.00   est = 1.82   {'actual_k': 40, 'was_impossible': False}


In [160]:

data.df.loc[data.df['user_id']==6958].loc[data.df['song_id']==3232]

Unnamed: 0,user_id,song_id,play_count


In [161]:
#predicting play_count for a sample user with a song not-listened by the user.
sim_user_user.predict(6958,3232, verbose=True) #Use user_id 6958 and song_id 3232

user: 6958       item: 3232       r_ui = None   est = 1.78   {'actual_k': 34, 'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.7791119099433816, details={'actual_k': 34, 'was_impossible': False})

In [162]:
df_final.loc[df_final['song_id']==1671].head(1)

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
24649,6958,1671,2,Sleeping In (Album),Give Up,Postal Service,2003


In [163]:
df_final.loc[df_final['song_id']==3232].head(1)

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
16732,27018,3232,2,Life In Technicolor ii,Viva La Vida - Prospekt's March Edition,Coldplay,2008


**Observations and Insights:**


* The model would appear to recommend both a song that a user likes (Sleeping In by the Postal Service), and one they are likely to enjoy by Coldplay which makes sense given the artists' styles. 
* The above output shows that the actual rating is not too far from the predicted rating for this user-item pair (it made it over the threshold at least!)



Now, let's try to tune the model and see if we can improve the model performance.

In [164]:
# setting up parameter grid to tune the hyperparameters
param_grid1 = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [True], "min_support":[2,4]}
              }

param_grid2 = {'k': [25, 30, 35, 40], 'min_k': [9, 12],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [True], "min_support":[2,4]}
              }              

param_grid3 = {'k': [25, 30, 35], 'min_k': [9, 12, 15],
              'sim_options': {'name': ["pearson_baseline"],
                              'user_based': [True], "min_support":[0,2]}
              }                  

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid3, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data) #Use entire data for GridSearch

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])


1.0007398345275955
{'k': 35, 'min_k': 15, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 0}}


**RESULTS FROM PARAM GRID 1**

> 0.978482419730421
{'k': 30, 'min_k': 9, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}}

**RESULTS FROM PARAM GRID 2**

> 0.9999209108797182
{'k': 35, 'min_k': 12, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}}

**RESULTS FROM PARAM GRID 3**

> 0.9753996631042042
{'k': 35, 'min_k': 15, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 0}}




In [165]:
# Train the best model found in above gridsearch.
# using the optimal similarity measure for user-user based collaborative filtering
sim_options = {'name': 'pearson_baseline',
               'user_based': True,
               'min_support':0}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k=35, min_k=15, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(sim_user_user_optimized)

RMSE: 1.0053
Precision:  0.435
Recall:  0.771
F_1 score:  0.556


**Observations and Insights:_________**
* Tried a few different sim options and hyperparameters, managed to improve from the base model lowering the RMSE and increasing the F1 score slightly

In [166]:
#Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui=2
sim_user_user_optimized.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.91   {'actual_k': 30, 'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.907135255647403, details={'actual_k': 30, 'was_impossible': False})

In [167]:
#Predict the play count for a song that is not listened by the user (with user_id 6958) -> 3232
sim_user_user_optimized.predict(6958, 3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'Not enough neighbors.'}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.6952989820453472, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

**Observations and Insights:______________**
* The actual and predicted song play count for the song the user has listened to got closer with this optimized model. 

**Think About It:** Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain user?

In [168]:
#Use inner id 0. 
sim_user_user_optimized.get_neighbors(0, k=5)

[2450, 834, 213, 1681, 86]

In [169]:
def top_n_songs_for_user(innerUserId, n=5):
  rawID=trainset.to_raw_uid(innerUserId)
  return df_final.loc[df_final['user_id']==rawID].groupby(['song_id', 'title','artist_name'])['play_count'].sum().to_frame().sort_values(by='play_count', ascending=False).head(n)
  

In [170]:
top_n_songs_for_user(0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
2220,Sehr kosmisch,Harmonia,5
2091,Just Dance,Lady GaGa / Colby O'Donis,4
2210,Hey There Delilah,Plain White T's,4
630,'Till I Collapse,Eminem / Nate Dogg,4
3050,Terre Promise,O'Rosko Raricim,3


In [171]:
top_n_songs_for_user(575)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
7715,Savin' Me,Nickelback,8
4975,Until The End Of Time,Justin Timberlake duet with Beyonce,4
2276,Far Away (Album Version),Nickelback,4
8582,Use Somebody,Kings Of Leon,4
4448,Fireflies,Charttraxx Karaoke,4


In [172]:
top_n_songs_for_user(731)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
5728,My Name Is,Eminem,4
4298,On Repeat,LCD Soundsystem,3
1811,Ghosts 'n' Stuff (Original Instrumental Mix),Deadmau5,2
2615,She's Good For Business,MSTRKRFT,2
2747,Kut-Off,Skream,2


Below we will be implementing a function where the input parameters are - 

- data: a **song** dataset
- user_id: a user id **against which we want the recommendations**
- top_n: the **number of songs we want to recommend**
- algo: the algorithm we want to use **for predicting the play_count**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [173]:
df_f=df_final[['user_id','song_id','play_count']]
df_f.head()
df_f.drop_duplicates(inplace=True)

In [174]:
def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns=['song_id'], values='play_count')
    
    
    # extracting  song ids which the user_id has not played yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the song ids which user_id has not interacted yet
    for item_id in non_interacted_products:
        
        # predicting the ratings for those non visited restaurant ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating products for this user

In [175]:
#Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine.
recommendations =get_recommendations(df_f, 6958, 5, sim_user_user_optimized)

In [176]:
#Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_ratings'])

Unnamed: 0,song_id,predicted_ratings
0,5398,2.573805
1,1348,2.519339
2,7496,2.499019
3,5943,2.492109
4,1286,2.480112


In [177]:
sFrame = pd.DataFrame()
songs = [614,8247,5943,7682,5531]
for s in songs:
  #sid = trainset.to_raw_iid(s)
  sFrame=sFrame.append(df_final.loc[df_final['song_id'] == s].head(1))

sFrame[['song_id','title','artist_name']]

Unnamed: 0,song_id,title,artist_name
8546,614,You're The One,Dwight Yoakam
53891,8247,Tighten Up,The Black Keys
92486,5943,You've Got The Love,Florence + The Machine
114669,7682,I'm Sleeping In A Submarine,Arcade Fire
10888,5531,Secrets,OneRepublic


In [178]:
top_n_songs_for_user(trainset.to_inner_uid(6958), 20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
1050,Wet Blanket,Metric,5
5566,The Bachelor and the Bride,The Decemberists,5
9351,The Police And The Private,Metric,2
3718,The Penalty,Beirut,2
1671,Sleeping In (Album),Postal Service,2
1787,Help I'm Alive,Metric,2
8029,I CAN'T GET STARTED,Ron Carter,1
7738,Nantes,Beirut,1
8037,Gold Guns Girls,Metric,1
6305,Rhode Island Is Famous For You,Erin McKeown,1


**Observations and Insights:______________**
* Bit suprised to not see songs from artists the user enjoys but given the type of musical style still possible.
* Also suprising that all have ratings under 3 but still since our assumption is anything over 1 play count means a suer likes a song, anything with a value over 2 should be promising

### Correcting the play_counts and Ranking the above songs

In [179]:
def ranking_songs(recommendations, final_rating):
  # sort the songs based on play counts
  ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending=False)[['play_freq']].reset_index()

  # merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns=['song_id', 'predicted_ratings']), on='song_id', how='inner')

  # rank the songs based on corrected play_counts
  ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_ratings',ascending=False)
  
  return ranked_songs
  #note to self instead of play_freq may be play_count

**Think About It:** In the above function to make the correction in the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

Note: In the above-corrected rating formula, we can add the quantity 1/np.sqrt(n) instead of subtracting it to get more optimistic predictions, but here we don't necessarily need to encourage people to play a song, and we've eliminated songs with very few ratings "i.e" play counts, whereas we have songs with play couunts of 5 and we don't want any to have more than 5.

In [180]:
#Applying the ranking_songs function on the final_play data. 
ranking_songs(recommendations, final_play)

Unnamed: 0,song_id,play_freq,predicted_ratings,corrected_ratings
4,5398,300,2.573805,2.51607
1,1348,490,2.519339,2.474164
0,1286,940,2.480112,2.447495
2,5943,447,2.492109,2.44481
3,7496,316,2.499019,2.442765


In [181]:
sFrame = pd.DataFrame()
songs = [614,8247,5943,5531,7682]
for s in songs:
  #sid = trainset.to_raw_iid(s)
  sFrame=sFrame.append(df_final.loc[df_final['song_id'] == s].head(1))

sFrame[['song_id','title','artist_name']]

Unnamed: 0,song_id,title,artist_name
8546,614,You're The One,Dwight Yoakam
53891,8247,Tighten Up,The Black Keys
92486,5943,You've Got The Love,Florence + The Machine
10888,5531,Secrets,OneRepublic
114669,7682,I'm Sleeping In A Submarine,Arcade Fire


**Observations and Insights:**
* There were no differences between the recommendations with the corrected ratings, just the order changed slightly

### Item Item Similarity-based collaborative filtering recommendation systems 

In [182]:
#Apply the item-item similarity collaborative filtering model with random_state=1 and evaluate the model performance.
sim_options = {'name': 'cosine',
               'user_based': False}

#KNN algorithm is used to find desired similar items.
sim_item_item = KNNBasic(sim_options=sim_options, random_state=1, verbose=False)

# Train the algorithm on the trainset, and predict ratings for the testset
sim_item_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =10.
precision_recall_at_k(sim_item_item, k=10)

RMSE: 1.0212
Precision:  0.327
Recall:  0.436
F_1 score:  0.374


**Observations and Insights:______________**
* This Item-Item recommendation engine underperforms compared to the user-user model significantly 

In [183]:
data.df.loc[data.df['user_id']==6958].loc[data.df['song_id']==1671]

Unnamed: 0,user_id,song_id,play_count
24649,6958,1671,2


In [184]:
#predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user.
sim_item_item.predict(6958,1671,r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.40   {'actual_k': 28, 'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.404022400965297, details={'actual_k': 28, 'was_impossible': False})

In [185]:
data.df.loc[data.df['user_id']==69587].loc[data.df['song_id']==1671]

Unnamed: 0,user_id,song_id,play_count


In [186]:
#Predict the play count for a user that has not listened to the song (with song_id 1671)
sim_item_item.predict(69587,1671, verbose=True)

user: 69587      item: 1671       r_ui = None   est = 1.46   {'actual_k': 23, 'was_impossible': False}


Prediction(uid=69587, iid=1671, r_ui=None, est=1.45698954895159, details={'actual_k': 23, 'was_impossible': False})

**Observations and Insights:______________**
* While still predicting that this user may like these songs, the estimates are lower for both. Given we know this user "likes" song 1671 (the postal service one) just something to be wary of

In [187]:
#Apply grid search for enhancing model performance

# setting up parameter grid to tune the hyperparameters
param_grid1 = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [False], "min_support":[2,4]}
              }

param_grid2 = {'k': [30], 'min_k': [6, 9],
              'sim_options': {'name': ['pearson',"pearson_baseline"],
                              'user_based': [False], "min_support":[2],'shrinkage':[50,80]}
              }   

param_grid3 = {'k': [30], 'min_k': [6],
              'sim_options': {'name': ["pearson_baseline"],
                              'user_based': [False], 
                              "min_support":[0,2],
                              'shrinkage':[80,90,100]}
              }                               

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid3, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])


0.9851707945374399
{'k': 30, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 0, 'shrinkage': 100}}


**Results with param grid 1** are
0.9857056770027918

> {'k': 30, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 2}}

**Results with param grid 2** are
0.9861578252605048
>  {'k': 30, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 2, 'shrinkage': 80}}


**Results with param grid 3** are
0.9636829605330276
> {'k': 30, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 0, 'shrinkage': 100}}

**Think About It:** How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameter [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html).

Fromt the documentation 

> **Similarity measure configuration**
Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

* `'name': `The name of the similarity to use, as defined in the similarities module. Default is 'MSD'.
* `'user_based':` Whether similarities will be computed between users or between items. This has a huge impact on the performance of a prediction algorithm. Default is True.
*`'min_support':` The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') for the similarity not to be zero. Simply put, if |𝐼𝑢𝑣|<min_support then sim(𝑢,𝑣)=0. 
The same goes for items.

* `'shrinkage':` Shrinkage parameter to apply (only relevant for pearson_baseline similarity). Default is 100.


In [188]:
#Apply the best model found in the grid search.

# using the optimal similarity measure for item-item based collaborative filtering
#TODO FILL IN THE RIGHT SIM OPTIONS
sim_options = {'name': 'pearson_baseline',
               'user_based': False,
               'min_support': 0,
               'shrinkage':100}

#TODO FILL IN THE RIGHT SIM OPTIONS
# creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options=sim_options, k=30, min_k=6, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(sim_item_item_optimized, k=10)


RMSE: 0.9912
Precision:  0.455
Recall:  0.583
F_1 score:  0.511


**Observations and Insights:**
* achieved a lower RMSE and significantly improved the F_1 score compared to the baseline item-item similarity model. 

In [189]:
#Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
sim_item_item_optimized.predict(uid=6958, iid=1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 2.02   {'actual_k': 12, 'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=2.0185070905732365, details={'actual_k': 12, 'was_impossible': False})

In [190]:
#predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user.
sim_item_item_optimized.predict(uid=6958, iid=3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.03   {'actual_k': 12, 'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.025218552138185, details={'actual_k': 12, 'was_impossible': False})

**Observations and Insights:______________**
* The estimate for the known song improved (closer to actual value) and for the unkown song to the user decreased.

In [191]:
#Find five most similar users to the user with inner id 0
sim_item_item_optimized.get_neighbors(0, k=5)


[426, 85, 197, 174, 242]

In [192]:
top_n_songs_for_user(0, 10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
2220,Sehr kosmisch,Harmonia,5
2091,Just Dance,Lady GaGa / Colby O'Donis,4
2210,Hey There Delilah,Plain White T's,4
630,'Till I Collapse,Eminem / Nate Dogg,4
3050,Terre Promise,O'Rosko Raricim,3
1334,Hey_ Soul Sister,Train,3
8099,Toxic,Britney Spears,3
1828,Times Like These,Jack Johnson,3
2616,Gives You Hell,The All-American Rejects,3
5291,Bring Me To Life,Evanescence,2


In [193]:
top_n_songs_for_user(229, 10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
7796,Love Story,Taylor Swift,8
8582,Use Somebody,Kings Of Leon,8
7998,Nothin' On You [feat. Bruno Mars] (Album Version),B.o.B,6
4270,Don't Stop The Music,Rihanna,6
657,Luvstruck,Southside Spinners,5
6448,Wild World,Cat Stevens,5
6450,Brave The Elements,Colossal,5
7969,Savior,Rise Against,5
6860,Mercy:The Laundromat,Pavement,5
352,Dog Days Are Over (Radio Edit),Florence + The Machine,5


In [194]:
top_n_songs_for_user(239,10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
6293,Yellow,Coldplay,6
1286,Somebody To Love,Justin Bieber,6
7998,Nothin' On You [feat. Bruno Mars] (Album Version),B.o.B,6
2048,Already Gone,Kelly Clarkson,5
5733,Face Down (Album Version),The Red Jumpsuit Apparatus,5
8698,Whatcha Say,Jason Derulo,4
3310,Yeah!,Usher Featuring Lil' Jon & Ludacris,4
1223,The Way I Are,Timbaland / Keri Hilson / D.O.E.,4
1118,Clocks,Coldplay,4
7911,Heartbreak Warfare,John Mayer,4


In [195]:
#197,
top_n_songs_for_user(286,10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
6293,Yellow,Coldplay,6
8612,Fix You,Coldplay,5
7989,Drive,Incubus,4
7796,Love Story,Taylor Swift,4
9081,Take Me Out,Franz Ferdinand,4
4639,Bulletproof,La Roux,4
2220,Sehr kosmisch,Harmonia,4
4522,Sparks,Coldplay,4
3165,Mockingbird,Eminem,4
7212,Hide & Seek,Imogen Heap,3


In [196]:
#Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine.

recommendations =get_recommendations(df_f, 6958, 5, sim_item_item_optimized)

In [197]:
#Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"

pd.DataFrame(recommendations, columns=['song_id', 'predicted_play_count'])

Unnamed: 0,song_id,predicted_play_count
0,2914,3.243846
1,318,2.93585
2,2234,2.794041
3,5101,2.663005
4,3207,2.635741


In [198]:
#Applying the ranking_songs function. 
ranking_songs(recommendations, final_play)

Unnamed: 0,song_id,play_freq,predicted_ratings,corrected_ratings
4,2914,117,3.243846,3.151395
3,318,124,2.93585,2.846047
1,2234,154,2.794041,2.713458
0,3207,365,2.635741,2.583399
2,5101,135,2.663005,2.576939


In [234]:
top_n_songs_for_user(trainset.to_inner_uid(6958), 10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
1050,Wet Blanket,Metric,5
5566,The Bachelor and the Bride,The Decemberists,5
9351,The Police And The Private,Metric,2
3718,The Penalty,Beirut,2
1671,Sleeping In (Album),Postal Service,2
1787,Help I'm Alive,Metric,2
8029,I CAN'T GET STARTED,Ron Carter,1
7738,Nantes,Beirut,1
8037,Gold Guns Girls,Metric,1
6305,Rhode Island Is Famous For You,Erin McKeown,1


In [200]:
sFrame = pd.DataFrame()
songs = [2914,318,2234,3207,5101]
for s in songs:
  #sid = trainset.to_raw_iid(s)
  sFrame=sFrame.append(df_final.loc[df_final['song_id'] == s].head(1))

sFrame[['song_id','title','artist_name']]

Unnamed: 0,song_id,title,artist_name
138299,2914,Billy Liar,The Decemberists
40222,318,Hilarious Movie Of The 90s,Four Tet
42918,2234,Your Touch,The Black Keys
87028,3207,Black,Pearl Jam
131771,5101,White Sky,Vampire Weekend


**Observations and Insights:_________**
* Recommendations include some songs by artists that user seems to like so that's promising - even though we have a better F-1 score for user-user, something to keep in mind

### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [201]:

# using SVD matrix factorization
svd = SVD(random_state=1)

# training the algorithm on the trainset
svd.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(svd)

RMSE: 0.9929
Precision:  0.428
Recall:  0.655
F_1 score:  0.518


In [202]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui=2
svd.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.43   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.42705967197869, details={'was_impossible': False})

In [203]:
# Making prediction for user who has not listened the song (song_id 3232)
svd.predict(6958, 3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.80   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.7952112223609782, details={'was_impossible': False})

#### Improving matrix factorization based recommendation system by tuning its hyperparameters

In [204]:
# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.01, 0.015, 0.02],
              'reg_all': [0.2, 0.4, 0.6]}

# set the parameter space to tune
param_grid2 = {'n_epochs': [40], 'lr_all': [0.01],
              'reg_all': [0.1, 0.2, 0.5], 'biased': [True,False]}              

# performing 3-fold gridsearch cross validation
gs_ = GridSearchCV(SVD, param_grid2, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs_.fit(data)

# best RMSE score
print(gs_.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])

0.9733067646002294
{'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.1, 'biased': True}


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

param grid results: 

`svd_optimized = SVD(n_epochs=30, lr_all=0.01, reg_all=0.2, random_state=1)`

param grid 2 results:
0.9715271646427835
> {'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.1}

In [205]:
# building the optimized SVD model using optimal hyperparameter search
svd_optimized = SVD(n_epochs=40, lr_all=0.01, reg_all=0.1, random_state=1)

# training the algorithm on the trainset
svd_optimized=svd_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(svd_optimized)

RMSE: 0.9753
Precision:  0.442
Recall:  0.642
F_1 score:  0.524


**Observations and Insights:**
* We were able to ever so slightly reduce the RMSE and improve the F_1 score compared to the baseline item-item but this model in terms of F-1 scores still underperforms compared to the optimized user-user similarity model. 

In [206]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671.

svd_optimized.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.62   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.6179843391556392, details={'was_impossible': False})

In [207]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating.
svd_optimized.predict(6958, 3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.72   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.7238359836427828, details={'was_impossible': False})

**Observations and Insights:**
* we are still underestimating with this model the likelihood a user will like a song that we know they do like which is congruent with the higher RMSE and F-1 score compared to the optimized user-user similarity based model

In [208]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm.

recommendationsSVD =get_recommendations(df_f, 6958, 5, svd_optimized)

In [236]:
#Ranking songs based on above recommendations

ranking_songs(recommendationsSVD, final_play)

Unnamed: 0,song_id,play_freq,predicted_ratings,corrected_ratings
3,7224,111,3.094069,2.999153
0,7889,183,2.64852,2.574598
2,5653,112,2.548044,2.453553
4,8777,78,2.397416,2.284188
1,6450,115,2.325181,2.231931


In [237]:
sFrame = pd.DataFrame()
songs = [7224,7889,5653,8777,6450]
for s in songs:
  #sid = trainset.to_raw_iid(s)
  sFrame=sFrame.append(df_final.loc[df_final['song_id'] == s].head(1))

sFrame[['song_id','title','artist_name']]

Unnamed: 0,song_id,title,artist_name
134529,7224,Victoria (LP Version),Old 97's
37482,7889,Make Love To Your Mind,Bill Withers
132569,5653,Transparency,White Denim
152986,8777,Sugar_ We're Goin Down,Fall Out Boy
133467,6450,Brave The Elements,Colossal


In [235]:
top_n_songs_for_user(trainset.to_inner_uid(6958), 10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,play_count
song_id,title,artist_name,Unnamed: 3_level_1
1050,Wet Blanket,Metric,5
5566,The Bachelor and the Bride,The Decemberists,5
9351,The Police And The Private,Metric,2
3718,The Penalty,Beirut,2
1671,Sleeping In (Album),Postal Service,2
1787,Help I'm Alive,Metric,2
8029,I CAN'T GET STARTED,Ron Carter,1
7738,Nantes,Beirut,1
8037,Gold Guns Girls,Metric,1
6305,Rhode Island Is Famous For You,Erin McKeown,1


**Observations and Insights:**
* IWhile the most listened to artists do not appear in the recommendations, we know we got a lower RMSE and ok F-1, and given the nature of the SVD algorithm that helps uncover latent features, it's possible the recommended songs will appeal to the user at the %s suggested by the precision and recall scores.
* Listening to some snippets of the songs they all sound similar ;) 

### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [210]:
# Make baseline clustering model
# using CoClustering algorithm.
clust_baseline = CoClustering(random_state=1)

# training the algorithm on the trainset
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_baseline)


RMSE: 1.0382
Precision:  0.397
Recall:  0.584
F_1 score:  0.473


In [211]:
#Making prediction for user_id 6958 and song_id 1671.
clust_baseline.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.27   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=1.2663258829147817, details={'was_impossible': False})

In [212]:
#Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user.
clust_baseline.predict(6958, 3232, verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.28   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=None, est=1.2767753869678147, details={'was_impossible': False})

#### Improving clustering-based recommendation system by tuning its hyper-parameters

In [213]:
# set the parameter space to tune
param_grid = {'n_cltr_u':[5,6,7,8], 'n_cltr_i': [5,6,7,8], 'n_epochs': [10,20,30]}
param_grid2 = {'n_cltr_u':[4, 5,6,7,8], 'n_cltr_i': [4,6, 9], 'n_epochs': [5, 10,15]}

# performing 3-fold gridsearch cross validation
gs = GridSearchCV(CoClustering, param_grid2, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.04009423451208
{'n_cltr_u': 4, 'n_cltr_i': 4, 'n_epochs': 10}


param grid 1 
> 1.0475849298801967
{'n_cltr_u': 5, 'n_cltr_i': 6, 'n_epochs': 10}

**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/co_clustering.html).

In [214]:
# Train the tuned Coclustering algorithm

#TODO use the opt output
# using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u=5,n_cltr_i=6, n_epochs=10, random_state=1)

# training the algorithm on the trainset
clust_tuned.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_tuned)

RMSE: 1.0491
Precision:  0.39
Recall:  0.565
F_1 score:  0.461


**Observations and Insights:**


In [215]:
#Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671.
clust_tuned.predict(6958, 1671, r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 0.90   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=2, est=0.8984024923526686, details={'was_impossible': False})

In [216]:
#Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating.
clust_tuned.predict(6958, 1671, verbose=True)

user: 6958       item: 1671       r_ui = None   est = 0.90   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=None, est=0.8984024923526686, details={'was_impossible': False})

**Observations and Insights:**
* Unfortunately thi smodel eresulted in the largest discrepancy between the estimated value and the real value of the known played song. And comared to the other models it also doesn't seem to be recommending an song that is likely to be liked by the user according to most other models we tried thus far, which together with the lower F-1 score, may indicate this model is not ideal for this use case.

#### Implementing the recommendation algorithm based on optimized CoClustering model

In [217]:
#Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm.
clustering_recommendations = get_recommendations(df_f, 6958, 5, clust_tuned)

### Correcting the play_count and Ranking the above songs

In [218]:
#Ranking songs based on above recommendations

ranking_songs(clustering_recommendations, final_play)

Unnamed: 0,song_id,play_freq,predicted_ratings,corrected_ratings
4,7224,111,3.686119,3.591204
2,657,162,2.488407,2.40984
3,6450,115,2.466144,2.372894
0,5531,684,2.40998,2.371744
1,1664,418,2.416377,2.367465


In [238]:
sFrame = pd.DataFrame()
songs = [7224,657,6450,5531,1664]
for s in songs:
  #sid = trainset.to_raw_iid(s)
  sFrame=sFrame.append(df_final.loc[df_final['song_id'] == s].head(1))

sFrame[['song_id','title','artist_name']]

Unnamed: 0,song_id,title,artist_name
134529,7224,Victoria (LP Version),Old 97's
117693,657,Luvstruck,Southside Spinners
133467,6450,Brave The Elements,Colossal
10888,5531,Secrets,OneRepublic
9808,1664,Horn Concerto No. 4 in E flat K495: II. Romanc...,Barry Tuckwell/Academy of St Martin-in-the-Fie...


**Observations and Insights:**
* Very interestingly we see some recommended songs overlapping with the Matrix SVD based model (Victoria (LP Version) and Brave The Elements for instance)

### Content Based Recommendation Systems

**Think About It:** So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [219]:
df_small=df_final

In [220]:
df_final.nunique()

user_id        3476
song_id         695
play_count        5
title           704
release         484
artist_name     258
year             38
dtype: int64

In [221]:
# Concatenate the "title","release","artist_name" columns to create a different column named "text"
df_small['text']=df_small['title']+ ' ' + df_small['release']+ ' ' + df_small['artist_name']
df_small.nunique()

user_id        3476
song_id         695
play_count        5
title           704
release         484
artist_name     258
year             38
text            759
dtype: int64

In [222]:
#Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
df_small = df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]

#drop the duplicates from the title column
df_small = df_small.drop_duplicates(subset=['title'])

#Set the title column as the index
df_small = df_small.set_index('title')

# see the first 5 records of the df_small dataset
df_small.head()

Unnamed: 0_level_0,user_id,song_id,play_count,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Stronger,39814,736,1,Stronger Graduation Kanye West
Constellations,57932,750,1,Constellations In Between Dreams Jack Johnson
Learn To Fly,75901,1188,2,Learn To Fly There Is Nothing Left To Lose Foo...
Paper Gangsta,33280,1536,2,Paper Gangsta The Fame Monster Lady GaGa
Sehr kosmisch,56576,2220,2,Sehr kosmisch Musik von Harmonia Harmonia


In [223]:
df_small.info()

<class 'pandas.core.frame.DataFrame'>
Index: 704 entries, Stronger to Synchronicity II
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     704 non-null    int64 
 1   song_id     704 non-null    int64 
 2   play_count  704 non-null    int64 
 3   text        704 non-null    object
dtypes: int64(3), object(1)
memory usage: 27.5+ KB


In [224]:
# Create the series of indices from the data
#indices = pd.Series(final_ratings.index)
indices =pd.Series(df_small.index)
indices[:5]

0          Stronger
1    Constellations
2      Learn To Fly
3     Paper Gangsta
4     Sehr kosmisch
Name: title, dtype: object

In [225]:
#Importing necessary packages to work with text data
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
import re
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


We will create a **function to pre-process the text data:**

In [226]:
# Function to tokenize the text
def tokenize(text):
    text = re.sub(r"[^a-zA-Z]"," ",text.lower())
    tokens = word_tokenize(text)
    words = [word for word in tokens if word not in stopwords.words("english")] #Use stopwords of english words("english")
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

In [227]:
#Create tfidf vectorizer 

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array.

tfidf = TfidfVectorizer(tokenizer=tokenize)
songs_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()

In [228]:
pd.DataFrame(songs_tfidf)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1598,1599,1600,1601,1602,1603,1604,1605,1606,1607
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
699,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
700,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [229]:
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(songs_tfidf, songs_tfidf)
similar_songs

array([[1.        , 0.        , 0.        , ..., 0.        , 0.65559925,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.65559925, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

 Finally, let's create a function to find most similar songs to recommend for a given song

In [230]:
# function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
    
    recommended_songs = []
    
    # gettin the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)
    
    # populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])
        
    return recommended_songs

Recommending 10 songs similar to Learn to Fly

In [231]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', similar_songs)

[559, 431, 312, 22, 469, 400, 626, 478, 477, 466]


['Big Me',
 'Everlong',
 'The Pretender',
 'Just Lose It',
 'Nothing Better (Album)',
 'From Left To Right',
 'Lifespan Of A Fly',
 'Campus (Album)',
 'Last Day Of Our Love',
 'Lump Sum']

In [232]:
recommendations('Starlight', similar_songs)

[301, 666, 574, 677, 606, 363, 103, 605, 292, 293]


['Do We Need This?',
 'Unintended',
 'Stockholm Syndrome',
 'Invincible',
 "Can't Take My Eyes Off You",
 'Resistance',
 'Uprising',
 'Map Of The Problematique',
 'Supermassive Black Hole (Twilight Soundtrack Version)',
 'Supermassive Black Hole (Album Version)']

In [233]:
recommendations('Toxic', similar_songs)

[265, 115, 279, 619, 100, 604, 600, 385, 507, 469]


['Rehab',
 'Rianna',
 "If I Ain't Got You",
 'Flashing Lights',
 'Balloons (Single version)',
 'Bad Moon Rising',
 'Billy Liar',
 'First Day Of My Life (Single Version)',
 'Strut (1993 Digital Remaster)',
 'Nothing Better (Album)']

**Observations and Insights:**
* Spot checking for recommendtions, makes sense but probably because this content model is looking heavily at artist and title so it will often recommend songs from the same artist 

## **Conclusion and Recommendations:** 

- **Refined Insights -** What are the most meaningful insights from the data relevant to the problem?

- **Comparison of various techniques and their relative performance -** How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

- **Proposal for the final solution design -** What model do you propose to be adopted? Why is this the best solution to adopt?

For this problem formulation, we built recommendation systems using several different algorithms. They are as follows:

* rank-based using averages
* User-user-similarity-based collaborative filtering
* Item-item-similarity-based collaborative filtering
* model-based (matrix factorization) collaborative filtering

To demonstrate "user-user-similarity-based collaborative filtering","item-item-similarity-based collaborative filtering", and "model-based (matrix factorization) collaborative filtering", we used the surprise library 

For these algorithms, grid search cross-validation is used to find optimal hyperparameters. Once we found the optimal parameters, we made the corresponding predictions. 

We also used clustering-based recommendation systems
content-based recommendation systems

For performance evaluation of these models, we used precision@k and recall@k.

Using these two metrics, the F_1 score is calculated for each working model.


Overall, the *user-user similarity-based* recommendation system has given the best performance in terms of the F1-Score.

We can try to further improve the performance of these models using hyperparameter tuning or build hybrid recommendation systems.

**we propose to further tune the user - user similarity collaborative filtering model and explore a hybrid model with SVD**

**We will explore that in the subsequent stages and ultimately see which model has best performance based on f-1 and RMSE metrics**