<a href="https://colab.research.google.com/github/erinjsoto/capstone_recommendation_system/blob/main/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Music Recommendation System**

## **Milestone 1**

## **Problem Definition**

**The context:** Why is this problem important to solve?<br>
**The objectives:** What is the intended goal?<br>
**The key questions:** What are the key questions that need to be answered?<br>
**The problem formulation:** What are we trying to solve using data science?


## **Data Dictionary**

The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.

song_data

song_id - A unique id given to every song

title - Title of the song

Release - Name of the released album

Artist_name - Name of the artist 

year - Year of release

count_data

user _id - A unique id given to the user

song_id - A unique id given to the song

play_count - Number of times the song was played

## **Data Source**
http://millionsongdataset.com/

## **Important Notes**

- This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for each Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this. **There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise**. 

- In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract insights from the outputs.

- The naming convention for different variables can vary. **Please consider the code provided in this notebook as a sample code.**

- All the outputs in the notebook are just for reference and can be different if you follow a different approach.

- There are sections called **Think About It** in the notebook that will help you get a better understanding of the reasoning behind a particular technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

In [None]:
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/Colab_Notebooks/spotify_music/ipynb/Final.ipynb

[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab_Notebooks/spotify_music/ipynb/Final.ipynb to html
[NbConvertApp] Writing 548743 bytes to /content/drive/MyDrive/Colab_Notebooks/spotify_music/ipynb/Final.html




### **Importing Libraries and the Dataset**

In [None]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd

# Basic library for data visualization
import matplotlib.pyplot as plt

# Slightly advanced library for data visualization
import seaborn as sns

# To compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# A dictionary output that does not raise a key error
from collections import defaultdict

# A performance metrics in sklearn
from sklearn.metrics import mean_squared_error

# To do label encoding
from sklearn.preprocessing import LabelEncoder

### **Load the dataset**

In [None]:
# Importing the datasets
count_df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/spotify_music/count_data.csv')
song_df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/spotify_music/song_data.csv')

### **Understanding the data by viewing a few observations**

In [None]:
# See top 10 records of count_df data
# Top 10 records based on play_count column
count_df.sort_values(by=['play_count'], ascending=False).head(10)

Unnamed: 0.1,Unnamed: 0,user_id,song_id,play_count
1228366,1228366,d13609d62db6df876d3cc388225478618bb7b912,SOFCGSE12AF72A674F,2213
1048310,1048310,50996bbabb6f7857bf0c8019435b5246a0e45cfd,SOUAGPQ12A8AE47B3A,920
1586780,1586780,5ea608df0357ec4fda191cb9316fe8e6e65e3777,SOKOSPK12A8C13C088,879
31179,31179,bb85bb79612e5373ac714fcd4469cabeb5ed94e1,SOZQSVB12A8C13C271,796
1875121,1875121,c012ec364329bb08cbe3e62fe76db31f8c5d8ec3,SOBONKR12A58A7A7E0,683
1644909,1644909,70caceccaa745b6f7bc2898a154538eb1ada4d5a,SOPREHY12AB01815F9,676
1731945,1731945,972cce803aa7beceaa7d0039e4c7c0ff097e4d55,SOJRFWQ12AB0183582,664
1374693,1374693,d2232ac7a1ec17b283b5dff243161902b2cb706c,SOLGIWB12A58A77A05,649
1819571,1819571,f5363481018dc87e8b06f9451e99804610a594fa,SOVRIPE12A6D4FEA19,605
515442,515442,f1bdbb9fb7399b402a09fa124210dedf78e76034,SOZPMJT12AAF3B40D1,585


In [None]:
# See 10 records of song_df data
song_df.head(10)

Unnamed: 0,song_id,title,release,artist_name,year
0,SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
1,SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
2,SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
3,SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
4,SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0
5,SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",Berwald: Symphonies Nos. 1/2/3/4,David Montgomery,0
6,SOQVRHI12A6D4FB2D7,We Have Got Love,Strictly The Best Vol. 34,Sasha / Turbulence,0
7,SOEYRFT12AB018936C,2 Da Beat Ch'yall,Da Bomb,Kris Kross,1993
8,SOPMIYT12A6D4F851E,Goodbye,Danny Boy,Joseph Locke,0
9,SOJCFMH12A8C13B0C2,Mama_ mama can't you see ?,March to cadence with the US marines,The Sun Harbor's Chorus-Documentary Recordings,0


In [None]:
# Testing
song_df.groupby(['song_id']).count().sum()

title           999985
release         999995
artist_name    1000000
year           1000000
dtype: int64

In [None]:
# See top 10 records of song_df data
a = song_df['song_id'].value_counts()
a.head(10)

SOUYQYY12AF72A000F    3
SOKUAGP12A8C133B94    3
SOFQIZF12A67ADE730    3
SOBPICV12A8151CDF1    3
SORBGBD12A8C141CEA    3
SONBEKD12AB01894DC    3
SOBPAEP12A58A77F49    3
SOUWROC12A8C141CF3    3
SOQNMCD12A8C1383D4    3
SOODBWM12A6D4F6B0E    3
Name: song_id, dtype: int64

In [None]:
# See top 10 artists of song_df data
a = song_df['artist_name'].value_counts()
a.head(10)

Michael Jackson       194
Johnny Cash           193
Beastie Boys          187
Joan Baez             181
Neil Diamond          176
Duran Duran           175
Radiohead             173
Franz Ferdinand       173
Aerosmith             173
The Rolling Stones    171
Name: artist_name, dtype: int64

### **Let us check the data types and and missing values of each column**

In [None]:
# See the info of the count_df data
count_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   Unnamed: 0  int64 
 1   user_id     object
 2   song_id     object
 3   play_count  int64 
dtypes: int64(2), object(2)
memory usage: 61.0+ MB


In [None]:
# See the info of the song_df data
song_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   song_id      1000000 non-null  object
 1   title        999985 non-null   object
 2   release      999995 non-null   object
 3   artist_name  1000000 non-null  object
 4   year         1000000 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 38.1+ MB


#### **Observations and Insights:**

*   Currently there are two different data sets (song_df and count_df)
*   song_df contains data related to songs such as release year, artist, album, etc. 
*   count_df contains data related to the users such as user id, and play_count
*   Both datasets have a song_id column
*   Both datasets have object and int data types
*   It appears that within song_df there is some missing data regarding title and release

In [None]:
# Left merge the count_df and song_df data on "song_id". Drop duplicates from song_df data simultaneously
#df = count_df.merge(song_df.drop_duplicates(['song_id']), how="left", on="song_id")
df = pd.merge(count_df, song_df.drop_duplicates(['song_id']), on="song_id", how="left")
df.head()
# Drop the column 'Unnamed: 0'
df = df.drop(['Unnamed: 0'], axis=1)

**Think About It:** As the user_id and song_id are encrypted. Can they be encoded to numeric features?

In [None]:
# Label Encoding
le = LabelEncoder()

# Fit transform the user_id column
df['user_id'] = le.fit_transform(df['user_id'])

# Fit transform the business_id column
df['business_id'] = le.fit_transform(df['song_id'])

**Think About It:** As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?

In [None]:
# Get the column containing the users
users = df.user_id

# Create a dictionary from users to their number of songs
ratings_count = dict()

for user in users:
    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1
    
    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1    

In [None]:
# We want our users to have listened at least 90 songs
RATINGS_CUTOFF = 90

# Create a list of users who need to be removed
remove_users = []

for user, num_ratings in ratings_count.items():
    
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]

In [None]:
# Get the column containing the songs
songs = df.song_id

# Create a dictionary from songs to their number of users
ratings_count = dict()

for song in songs:
    # If we already have the song, just add 1 to their rating count
    if song in ratings_count:
        ratings_count[song] += 1
    
    # Otherwise, set their rating count to 1
    else:
        ratings_count[song] = 1    

In [None]:
# We want our song to be listened by at least 120 users to be considred
RATINGS_CUTOFF = 120

remove_songs = []

for song, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_songs.append(song)

df_final= df.loc[ ~ df.song_id.isin(remove_songs)]

In [None]:
df_final.head()

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year,business_id
196,6958,SOAARXR12A8C133D15,1,Aunt Eggma Blowtorch,Everything Is,Neutral Milk Hotel,1995,12
197,6958,SOACPBY12A8C13FEF9,1,Full Circle,Breakout,Miley Cyrus,2008,40
198,6958,SOAKHOF12A8C13C72A,2,Poor Jackie,Rabbit Habits,Man Man,2008,151
199,6958,SOAVIJW12AB018269B,1,Hot N Cold (Manhattan Clique Remix Radio Edit),Hot N Cold,Katy Perry,2008,326
200,6958,SOBDVAK12AC90759A2,1,Daisy And Prudence,Distillation,Erin McKeown,2000,447


In [None]:
# Drop records with play_count more than(>) 5
df_final.drop(df_final[df_final['play_count']>5].index, inplace=True)
#df_final.drop(df_final.index[df_final['play_count'] > 5], inplace = True)

# Get names of indexes for which column play_count more than(>) 5
indexNames = df_final[ df_final['play_count']==5 ].index
# Delete these row indexes from dataFrame
test = df_final.drop(indexNames , inplace=True)

In [None]:
# Check the shape of the pre narrowed down dataset to compare to the narrowed down df_final
df.shape

(438390, 8)

In [None]:
# Check the shape of the narrowed down data
df_final.shape

(383454, 8)

## **Exploratory Data Analysis**

### **Checking the total number of unique users, songs, artists in the data**

Total number of unique user id

In [None]:
# Display total number of unique user_id
print("Total number of unique user_ids:", df_final.user_id.nunique())

Total number of unique song id

In [None]:
# Display total number of unique song_id
print("Total number of unique song_ids:", df_final.song_id.nunique())

Total number of unique artists

In [None]:
# Display total number of unique artists
print("Total number of unique artists:",df_final.artist_name.nunique())

#### **Observations and Insights:**


*   The total number of unique users are 3155
*   The total number of unique songs are 563
*   The total number of Unique artists are 232
*   After narrowing down the dataset to df_final we can see that the dataset went from 438390 rows and 8 columns to 117876 rows and 8 columns
*   To narrow the data only users who have listened to 90 or more songs were kept int the df_final dataset. Any song that had not been listened to at least 120 times was also removed. 

### **Let's find out about the most interacted songs and interacted users**

Most interacted songs

In [None]:
# Finding the most played songs in the dataset
df_final['song_id'].value_counts().head(10)

In [None]:
# See top 10 artists of df_final data
df_final['artist_name'].value_counts().head()

In [None]:
# Display some songs by the most popular artist_name
df_final.loc[df_final['artist_name'] == 'Coldplay']

Most interacted users

In [None]:
# Finding the most played songs in the dataset
df_final['user_id'].value_counts().head(10)

In [None]:
df_final.to_csv(r'/content/drive/MyDrive/Colab_Notebooks/spotify_music/my_data.csv', index=False)

#### **Observations and Insights:**


*   The most intereacted song id is SOWCKVR12A8C142411 being played 751 times
*   The most interacted user id is 61472






Songs played in a year

In [None]:
count_songs = df_final.groupby('year').count()['title']

count = pd.DataFrame(count_songs)

count.drop(count.index[0], inplace = True)

count.tail()

In [None]:
# Create the plot

# Set the figure size
plt.figure(figsize = (30, 20))

sns.barplot(x = count.index,
            y = 'title',
            data = count,
            estimator = np.median)

# Set the y label of the plot
plt.ylabel('number of titles played') 

# Show the plot
plt.show()

#### **Observations and Insights:** 

*   The year with the most amount of song played is 2009 with 16351 counted songs
*   There is a correlation between the more popular songs with the year they came out. 
*   You can see a significant increase in songs listened to per year starting around 2000. This indicates the users who are listening to music are listening to more recent music compared to music that came out prior to 2000
*   It would be interesting to see the user_ids age, I believe the reason there is a spike starting around 2000 might be because users who are more familir with technology are younger and more tech savy compared to older generations. Threfore, the users are more likely to listen to recent music compared music released in older years.

**Think About It:** What other insights can be drawn using exploratory data analysis?

# **Music Recommendation System**

# **Milestone 2**

Now that we have explored the data, let's apply different algorithms to build recommendation systems.

**Note:** Use the shorter version of the data, i.e., the data after the cutoffs as used in Milestone 1.

## **Load the dataset**

In [None]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd

# Basic library for data visualization
import matplotlib.pyplot as plt

# Slightly advanced library for data visualization
import seaborn as sns

# To compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# A dictionary output that does not raise a key error
from collections import defaultdict

# A performance metrics in sklearn
from sklearn.metrics import mean_squared_error

# To do label encoding
from sklearn.preprocessing import LabelEncoder

In [None]:

# Load the dataset you have saved at the end of milestone 1

df_final = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/spotify_music/my_data.csv')


In [None]:
df_final.head()

### **Popularity-Based Recommendation Systems**

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.

In [None]:
# Calculating average play_count
average_count = df_final.groupby('song_id').mean()['play_count']        # Hint: Use groupby function on the song_id column

# Calculating the frequency a song is played
play_freq = df_final.groupby('song_id').count()['play_count']           # Hint: Use groupby function on the song_id column

In [None]:
# Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count': average_count, 'play_freq': play_freq})

# Let us see the first five records of the final_play dataset
final_play.head()

Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [None]:
# Build the function to find top n songs

def top_n_products(final_play, n, min_interaction):
    
    # Finding products with minimum number of interactions
    recommendations = final_play[final_play['play_freq'] > min_interaction]
    
    # Sorting values with respect to average rating 
    recommendations = recommendations.sort_values(by='avg_count', ascending=False)
    
    return recommendations.index[:n]

In [None]:
# Recommend top 10 songs using the function defined above
list(top_n_products(final_play, 10, 50))

### **User User Similarity-Based Collaborative Filtering**

To build the user-user-similarity-based and subsequent models we will use the "surprise" library.

In [None]:
# Install the surprise package using pip.
!pip install surprise 

In [None]:
# Import necessary libraries

# To compute the accuracy of models
from surprise import accuracy

# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader

# Class for loading datasets
from surprise.dataset import Dataset

# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# For splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing KFold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

### Some useful functions

Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

**Think About It:** Which metric should be used for this problem to compare different models?

In [None]:
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    # Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x : x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[ : k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    # Mean of all the predicted precisions are calculated
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
    
    accuracy.rmse(predictions)

    # Command to print the overall precision
    print('Precision: ', precision)

    # Command to print the overall recall
    print('Recall: ', recall)
    
    # Formula to compute the F-1 score
    print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

**Think About It:** In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value of 1.5? 

In [None]:
# Instantiating Reader scale with expected rating scale 
reader = Reader(rating_scale = (0, 5)) #use rating scale (0, 5)

# Loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) # Take only "user_id","song_id", and "play_count"

# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.4, random_state = 42) # Take test_size = 0.4

**Think About It:** How changing the test size would change the results and outputs?

In [None]:
# Build the default user-user-similarity model
sim_options = {'name': 'cosine',
               'user_based':True}

# KNN algorithm is used to find desired similar items
sim_user_user = KNNBasic(random_state=1) # Use random_state = 1 

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 30
precision_recall_at_k(sim_user_user) # Use sim_user_user model

💡 **Observations and Insights:**


*   F_1 score is 0.492 meaning this model could be a lot better in predicting methods
*   RMSE is 1.0626 also indicating that this model has lots of room for imporovement
*   The precision is 0.413 meaning that out of 30 recommended items 0.413 were relevant to the user 
*   The recall is 0.608 meaning that out of 30 relevant products 0.608 are recommended



In [None]:
# Predicting play_count for a sample user with a listened song
sim_user_user.predict('6958', '1671', r_ui = 2, verbose = True) # Use user id 6958 and song_id 1671

In [None]:
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict('6958', '3232', verbose = True) # Use user_id 6958 and song_id 3232


💡 **Observations and Insights:**

*    Per the initial prediction of user 6958 and song 1671 the recommendation r_ui is 2 while my model predicted 1.6989 which proves observations above that the model is not as efficient as it could be. 
*    There is a 0.3011 discrepency between the r_ui and the user-to-user prediction
*   Per the second predictor r_ui is unknown as the song 3232 has not been listened to by user, the model predicted a 1.70 reccomendation rating


Now, let's try to tune the model and see if we can improve the model performance.

In [None]:
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
                              'user_based': [True], "min_support": [2, 4]}
              }

# Performing 3-fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# Fitting the data
gs.fit(data) # Use entire data for GridSearch

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

In [None]:
# Using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name': 'pearson_baseline',
               'user_based': True}

# Creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k =30 , min_k =6, random_state = 1, verbose = True)

# Training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10
precision_recall_at_k(sim_user_user_optimized)

💡 **Observations and Insights:**

*   f_1 score went from 0.492 to 0.521 after optimization of hyperparameters 
*   After implmeenting hyperparameters precision and recall both incresed meaning that tuning the model increased efficienciy in the model
*   Tuning the model increased model performance 
*   RMSE	1.0534
*   Precision	0.414
*   Recall	0.703




In [None]:
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
sim_user_user_optimized.predict("6958", "1671", r_ui = 2, verbose = True)

In [None]:
# Predict the play count for a song that is not listened to by the user (with user_id 6958) 
sim_user_user_optimized.predict("6958", "3232", verbose = True) # Use user_id 6958 and song_id 3232

💡 **Observations and Insights:**

*    Per the initial prediction of user 6958 and song 1671 the recommendation r_ui is 2 while my model predicted 1.70  
*    There is a 0.3 discrepency between the r_ui and the user-to-user prediction
*   Per the second predictor r_ui is unknown as the song 3232 has not been listened to by user, the model predicted a 1.70 reccomendation rating similiar to the user to user model prior to tuning



**Think About It:** Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?

In [None]:
# Use inner id 0
sim_user_user_optimized.get_neighbors(0, k = 5)

Below we will be implementing a function where the input parameters are:

- data: A **song** dataset
- user_id: A user-id **against which we want the recommendations**
- top_n: The **number of songs we want to recommend**
- algo: The algorithm we want to use **for predicting the play_count**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [None]:
def get_recommendations(data, user_id, top_n, algo):
    
    # Creating an empty list to store the recommended product ids
    recommendations = []
    
    # Creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot_table(index = 'user_id', columns = 'song_id', values = 'play_count')
    
    # Extracting those business ids which the user_id has not visited yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # Looping through each of the business ids which user_id has not interacted yet
    for item_id in non_interacted_products:
        
        # Predicting the ratings for those non visited restaurant ids by this user
        est = algo.predict(user_id, item_id).est
        
        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x : x[1], reverse = True)

    return recommendations[:top_n] # Returing top n highest predicted rating products for this user

In [None]:
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_user_user_optimized)

In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns = ['song_id', 'predicted_ratings'])

💡 **Observations and Insights:**


*    After adding input paramaters I was able to compute a list of song_ids with their predicted ratings
*   This is still using a user-user, so this is comparing data from similiar users listened to songs and provided recommendations based on this

### Correcting the play_counts and Ranking the above songs

In [None]:
def ranking_songs(recommendations, final_rating):
  # Sort the songs based on play counts
  ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()

  # Merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = ['song_id', 'predicted_ratings']), on = 'song_id', how = 'inner')

  # Rank the songs based on corrected play_counts
  ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # Sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_ratings', ascending = False)
  
  return ranked_songs
  print()

**Think About It:** In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

In [None]:
# Applying the ranking_songs function on the final_play data
ranking_songs(recommendations, final_play)

💡 **Observations and Insights:**

*   Predicted ratings in this model are higher than the actual rating
*    Its interesting to note that these song_ids are supposed to be the top n songs. Howevever, I argue that the actual ratings of theses songs are lower than what I would expect. I originally thought that users would have top songs that they play over and over but as you notice song _id SONYKOW12AB01849C9 is played roughly 618 times with a predicted rating of 2.55 and actual rating of 2.51




### Item Item Similarity-based collaborative filtering recommendation systems 

In [None]:
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance
# Declaring the similarity options
sim_options = {'name': 'cosine',
               'user_based': False}

# KNN algorithm is used to find desired similar items
sim_item_item = KNNBasic(sim_options = sim_options, random_state = 1, verbose = True)

# Train the algorithm on the trainset, and predict ratings for the test set
sim_item_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k = 10
precision_recall_at_k(sim_item_item)

💡 **Observations and Insights:**

*   Root mean square error (RMSE) is 0.9576 meaning our algorithm worked but could be better. 
*   The precision is 0.307 meaning that out of the recommended items 0.307 were relevant to the user. 
*   The recall is 0.562 meaning that out of the relevant products 0.562 products are recommended.
*  With an F_1 score of 0.397 which is representative of both recall and precision.
*  Based on the F_1 score this model is not great

In [None]:
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
sim_item_item.predict("6958", "1671", r_ui =2, verbose = True)

In [None]:
# Predict the play count for a user that has not listened to the song (with song_id 1671)
sim_item_item.predict("6958", "3232",  verbose = True) # Use user_id 6958 and song_id 3232

💡 **Observations and Insights:**

*    The predicted rate compared to r_ui is 1.70 this is the same predicted rate as song_id 3232 for the same user  

In [None]:
# Apply grid search for enhancing model performance

# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
                              'user_based': [False], "min_support": [2, 4]}
              }

# Performing 3-fold cross-validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# Fitting the data
gs.fit(data)

# Find the best RMSE score
print(gs.best_score['rmse'])

# Extract the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

**Think About It:** How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameters [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html).

In [None]:
# Apply the best modle found in the grid search
# Using the optimal similarity measure for item-item based collaborative filtering
sim_options = {'name': 'msd', 'user_based': False}

# Creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options=sim_options, k =30 , min_k =6, random_state = 1, verbose = True)

# Training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)

# Let us compute precision@k and recall@k, f1_score and RMSE
precision_recall_at_k(sim_item_item_optimized)

💡 **Observations and Insights:**


*   Root mean square error (RMSE) is 1.0423 
*   The precision is 0.34 meaning that out of the recommended items 0.34 were relevant to the user. 
*   The recall is 0.563 meaning that out of the relevant products 0.563 products are recommended.
*  With an F_1 score of 0.424 which is representative of both recall and precision.

In [None]:
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
sim_item_item_optimized.predict("6958", "1671", verbose = True)

In [None]:
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
sim_item_item_optimized.predict("6958", "3232", verbose = True)

In [None]:
# Find five most similar items to the item with inner id 0
sim_item_item_optimized.get_neighbors(0, k = 5)

In [None]:
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
recommendations = get_recommendations(df_final, 6958, 5, sim_item_item_optimized)

In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(recommendations, columns = ['song_id', 'predicted_play_count'])

In [None]:
# Applying the ranking_songs function
ranking_songs(recommendations, final_play)

💡 **Observations and Insights:**

*    The larger the frequency the smaller gap between predicted and corrected ratings
*    There are not significant outliers between predicted and corrected ratings



### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [None]:
# Build baseline model using svd
# Using SVD matrix factorization
svd = SVD(random_state=1)

# Training the algorithm on the trainset
svd.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd)

In [None]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
svd.predict("6958", 1671, r_ui = 2, verbose = True)

In [None]:
# Making a prediction for the user who has not listened to the song (song_id 3232)
svd.predict("6958", 3232, verbose = True)

#### Improving matrix factorization based recommendation system by tuning its hyperparameters

In [None]:
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# Performe 3-fold grid-search cross-validation
gs = GridSearchCV(SVD, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# Fitting data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [None]:
# Building the optimized SVD model using optimal hyperparameters
svd_optimized = SVD(n_epochs= 30, lr_all= 0.01, reg_all= 0.2, random_state=1)

# Train the algorithm on the trainset
svd_optimized=svd_optimized.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE
precision_recall_at_k(svd_optimized)


💡 **Observations and Insights:**

*   Root mean square error (RMSE) is 1.0141
*   The precision is 0.415 
*   The recall is 0.635 
*  With an F_1 score of 0.502 which is representative of both recall and precision.

In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
svd_optimized.predict('6958', 1671, verbose=True)

In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
svd_optimized.predict("6958", 3232, verbose=True)

💡 **Observations and Insights:**

*    Estimated rating prediction is the same for both song_ids

In [None]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm
svd_recommendations = get_recommendations(df_final, 6958, 5, svd_optimized )

In [None]:
# Ranking songs based on above recommendations
ranking_songs(svd_recommendations, final_play)

💡 **Observations and Insights:**

*   The closest predicted rating compared to corrected rating song_id is SOZVVRE12A8C143150
*   The larger the frequency the smaller gap between predicted and corrected ratings
*   It would be interesting to see if we were able to calculate a confidence interval or score on the predicted ratings



### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [None]:
# Make baseline clustering model
# Using Co-Clustering algorithm
clust_baseline = CoClustering(random_state = 1)

# Training the algorithm on the trainset
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k = 10
precision_recall_at_k(clust_baseline)

In [None]:
# Making prediction for user_id 6958 and song_id 1671
clust_baseline.predict('6958', 1671, verbose = True)

In [None]:
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
clust_baseline.predict('6958', 3232, verbose = True)

#### Improving clustering-based recommendation system by tuning its hyper-parameters

In [None]:
# Set the parameter space to tune
param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}

# Performing 3-fold gridsearch cross validation
gs = GridSearchCV(CoClustering, param_grid, measures = ['rmse'], cv = 3, n_jobs = -1)

# Fitting data
gs.fit(data)

# Best RMSE score
print(gs.best_score['rmse'])

# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/co_clustering.html).

In [None]:
# Train the tuned Coclustering algorithm
# Using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u = 5,n_cltr_i = 5, n_epochs = 10, random_state = 1)

# Training the algorithm on the trainset
clust_tuned.fit(trainset)

# Let us compute precision@k and recall@k with k = 10
precision_recall_at_k(clust_tuned)


💡 **Observations and Insights:**

*   Root mean square error (RMSE) is 1.0654
*   The precision is 0.394 meaning that out of the recommended items 0.394 were relevant to the user. 
*   The recall is 0.566 meaning that out of the relevant products 0.566 products are recommended.
*  With an F_1 score of 0.465 which is representative of both recall and precision.

In [None]:
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
clust_tuned.predict(6958, 1671, verbose = True)

In [None]:
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
clust_tuned.predict(6958, 3232, verbose = True) 

💡 **Observations and Insights:**

*    Estimated rating prediction is the same for both song_ids

#### Implementing the recommendation algorithm based on optimized CoClustering model

In [None]:
def get_recommendations(data, user_id, top_n, algo):
    
    # Creating an empty list to store the recommended restaurants ids
    recommendations = []
    
    # Creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot_table(index = 'user_id', columns = 'song_id', values = 'play_count')
    
    # Extracting those restaurants ids which the user_id has not interacted yet
    non_interacted_restaurants = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # Looping through each of the restaurants ids which user_id has not interacted yet
    for item_id in non_interacted_restaurants:
        
        # Predicting the ratings for those non interacted restaurants ids by this user
        est = algo.predict(user_id, item_id).est
        
        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x: x[1], reverse = True)

    # Returing top n highest predicted rating restaurants for this user
    return recommendations[:top_n]

In [None]:
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = get_recommendations(df_final, 6958, 5, clust_tuned)

### Correcting the play_count and Ranking the above songs

In [None]:
# Ranking songs based on the above recommendations
ranking_songs(clustering_recommendations, final_play)

💡 **Observations and Insights:**


*    Play_freq average is less than item to item play frequency songs
*    The highest predicted rating is 3.7 which is close to the corrected rating of 3.6

### Content Based Recommendation Systems

**Think About It:** So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [None]:
df_small = df_final

In [None]:
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
df_small['text'] = df_small['title'] + ' ' + df_small['release'] + ' ' + df_small['artist_name']

In [None]:
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
df_small[['user_id', 'song_id', 'play_count', 'title', 'text']]

# Drop the duplicates from the title column
df_small = df_small.drop_duplicates(subset = ['title'])

# Set the title column as the index
df_small = df_small.set_index('title')

# See the first 5 records of the df_small dataset
df_small.head()


In [None]:
# Create the series of indices from the data
indices = pd.Series(df_small.index)

indices[ : 5]

In [None]:
# Importing necessary packages to work with text data
import nltk
nltk.download('omw-1.4')

# Download punkt library
nltk.download("punkt")

# Download stopwords library
nltk.download("stopwords")

# Download wordnet 
nltk.download("wordnet")

# Import regular expression
import re

# Import word_tokenizer
from nltk import word_tokenize

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Import stopwords
from nltk.corpus import stopwords

# Import CountVectorizer and TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

We will create a **function to pre-process the text data:**

In [None]:
# Function to tokenize the text
def tokenize(text):
    
    text = re.sub(r"[^a-zA-Z]"," ", text.lower())
    
    tokens = word_tokenize(text)
    
    words = [word for word in tokens if word not in stopwords.words('english')]  # Use stopwords of english
    
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems


In [None]:
# Create tfidf vectorizer 
tfidf = TfidfVectorizer(tokenizer = tokenize)

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array
song_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()

In [None]:
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(song_tfidf, song_tfidf)

similar_songs

 Finally, let's create a function to find most similar songs to recommend for a given song.

In [None]:
# Function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
    
    recommended_songs = []
    
    # Getting the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # Creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # Getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1 : 11].index)
    print(top_10_indexes)
    
    # Populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])
        
    return recommended_songs

Recommending 10 songs similar to Learn to Fly

In [None]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations('Learn To Fly', similar_songs)