# Recommendation System

A recommender system or a recommendation system (sometimes replacing "system" with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.They are primarily used in commercial applications.

Recommender systems are utilized in a variety of areas, and are most commonly recognized as playlist generators for video and music services like Netflix, YouTube and Spotify, product recommenders for services such as Amazon, or content recommenders for social media platforms such as Facebook and Twitter. These systems can operate using a single input, like music, or multiple inputs within and across platforms like news, books, and search queries. There are also popular recommender systems for specific topics like restaurants and online dating. Recommender systems have been developed to explore research articles and experts, collaborators, financial services and life insurance.

In [3]:
import pandas as pd
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import re
import html
from html.parser import HTMLParser
import os
import warnings
warnings.filterwarnings('ignore')

In [6]:
os.chdir(r'C:\Users\ganes\Downloads\ML JOB\Coding')

In [7]:
artists = pd.read_csv('artist.csv', encoding='latin-1')
artists = artists.rename(columns={'id': 'artistID'})
artists.head()

Unnamed: 0,artistID,name
0,1,MALICE MIZER
1,2,Diary of Dreams
2,3,Carpathian Forest
3,4,Moi dix Mois
4,5,Bella Morte


In [8]:
artists.shape

(17632, 2)

In [9]:
# Cleaning artist name
def clean_column(column):
    hparser=HTMLParser()  
    new_text=html.unescape(column)  # converts HTML character codes to ASCII equivalent
    lower_case = new_text.lower()  # converts all the letters to lower case
    spaces = re.sub('[\s]+', ' ', lower_case) # Removes the spaces
    link = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',spaces) # Removes the URL's if it finds the lines with http. or www.
    alpha_numeric = re.sub('[^a-zA-Z\s]+', '', link) # Removes the alpha numeric symbols
    return alpha_numeric

clean_artist_name = artists['name'].apply(lambda column: clean_column(column)) # Applies above changes made to all the rows in the data set
artists['cleaned_artist_name'] = clean_artist_name

# Replacing empty strings with Nan
artists['cleaned_artist_name'].replace("", np.nan, inplace=True)
artists['cleaned_artist_name'].replace(" ", np.nan, inplace=True)
artists['cleaned_artist_name'].replace("  ", np.nan, inplace=True)
artists['cleaned_artist_name'].replace("   ", np.nan, inplace=True)
artists['cleaned_artist_name'].replace("    ", np.nan, inplace=True)

In [10]:
artists['cleaned_artist_name'].head(10)

0         malice mizer
1      diary of dreams
2    carpathian forest
3         moi dix mois
4          bella morte
5            moonspell
6       marilyn manson
7          dir en grey
8          combichrist
9              grendel
Name: cleaned_artist_name, dtype: object

In [11]:
artists['cleaned_artist_name'].shape

(17632,)

In [12]:
# artists['cleaned_artist_name'].isnull().sum(axis=0)

In [46]:
user_artist = pd.read_csv('user_artist.csv')
user_artist

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983
5,2,56,6152
6,2,57,5955
7,2,58,4616
8,2,59,4337
9,2,60,4147


In [14]:
# Merging two dataframes
df_ratings = user_artist.join(artists.set_index('artistID'), on='artistID')
df_ratings = df_ratings.rename(columns={'name': 'artist_name'})
df_ratings.head()

Unnamed: 0,userID,artistID,weight,artist_name,cleaned_artist_name
0,2,51,13883,Duran Duran,duran duran
1,2,52,11690,Morcheeba,morcheeba
2,2,53,11351,Air,air
3,2,54,10300,Hooverphonic,hooverphonic
4,2,55,8983,Kylie Minogue,kylie minogue


In [15]:
drop_columns = ['weight', 'artist_name']
df_ratings = df_ratings.drop(drop_columns, axis = 1)
df_ratings.head()

Unnamed: 0,userID,artistID,cleaned_artist_name
0,2,51,duran duran
1,2,52,morcheeba
2,2,53,air
3,2,54,hooverphonic
4,2,55,kylie minogue


In [16]:
df_ratings['cleaned_artist_name'].value_counts()

lady gaga                                   611
britney spears                              525
rihanna                                     485
the beatles                                 480
katy perry                                  474
madonna                                     429
avril lavigne                               417
christina aguilera                          407
muse                                        400
paramore                                    400
beyonc                                      397
radiohead                                   393
coldplay                                    369
keha                                        362
shakira                                     319
pnk                                         305
the killers                                 304
black eyed peas                             304
kylie minogue                               298
miley cyrus                                 287
depeche mode                            

I would like to rate the artists based on how many times they have rated by the users. It seems like Lady Gaga has been rated most no of times which is 611 and there are many other artists who were rated more than 200 times. For artists who were rated more than 400 times consider them as 5 star artists and artists with 300 and more #ratings consider them as 4 star artists and artists with 200 and more #ratings consider them as 3 star artists and artists with 100 and more #ratings are considered as 2 star artists and artists with less than 100 #ratings are considered as 1 star artists.

In [17]:
pd.set_option('display.max_rows', 500)
df_ratings['# artist_rated'] = df_ratings.groupby(['artistID'])['cleaned_artist_name'].transform('count')
df_ratings.head()

Unnamed: 0,userID,artistID,cleaned_artist_name,# artist_rated
0,2,51,duran duran,111
1,2,52,morcheeba,23
2,2,53,air,75
3,2,54,hooverphonic,18
4,2,55,kylie minogue,298


Now let's rate the artists based on the count of each artist

In [18]:
df_ratings['rating'] = df_ratings['# artist_rated'].apply(lambda x:5 if x >= 400 else 4 if x >= 300 else 3 if x >= 200 
                                                          else 2 if x >= 100 else 1) 

In [19]:
df_ratings.head()

Unnamed: 0,userID,artistID,cleaned_artist_name,# artist_rated,rating
0,2,51,duran duran,111,2
1,2,52,morcheeba,23,1
2,2,53,air,75,1
3,2,54,hooverphonic,18,1
4,2,55,kylie minogue,298,3


In [20]:
df_ratings['rating'].value_counts()

1    67925
2    11224
3     6310
5     4223
4     3152
Name: rating, dtype: int64

### Building a simple recommendation system

One approach to the design of recommender systems that has wide use is collaborative filtering. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items.

### User-User Collaborative filtering

This algorithm first finds the similarity score between users. Based on this similarity score, it then picks out the most similar users and recommends products which these similar users have liked or bought previously.

![image](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/0o0zVW2O6Rv-LI5Mu1-850x466.png)

### Item-Item Collaborative filtering

In this algorithm, we compute the similarity between each pair of items

![image](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/05/1skK2fqWiBF7weHU8SjuCzw.png)

In [21]:
n_users = df_ratings.userID.unique().shape[0]
n_items = df_ratings.artistID.unique().shape[0]

In [22]:
data_matrix = np.zeros((df_ratings['userID'].max(), df_ratings['artistID'].max()))

#populate the matrix based on the dataset
for line in df_ratings.itertuples():
    data_matrix[line[1]-1, line[2]-1] = line[5]


In [23]:
# data_matrix.unique.shape[0]
np.unique(data_matrix)

array([0., 1., 2., 3., 4., 5.])

In [24]:
data_matrix_df = pd.DataFrame(data_matrix)
data_matrix_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18735,18736,18737,18738,18739,18740,18741,18742,18743,18744
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we will calculate the similarity. We can use the pairwise_distance function from sklearn to calculate the cosine similarity. This gives us the item-item and user-user similarity in an array form. The next step is to make predictions based on these similarities. Let’s define a function to do just that.

In [25]:
from sklearn.metrics.pairwise import pairwise_distances 
user_similarity = pairwise_distances(data_matrix, metric='cosine')
item_similarity = pairwise_distances(data_matrix.T, metric='cosine')

In [26]:
user_similarity_dataframe = pd.DataFrame(user_similarity)
user_similarity_dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2090,2091,2092,2093,2094,2095,2096,2097,2098,2099
0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,0.0,1.0,0.777676,0.994236,0.989058,0.732963,0.855662,1.0,1.0,...,0.913207,1.0,0.990247,0.843848,1.0,0.972507,0.862081,1.0,1.0,1.0
2,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.98,1.0
3,1.0,0.777676,1.0,0.0,0.69243,0.989385,1.0,1.0,1.0,0.817761,...,1.0,0.936908,0.990538,1.0,1.0,0.586584,0.719654,1.0,1.0,1.0
4,1.0,0.994236,1.0,0.69243,0.0,1.0,1.0,1.0,1.0,0.644492,...,1.0,0.53816,1.0,1.0,1.0,0.549292,0.66746,1.0,1.0,1.0


In [27]:
item_similarity_dataframe =  pd.DataFrame(item_similarity)
item_similarity_dataframe.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18735,18736,18737,18738,18739,18740,18741,18742,18743,18744
0,0.0,1.0,1.0,1.0,1.0,1.0,0.949937,0.833333,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,0.0,1.0,1.0,0.795876,1.0,0.924906,1.0,0.646447,0.711325,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,0.0,1.0,1.0,0.452277,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,0.0,1.0,1.0,0.938686,0.795876,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,0.795876,1.0,1.0,0.0,1.0,0.938686,1.0,0.855662,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [28]:
def predict(data_matrix, similarity, type='user'):
    if type == 'user':
        mean_user_rating = data_matrix.mean(axis=1)
        #We use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (data_matrix - mean_user_rating[:, np.newaxis])
        ratings_diff = (data_matrix)
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
#         pred = similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = data_matrix.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [29]:
user_prediction = predict(data_matrix, user_similarity, type='user')
item_prediction = predict(data_matrix, item_similarity, type='item')

In [30]:
user_prediction.shape

(2100, 18745)

In [31]:
item_prediction.shape

(2100, 18745)

In [32]:
user_prediction_df = pd.DataFrame(user_prediction)
user_prediction_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18735,18736,18737,18738,18739,18740,18741,18742,18743,18744
0,0.001429,0.005717,0.001429,0.000953,0.000953,0.004764,0.126727,0.005717,0.011434,0.004288,...,0.000476,0.000476,0.000476,0.000476,0.000476,0.000476,0.000476,0.000476,0.000476,0.000476
1,0.005235,0.009589,0.005338,0.004766,0.004761,0.00897,0.12887,0.009592,0.015298,0.008339,...,0.004269,0.004269,0.004269,0.004269,0.004269,0.004269,0.004269,0.004269,0.004269,0.004208
2,0.004097,0.008387,0.004097,0.003621,0.003621,0.007434,0.129455,0.008387,0.014107,0.006957,...,0.003144,0.003144,0.003144,0.003144,0.003144,0.003144,0.003144,0.003144,0.003144,0.003144
3,0.005197,0.009375,0.005494,0.004839,0.004705,0.0088,0.124106,0.009524,0.014496,0.008276,...,0.004429,0.004429,0.004429,0.004429,0.004429,0.004429,0.004429,0.004429,0.004429,0.004262
4,0.005902,0.010415,0.006293,0.005601,0.005499,0.009725,0.120328,0.010401,0.015905,0.009025,...,0.005171,0.005171,0.005171,0.005171,0.005171,0.005171,0.005171,0.005171,0.005171,0.00494


In [33]:
item_prediction_df = pd.DataFrame(item_prediction)
item_prediction_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18735,18736,18737,18738,18739,18740,18741,18742,18743,18744
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.003716,0.003698,0.003743,0.003734,0.003732,0.003738,0.003555,0.00371,0.003665,0.003732,...,0.00374,0.00374,0.00374,0.00374,0.00374,0.00374,0.00374,0.00374,0.00374,0.003708
2,0.002672,0.002677,0.002674,0.002672,0.002671,0.002676,0.002689,0.002676,0.002681,0.002677,...,0.002671,0.002671,0.002671,0.002671,0.002671,0.002671,0.002671,0.002671,0.002671,0.00267
3,0.003847,0.003819,0.0039,0.003856,0.003829,0.003857,0.003615,0.003834,0.003754,0.003848,...,0.0039,0.0039,0.0039,0.0039,0.0039,0.0039,0.0039,0.0039,0.0039,0.003856
4,0.00457,0.004593,0.004653,0.004615,0.004598,0.004624,0.004201,0.004564,0.00454,0.004601,...,0.004638,0.004638,0.004638,0.004638,0.004638,0.004638,0.004638,0.004638,0.004638,0.00457


### Getting recommendations based on item and user similarity

In [49]:
#Construct a reverse map of indices and artist titles

index=df_ratings['cleaned_artist_name'].drop_duplicates()
index
indices = pd.Series(index)
indices = indices.tolist()

In [50]:
# Function that takes in artist title as input and outputs most similar artist
def get_recommendations(cleaned_artist_name, cosine_sim= item_similarity):
    # Get the index of the artist that matches the title
    idx = indices.index(cleaned_artist_name)

    # Get the pairwsie similarity scores of all artists with that artist
    sim_scores = list(enumerate(item_similarity[idx]))

    # Sort the artists based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar artists
    sim_scores = sim_scores[1:11]

    # Get the artist indices
    artist_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar artists
    return df_ratings['cleaned_artist_name'].iloc[artist_indices]

In [53]:
get_recommendations('coldplay')

1         morcheeba
3      hooverphonic
4     kylie minogue
7         goldfrapp
13          ryksopp
15        faithless
18             sade
19             moby
20             dido
21     depeche mode
Name: cleaned_artist_name, dtype: object

In [55]:
user_index=df_ratings['userID'].drop_duplicates()
user_index
indices_user = pd.Series(user_index)
indices_user = indices_user.tolist()

In [56]:
# Function that takes in userID as input and outputs most similar movies
def get_recommendations(userID, cosine_sim= user_similarity):
    # Get the index of the userID that matches the title
    idx = indices_user.index(userID)

    # Get the pairwsie similarity scores of all artists with that artist
    sim_scores = list(enumerate(user_similarity[idx]))

    # Sort the userID based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar users
    sim_scores = sim_scores[1:11]

    # Get the user indices
    user_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df_ratings['cleaned_artist_name'].iloc[user_indices]

In [57]:
get_recommendations(2100)

2                air
5          daft punk
15         faithless
18              sade
19              moby
20              dido
21      depeche mode
27    the adventures
32        cock robin
34    spandau ballet
Name: cleaned_artist_name, dtype: object