# COLLABORATIVE FILTERING RS

Edison Alexander Mosquera -
Luis Fernando Valencia

## Implementing a recommender system based on collaborative filtering

Download  the anime dataset available [here](https://drive.google.com/drive/folders/1F8e7Dwt-On2apF6pQsEAKARqVqSA1d8a?usp=sharing)


Import the necessary libraries

In [19]:
import pandas as pd
import numpy as np
from math import sqrt


Load datasets anime and rating. The rating is -1 if the user watched the item but didn't rated it

In [20]:
anime_df = pd.read_csv('anime.csv')
ratings_df = pd.read_csv('rating.csv')
ratings_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


In [21]:
ratings_df.tail()

Unnamed: 0,user_id,anime_id,rating
7813732,73515,16512,7
7813733,73515,17187,9
7813734,73515,22145,10
7813735,73516,790,9
7813736,73516,8074,9


Remove irrelevant columns

In [22]:
anime_df = anime_df.drop('genre', 1)
anime_df = anime_df.drop('type', 1)
anime_df = anime_df.drop('episodes', 1)
anime_df = anime_df.drop('members', 1)
anime_df.head()

Unnamed: 0,anime_id,name,rating
0,32281,Kimi no Na wa.,9.37
1,5114,Fullmetal Alchemist: Brotherhood,9.26
2,28977,Gintama°,9.25
3,9253,Steins;Gate,9.17
4,9969,Gintama&#039;,9.16


In [23]:
ratings_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,20,-1
1,1,24,-1
2,1,79,-1
3,1,226,-1
4,1,241,-1


We create a new user with five items and their respective ratings. 

In [24]:
userInput = [
            {'name':'Fullmetal Alchemist: Brotherhood', 'rating':4},
            {'name':'Cowboy Bebop', 'rating':5},
            {'name':'Death Note', 'rating':4.5},
            {'name':'Clannad: After Story', 'rating':3},
            {'name':'Koe no Katachi', 'rating':2}             
]
inputItems = pd.DataFrame(userInput)
inputItems

Unnamed: 0,name,rating
0,Fullmetal Alchemist: Brotherhood,4.0
1,Cowboy Bebop,5.0
2,Death Note,4.5
3,Clannad: After Story,3.0
4,Koe no Katachi,2.0


Find the anime_id for all the animes rated by the new user.

In [25]:
animetemp_df=anime_df
animetemp_df = animetemp_df.drop('rating', 1)
inputId = animetemp_df[animetemp_df['name'].isin(inputItems['name'].tolist())]
inputItems= pd.merge(inputId, inputItems)
inputItems

Unnamed: 0,anime_id,name,rating
0,5114,Fullmetal Alchemist: Brotherhood,4.0
1,4181,Clannad: After Story,3.0
2,28851,Koe no Katachi,2.0
3,1,Cowboy Bebop,5.0
4,1535,Death Note,4.5


Group users who have rated the same items as the new user. We skip ratings of -1 (Users who have watchd the anime but didn't rate it)

In [26]:
userSubset = ratings_df[ratings_df['anime_id'].isin(inputItems['anime_id'].tolist())]
indexNames = userSubset[userSubset['rating'] == -1 ].index
userSubset.drop(indexNames , inplace=True)
userSubset.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,user_id,anime_id,rating
173,3,1535,10
183,3,5114,10
396,5,1535,4
849,7,1535,9
876,7,4181,9


Group users by userId and sort them giving priority to those with a higher number of movies

In [27]:
userSubsetGroup = userSubset.groupby(['user_id'])
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)
userSubsetGroup[0:3]

[(18051,          user_id  anime_id  rating
  1863694    18051         1       7
  1863729    18051      1535       8
  1863756    18051      4181      10
  1863763    18051      5114       9
  1864056    18051     28851      10),
 (20049,          user_id  anime_id  rating
  2070899    20049         1       8
  2070930    20049      1535       7
  2070959    20049      4181       7
  2070969    20049      5114       7
  2071099    20049     28851       9),
 (27153,          user_id  anime_id  rating
  2913426    27153         1       9
  2913444    27153      1535       9
  2913451    27153      4181       9
  2913454    27153      5114       9
  2913568    27153     28851       9)]

Choose a subset of users to iterate and calculate the PCC between the new user and the group of users and store it in a dict with keys as userId, and values as the PCC

In [28]:
userSubsetGroup=userSubsetGroup[0:100]
pearsonCorrelationDict = {}
for name, group in userSubsetGroup:
  group=group.sort_values(by='anime_id')
  inputItems = inputItems.sort_values(by='anime_id')
  nRatings=len(group)
  temp_df=inputItems[inputItems['anime_id'].isin(group['anime_id'].tolist())]
  tempRatingList=temp_df['rating'].tolist()
  tempGroupList = group['rating'].tolist()
  Sxx=sum([i**2 for i in tempRatingList])-pow(sum(tempRatingList),2)/float(nRatings)
  Syy=sum([i**2 for i in tempGroupList])-pow(sum(tempGroupList),2)/float(nRatings)
  Sxy=sum(i*j for i,j in zip(tempRatingList, tempGroupList))-sum(tempRatingList)*sum(tempGroupList)/float(nRatings)

  if Sxx != 0 and Syy != 0:
    pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
  else:
    pearsonCorrelationDict[name] = 0

pearsonCorrelationDict.items()

dict_items([(18051, -0.9235481451828003), (20049, -0.48745026271476083), (27153, 0), (28550, 0.8907652012052085), (40915, -0.6064784348631236), (41677, -0.7960029457578504), (45583, -0.7892051872524735), (56426, 0.4218479169376418), (64354, -0.22742941307366998), (21, 0.7171371656006361), (46, 0.9683296637314885), (51, 0.50709255283711), (81, -0.5988617490341906), (173, -0.50709255283711), (191, -0.2548235957188128), (226, 0.5606119105813882), (235, -0.2548235957188128), (250, -0.29277002188455997), (261, -0.50709255283711), (294, -0.41403933560541256), (352, 0.2548235957188128), (392, -0.9561828874675149), (403, 0.9561828874675149), (446, -0.8451542547285166), (530, -0.2548235957188128), (563, -0.8664002254439634), (565, -0.8142198690509739), (578, 0.1690308509457033), (582, 0.050964719143762556), (610, -0.8919017444789035), (614, -0.4621247905424446), (618, 0.8783100656536799), (687, -0.8451542547285166), (694, -0.6831300510639733), (702, 0), (771, -0.9683296637314885), (795, -0.6831

Save the dict to a df and name its columns

In [29]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['user_id'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,user_id
0,-0.923548,18051
1,-0.48745,20049
2,0.0,27153
3,0.890765,28550
4,-0.606478,40915


Get the first 50 users closest to the new user

In [30]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,user_id
10,0.96833,46
22,0.956183,403
3,0.890765,28550
50,0.87831,1176
70,0.87831,1705


##Recommend items to the new user

Find the avg weight of the anime ratings using the PCC but first look for the animes in our pearsonDF, starting with the Score DataFrame and save their corr to a new column called similarityIndex

In [31]:
topUsersRating=topUsers.merge(ratings_df, left_on='user_id', right_on='user_id', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,user_id,anime_id,rating
0,0.96833,46,1,10
1,0.96833,46,20,7
2,0.96833,46,45,8
3,0.96833,46,149,7
4,0.96833,46,150,8


Multiply both columns of anime's rating by its weight (similarity index)

In [32]:
topUsersRating['weightedRating']=topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,user_id,anime_id,rating,weightedRating
0,0.96833,46,1,10,9.683297
1,0.96833,46,20,7,6.778308
2,0.96833,46,45,8,7.746637
3,0.96833,46,149,7,6.778308
4,0.96833,46,150,8,7.746637


Group users by anime_id column and sum the results of the topUsers

In [33]:
tempTopUsersRating = topUsersRating.groupby('anime_id').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,15.015417,135.855019
5,7.620794,66.646254
6,7.069864,58.31946
7,0.333572,1.742119
15,0.989294,8.474963


Create a new df with the weighted avg for each movie

In [34]:
recommendation_df = pd.DataFrame()
recommendation_df['weighted average recommendation score'] = \
tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['anime_id']=tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,9.047702,1
5,8.745316,5
6,8.249022,6
7,5.222617,7
15,8.566679,15


Sort the ten first movies recommended by the CF algorithm

In [35]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,anime_id
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4107,402.583175,4107
7559,272.381549,7559
7902,84.639776,7902
10842,83.639776,10842
2952,53.603635,2952
5675,49.96025,5675
13263,40.696147,13263
82,27.505875,82
29223,17.83051,29223
6768,17.431369,6768


<font color='blue'>There are some errors when the weighted average recommendation score turns out being values over the maximum posible score 10 and the same way for the minimum score of 0 so here we deal with it and arrange the index column since we already have a column for movieId.
Scores over 10 set to 10 and scores under 0 set to 0</font>

In [36]:
recommendation_df.index = range(len(recommendation_df))
temp_wars = recommendation_df['weighted average recommendation score'].tolist()
j=0
for i in temp_wars:
  if i < 0:
        recommendation_df.loc[j,'weighted average recommendation score'] = 0
  elif i > 10:
        recommendation_df.loc[j,'weighted average recommendation score'] = 10

  j+=1
recommendation_df.head(10)

Unnamed: 0,weighted average recommendation score,anime_id
0,10.0,4107
1,10.0,7559
2,10.0,7902
3,10.0,10842
4,10.0,2952
5,10.0,5675
6,10.0,13263
7,10.0,82
8,10.0,29223
9,10.0,6768


Look for the id of the movies with the DataFrame to know the title of the movie

In [37]:
anime_df.loc[anime_df['anime_id'].isin(recommendation_df.head(10)['anime_id'].tolist())]


Unnamed: 0,anime_id,name,rating
292,4107,Tengen Toppa Gurren Lagann Movie: Gurren-hen,8.22
333,2952,Final Fantasy VII: Advent Children Complete,8.17
574,6768,Code Geass: Hangyaku no Lelouch R2 Special Edi...,7.96
767,82,Mobile Suit Gundam 0080: War in the Pocket,7.85
1030,7902,Fullmetal Alchemist: Brotherhood - 4-Koma Theater,7.71
1999,5675,Basquash!,7.38
2397,10842,Fullmetal Alchemist: The Sacred Star of Milos ...,7.27
3331,7559,Fate/stay night TV Reproduction,7.02
3359,29223,Aldnoah.Zero Extra Archives,7.01
4276,13263,Fate/Zero: Onegai! Einzbern Soudanshitsu,6.76
