# Recommender system with collaborative filtering

This notebook uses Movielens dataset for building a recommender system. 
The smallest dataset is used here (ml-latest-small.zip) : 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. It is last updated 9/2018.

## 1. Loading the dataset

The dataset is loaded from movielens pages.
- In movies-file the genres are listed for each movie
- In ratings-file there are ratings from 600 users for movies they have watched.

In [63]:
import pandas as pd
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn

In [33]:
movies_df = pd.read_csv('Data/movies.csv')
ratings_df = pd.read_csv('Data/ratings.csv')

In [34]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [35]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## 2. Pre-processing the data

Let's create two dataframes
- movies_df : dataframe containing titles of the movies. 
- ratings_df: dataframe containing user ratings for the movies

In [36]:
# Drop the year information from the titles and remove any possible whiteplaces
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '',regex=True).apply(lambda x: x.strip())
movies_df = movies_df.drop(labels='genres', axis=1)
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story
1,2,Jumanji
2,3,Grumpier Old Men
3,4,Waiting to Exhale
4,5,Father of the Bride Part II


In [37]:
ratings_df = ratings_df.drop(labels='timestamp', axis=1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


### 3. Collaborative filtering

Collaborative filtering is so called user-to-user filtering. The idea is to find other users that are similar to the first user by their ratings and preferences. The recommendation system then makes recommendations based on preferences on these similar users to the first user. If the other users have liked items that this first user have not yet tried, they can be recommended to him/her.

The steps needed to create collaborative filtering algorithm:
- Create a user that we want to give recommendations to, and provide a list of movies and ratings for him/her
- Calculate the similarity of other users vs. this user, based on the movies they have seen and their ratings
- Find top 5-10 users by their similarity to this user
- Recommend the items with the highest score to the user in question

In [38]:
movies_df["title"].shape

(9742,)

### 3.1 Create an example user to whom to give recommendations

In [125]:
# Select randomly 5 movies for our user1
import random
user1 = movies_df.copy().sample(n=6,random_state=31).reset_index(drop=True)
# Create ratings that user1 has given for these movies
rating1=[5,4.5,2,3,4,5]
user1["rating"]=rating1
user1.head(6)

Unnamed: 0,movieId,title,rating
0,44022,Ice Age 2: The Meltdown,5.0
1,1305,"Paris, Texas",4.5
2,38061,Kiss Kiss Bang Bang,2.0
3,6370,"Spanish Apartment, The (L'auberge espagnole)",3.0
4,3500,Mr. Saturday Night,4.0
5,1673,Boogie Nights,5.0


### 3.2 Find other users who have seen the same movies as user1

In [126]:
userList = ratings_df[ratings_df['movieId'].isin(user1['movieId'].tolist())]
userList.head()

Unnamed: 0,userId,movieId,rating
2071,18,38061,4.0
2073,18,44022,3.5
2920,19,3500,2.0
3431,21,44022,3.5
3973,24,38061,4.0


In [127]:
# Group the dataframe by users, and look at one of the users
userListGroup = userList.groupby(['userId'])
# Sort the users so that those who have biggest number of movies in common with the user1 are top in the list
userListGroup = sorted(userListGroup,  key=lambda x: len(x[1]), reverse=True)
userListGroup[0:3]

[(414,
         userId  movieId  rating
  62883     414     1673     4.0
  64134     414     6370     5.0
  64455     414    38061     4.0
  64497     414    44022     2.5),
 (68,
         userId  movieId  rating
  10637      68     1673     3.0
  11252      68    38061     3.5
  11271      68    44022     3.5),
 (480,
         userId  movieId  rating
  76439     480     1673     4.5
  76976     480    38061     4.0
  77008     480    44022     3.0)]

In [128]:
#Select the first 100 users from this list
userListGroup = userListGroup[0:100]

### 3.3 Calculate similarity score between the selected users and user1

Use Pearson correlation for calculating similarity scores

In [129]:
# The following dictionary contains user Id and the Pearson correlation coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userListGroup:
    # Sort the values by movieId both for user1 and group
    user1 = user1.sort_values(by='movieId')
    group = group.sort_values(by='movieId')
    #Number of ratings for this reference user
    nRatings = len(group)
    #Get the user1 ratings for those movies that both users have 
    user1Ratings = user1[user1['movieId'].isin(group['movieId'].tolist())]
    user1RatingList = user1Ratings['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    groupRatingList = group['rating'].tolist()
    # Calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in user1RatingList]) - pow(sum(user1RatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in groupRatingList]) - pow(sum(groupRatingList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in 
              zip(user1RatingList, groupRatingList)) - sum(user1RatingList)*sum(groupRatingList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/np.sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [130]:
pearsonCorrelationDict.items()

dict_items([(414, -0.5659164584181102), (68, -0.5000000000000036), (480, -0.1889822365046138), (606, 0.9707253433941623), (18, -1.0), (57, 0), (105, 1.0), (132, -1.0), (232, -1.0), (239, -1.0), (249, -1.0), (288, -1.0), (307, 1.0), (318, 1.0), (356, -1.0), (370, -1.0), (474, 1.0), (483, -1.0), (560, -1.0), (580, -1.0), (599, 1.0), (19, 0), (21, 0), (24, 0), (28, 0), (33, 0), (42, 0), (50, 0), (62, 0), (64, 0), (74, 0), (84, 0), (89, 0), (111, 0), (119, 0), (122, 0), (177, 0), (178, 0), (182, 0), (187, 0), (195, 0), (200, 0), (202, 0), (218, 0), (219, 0), (221, 0), (254, 0), (256, 0), (260, 0), (268, 0), (274, 0), (279, 0), (293, 0), (298, 0), (330, 0), (335, 0), (352, 0), (361, 0), (368, 0), (380, 0), (381, 0), (385, 0), (387, 0), (388, 0), (405, 0), (448, 0), (450, 0), (475, 0), (506, 0), (509, 0), (510, 0), (514, 0), (522, 0), (546, 0), (554, 0), (555, 0), (561, 0), (590, 0), (596, 0), (597, 0), (600, 0), (603, 0), (608, 0), (610, 0)])

### 3.4 Get top 50 users by their similarity score

In [131]:
pearson_df = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearson_df.columns = ['similarityScore']
pearson_df['userId'] = pearson_df.index
pearson_df.index = range(len(pearson_df))
top50=pearson_df.sort_values(by='similarityScore', ascending=False)[0:50]
top50.head()

Unnamed: 0,similarityScore,userId
12,1.0,307
6,1.0,105
20,1.0,599
16,1.0,474
13,1.0,318


### 3.5 Create recommendations for user1 based on top 50 similar users

Calculate weighted average of the ratings of the movies done by top 50 similar users. Pearson correlation coefficient is the weight.

In [132]:
# Merge top50 users with their ratings
top50Rating=top50.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
#Multiply the similarity score by the user's ratings
top50Rating['weightedRating'] = top50Rating['similarityScore']*top50Rating['rating']
top50Rating.head()

Unnamed: 0,similarityScore,userId,movieId,rating,weightedRating
0,1.0,307,1,4.0,4.0
1,1.0,307,2,2.5,2.5
2,1.0,307,3,3.5,3.5
3,1.0,307,10,2.5,2.5
4,1.0,307,16,4.5,4.5


First, sum the similarityScore and weightedRating , when dataframe is grouped by movieId 

In [134]:
# Calculate a sum of similarity scores and weights for the top50 users after grouping it up by movieId
Top50Rating_temp = top50Rating.groupby('movieId').sum()[['similarityScore','weightedRating']]
Top50Rating_temp.columns = ['sum_similarityScore','sum_weightedRating']
Top50Rating_temp.head()

Unnamed: 0_level_0,sum_similarityScore,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.970725,13.426813
2,4.0,11.5
3,2.0,5.0
4,0.0,0.0
5,1.0,1.5


After that, calculate the weighted average of recommendation score for each movieId.

In [143]:
recommendation_df = pd.DataFrame() 
recommendation_df['recommendation score'] = \
    Top50Rating_temp['sum_weightedRating']/Top50Rating_temp['sum_similarityScore']
recommendation_df['movieId'] = Top50Rating_temp.index
recommendation_df.head()

Unnamed: 0_level_0,recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.381451,1
2,2.875,2
3,2.5,3
4,,4
5,1.5,5


Sort the recommendation scores to get the topmost recommendations

In [144]:
recommendation_df = recommendation_df.sort_values(by='recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
7767,5.0,7767
117531,5.0,117531
3549,5.0,3549
69134,5.0,69134
147250,5.0,147250
170597,5.0,170597
92259,5.0,92259
134796,5.0,134796
49347,5.0,49347
5034,5.0,5034


In [145]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title
2651,3549,Guys and Dolls
3660,5034,"Truly, Madly, Deeply"
5013,7767,"Best of Youth, The (La meglio gioventù)"
6354,49347,Fast Food Nation
7045,69134,Antichrist
7802,92259,Intouchables
8591,117531,Watermark
8896,134796,Bitter Lake
9138,147250,The Adventures of Sherlock Holmes and Doctor W...
9495,170597,A Plasticine Crow
