Collaborative filtering also termed as user-to-user is a type of recommendation system, which considers the information of other users that have similar preferences and opinions for an item, for recommending that item.
This tutorial, uses the dataset from  the website: https://grouplens.org/datasets/movielens/

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
movies_data = pd.read_csv('movies.csv')
ratings_data = pd.read_csv('ratings.csv')

print(ratings_data.head())
print("\n")
print(movies_data.head())

   userId  movieId  rating     timestamp
0       1        2     3.5  1.112486e+09
1       1       29     3.5  1.112485e+09
2       1       32     3.5  1.112485e+09
3       1       47     3.5  1.112485e+09
4       1       50     3.5  1.112485e+09


   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


In [16]:
#preprocessing the data
# Spliting the year from movies

movies_data['year'] = movies_data.title.str.extract('(\(\d\d\d\d\))',expand=False)
movies_data['year'] = movies_data.year.str.extract('(\d\d\d\d)',expand=False)

movies_data['title'] = movies_data.title.str.replace('(\(\d\d\d\d\))', '')
movies_data['title'] = movies_data['title'].apply(lambda x: x.strip())

#dropping the genres column
movies_data = movies_data.drop('genres', 1)


#Dropping timeframe column from ratings
ratings_data = ratings_data.drop('timestamp', 1)

print(movies_data.head())
print(ratings_data.head())

   movieId                        title  year
0        1                    Toy Story  1995
1        2                      Jumanji  1995
2        3             Grumpier Old Men  1995
3        4            Waiting to Exhale  1995
4        5  Father of the Bride Part II  1995
   userId  movieId  rating
0       1        2     3.5
1       1       29     3.5
2       1       32     3.5
3       1       47     3.5
4       1       50     3.5


Similar user, i.e., user having similar preferences can be found using __Pearson correlation function__. Then calculating the similarity score then recommend the movies, which has highest score.
</br>
Let's provide the user input, using a matrix.

In [17]:
MyInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(MyInput)
inputMovies

Unnamed: 0,rating,title
0,5.0,"Breakfast Club, The"
1,3.5,Toy Story
2,2.0,Jumanji
3,5.0,Pulp Fiction
4,4.5,Akira


Now, extracting the information about the movie of input movies, from the movies dataset. The information is about the movieID for the input movies.
</br>
Kindly, check the capitalization of the movies or use nltk library to search all type of combinations. (Hint: word.upper() or word.lower())

In [18]:
inputId = movies_data[movies_data['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)

inputMovies = inputMovies.drop('year', 1)
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Search for the user, who had same preferences of the movies. The user's can be found from ratings dataset. Then preprocessing the user's data. Soritng the group, so that highest priority can be more transparent and calculated for higher recommendation.


In [21]:
userSubset = ratings_data[ratings_data['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
11,1,296,4.0
236,3,1,4.0
451,5,2,3.0
517,6,1,5.0


In [32]:
userGroup = userSubset.groupby(['userId'])
userGroup = sorted(userGroup,  key=lambda x: len(x[1]), reverse=True)
userGroup[0:3]

[(91,       userId  movieId  rating
  9621      91        1     4.0
  9622      91        2     3.5
  9669      91      296     3.5
  9826      91     1274     2.5
  9903      91     1968     4.0), (294,        userId  movieId  rating
  37452     294        1     4.5
  37453     294        2     4.5
  37504     294      296     4.5
  37648     294     1274     4.5
  37731     294     1968     5.0), (586,        userId  movieId  rating
  81164     586        1     2.5
  81165     586        2     3.0
  81226     586      296     5.0
  81390     586     1274     4.0
  81499     586     1968     3.0)]

Now from the group, find the user which is more similar to the input  provided, by using Pearson correlation coefficient.
</br>
1: users are similar, -1: nor similar


In [0]:
#Storing Pearson Correlation in a dictionary, where the key: user Id and the value: coefficient
pearsonCorrelationDict = {}

from math import sqrt

for name, group in userGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    nRatings = len(group)
    
    #Get the review scores for the movies that they both have in common
    temp_data = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_data['rating'].tolist()
    
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    
    # pearson correlation between two users
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [37]:
# Similarity index
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()


Unnamed: 0,similarityIndex,userId
0,-0.080064,91
1,0.438529,294
2,0.539319,586
3,0.688021,648
4,0.836242,775


Let's get the top 100 user's in the list.

In [38]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:100]
topUsers.head()


Unnamed: 0,similarityIndex,userId
3268,1.0,18309
2422,1.0,10179
4121,1.0,27630
3860,1.0,24945
2340,1.0,9300


Recommending movies to the user.
</br>
weighted average of the ratings = Pearson Correlation (weight)


In [39]:
topUsersRating=topUsers.merge(ratings_data, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,18309,1,3.5
1,1.0,18309,6,3.5
2,1.0,18309,32,4.0
3,1.0,18309,39,4.0
4,1.0,18309,47,4.0


In [40]:
#Multiplying the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']  *  topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,18309,1,3.5,3.5
1,1.0,18309,6,3.5,3.5
2,1.0,18309,32,4.0,4.0
3,1.0,18309,39,4.0,4.0
4,1.0,18309,47,4.0,4.0


In [41]:
# summing up 
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]

tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,77.0,265.0
2,29.0,84.0
3,10.0,32.0
4,8.0,22.5
5,11.0,31.0


In [43]:
#empty dataframe for merging the recommendation score
recommendation_data = pd.DataFrame()

#weighted average
recommendation_data['weighted_avg_recommendation_score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_data['movieId'] = tempTopUsersRating.index
recommendation_data.head()

Unnamed: 0_level_0,weighted_avg_recommendation_score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.441558,1
2,2.896552,2
3,3.2,3
4,2.8125,4
5,2.818182,5


In [45]:
#recommending top 30 movies
recommendation_data = recommendation_data.sort_values(by='weighted_avg_recommendation_score', ascending=False)
recommendation_data.head(30)

Unnamed: 0_level_0,weighted_avg_recommendation_score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
861,5.0,861
4100,5.0,4100
8398,5.0,8398
5641,5.0,5641
5647,5.0,5647
5648,5.0,5648
2879,5.0,2879
443,5.0,443
43652,5.0,43652
2885,5.0,2885


In [48]:
movies_data.loc[movies_data['movieId'].isin(recommendation_data.head(30)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
439,443,"Endless Summer 2, The",1994
508,512,"Puppet Masters, The",1994
522,526,"Savage Nights (Nuits fauves, Les)",1992
585,591,Tough and Deadly,1995
611,617,"Flower of My Secret, The (La flor de mi secreto)",1995
846,861,Supercop (Police Story 3: Supercop) (Jing cha ...,1992
2044,2128,Safe Men,1998
2082,2166,Return to Paradise,1998
2128,2212,"Man Who Knew Too Much, The",1934
2793,2879,Armour of God II: Operation Condor (Operation ...,1991


Above are the top 30 movies, which are recommended to the input user.
</br>
The advantage of collaborative filtering is that it takes care of the other user's information and adapts user interest as per the change. The disadvantages are, first the privacy issues of other users, second: users are not sufficient in number and third is the heavy load of data.