The project is to find similar movies that user might like by designing recommendation systems. We are going to create a item-based recommendation system and a user-based recommendation system. Dataset can be downloaded at https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zi

# Preprocessing
import all the frameworks that we need and all the data we want to process

In [1]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df_movie= pd.read_csv('data-rec/movies.csv')
df_rating = pd.read_csv('data-rec/ratings.csv')
df_movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
df_movie.shape

(34208, 3)

In [4]:
df_movie.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

firstly, we need to :
- extract the years in the 'title' column, create a new column named 'year' and move the years to the column
- split values in the 'genres' column

In [5]:
#We specify the parantheses as year informations' format so we don't conflict with movies that have years in their titles
df_movie['year'] = df_movie.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
df_movie['year'] = df_movie.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
df_movie['title'] = df_movie.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
df_movie['title'] = df_movie['title'].apply(lambda x: x.strip())
#Splitting function on |
df_movie['genres'] = df_movie.genres.str.split('|')
df_movie.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


We also need to convert the list of genres to a vector where each column corresponds to one possible value of the feature (1 if movie was belong to some genre or 0 if not).

In [6]:
df_movgen = df_movie.copy()
for index, row in df_movgen.iterrows():
    for gen in row['genres']:
        df_movgen.at[index, gen] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
df_movgen = df_movgen.fillna(0)
df_movgen.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Prepare the df_rating, delete column that we dont need.

In [7]:
df_rating = df_rating.drop('timestamp', 1)
df_rating.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


In [8]:
df_rating.shape

(22884377, 3)

In [54]:
df_movgen_show=df_movgen.set_index('movieId')
df_movgen_show=df_movgen_show.drop('title', 1).drop('year', 1).drop('genres',1)

In [57]:
df_movgen_show.shape

(34208, 20)

In [70]:
df_movgen_show.sum()

Adventure              2763.0
Animation              1387.0
Children               1609.0
Comedy                10124.0
Fantasy                1692.0
Romance                4875.0
Drama                 15774.0
Action                 4445.0
Crime                  3446.0
Thriller               5300.0
Horror                 3365.0
Mystery                1837.0
Sci-Fi                 2156.0
IMAX                    196.0
Documentary            3040.0
War                    1345.0
Musical                1052.0
Western                 779.0
Film-Noir               338.0
(no genres listed)     1145.0
dtype: float64

# Item based filter

attempts to figure out the input's favorite genres from the movies and ratings given.

Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the __user_input__. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Breakfast Club" then write it in like this: 'Breakfast Club, The' .

In [9]:
user_input = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
input_movies = pd.DataFrame(user_input)

With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows of df_movie that contain the input_movies's title and then merging this subset with the input_movies dataframe. We also drop unnecessary columns for the input to save memory space.

In [10]:
#Filtering out the movies by title
input_id = df_movie[df_movie['title'].isin(input_movies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
input_movies = pd.merge(input_id, input_movies)
#Dropping information we won't use from the input dataframe
input_movies = input_movies.drop('genres', 1).drop('year', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
input_movies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.

In [11]:
#Filtering out the movies from the input
user_movies = df_movgen[df_movgen['movieId'].isin(input_movies['movieId'].tolist())]
user_movies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the __movieId, title, genres__ and __year__ columns.

In [12]:
#Resetting the index to avoid future issues
user_movies = user_movies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = user_movies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we're ready to start learning the input's preferences!

To do this, we're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's "dot" function.

In [13]:
input_movies['rating']

0    3.5
1    2.0
2    5.0
3    4.5
4    5.0
Name: rating, dtype: float64

In [14]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(input_movies['rating'])
#The user profile
userProfile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Now, we have the weights for every of the user's preferences. This is known as the User Profile ( __userProfile__ ). Using this, we can recommend movies that satisfy the user's preferences.

Let's start by extracting the genre table from the original dataframe:

In [15]:
#Now let's get the genres of every movie in our original dataframe
genreTable = df_movgen.set_index(df_movgen['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
genreTable.shape

(34208, 20)

In [17]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64

With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top 5 movies that most satisfy it.

In [18]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
dtype: float64

In [19]:
#The final recommendation table
x=recommendationTable_df.head(5).keys()
## df_movie.query('movieId in @x')
## df_movie[df_movie['movieId'].isin(recommendationTable_df.head(5).keys())]
##### BY ORDER
mov_byitem=df_movie.copy()
mov_byitem=mov_byitem.set_index('movieId').loc[x].reset_index(inplace=False)

In [20]:
mov_byitem

Unnamed: 0,movieId,title,genres,year
0,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991
1,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
2,27344,Revolutionary Girl Utena: Adolescence of Utena...,"[Action, Adventure, Animation, Comedy, Drama, ...",1999
3,148775,Wizards of Waverly Place: The Movie,"[Adventure, Children, Comedy, Drama, Fantasy, ...",2009
4,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002


In [21]:
rate=df_rating[df_rating['movieId'].isin(mov_byitem['movieId'].tolist())].groupby('movieId')['rating'].sum()/df_rating[df_rating['movieId'].isin(mov_byitem['movieId'].tolist())].groupby('movieId')['rating'].count()

In [22]:
rate_pd=pd.DataFrame(rate)

In [23]:
rate_pd.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
5018,3.130435
6902,3.866979
26093,3.5
27344,3.333333
148775,3.166667


In [24]:
mov_byitem=mov_byitem.merge(rate_pd, left_on='movieId', right_on='movieId')

In [25]:
print('MOVIES YOU MIGHT LIKE')
mov_byitem

MOVIES YOU MIGHT LIKE


Unnamed: 0,movieId,title,genres,year,rating
0,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991,3.130435
1,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962,3.5
2,27344,Revolutionary Girl Utena: Adolescence of Utena...,"[Action, Adventure, Animation, Comedy, Drama, ...",1999,3.333333
3,148775,Wizards of Waverly Place: The Movie,"[Adventure, Children, Comedy, Drama, Fantasy, ...",2009,3.166667
4,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002,3.866979


# User based filter

attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. We will be using a method based on the __Pearson Correlation Function__.


The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score


Let's begin by creating an input user to recommend movies to:

In [26]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)

Add movieId to inputMovies


In [27]:
#Filtering out the movies by title
inputId = df_movie[df_movie['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', 1)
inputMovies = inputMovies.drop('genres', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.



In [28]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = df_rating[df_rating['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


We now group up the rows by user ID.

In [29]:
userSubsetGroup = userSubset.groupby(['userId'])

In [30]:
len(userSubsetGroup)

116140

check some row

Lets look at one of the users, e.g. the one with userID=1607

In [31]:
userSubsetGroup.get_group(1607)

Unnamed: 0,userId,movieId,rating
144611,1607,1,5.0
144612,1607,2,4.0
144664,1607,296,3.0
145034,1607,1968,3.0


Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [32]:
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Now lets look at the first 5 users

In [33]:
userSubsetGroup[0:5]

[(75,       userId  movieId  rating
  7507      75        1     5.0
  7508      75        2     3.5
  7540      75      296     5.0
  7633      75     1274     4.5
  7673      75     1968     5.0), (106,       userId  movieId  rating
  9083     106        1     2.5
  9084     106        2     3.0
  9115     106      296     3.5
  9198     106     1274     3.0
  9238     106     1968     3.5), (686,        userId  movieId  rating
  61336     686        1     4.0
  61337     686        2     3.0
  61377     686      296     4.0
  61478     686     1274     4.0
  61569     686     1968     5.0), (815,        userId  movieId  rating
  73747     815        1     4.5
  73748     815        2     3.0
  73922     815      296     5.0
  74362     815     1274     3.0
  74678     815     1968     4.5), (1040,        userId  movieId  rating
  96689    1040        1     3.0
  96690    1040        2     1.5
  96733    1040      296     3.5
  96859    1040     1274     3.0
  96922    1040     1968  

#### Similarity of users to input user
Next, we are going to compare some users to our specified user and find the one that is most similar.  
we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables.

![](pearson.png)

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user.

In [34]:
userSubsetGroup = userSubsetGroup[0:5000]

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary



In [35]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}
#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) -(pow(sum(tempRatingList),2)/float(nRatings))
    Syy = sum([i**2 for i in tempGroupList]) - (pow(sum(tempGroupList),2)/float(nRatings))
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - (sum(tempRatingList)*sum(tempGroupList)/float(nRatings))
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0



In [36]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.827278,75
1,0.586009,106
2,0.83205,686
3,0.576557,815
4,0.943456,1040



Now let's get the top 50 users that are most similar to the input.

In [37]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
3156,1.0,72209
3472,1.0,86075
1705,1.0,12091
4526,1.0,129566
4356,1.0,121972


Now, let's start recommending movies to the input user.

We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex_. This is achieved below by merging of these two tables.

In [38]:
topUsersRating=topUsers.merge(df_rating, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,72209,1,4.0
1,1.0,72209,2,3.0
2,1.0,72209,6,5.0
3,1.0,72209,8,4.0
4,1.0,72209,16,4.5


Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:

In [39]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,72209,1,4.0,4.0
1,1.0,72209,2,3.0,3.0
2,1.0,72209,6,5.0,5.0
3,1.0,72209,8,4.0,4.0
4,1.0,72209,16,4.5,4.5


In [40]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,49.994622,179.986485
2,49.994622,135.495932
3,19.0,51.0
4,5.0,12.5
5,11.0,31.0


In [41]:
#Creates an empty dataframe
df_rec = pd.DataFrame()
#Now we take the weighted average
df_rec['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
# df_rec['movieId'] = tempTopUsersRating.index
df_rec.head()

Unnamed: 0_level_0,weighted average recommendation score
movieId,Unnamed: 1_level_1
1,3.600117
2,2.71021
3,2.684211
4,2.5
5,2.818182


Now let's sort it and see the top 20 movies that the algorithm recommended!

In [42]:
df_rec = df_rec.sort_values(by='weighted average recommendation score', ascending=False)
df_rec.head(10)

Unnamed: 0_level_0,weighted average recommendation score
movieId,Unnamed: 1_level_1
921,5.0
1162,5.0
1446,5.0
6100,5.0
98491,5.0
73344,5.0
3304,5.0
4529,5.0
8799,5.0
4432,5.0


In [43]:
# df_movie.loc[df_movie['movieId'].isin(df_rec.head(10)['movieId'].tolist())]
## BY ORDER
# xy=df_rec.head(5).index.tolist()
# df_mvx=df_movie.copy()
# df_mvx=df_mvx[df_mvx['movieId'].isin(xy)]

In [44]:
df_mvu=df_movie.copy()
df_mvu=df_rec.merge(df_mvu,left_on='movieId',right_on='movieId',how='inner')
df_mvu.head()

Unnamed: 0,movieId,weighted average recommendation score,title,genres,year
0,921,5.0,My Favorite Year,[Comedy],1982
1,1162,5.0,"Ruling Class, The","[Comedy, Drama]",1972
2,1446,5.0,Kolya (Kolja),"[Comedy, Drama]",1996
3,6100,5.0,"Midsummer Night's Sex Comedy, A","[Comedy, Romance]",1982
4,98491,5.0,Paperman,"[Animation, Comedy, Romance]",2012


In [45]:
rat=df_rating[df_rating['movieId'].isin(df_mvu['movieId'].tolist())].groupby('movieId')['rating'].sum()/df_rating[df_rating['movieId'].isin(df_mvu['movieId'].tolist())].groupby('movieId')['rating'].count()

In [46]:
rate_us=pd.DataFrame(rat)

In [47]:
df_mvu=df_mvu.merge(rate_us,left_on='movieId',right_on='movieId')

In [48]:
df_mvu_new=df_mvu.sort_values(by=['weighted average recommendation score','rating'],ascending=[False,False]).reset_index(inplace=False)
df_mvu_new=df_mvu_new.drop(['index'],axis=1)

In [49]:
df_mvu_rec=df_mvu_new.loc[0:5,('movieId','title','genres','year','rating')]
print('Other people who watched the movie(s) also watched and liked')
df_mvu_rec.head()

Other people who watched the movie(s) also watched and liked


Unnamed: 0,movieId,title,genres,year,rating
0,139096,Unmatched,[Documentary],2010,4.5
1,139114,The Price of Gold,[Documentary],2014,4.5
2,139116,Requiem For The Big East,[Documentary],2014,4.333333
3,139100,Once Brothers,[Documentary],2010,4.230769
4,139098,Four Days in October,[Documentary],2010,4.2


Conclusion:
Item based technique is higliy personalised for the user, but it does not take into account what others think of the item. User based technique takes other user's ratings into consideration and adapts to the user's interests which might change over time, but it may takes a longer time to process the system
