## Introduction

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [3]:
df1 = pd.read_csv("movies_cleaned.csv")  
df2 = pd.read_csv("ratings_small.csv")     

In [4]:
df1.head()        # movies dataset

Unnamed: 0,budget,id,original_language,original_title,overview,popularity,release_date,revenue,runtime,tagline,...,production_countries,year,month,date,spoken_languages,cast,director,exec_prod,screenplay,composer
0,237000000,19995,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,2009-12-10,2787965087,162.0,Enter the World of Pandora.,...,"['United States of America', 'United Kingdom']",2009.0,12.0,10.0,"['English', 'Español']","['Sam Worthington', 'Zoe Saldana', 'Sigourney ...",James Cameron,Laeta Kalogridis,James Cameron,James Horner
1,300000000,285,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,2007-05-19,961000000,169.0,"At the end of the world, the adventure begins.",...,['United States of America'],2007.0,5.0,19.0,['English'],"['Johnny Depp', 'Orlando Bloom', 'Keira Knight...",Gore Verbinski,Mike Stenson,Ted Elliott,Hans Zimmer
2,245000000,206647,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,2015-10-26,880674609,148.0,A Plan No One Escapes,...,"['United Kingdom', 'United States of America']",2015.0,10.0,26.0,"['Français', 'English', 'Español', 'Italiano',...","['Daniel Craig', 'Christoph Waltz', 'Léa Seydo...",Sam Mendes,Callum McDougall,John Logan,Thomas Newman
3,250000000,49026,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,2012-07-16,1084939099,165.0,The Legend Ends,...,['United States of America'],2012.0,7.0,16.0,['English'],"['Christian Bale', 'Michael Caine', 'Gary Oldm...",Christopher Nolan,Michael Uslan,Christopher Nolan,Hans Zimmer
4,260000000,49529,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,2012-03-07,284139100,132.0,"Lost in our world, found in another.",...,['United States of America'],2012.0,3.0,7.0,['English'],"['Taylor Kitsch', 'Lynn Collins', 'Samantha Mo...",Andrew Stanton,,Andrew Stanton,


In [5]:
df2.head()        # ratings dataset

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [6]:
df1.columns

Index(['budget', 'id', 'original_language', 'original_title', 'overview',
       'popularity', 'release_date', 'revenue', 'runtime', 'tagline', 'title',
       'vote_average', 'vote_count', 'genres', 'keywords',
       'production_companies', 'production_countries', 'year', 'month', 'date',
       'spoken_languages', 'cast', 'director', 'exec_prod', 'screenplay',
       'composer'],
      dtype='object')

In this notebook, we will explore 3 types of recommender systems:  
- Demographic Filtering: This system aims to categorize users based on attributes (ex. same age group, same country) and make recommendations based on demographic classes 
- Content Based Filtering: Learns a profile of a user's interests based on the features present in objects the user has rated. In our movies dataframe, these features could be 'overview', 'tagline', 'genres', 'cast', 'director'  
- Collaborative Filtering: Matches users with similar movie tastes on the basis of their ratings then provides new recommendations based on inter-user comparisons

## Demographic Filtering  
Looking at the ratings dataframe, we don't have any information on users other than what movies they viewed and how much they liked each movie. Thus, we cannot really perform any demographic filtering. Nevertheless, we will treat all the users as a single large demographic and implement a recommendation system for them. The system implemented here will give the same very simple and generalized recommendation to all users based on movie popularity.  

First we need to determine an appropriate scoring system to determine the most 'popular' movies. I will use two popularity scoring systems: First is the Internet Movie Database's (IMDB) weighted rating formula and second is the popularity calculated by The Movie Database (TMDB) and already included in the movies dataset.  

### IMDB's Weighted Rating (WR) Formula  
The formula is given by:  
$Weighted Rating (WR) = (\frac{v}{v+m}.R) + (\frac{m}{v+m}.C)$  
where  
- v: number of votes for the movie  
- m: minimum votes required to be listed in the chart  
- R: average rating of the movie  
- C: mean vote across the whole report  

We already have v and R from the movies dataset.  
C can be easily calculated as:

In [7]:
C = df1['vote_average'].mean()
C

6.092171559442011

Now let's set a value for m. We'll say a movie has to be in the 90th percentile, meaning it must have more votes than at least 90% of the movies in the list, to be considered.

In [8]:
m = df1['vote_count'].quantile(0.9)
m

1838.4000000000015

Now we can filter out the movies that qualified to our recommendations list

In [9]:
qualified = df1[df1['vote_count'] >= m]
len(qualified)

481

In [10]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return ( v/(v+m)*R + m/(v+m)*C )

In [11]:
qualified['score'] = qualified.apply(weighted_rating, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Now let's see the top 10 movies based on the WR formula:

In [12]:
qualified = qualified.sort_values('score', ascending = False)

qualified[['title', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_average,score
1881,The Shawshank Redemption,8.5,8.059258
662,Fight Club,8.3,7.939256
65,The Dark Knight,8.2,7.92002
3232,Pulp Fiction,8.3,7.904645
96,Inception,8.1,7.863239
3337,The Godfather,8.4,7.851236
95,Interstellar,8.1,7.809479
809,Forrest Gump,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8.1,7.727243
1990,The Empire Strikes Back,8.2,7.697884


### Popularity (Calculated by TMDB)
Alternatively, we could use the movies' popularity as a basis for recommendations.  
To do so, we simply sort the movies in order of descending popularity:

In [13]:
df1.sort_values('popularity', ascending = False)[['title','popularity']].head(10)

Unnamed: 0,title,popularity
546,Minions,875.581305
95,Interstellar,724.247784
788,Deadpool,514.569956
94,Guardians of the Galaxy,481.098624
127,Mad Max: Fury Road,434.278564
28,Jurassic World,418.708552
199,Pirates of the Caribbean: The Curse of the Bla...,271.972889
82,Dawn of the Planet of the Apes,243.791743
200,The Hunger Games: Mockingjay - Part 1,206.227151
88,Big Hero 6,203.73459


## Content Based Filtering  
In this recommender system, the features of the movie (overview, keywords, genres, cast, etc) are used to find its similarity with other movies. 

### Plot Description Based Recommender

First we need to compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.  

Term frequency is the relative frequency of a word in a document and is given by (term instances / total instances).  
Inverse Document frequency is the relative count of documents containing the term and is given by log(number of documents / documents with term).  
The overall importance of each word to the documents in which they appear is equal to TF * IDF.  

The result is a matrix where each column represents a word in the overview vocabulary and each row represents a movie.

In [14]:
# define a TF-IDF Vectorizer Object. Remove all english stop words
tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df1['overview'].fillna(''))

tfidf_matrix.shape  # there were 20,978 words used to describe the 4803 movies in the dataset

(4803, 20978)

Now we have to compute a similarity score matrix for the tfidf matrix.  
I will be using the cosine similarity score metric. The formula is given by:  
$similarity = cos(\theta) = \frac{A.B}{||A||||B||} = \frac{\sum\limits_{i=1}^{n}A_i B_i}{\sqrt{\sum\limits_{i=1}^{n}{A_i}^2}  \sqrt{\sum\limits_{i=1}^{n}{B_i}^2} }$

In [15]:
# since our matrix is already L2 normalized, we can just use linear_kernel() over the cosine_sim() function. Plus it is faster than cosine_sim()
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.02160533, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.01488159, 0.        ,
        0.        ],
       ...,
       [0.        , 0.02160533, 0.01488159, ..., 1.        , 0.01609091,
        0.00701914],
       [0.        , 0.        , 0.        , ..., 0.01609091, 1.        ,
        0.01171696],
       [0.        , 0.        , 0.        , ..., 0.00701914, 0.01171696,
        1.        ]])

Now we are all set to implement our content-based recommender system. Our system operates as follows:  

- The user inputs a movie title, optionally the cosine similarity matrix (default value = cosine similarity based on tfidf of movie overview) and the number of recommendations m (default value = 10)
- Get the index of the movie using its title
- Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.  
- Sort the aforementioned list of tuples based on the similarity scores
- Get the top m elements of this list. Ignore the first element as it refers to the input movie (the movie most similar to a particular movie is the movie itself).    
- Return the titles corresponding to the indices of the top elements

In [16]:
# Create a series that can be used to identify the index of a movie given its title
indices = pd.Series(df1.index, index = df1['title'])
indices

title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

In [17]:
# function that takes in movie title as input and outputs most similar movies
def get_movie_recommendations(title, cosine_sim=cosine_sim, num_recom=10):
    
    # get the index of the movie that matches the title
    index = indices[title]

    # get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[index]))

    # sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # get the ids and scores of the m most similar movies, excluding the input movie
    top_movies = sim_scores[1:num_recom+1]

    # get the top m movie indices
    top_movies_indices = [i[0] for i in top_movies]

    return df1[['id','title','vote_average']].iloc[top_movies_indices]

In [18]:
get_movie_recommendations('Alice in Wonderland')

Unnamed: 0,id,title,vote_average
105,241259,Alice Through the Looking Glass,6.5
3455,10160,A Nightmare on Elm Street 5: The Dream Child,5.5
4330,27374,Lake Mungo,6.1
4732,13538,Straightheads,5.2
3329,45650,The Hole,5.6
2294,129,Spirited Away,8.3
799,35791,Resident Evil: Afterlife,5.8
3589,284293,Still Alice,7.5
4655,77934,The Christmas Bunny,5.7
2733,10131,A Nightmare on Elm Street 4: The Dream Master,5.8


In [19]:
get_movie_recommendations('Avengers: Age of Ultron')

Unnamed: 0,id,title,vote_average
16,24428,The Avengers,7.4
79,10138,Iron Man 2,6.6
68,1726,Iron Man,7.4
26,271110,Captain America: Civil War,7.1
227,37834,Knight and Day,5.9
31,68721,Iron Man 3,6.8
1868,10623,Cradle 2 the Grave,5.8
344,44048,Unstoppable,6.3
1922,10655,Gettysburg,6.6
531,203801,The Man from U.N.C.L.E.,7.1


Our system finds movies with similar plot descriptions. Notice in the second set of recommendations that most of the recommendations are also part of the Avengers franchise.  
Next we'll make a more complicated recommender using a larger number of metadata.

### Cast, Crew, Genres and Keywords Based Recommender  
We will use the following metadata: the top 5 actors, the director, executive producer, genres and keywords.

We need to convert the names and keyword instances into lowercase and strip spaces between them. This is done so that our vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same.

In [20]:
# function to convert strings to lower case and strip names of spaces
def clean_features(x):
    # if the value is a list i.e. 'genres', 'keywords' and 'cast' columns
    if isinstance(x, list):
        return [str.lower(i.replace(" ","")) for i in x]
    # if the value is a string i.e. 'director', 'exec prod' columns
    else:
        if isinstance(x, str):
            return str.lower(x.replace(" ",""))
        else:
            # if there is no value, return empty string
            return ''

In [21]:
features = ['genres','keywords','cast','director','exec_prod']
            
for i in features:
    df1[i] = df1[i].apply(clean_features)

We create a metadata soup which is essentially a long string containing all the strings inside our various metadata.

In [22]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

df1['soup'] = df1.apply(create_soup, axis=1)

Here we use the countvectorizer() instead of TF-IDF, the reason being that we do not want to down-weight the presence of an actor/director/exec producer if he/she has acted/directed/produced in more movies.

In [23]:
# calculate the count matrix
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df1['soup'])

In [24]:
# compute the cosine similarity matrix based on count matrix
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

We can get our new recommendations using the get_recommendations() function we made earlier, but using cosine_sim2 instead of cosine_sim as the second argument.

In [25]:
get_movie_recommendations('Alice in Wonderland', cosine_sim2)

Unnamed: 0,id,title,vote_average
117,118,Charlie and the Chocolate Factory,6.7
133,62213,Dark Shadows,5.7
278,869,Planet of the Apes,5.6
428,364,Batman Returns,6.6
473,75,Mars Attacks!,6.1
583,587,Big Fish,7.6
1293,62214,Frankenweenie,6.6
1359,268,Batman,7.0
1594,3933,Corpse Bride,7.2
2108,162,Edward Scissorhands,7.5


In [26]:
get_movie_recommendations('Avengers: Age of Ultron', cosine_sim2)

Unnamed: 0,id,title,vote_average
16,24428,The Avengers,7.4
1294,16320,Serenity,7.4
0,19995,Avatar,7.2
1,285,Pirates of the Caribbean: At World's End,6.9
2,206647,Spectre,6.3
3,49026,The Dark Knight Rises,7.6
4,49529,John Carter,6.1
5,559,Spider-Man 3,5.9
6,38757,Tangled,7.4
8,767,Harry Potter and the Half-Blood Prince,7.4


As we can see, the recommendations for both movies have changed. Particularly in the recommedations for 'Avengers: Age of Ultron', we see a lot more variety outside of the Avengers franchise movies. 

## Collaborative Filtering  (CF)
There are two main types of filtering:  
1. User-based CF: These systems recommend products to a user that similar users have liked (rated highly)  
    A main issue with user based CF is that users' preferences can change over time. This means that precomputing the matrix may lead to bad performance in the future. We can avoid this problem by using item-based CF.   
2. Item-based CF: Recommends items based on their similarity with the items that the target user has previously rated  
    Some issues arise for this method. First is scalability. The computation grows with both the customers and products. The worst case complexity is O(mn) for m users and n items. The second main concern is sparsity. In extreme cases, we can have millions of users and the similarity between two very different movies could be very high simply because a single user or a few users gave the two movies similar ranks. 

### User-Based CF  

First we need to create a ratings matrix.

We create a matrix of every user's rating for each movie, then avg_rating is the mean of that single user's ratings for all his/her movies.  
  
Lastly, norm_rating is a normalized measure of the user's ratings for movie ratings.  
norm_rating = ind_rating - avg_rating

In [27]:
rating_mean = df2.groupby(by = "userId")['rating'].mean()
rating_avg = df2.merge(rating_mean, on = 'userId')

# because ratings dataset (df2) and rating_avg dataset have a column with the same name, 'rating', when we merge them the columns get renamed as rating_x and rating_y
# we rename them to 'individual_rating' and 'avg_rating' for clarity
rating_avg = rating_avg.rename(columns={'rating_x':'ind_rating', 'rating_y':'avg_rating'})

rating_avg['norm_rating'] = rating_avg['ind_rating'] - rating_avg['avg_rating']   
rating_avg.head()

Unnamed: 0,userId,movieId,ind_rating,timestamp,avg_rating,norm_rating
0,1,31,2.5,1260759144,2.55,-0.05
1,1,1029,3.0,1260759179,2.55,0.45
2,1,1061,3.0,1260759182,2.55,0.45
3,1,1129,2.0,1260759185,2.55,-0.55
4,1,1172,4.0,1260759205,2.55,1.45


In [28]:
# this pivot table will be our ratings matrix for normalized ratings
ratings_matrix = pd.pivot_table(rating_avg, values='norm_rating', index='userId', columns='movieId')
ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,0.513158,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,-0.348039,...,,,,,,,,,,
5,,,0.09,,,,,,,,...,,,,,,,,,,


Impute missing values

In [29]:
# Replace NaN by movie average over the column
ratings_matrix = ratings_matrix.fillna(ratings_matrix.mean(axis=0))
ratings_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.225976,-0.155981,-0.421958,-1.248102,-0.322901,0.24794,-0.22842,-0.149791,-0.466768,-0.101602,...,-0.866792,-2.121765,-0.374224,-1.894236,-1.394236,0.468504,1.125776,1.328571,-0.671429,1.633208
2,0.225976,-0.155981,-0.421958,-1.248102,-0.322901,0.24794,-0.22842,-0.149791,-0.466768,0.513158,...,-0.866792,-2.121765,-0.374224,-1.894236,-1.394236,0.468504,1.125776,1.328571,-0.671429,1.633208
3,0.225976,-0.155981,-0.421958,-1.248102,-0.322901,0.24794,-0.22842,-0.149791,-0.466768,-0.101602,...,-0.866792,-2.121765,-0.374224,-1.894236,-1.394236,0.468504,1.125776,1.328571,-0.671429,1.633208
4,0.225976,-0.155981,-0.421958,-1.248102,-0.322901,0.24794,-0.22842,-0.149791,-0.466768,-0.348039,...,-0.866792,-2.121765,-0.374224,-1.894236,-1.394236,0.468504,1.125776,1.328571,-0.671429,1.633208
5,0.225976,-0.155981,0.09,-1.248102,-0.322901,0.24794,-0.22842,-0.149791,-0.466768,-0.101602,...,-0.866792,-2.121765,-0.374224,-1.894236,-1.394236,0.468504,1.125776,1.328571,-0.671429,1.633208


Now we have to create the similarity matrix. As before, we will be using cosine similarity as the distance metric

In [30]:
# user similarity matrix 
cosine_sim3 = cosine_similarity(ratings_matrix, ratings_matrix)
# replace values on diagonal (1) with 0 so when we get our most similar users, we won't get the input user
np.fill_diagonal(cosine_sim3, 0)
  
# name the index 'userId'                            
cosine_sim3 = pd.DataFrame(cosine_sim3, index = ratings_matrix.index) 
# name the column 'userId'
cosine_sim3.columns = ratings_matrix.index  
                                
cosine_sim3.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.993278,0.996512,0.988383,0.994534,0.994378,0.993484,0.993246,0.996332,0.996581,...,0.995185,0.998248,0.986868,0.968298,0.995535,0.995306,0.997498,0.99747,0.994596,0.993397
2,0.993278,0.0,0.992341,0.983894,0.990455,0.990142,0.990065,0.988928,0.992027,0.992153,...,0.990727,0.994029,0.982806,0.962527,0.991146,0.989652,0.993086,0.992976,0.989205,0.989623
3,0.996512,0.992341,0.0,0.987406,0.993557,0.992872,0.992655,0.99221,0.994993,0.995528,...,0.993802,0.997248,0.985862,0.966684,0.99526,0.994572,0.996308,0.996766,0.993372,0.992337
4,0.988383,0.983894,0.987406,0.0,0.985509,0.985428,0.984978,0.984434,0.987147,0.987446,...,0.986179,0.989202,0.977682,0.961241,0.986209,0.986664,0.98857,0.988482,0.985895,0.984291
5,0.994534,0.990455,0.993557,0.985509,0.0,0.991397,0.990951,0.990023,0.993226,0.993758,...,0.992437,0.995277,0.984533,0.966469,0.992392,0.992207,0.994178,0.994566,0.99157,0.990364


Now we create a function to find n neighbors:

In [31]:
# this function finds n closest neighbors for each user based on a similarity matrix df
def find_n_neighbors(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False).iloc[:n].index, 
         index=['neighbor{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

In [32]:
# top 30 neighbors for each user
users_neighbors = find_n_neighbors(cosine_sim3,30)
users_neighbors.head()

Unnamed: 0_level_0,neighbor1,neighbor2,neighbor3,neighbor4,neighbor5,neighbor6,neighbor7,neighbor8,neighbor9,neighbor10,...,neighbor21,neighbor22,neighbor23,neighbor24,neighbor25,neighbor26,neighbor27,neighbor28,neighbor29,neighbor30
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,446,663,335,438,280,181,566,449,443,109,...,657,413,71,513,459,319,249,448,651,302
2,64,449,357,443,633,280,663,446,438,233,...,100,583,526,435,632,506,24,459,511,218
3,280,438,663,24,446,335,249,526,633,181,...,435,630,71,329,413,583,459,651,495,669
4,319,280,446,513,438,663,335,181,566,221,...,459,583,526,209,632,24,71,651,218,538
5,229,438,446,221,663,280,181,24,566,443,...,657,448,319,459,335,249,513,218,71,413


Now we are all set to implement our user-based recommender system. Our system operates as follows:  

- The user inputs their userId, optionally the cosine similarity matrix (default value = cosine similarity based on user ratings), the number of neighbors n (default value = 10) and the number of recommendations m (default value = 10)
- Using the similarity matrix between users based off of their past ratings, we will look through the n nearest neighbors and all the movies they rated 
- We will then calculate the average of the neighbors' rating on each movie that any of the neighbors have rated on (ex. If n=10, but only 3 neighbors rated one movie, the rating average will be the 3 users' average)  
- Sort the neighbors' rated movies dictionary in order of decreasing value (average neighbors' rating)  
- Get the first m movies from the sorted neighbors' rated movies dict  
- Get the average rating of all users for each movie in the top m movies list
- Get the average rating of user's n neighbors for each movie in the top m movies list. This is also our prediction for the rating that input user will give each movie
- Return top m movies with their movieId, title, average rating and predicted user rating

In [33]:
# create a dataframe containing movies that are contained in both 'metadata' and 'ratings' datasets
df1.rename(columns={'id': 'movieId'}, inplace=True)
ratings_valid = df2.merge(df1, how='inner', on='movieId')
ratings_valid = ratings_valid[['userId','movieId','rating','title']]
len(ratings_valid)

18571

In [76]:
def get_user_cf_recommendations(uid, cosine_sim=cosine_sim3, n=10, num_recom=10):
    
    # get the n closest neighbors and their user ids
    users_neighbors = find_n_neighbors(cosine_sim, n)
    neighbors_uid =  users_neighbors.iloc[uid-1]       # [uid-1] because userId starts from 1 not 0
    

    # create a dictionary with keys being the movieId of all movies any of the neighbors has watched and values being a list of the available k-neighbors' ratings for each specific movie  
    considered_movies = {}  
    for i in range(n):
        temp = ratings_valid[ ratings_valid.userId == neighbors_uid[i] ]    
        # convert movieId and rating columns to lists for easier subsetting
        list_movieId = (temp.movieId).tolist()     
        list_rating = (temp.rating).tolist()
        
        for j in range(len(list_movieId)):    
            if (list_movieId[j] in considered_movies.keys()):
                considered_movies[list_movieId[j]].append(list_rating[j])
            else:
                considered_movies[list_movieId[j]] = [list_rating[j]]         
    
    
    # convert the values of our dictionary that have multiple elements in their value into averages. If there is only 1 rating available, we keep that rating as the value
    for k in considered_movies.keys():
        if len(considered_movies[k])==1:
            considered_movies[k] = considered_movies[k][0]
        else:
            considered_movies[k] = sum(considered_movies[k]) / len(considered_movies[k])  


            
    # get the movieIds of the top recommended movies by sorting movieIds in desc order and getting the first m keys
    considered_movies = dict(sorted(considered_movies.items(), key = lambda item: item[1], reverse=True))
    top_movies_id = list(considered_movies.keys())[:num_recom] 
    top_movies_rating = list(considered_movies.values())[:num_recom]
    # get movieId and title by matching movieId in top movies list with master movies dataframe
    # instead of creating a new dataframe, we just copy the first element then we append the other (m-1) movies
    recom_df = df1[df1.movieId == top_movies_id[0]][['movieId','title']]
    for i in range(len(top_movies_id)):
        if i == 0:
            pass
        else:
            recom_df = recom_df.append(df1[df1.movieId == top_movies_id[i]][['movieId','title']])
    

    
    # add the average rating given by all users to the dataframe
    avg_ratings = []
    for i in range(len(top_movies_id)):
        avg_ratings.append( df2[df2.movieId == top_movies_id[i]].rating.mean() )
    recom_df['average_rating'] = avg_ratings
        
    
    # add the average rating given by n-nearest neighbors to the dataframe
    recom_df['predicted_user_rating'] = top_movies_rating
    
        
    return(recom_df)

In [77]:
get_user_cf_recommendations(345,cosine_sim3)

46


Unnamed: 0,movieId,title,average_rating,predicted_user_rating
1850,111,Scarface,4.224576,5.0
1210,4970,Gothika,3.722222,5.0
1170,1213,The Talented Mr. Ripley,4.20229,5.0
2530,4011,Beetlejuice,4.036842,5.0
2661,454,Romeo + Juliet,3.472727,5.0
378,8665,K-19: The Widowmaker,3.790541,5.0
1028,593,Solaris,4.138158,4.75
2063,1073,Arlington Road,3.753378,4.5
93,296,Terminator 3: Rise of the Machines,4.256173,4.5
1869,590,The Hours,3.717822,4.333333


In [79]:
get_user_cf_recommendations(88,cosine_sim3)

49


Unnamed: 0,movieId,title,average_rating,predicted_user_rating
2216,5971,We're No Angels,4.230769,5.0
378,8665,K-19: The Widowmaker,3.790541,5.0
2790,319,True Romance,3.973684,5.0
1850,111,Scarface,4.224576,5.0
4467,838,American Graffiti,3.817073,5.0
1525,28,Apocalypse Now,4.083333,5.0
1028,593,Solaris,4.138158,4.25
2063,1073,Arlington Road,3.753378,4.0
1120,509,Notting Hill,3.75,4.0
846,1792,Stuck on You,3.363636,4.0


### Item-Based CF  
Item-based CF is typically used to predict a user's rating on an unrated item based on ratings he/she has given other items. We will incorporate these predictions as a scoring metric for our recommender system.

#### Singular Value Decomposition  
We will utilize singular value decomposition (SVD) to leverage a latent factor model to capture the similarity between users and items. We can then train SVD on the entire dataset then use it to make predicted ratings for each user and item.  
Applying SVD to our dataset handles the scalability and sparsity issue of CF.

A latent factor is a broad idea which describes a property or concept that a user or an item have. For instance, one latent factor of a song is which genre of music it belongs to. SVD decreases the dimension of the utility matrix by extracting its latent factors. Essentially, we map each user and each item into a latent space with dimension r. Therefore, it helps us better understand the relationship between users and items as they become directly comparable.

In [89]:
reader = Reader()

data = Dataset.load_from_df(df2[['userId','movieId','rating']], reader)
svd = SVD()
svd_errors = cross_validate(svd, data, measures=['rmse', 'mae'], cv=5)
svd_errors

{'test_rmse': array([0.90047261, 0.89369735, 0.89322698, 0.88393298, 0.91294274]),
 'test_mae': array([0.69179404, 0.69049327, 0.68759207, 0.68219508, 0.7006963 ]),
 'fit_time': (6.874392032623291,
  7.2251293659210205,
  8.453959226608276,
  8.133680582046509,
  9.60582423210144),
 'test_time': (0.3378283977508545,
  0.19579648971557617,
  0.1988229751586914,
  0.26306724548339844,
  0.3948044776916504)}

In [88]:
print(svd_errors['test_rmse'].mean())
print(svd_errors['test_mae'].mean())

0.8981837533270893
0.6910832447801247


Our root mean square error (RMSE) and mean absolute error (MAE) are pretty small, which is good.

Now we can train on our entire dataset and then make predictions:

In [91]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2735144d2c8>

Let's pick user 1 and check the ratings he/she has given:

In [40]:
df2[df2.userId==1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


User 1 hasn't watched the movie with id 5000 yet so let's predict his/her rating if he were to watch movie 5000.

In [92]:
# predict user 1's rating for movie with movieId=5000
svd.predict(1,5000).est

2.663492043522158

User 1's prediction for movie 5000 doesn't tell us much on its own. As a point of comparison, let's look at user 1's average rating for all movies he/she has rated.

In [81]:
df2[df2.userId==1].rating.mean()

2.55

User 1's average rating for movies is 2.55. Since his/her predicted rating for movie 5000 is 2.66, higher than his/her average rating, he/she might enjoy watching movie 5000.

Now we are all set to create our item-based recommender system. Our system operates as follows:  

- The user inputs their userId, and optionally the number of recommendations m (default value = 10)  
- Get the ids of all the movies the user has previously rated  
- Using the ids of rated movies, loop through the master dataframe to get the ids of unrated movies and calculate their predicted user ratings (for the input user). Store this information in a dictionary 
- Sort the unrated movies dictionary in order of decreasing value (predicted user rating)
- Get the first m movies from the sorted unrated movies dict  
- Get the average rating of all users for each movie in the top m movies list
- Return top m movies with their movieId, title, average rating and predicted user rating

In [57]:
def get_item_cf_recommendations(uid, num_recom=10):
    
    # movie ids of all the movies that input user has already rated
    rated_movies_id = df2[df2.userId == uid].movieId
    
    # unrated_movies is a dictionary whose keys are the movie ids of movies input user has not rated, and values are input user's predicted rating for each unrated movie
    unrated_movies = {}
    for i in range(len(df1)):
        cur_movieId = df1['movieId'][i]
        if cur_movieId not in rated_movies_id:
            unrated_movies[cur_movieId] = svd.predict(uid, cur_movieId).est
            
    
    
    # sort unrated movies in order of decreasing predicted rating
    unrated_movies = dict(sorted(unrated_movies.items(), key = lambda item: item[1], reverse=True))
    
    # get the top movies' ids and predicted ratings
    top_movies_id = list(unrated_movies.keys())[:num_recom]
    top_movies_rating = list(unrated_movies.values())[:num_recom]
    
    # get movieId and title by matching movieId in top movies list with master movies dataframe
    recom_df = df1[df1.movieId == top_movies_id[0]][['movieId','title']]
    for i in range(len(top_movies_id)):
        if i == 0:
            pass
        else:
            recom_df = recom_df.append(df1[df1.movieId == top_movies_id[i]][['movieId','title']])
        
    

    # add the average rating given by all users for each movie in the recommended df
    avg_ratings = []
    for i in range(len(top_movies_id)):
        avg_ratings.append( df2[df2.movieId == top_movies_id[i]].rating.mean() )
    recom_df['average_rating'] = avg_ratings
        
        
    # add the predicted rating given by svd predict for each movie in the recommended df
    recom_df['predicted_user_rating'] = top_movies_rating
    
    
    return(recom_df)

In [58]:
get_item_cf_recommendations(345)

Unnamed: 0,movieId,title,average_rating,predicted_user_rating
93,296,Terminator 3: Rise of the Machines,4.256173,5.0
1391,2959,License to Wed,4.178218,4.950014
1850,111,Scarface,4.224576,4.889558
150,608,Men in Black II,4.256696,4.827689
1170,1213,The Talented Mr. Ripley,4.20229,4.745247
1025,913,The Thomas Crown Affair,4.387097,4.742042
1782,951,Kindergarten Cop,4.211538,4.713529
217,1250,Ghost Rider,4.138462,4.667371
1028,593,Solaris,4.138158,4.577013
425,954,Mission: Impossible,4.225806,4.475125


In [59]:
get_item_cf_recommendations(88)

Unnamed: 0,movieId,title,average_rating,predicted_user_rating
878,3683,Flags of Our Fathers,4.291667,4.352462
1170,1213,The Talented Mr. Ripley,4.20229,4.265306
1666,6016,The Good Thief,4.297101,4.231482
150,608,Men in Black II,4.256696,4.177172
1748,1259,Notes on a Scandal,4.09375,4.148123
2944,766,Army of Darkness,4.222222,4.138643
4281,223,Rebecca,3.963303,4.062127
1025,913,The Thomas Crown Affair,4.387097,4.057445
2530,4011,Beetlejuice,4.036842,4.040999
1120,509,Notting Hill,3.75,4.03399
