In [4]:
import numpy as np
import pandas as pd

# Movie Recommendation System

In this project we will use the MovieLens dataset to develop four recommendation algorithms. Two recommendation schemes will be based on the user's favorite genre, and the remaining two will use a collaborative recommendation algorithm like user-based KNN and SVD. These models will be productionalized and hosted via Dash, the python data web framework. For system one, users will input their desired genre and we will receive a set of movie recommendations. For system two, users will rate as many movie's as they can and receive recommendations using our collaborative algo. 

### MovieLens Dataset

The MovieLens dataset was composed by the GroupLens research group at the University of Minnesota, and is comprised of 1,000,209 anonymous rating of approximately 3,900 movies made by 6,040 MovieLens users.


F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

----

## Table Of Contents: <a class="anchor" id="back-to-top"></a>
* [Exploratory Data Analysis](#data-analysis)
* [System 1](#system-1)
	* [System 1 - Popular by Genre](#system-1-popular)
	* [System 1 - Highly Rated by Genre](#system-1-highly-rated)
	* [System 1 - Summary](#system-1-summary)
* [System 2](#system-2)
	* [System 2 - KNN](#system-2-KNN)
	* [System 2 - SVD](#system-2-SVD)
	* [System 2 - Measuring Performance](#system-2-measuring-performance)
	* [System 2 - Summary](#system-2-summary)
* [Performance Summary](#algorithm-performance-summary)
	* [Choice of Best Algorithm for App](#choice-of-best-algo)
* [App Functionality Prototype](#prototyping-app-functionality)
* [Conclusion](#conclusion)
* [References](#references)


----

## Exploratory Data Analysis  <a class="anchor" id="data-analysis"></a>

In [6]:
# UserID::MovieID::Rating::Timestamp
ratings = pd.read_csv('./ml-1m/ratings.dat', sep = '::', names = ['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [9]:
# UserID::Gender::Age::Occupation::Zip-code
users = pd.read_csv('./ml-1m/users.dat', sep = '::', names = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'], engine='python')
users.head()

Unnamed: 0,UserID,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [7]:
# MovieID::Title::Genres
movies = pd.read_csv('./ml-1m/movies.dat', sep = '::', names = ['MovieID', 'Title', 'Genres'], engine='python')
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Let's merge our ratings and and movies dataframe to get a set of all movies and every rating. This will allow us to check things like (1) the most popular movie (2) the most popular movie by genre (3) the most highly rated movies, etc. 

Here 'popular' means the movies with the highest number of ratings.

In [8]:
# MovieID::Title::Genres::UserID::Rating::TImestamp
movies_and_ratings = pd.merge(movies, ratings, on = 'MovieID')
movies_and_ratings.head()

Unnamed: 0,MovieID,Title,Genres,UserID,Rating,Timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


#### List the Most Popular Movies 

In [23]:
popular_movies = movies_and_ratings.groupby(['MovieID', 'Title', 'Genres'], as_index=False).size().sort_values(ascending=False)
popular_movies.head()

MovieID  Title                                                  Genres                             
2858     American Beauty (1999)                                 Comedy|Drama                           3428
260      Star Wars: Episode IV - A New Hope (1977)              Action|Adventure|Fantasy|Sci-Fi        2991
1196     Star Wars: Episode V - The Empire Strikes Back (1980)  Action|Adventure|Drama|Sci-Fi|War      2990
1210     Star Wars: Episode VI - Return of the Jedi (1983)      Action|Adventure|Romance|Sci-Fi|War    2883
480      Jurassic Park (1993)                                   Action|Adventure|Sci-Fi                2672
dtype: int64

----

## System 1: Listing the most Popular and most Highly Rated Movies by Genre <a class="anchor" id="system-1"></a>

For system 1, we want to take, as input, a selected genre and return movie recommendations. We will develop two schemas for doing this: (1) List the most popular movies by genre (2) List the most highly rated movies by genre, with the rating weighted based on the number of reviews.

- *'popular' - means the movies with the highest number of reviews*
- *'highly rated' - means the movie with the highest weighted mean review score*

#### List the Most Popular Movies by Genre  <a class="anchor" id="system-1-popular"></a>

In [44]:
def get_popular_movies_by_genre(movies, genre, count=10):
	popular = movies[movies["Genres"].str.contains(genre)]
	popular = popular.groupby(['MovieID', 'Title', 'Genres'], as_index=False).agg({"Rating": ["sum"]})
	popular['RatingCount'] = popular['Rating']['sum']
	return popular.sort_values(by='RatingCount', ascending=False).head(count)

In [45]:
get_popular_movies_by_genre(movies_and_ratings, "Romance")

Unnamed: 0_level_0,MovieID,Title,Genres,Rating,RatingCount
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,Unnamed: 5_level_1
161,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War,11598,11598
160,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,9976,9976
307,2396,Shakespeare in Love (1998),Comedy|Romance,9778,9778
169,1265,Groundhog Day (1993),Comedy|Romance,9005,9005
60,356,Forrest Gump (1994),Comedy|Romance|War,8969,8969
126,912,Casablanca (1942),Drama|Romance|War,7365,7365
174,1307,When Harry Met Sally... (1989),Comedy|Romance,6387,6387
65,377,Speed (1994),Action|Romance|Thriller,5883,5883
231,1721,Titanic (1997),Drama|Romance,5540,5540
165,1230,Annie Hall (1977),Comedy|Romance,5525,5525


#### List the Most Highly Rating Movies by Genre  <a class="anchor" id="system-1-highly-rated"></a>

In order to list the most highly rated movies by genre we must relate movie ratings and number of reviews. A movie with a rating of 5 and one review should rank lower than a movie with a rating of 3 but thousands of reviews. To do this I'll create a rating count weight. We can then generate an overall score for each movie in a genre by multiplying the rating count weight by the mean of the movie rating. 

##### Movie Rating Count Weight

I wanted a movie's overall scored to be predominately based on the movie rating without neglecting the number of reviews. To do this, I generated a rating count weight based on the log of the number of reviews. For example, Star Wars: Episode IV has 13,321 ratings and it's corresponding rating count weight is `9.49` or `log(13321)`. 

We use the log to minimize the importance of a movie with a high number of ratings but a lower average rating.

In [79]:
def get_highly_rated_movies_by_genre(movies, genre, count=10):
	highly_rated = movies[movies["Genres"].str.contains(genre)]
	highly_rated = highly_rated.groupby(['MovieID', 'Title', 'Genres'], as_index=False).agg({"Rating": ["mean", "sum"]})
	highly_rated['RatingCountWeight'] = np.log(highly_rated['Rating']['sum'])
	highly_rated['RatingScore'] = highly_rated['RatingCountWeight'] * highly_rated['Rating']['mean']
	return highly_rated.sort_values(by='RatingScore', ascending=False).head(count)

In [80]:
get_highly_rated_movies_by_genre(movies_and_ratings, "Action")

Unnamed: 0_level_0,MovieID,Title,Genres,Rating,Rating,RatingCountWeight,RatingScore
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,sum,Unnamed: 6_level_1,Unnamed: 7_level_1
31,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,4.453694,13321,9.497097,42.297168
137,1198,Raiders of the Lost Ark (1981),Action|Adventure,4.477725,11257,9.328745,41.771554
111,858,"Godfather, The (1972)",Action|Crime|Drama,4.524966,10059,9.216223,41.703098
135,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War,4.292977,12836,9.460009,40.611597
260,2028,Saving Private Ryan (1998),Action|Drama|War,4.337354,11507,9.350711,40.557342
330,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.31583,11178,9.321703,40.230886
136,1197,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance,4.30371,9976,9.207937,39.628294
13,110,Braveheart (1995),Action|Drama|War,4.234957,10346,9.244355,39.149447
146,1221,"Godfather: Part II, The (1974)",Action|Crime|Drama,4.357565,7373,8.90558,38.806644
86,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller,4.058513,10751,9.282754,37.674175


#### Measuring the Performance of Our Schemas

In [161]:
import time

genres = pd.DataFrame(movies_and_ratings.Genres.str.split('|').tolist()).stack().unique()

popular_perf = []
highly_rated_perf = []
for genre in genres:
	# measure execution time of getting popular movies by genre
	start_time = time.time()
	get_popular_movies_by_genre(movies_and_ratings, genre)
	popular_perf.append(time.time() - start_time)

	# measure execution time of getting highly rated movies by genre
	start_time = time.time()
	get_highly_rated_movies_by_genre(movies_and_ratings, genre)
	highly_rated_perf.append(time.time() - start_time)

genre_performance = pd.DataFrame(list(zip(genres, popular_perf, highly_rated_perf)), columns=['Genre', 'Popular By Genre Runtime (in seconds)', 'High Rated By Genre Runtime (in seconds)'])
genre_performance

Unnamed: 0,Genre,Popular By Genre Runtime (in seconds),High Rated By Genre Runtime (in seconds)
0,Animation,0.43875,0.435747
1,Children's,0.404281,0.410389
2,Comedy,0.506095,0.488236
3,Adventure,0.416077,0.430543
4,Fantasy,0.39329,0.393887
5,Romance,0.424102,0.421579
6,Drama,0.455264,0.473974
7,Action,0.440879,0.446181
8,Crime,0.396466,0.413674
9,Thriller,0.42006,0.436535


### System 1 Summary <a class="anchor" id="system-1-summary"></a>

For system 1 we developed two methods of recommending movies by genre. The first was to simply recommend the most popular movies, i.e those with the most ratings. The second involved developing a rating score. This rating score was a combination of the mean movie rating and a weighted review count. Together these allowed us to return the most highly rated movies by genre.

----

## System 2: Collaborative Recommendation System <a class="anchor" id="system-2"></a>

A collaborative filtering system can recommend movies on the basis of similarities between users or items. We will develop two algorithms for collaborative filtering: (1) User-based or item-based KNN (2) SVD. Let's start with KNN.

### KNN User-based Collaborative Filtering <a class="anchor" id="system-2-KNN"></a>

KNN is a memory-based algorithm, and in particular we want to use centered KNN. Centered KNN means that a user's reviews will be reduced by the mean of their total reviews. The advantage of centered KNN is that it properly handles missing values, in our cases movies that have not been reviewed by a user, by setting them to zero. Because we are using centered KNN a empty zero-value, will not skew the overall ratings or pattern of that user.

We will be using the Surprise library to build our collaborative filtering models. Let's start be defining a parameter grid. This parameter grid will allow us to test different combinations of simularity options for KNN.

In [155]:
from surprise import KNNWithMeans
from surprise import SVD 
from surprise import Dataset
from surprise import Reader 
from surprise import accuracy
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split

In [137]:
# User-based cosine similarity
param_grid = {
	"k": [40, 50],
	"sim_options" : {
		"name": ["cosine"],
		"user_based": [True],
		"min_support": [5, 7]
	}
}

- *min_support - "The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') for the similarity not to be zero"* 
- *cosine similarity - returns a higher similarity when the angle between users is lower* (see the following diagram)
- *k - is the neighborhood size*

----
![Cosine Similarity](../report_assets/cosine_similarity.jpg)

----

### Building our Dataset

Our productionalized movie recommendation system will take as input, ratings for movies by a new user, and will use these ratings in our user-based KNN model. Let's start by hard-coding a set of 10 action movie reviews. Our aim is to be found similar to other users who have reviewed action movies and who had similar ratings. When building a Surprise dataset from a dataframe, the df must have three columns "userID", "itemID", and "rating".

In [131]:
ratings = pd.read_csv('./ml-1m/ratings.dat', sep = '::', names = ['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')

# The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order
ratings = ratings.drop('Timestamp', axis=1)
ratings = ratings.rename({ 
	'UserID': 'userID', 
	'MovieID': 'itemID', 
	'Rating': 'rating'
}, axis=1)

# hard-code 10 additional movie ratings
new_user_id = np.repeat(ratings['userID'].unique()[-1] + 1, 10)
new_item_ids = movies[movies["Genres"].str.contains("Action")].sample(10)['MovieID']
new_ratings = [3, 4, 5, 4, 2, 5, 5, 2, 3, 1]

new_user = pd.DataFrame({
	'userID': new_user_id,
	'itemID': new_item_ids,
	'rating': new_ratings
})

# combine new and old rating datasets
ratings = pd.concat([ratings, new_user], axis = 0).reset_index().drop('index', axis=1)
ratings.tail(10)

Unnamed: 0,userID,itemID,rating
1000209,6041,1291,3
1000210,6041,227,4
1000211,6041,1785,5
1000212,6041,1681,4
1000213,6041,2370,2
1000214,6041,2411,5
1000215,6041,2196,5
1000216,6041,2817,2
1000217,6041,3372,3
1000218,6041,2986,1


#### Finding Optimal Parameters Using GridSearchCV

In [None]:
# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(ratings[['userID', 'itemID', 'rating']], reader)

# GridSearch, 3-fold cross validation, KNNWithMeans
gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse"], cv=3)
gs.fit(data)

#### Output Best KNN Parameters from GridSearch

In [140]:
# best RMSE score
gs.best_score['rmse']

0.9294202850790415

In [141]:
# combination of parameters that gave the best RMSE score
gs.best_params['rmse']

{'k': 50,
 'sim_options': {'name': 'cosine', 'user_based': True, 'min_support': 7}}

### SVD Collaborative Filtering <a class="anchor" id="system-2-SVD"></a>

Our second collaborative filtering algorithm is SVD or Singular Value Decomposition, which came into light when seen performing well in the Netflix Competition. Again, we will use the Surprise package for SVD. SVD minimizes the regularized squared error seen in the diagram below.

![](../report_assets/SVD_minimizer.png)

A note on SVD:
- "When baselines are not used, this is equivalent to Probabilistic Matrix Factorization" 

In [146]:
param_grid = {
	"lr_all": [0.003, 0.005, 0.007], 
	"reg_all": [0.1, 0.2, 0.3]
}

- *lr_all - The learning rate for all parameters*
- *reg_all - The regularization term for all parameters*

### Building Our Dataset

To reiterate, our production system will take user input as movie ratings, to find similarities with other users. We will use the same `ratings` dataset compiled above.

#### Finding Optimal Parameters Using GridSearchCV

In [147]:
# GridSearch, 3-fold cross validation, KNNWithMeans
gs = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
gs.fit(data)

#### Output Best KNN Parameters from GridSearch

In [148]:
# best RMSE score
gs.best_score['rmse']

0.8917625848955452

In [149]:
# combination of parameters that gave the best RMSE score
gs.best_params['rmse']

{'lr_all': 0.007, 'reg_all': 0.1}

----

### Measuring Performance of our Collaborative Recommendation Algorithims <a class="anchor" id="system-2-measuring-performance"></a>

For both algorithms we will use RMSE over 10 iterations to evaluate the prediction performance. In each iteration, we create a training and test split, train a recommender system on the training data and record prediction accuracy/error on the test data.

In [None]:
svd_rmse, svd_perf, knn_rmse, knn_perf = [], [], [], []
for i in range(10):
    trainset, testset = train_test_split(data, test_size=.25)

    print(f'measuring performance of SVD on iteration {i}')
    start_time = time.time()
    svd = SVD(lr_all = 0.007, reg_all = 0.1)
    svd.fit(trainset)
    predictions = svd.test(testset)
    svd_rmse.append(accuracy.rmse(predictions))
    svd_perf.append(time.time() - start_time)
    print('done...')

    print(f'measuring performance of KNNWithMeans on iteration {i}')
    start_time = time.time()
    knn = KNNWithMeans(k=50, sim_options={ 'name': 'cosine', 'user_based': True, 'min_support': 7})
    knn.fit(trainset)
    predictions = knn.test(testset)
    knn_rmse.append(accuracy.rmse(predictions))
    knn_perf.append(time.time() - start_time)
    print('done...')


In [162]:
collaborative_performance = pd.DataFrame(list(zip(svd_rmse, svd_perf, knn_rmse, knn_perf)), columns=['SVD RMSE', 'SVD Runtime(in seconds)', 'KNN RMSE', 'KNN Runtime(in seconds)'])
collaborative_performance

Unnamed: 0,SVD RMSE,SVD Runtime(in seconds),KNN RMSE,KNN Runtime(in seconds)
0,0.888406,46.157597,0.927374,199.087592
1,0.888938,45.262638,0.928809,194.873127
2,0.888619,46.176504,0.928158,196.035637
3,0.889471,45.848254,0.927686,196.510817
4,0.890093,45.388963,0.929192,189.429975
5,0.889594,45.131195,0.927482,187.742282
6,0.888394,44.955453,0.92795,193.109865
7,0.892229,44.159585,0.929555,192.540238
8,0.886828,45.013372,0.925655,197.378771
9,0.886811,46.252055,0.926901,192.486122


### System 2 Summary <a class="anchor" id="system-2-summary"></a>

For system 2 we developed two methods of recommending movies with collaborative filtering. Firstly, we used centered KNN (or KNNWithMeans) and found that the best neighborhood size was `k=50` and the best min support was `7` (these parameters are defined in the system-2 KNN section). Secondly, we used the same GridSearch method to determine the best parameters for SVD. We found the best learning rate to be `lr_all=0.007` and the best regularization term to be `reg_all=0.1`. We used the default n_epochs for SVD. 

The performance of both algorithms is modeled above.

----
## Algorithm Performance Summary <a class="anchor" id="algorithm-performance-summary"></a>

In [164]:
genre_performance

Unnamed: 0,Genre,Popular By Genre Runtime (in seconds),High Rated By Genre Runtime (in seconds)
0,Animation,0.43875,0.435747
1,Children's,0.404281,0.410389
2,Comedy,0.506095,0.488236
3,Adventure,0.416077,0.430543
4,Fantasy,0.39329,0.393887
5,Romance,0.424102,0.421579
6,Drama,0.455264,0.473974
7,Action,0.440879,0.446181
8,Crime,0.396466,0.413674
9,Thriller,0.42006,0.436535


In [165]:
collaborative_performance

Unnamed: 0,SVD RMSE,SVD Runtime(in seconds),KNN RMSE,KNN Runtime(in seconds)
0,0.888406,46.157597,0.927374,199.087592
1,0.888938,45.262638,0.928809,194.873127
2,0.888619,46.176504,0.928158,196.035637
3,0.889471,45.848254,0.927686,196.510817
4,0.890093,45.388963,0.929192,189.429975
5,0.889594,45.131195,0.927482,187.742282
6,0.888394,44.955453,0.92795,193.109865
7,0.892229,44.159585,0.929555,192.540238
8,0.886828,45.013372,0.925655,197.378771
9,0.886811,46.252055,0.926901,192.486122


#### Choice of Best Algorithms for App  <a class="anchor" id="choice-of-best-algo"></a>

For Genre based recommendations we will show users both the most popular and most highly rated movies, meaning we will use both genre-based algorithms. For collaborative filtering we will use SVD which has both the best RMSE and runtime. 

----

## Prototyping App Functionality <a class="anchor" id="prototyping-app-functionality"></a>

In [2]:
def read_data():
	ratings = pd.read_csv('./ml-1m/ratings.dat', sep = '::', names = ['UserID', 'MovieID', 'Rating', 'Timestamp'], engine='python')
	movies = pd.read_csv('./ml-1m/movies.dat', sep = '::', names = ['MovieID', 'Title', 'Genres'], engine='python')
	movies_and_ratings = pd.merge(movies, ratings, on = 'MovieID')

	return ratings, movies, movies_and_ratings

In [46]:
def get_popular_movies_by_genre(movies, genre, count=10):
	popular = movies[movies["Genres"].str.contains(genre)]
	popular = popular.groupby(['MovieID', 'Title', 'Genres'], as_index=False).agg({"Rating": ["sum"]})
	popular['RatingCount'] = popular['Rating']['sum']
	return popular.sort_values(by='RatingCount', ascending=False).head(count)

def get_highly_rated_movies_by_genre(movies, genre, count=10):
	highly_rated = movies[movies["Genres"].str.contains(genre)]
	highly_rated = highly_rated.groupby(['MovieID', 'Title', 'Genres'], as_index=False).agg({"Rating": ["mean", "sum"]})
	highly_rated['RatingCountWeight'] = np.log(highly_rated['Rating']['sum'])
	highly_rated['RatingScore'] = highly_rated['RatingCountWeight'] * highly_rated['Rating']['mean']
	return highly_rated.sort_values(by='RatingScore', ascending=False).head(count)

In [6]:
def get_genres(movies):
	return pd.DataFrame(movies.Genres.str.split('|').tolist()).stack().unique()

_, movies, _ = read_data()
get_genres(movies)

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

In [179]:
def build_train_and_test(ratings, new_user_ratings):
	"""
	Args:
		ratings - ratings dataframe from ratings.dat
		new_user_ratings - dataframe of the form userID, itemID, rating
	"""
	ratings = ratings.drop('Timestamp', axis=1)
	ratings = ratings.rename({ 
		'UserID': 'userID', 
		'MovieID': 'itemID', 
		'Rating': 'rating'
	}, axis=1)

	ratings = pd.concat([ratings, new_user_ratings], axis = 0).reset_index().drop('index', axis=1)	

	reader = Reader(rating_scale=(1, 5))
	data = Dataset.load_from_df(ratings[['userID', 'itemID', 'rating']], reader)
	trainset = data.build_full_trainset()
	testset = trainset.build_anti_testset()

	return trainset, testset


def train_and_predict(trainset, testset):
	algo = SVD(lr_all = 0.007, reg_all = 0.1)
	algo.fit(trainset)

	# Predict ratings for all pairs (u, i) that are NOT in the training set.
	predictions = algo.test(testset)

	return predictions

In [192]:
from collections import defaultdict

# source: https://github.com/NicolasHug/Surprise/blob/master/examples/top_n_recommendations.py
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.
    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


def get_user_top_n(predictions, user_id, n=10):
    """Return the top-N recommendation for a specific user from a set of predictions.
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        user_id: id of user
        n(int): The number of recommendation to output for each user. Default
            is 10.
    Returns:
        A list where values are tuples
        [(raw item id, rating estimation), ...] of size n.
    """

    top_n = [] 
    for uid, iid, _, est, _ in predictions:
        if uid == user_id:
            top_n.append((iid, est))

    top_n.sort(key=lambda x: x[1], reverse=True)
    return top_n[:n]

#### Piecing it All Together

In [49]:
ratings, movies, movies_and_ratings = read_data()

In [50]:
# prompt for user input
get_popular_movies_by_genre(movies_and_ratings, genre="Drama")

Unnamed: 0_level_0,MovieID,Title,Genres,Rating,RatingCount
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,sum,Unnamed: 5_level_1
1058,2858,American Beauty (1999),Comedy|Drama,14800,14800
487,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War,12836,12836
785,2028,Saving Private Ryan (1998),Action|Drama|War,11507,11507
267,593,"Silence of the Lambs, The (1991)",Drama|Thriller,11219,11219
271,608,Fargo (1996),Crime|Drama|Thriller,10692,10692
238,527,Schindler's List (1993),Drama|War,10392,10392
52,110,Braveheart (1995),Action|Drama|War,10346,10346
152,318,"Shawshank Redemption, The (1994)",Drama,10143,10143
358,858,"Godfather, The (1972)",Action|Crime|Drama,10059,10059
140,296,Pulp Fiction (1994),Crime|Drama,9288,9288


In [182]:
# prompt for user input
get_highly_rated_movies_by_genre(movies_and_ratings, genre="Sci-Fi")

Unnamed: 0_level_0,MovieID,Title,Genres,Rating,Rating,RatingCountWeight,RatingScore
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,sum,Unnamed: 6_level_1,Unnamed: 7_level_1
12,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,4.453694,13321,9.497097,42.297168
55,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Drama|Sci-Fi|War,4.292977,12836,9.460009,40.611597
162,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.31583,11178,9.321703,40.230886
41,750,Dr. Strangelove or: How I Learned to Stop Worr...,Sci-Fi|War,4.44989,6083,8.713253,38.773021
28,541,Blade Runner (1982),Film-Noir|Sci-Fi,4.273333,7692,8.947936,38.237514
30,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller,4.058513,10751,9.282754,37.674175
64,1240,"Terminator, The (1984)",Action|Sci-Fi|Thriller,4.15205,8711,9.072342,37.668813
60,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War,4.022893,11598,9.358588,37.648596
61,1214,Alien (1979),Action|Horror|Sci-Fi|Thriller,4.159585,8419,9.038246,37.595354
66,1270,Back to the Future (1985),Comedy|Sci-Fi,3.990321,10307,9.240579,36.872878


In [None]:
# prompt for user input

# hard-code 10 additional movie ratings
new_user_id = np.repeat(ratings['UserID'].unique()[-1] + 1, 10)
new_item_ids = movies[movies["Genres"].str.contains("Action")].sample(10)['MovieID']
new_ratings = [3, 4, 5, 4, 2, 5, 5, 2, 3, 1]

new_user = pd.DataFrame({
	'userID': new_user_id,
	'itemID': new_item_ids,
	'rating': new_ratings
})

train, test = build_train_and_test(ratings, new_user)
predictions = train_and_predict(train, test)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

# output omitted

In [197]:
uid = ratings['UserID'].unique()[-1] + 1
top_n = get_user_top_n(predictions, uid)
top_n

[(2905, 4.6998205553684675),
 (318, 4.6593108874882825),
 (2503, 4.616659739341537),
 (1117, 4.574145018407659),
 (527, 4.572543170173383),
 (53, 4.556078872464665),
 (670, 4.512521759929467),
 (745, 4.509700184774657),
 (2762, 4.467797716005988),
 (1148, 4.462175180717853)]

In [196]:
movie_ids = [recommendation[0] for recommendation in top_n] 
movies.loc[movies['MovieID'].isin(movie_ids)]

Unnamed: 0,MovieID,Title,Genres
52,53,Lamerica (1994),Drama
315,318,"Shawshank Redemption, The (1994)",Drama
523,527,Schindler's List (1993),Drama|War
664,670,"World of Apu, The (Apur Sansar) (1959)",Drama
735,745,"Close Shave, A (1995)",Animation|Comedy|Thriller
1101,1117,"Eighth Day, The (Le Huiti�me jour ) (1996)",Drama
1132,1148,"Wrong Trousers, The (1993)",Animation|Comedy
2434,2503,"Apple, The (Sib) (1998)",Drama
2693,2762,"Sixth Sense, The (1999)",Thriller
2836,2905,Sanjuro (1962),Action|Adventure


## Conclusion <a class="anchor" id="conclusion"></a>

We developed four algorithms for recommending movies to users. (1) recommend popular movies by genre (2) recommend highly rated movies by genre (3) use KNN for collaborative filtering (4) use SVD for collaborative filtering. We discussed the details of each algorithm, then analyized the performance, and prototyped our app.

## References <a class="anchor" id="references"></a>

*the following are hyperlinks to my references*

- Surprise 
	- [Documentation](https://surprise.readthedocs.io/en/stable/getting_started.html)
	- [Examples](https://github.com/NicolasHug/Surprise/tree/master/examples)
- [Build a Recommendation Engine With Collaborative Filtering](https://realpython.com/build-recommendation-engine-collaborative-filtering/#how-to-find-similar-users-on-the-basis-of-ratings)


[Back To Top](#back-to-top)