# Movie Recommendation System

## Introduction
In this modern world recommendation systems are everywhere, whether you shop online on Amazon or watch a movie on Netflix or read an article on Medium. Recommendation systems are constantly working to bring up the items that a user is most likely to like. Simply put, recommendation system is a program that predicts the rating of a user towards/for an item.  

Recommending a movie to a user is a challenging task as there is a lot more information available about the movie like the genre, actors, language etc. as well as user behaviour patterns and clicks for example whether user watched the trailer, clicked on the movie as well as ratings user gave to that movie.   

## Recommendation System Techniques
### I Content Based Filtering
It is a technique where the recommendation is based on the data about the item. The algorithm recommends the items that are similar in characteristic to the items user liked in the past. For example, if the user like Spider man 1 movie then this recommendation system recommends Spider man 2 as it has the same genre and star cast. 


### II Collaborative Filtering
It is an approach where recommendation is based on user's behaviour and comparing and contrasting it with other users behaviour. The history of all users plays an important role in this technique. The algorithm works by searching a large group of users data and finding a small number of users whose choice matches well with the particular user. **Collaborative Filtering** is the most popular recommendation techniques used by Amazon, Netflix and YouTube 

There are 2 types of collaborative filtering 
#### a. User-based  Collaborative Filtering
The basic concept here is to find other users that have similar preference pattern for the user say A and then recommending user A items that other similar users liked but its not viewed by user A. This is done by creating a matrix of items user has rated/clicked/liked and then generating similarity score between users and then recommending items  
For example, if user A liked Batman Begins, Justice League and Thor and user B likes Batman Begins, Justice League and Avengers then they have similar interests. We can then recommend Thor to user B and Avengers to user A.

#### b. Item-based Collaborative Filtering
This was developed by Amazon. In this approach we find similar items instead of similar user and then recommending the similar items to the user. This is done by creating a matrix of items that the user **(same user)** liked/clicked/rated and then measuring the similarity of that item across all users who rated/ viewed/ clicked both and then finally recommending them based on similarity score
For example,If suppose a user A rated Money Heist 5 and Ozark 5 and user B watched and rated Money Heist 4 and Ozark 4 and now we have person C who watched Money heist and rated it 5.Now our recommender system  take these two shows 'Money Heist' and 'Ozark' and check the rating by all users who rated both the shows (in our case A and B). Now since A and B rated both shows same then it means that these shows might have  similarity ie. items are similar   and thus system will recommend user C 'Ozark'  

Simpley said-For an item I, a set of similar items are determined based on rating vectors consisting of received user ratings. The rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by U and calculating the rating based on these N ratings

#### Pros and Cons
1. People are ficle minded and their views change over time however items remains the same
2. There are fewer items than users, hence, it is easy to computer item-based computing
3. User based system can be tricked with shilling attach

For this project I will use item-based collaborative filtering to buil the recommmender system

## Steps in Collaborative Filtering
To build a system that automatically recommends new movies to user based on ratings/ clicks of other users there are 2 steps involved as follows:
1. Finding similar users  (for user based)
2. Predict ratings of the items that are not yet rated/ viewed by the user  

To do this succesfully we need to come up with methods to answer the following questions:
1. How are we going to find similar users ?
2. Given we find similar users, how do we predict the rating that a user will give to a particular item based on the ratings for that item by similar users ?
3. How do we evaluate the performance of the prediction model ? 

There are many ways/algorithms by which this can be done. For example we can use euclidean distance or cosine distance to find the similar users and then take average of these user's rating as prediction for rating that a user would give to the movie suggested. To evaluate model we can use RMSE on the test dataset with actual rating by the user. Since not all users rate the movies they watch, the user-item matrix is usually sparse ie. mostly empty, therefore, there are various complex algoritms involved that involve steps like dimention reduction or matrix factorization (eg SVD or PCA)  

ref: https://realpython.com/build-recommendation-engine-collaborative-filtering/  
ref: http://www.mmds.org/

## Matrix Factorization
In user-item matrix, there are two dimensions ie. the users and items. If the matrix is mostly empty then to boost the performance of the model we must reduce its dimensions by factorizing it. 
**Matrix Factorization** is a technique in which we break down a large matrix into a product of smaller matrices. For example, a **m x n** matrix can be broken down into 2 matices of **m x p and p x n** 

**Remember** A matrix A can be multiplied with matrix B only if the number of columns in A = number of rows in 
B

# Recommendation system using Surprise Python Library
Surprise library is one of the most popular libriary and comes with various recommendation algorithms as well as inbuild data set -movielens100k

In [76]:
# Load Libraries
import pandas as pd
from surprise import Dataset
from surprise import Reader

In [77]:
# Load Data
movielens = Dataset.load_builtin('ml-100k')

In [84]:
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# Loads the builtin Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')
#The load_builtin() method will offer to download the movielens-100k dataset if it has not already been downloaded,
#and it will save it in the .surprise_data folder in your home directory 




## Model 1 Using KNN Algoritm (Distance based similarity approach)

**name** defines the similarity metric to use. Options are cosine, msd, pearson, or pearson_baseline. The default is msd= mean squared difference

**user_based** is a boolean that tells whether the approach will be user-based or item-based. The default is True, which means the user-based approach will be used

**min_support** is the minimum number of common items needed between users to consider them for similarity. For the item-based approach, this corresponds to the minimum number of common users for two items

In [85]:
from surprise import KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

## Conclusion
For  MovieLens 100k dataset, Centered-KNN algorithm works best if you go with **item-based** approach and use **msd** as the similarity metric with **minimum support 3**.



# Train-test split and the fit() method
If you don’t want to run a full cross-validation procedure, you can use the train_test_split() to sample a trainset and a testset with given sizes, and use the accuracy metric of your chosing. You’ll need to use the fit() method which will train the algorithm on the trainset, and the test() method which will return the predictions made from the testset
ref:https://surprise.readthedocs.io/en/stable/getting_started.html




In [66]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)


RMSE: 0.9421


0.9421139885167851

## Model 2 - Using SVD Algorithm - Matrix factorization approach

**n_epochs** is the number of iterations of SGD, which is basically an iterative method used in statistics to minimize a function  

**lr_all** is the learning rate for all parameters, which is a parameter that decides how much the parameters are adjusted in each iteration  

**reg_all** is the regularization term for all parameters, which is a penalty term added to prevent overfitting  

**Note** Keep in mind that there won’t be any similarity metrics in matrix factorization algorithms as the latent factors take care of similarity among users or items.

In [47]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")

param_grid = {
    "n_epochs": [5, 10],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.4, 0.6]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.9635502293746322
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


## Conclusion
So, for the MovieLens 100k dataset, the SVD algorithm works best if you go with **10 epochs** and use a **learning rate of 0.005** and **0.4 regularization**

# Making Predictions on custome dataframe using KNN

In [86]:
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# Loads the builtin Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')
#The load_builtin() method will offer to download the movielens-100k dataset if it has not already been downloaded,
#and it will save it in the .surprise_data folder in your home directory 




In [87]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
sim_options = {
    "name": "cosine",
    "user_based": False,  # Compute  similarities between items
}
algo = KNNWithMeans(sim_options=sim_options)

In [88]:
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
prediction = algo.predict('E', 2)
prediction.est

Computing the cosine similarity matrix...
Done computing similarity matrix.


4.15

## Making predictions using SVD

In [94]:
ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)
# Loads the builtin Movielens-100k data
movielens = Dataset.load_builtin('ml-100k')
#The load_builtin() method will offer to download the movielens-100k dataset if it has not already been downloaded,
#and it will save it in the .surprise_data folder in your home directory

In [121]:
from surprise import SVD

algo = SVD(n_epochs= 10,lr_all=  0.005,
    reg_all= 0.4)
trainingSet = data.build_full_trainset()
algo.fit(trainingSet)
#trainingSet = data.build_full_trainset()

#algo.fit(trainingSet)

prediction = algo.predict('E', 540)
prediction.est

3.52986

## Training on Whole dataset  and making predictions

In [128]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin("ml-100k")

#gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)
algo = SVD(n_epochs= 10,lr_all=  0.005,
    reg_all= 0.4)
#trainset, testset = train_test_split(data, test_size=.25) # To split 
trainset = data.build_full_trainset() # Training on whole dataset
algo.fit(trainset)
#print(gs.best_score["rmse"])
#print(gs.best_params["rmse"])

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fa750564bd0>

In [134]:
#The predict() uses raw ids (please read this about raw and inner ids). 
#As the dataset we have used has been read from a file, the raw ids are strings (even if they represent numbers).
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

In [135]:
algo.predict(uid, iid, verbose=True)

user: 196        item: 302        r_ui = None   est = 4.02   {'was_impossible': False}


Prediction(uid='196', iid='302', r_ui=None, est=4.015285886856125, details={'was_impossible': False})

In [138]:

prediction = algo.predict(uid=str(20),iid=str(120))
prediction.est

2.291272980108429

## References
1. https://surprise.readthedocs.io/en/stable/getting_started.html
2. https://realpython.com/build-recommendation-engine-collaborative-filtering/