# Movies Recommender System

<img src='http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg', width=500>

## Collaborative Filtering

The content based recommender we developed earlier suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Here we will use Collaborative Filtering to make recommendations, specifically we will use Matrix decompostion to factorize our user ratings matrix and generate recommendations. 

We will work with a dataset of users Ids, movie Ids and movie ratings. This is our ratings matrix **R**

### Load libraries
We will be using the library 'surprise' to perform the matrix decompostion using a technique called SVD or singular value decomposition. Surprise is a python library build specifically for recommender systems. Find out more about [surprise](https://surprise.readthedocs.io/en/stable/). Find out more about SVD for recommenders systems [here](https://medium.com/@m_n_malaeb/singular-value-decomposition-svd-in-recommender-systems-for-non-math-statistics-programming-4a622de653e9)

You may need to install the surprise library. With anaconda, run: `conda install -c conda-forge scikit-surprise`  (more info here: https://anaconda.org/conda-forge/scikit-surprise)

In [None]:
import pandas as pd
import numpy as np
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

### Read the data into a pandas dataframe

In [None]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

### Read the data from the dataframe into Surprise

In [None]:
reader = Reader()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

we will use cross-validation to avoid overfitting the matrix decomposition. We split the data into 5 folds (cv=5)

Evaluate how well the SVD fit our data. If the RMSE is low (<0.8), our estimated ratings may not be very good

In [None]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)

If we are satisfied with the accuracy we can train on our dataset and arrive at predictions.

In [None]:
trainset = data.build_full_trainset()
svd.fit(trainset) # svd.train(trainset) in older versions

Let us pick user 5000 and check the ratings s/he has given.

In [None]:
ratings[ratings['userId'] == 1]

First let uses our fitted model to predict the rating for a movie for which we know the true rating given by user 1: movie 31 with rating 2.5

In [None]:
svd.predict(1, 31 ,2.5)

Now let use our fitted model to predict the rating for a movie they have not yet rated, movie 302

In [None]:
svd.predict(1, 302)

This recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users with similar taste have predicted the movie.

### Your turn

We compared our model prediction for user 1 and movie 31 to the observed rating. This comparison isn't really fair as that observation was in our training data. It would be better to split the data into testing and training sets from the outset and evaluate model accuracy on the testset. We could then do a proper comparison. hint: you might want to use the import  
**from surprise.model_selection import train_test_split** and  
**from surprise import accuracy**