## Intelie Recruitee - desafio 2 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from surprise import KNNBasic
from surprise import SVD
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

## The MovieLens 
Common dataset available on the internet for building a Recommender System. This version contains 1,000,209 anonymous ratings,3,900 movies and 6,040 MovieLens users started in 2000.
This 1M version was released in 2003. All users selected had around 20 movies. Each user is represented by an id, and no other information is provided.

The original data are contains three files, movies.dat, ratings.dat and users.dat. Both were converted into csv files. 


### Reading ratings file

In [2]:
ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1')

In [3]:
ratings.head()

Unnamed: 0.1,Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id
0,0,1,1193,5,978300760,0,1192
1,1,1,661,3,978302109,0,660
2,2,1,914,3,978301968,0,913
3,3,1,3408,4,978300275,0,3407
4,4,1,2355,5,978824291,0,2354


### **Collaborative Filtering**

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [4]:
reader = Reader()

In [5]:
ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1')
ratings.head()

Unnamed: 0.1,Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id
0,0,1,1193,5,978300760,0,1192
1,1,1,661,3,978302109,0,660
2,2,1,914,3,978301968,0,913
3,3,1,3408,4,978300275,0,3407
4,4,1,2355,5,978824291,0,2354


### Getting Started with Surprise

In [6]:
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

In [7]:
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8724  0.8742  0.8745  0.8727  0.8747  0.8737  0.0009  
MAE (testset)     0.6851  0.6861  0.6860  0.6849  0.6867  0.6857  0.0007  
Fit time          59.46   60.57   52.89   63.28   59.68   59.17   3.42    
Test time         3.37    2.45    3.62    3.17    2.52    3.03    0.47    


{'test_rmse': array([0.87243468, 0.87416423, 0.87447968, 0.87274125, 0.87465316]),
 'test_mae': array([0.68509193, 0.68606877, 0.68601652, 0.68486662, 0.68669479]),
 'fit_time': (59.457765102386475,
  60.567166328430176,
  52.89252519607544,
  63.276416540145874,
  59.68015670776367),
 'test_time': (3.3690383434295654,
  2.4467992782592773,
  3.6198818683624268,
  3.174246072769165,
  2.515197992324829)}

Evaluation metrics to predicted ratings with **Root Mean Squared Error (RMSE)**. And **Mean Square Eerror (MSE)** function from sklearn, where the RMSE is just the square root of MSE.


We get a mean Root Mean Sqaure Error of 0.8768 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [8]:
trainset = data.build_full_trainset()

### Build an algorithm, and train it

In [9]:
algo = KNNBasic()
algo.fit(trainset)

Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x7f8bc40b0b50>

* **Clustering based algorithm (KNN)**: The idea of clustering is same as that of memory-based recommendation systems. In memory-based algorithms, it was used the similarities between users and/or items and use them as weights to predict a rating for a user and an item. The difference is that the similarities in this approach are calculated based on an unsupervised learning model, rather than Pearson correlation or cosine similarity.

In [10]:
ratings[ratings['user_id'] == 1]

Unnamed: 0.1,Unnamed: 0,user_id,movie_id,rating,timestamp,user_emb_id,movie_emb_id
0,0,1,1193,5,978300760,0,1192
1,1,1,661,3,978302109,0,660
2,2,1,914,3,978301968,0,913
3,3,1,3408,4,978300275,0,3407
4,4,1,2355,5,978824291,0,2354
5,5,1,1197,3,978302268,0,1196
6,6,1,1287,5,978302039,0,1286
7,7,1,2804,5,978300719,0,2803
8,8,1,594,4,978302268,0,593
9,9,1,919,4,978301368,0,918


In [11]:
algo.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=4.008512766936974, details={'actual_k': 40, 'was_impossible': False})

For movie with ID 302, I get an estimated prediction of 4.008. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

Collaborative Filtering: I used the Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.