# Machine Learning HW 5 - Cintia Zuccon Buffon

1. Recommender systems are a hot topic. Recommendation systems can be formulated as a task of matrix completion in machine learning. Recommender systems aim to predict the
rating that a user will give for an item (e.g., a restaurant, a movie, a product).
2. Download the movie rating dataset from: https://www.kaggle.com/rounakbanik/themovies-dataset. These files contain metadata for all 45,000 movies listed in the Full
MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages,
production companies, countries, TMDB vote counts and vote averages. This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are
on a scale of 1-5 and have been obtained from the official GroupLens website.
3. Building a small recommender system with the matrix data: “ratings.csv”. You can use the recommender system library: Surprise (http://surpriselib.com), 
use other recommender system libraries, or implement from scratches.

In [1]:
import pandas as pd
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import Dataset
from surprise import KNNBasic
from surprise.model_selection import cross_validate

3.a. Read data from “ratings.csv” with line format: 'userID movieID rating timestamp'.

In [2]:
# Read as a .csv file : 

data_movies = pd.read_csv(r'C:\Users\cinti\Documents\PythonF\ratings_small.csv')
#print(data_movies.head())

In [3]:
#Reader - for the rating scale parameter

reader = Reader(rating_scale = (0.5, 5.0))

data = Dataset.load_from_df(data_movies[['userId', 'movieId', 'rating']], reader)

b. MAE and RMSE are two famous metrics for evaluating the performances of a recommender system. The definition of MAE can be found via:
https://en.wikipedia.org/wiki/Mean_absolute_error. The definition of RMSE can be found via: https://en.wikipedia.org/wiki/Root-mean-square_deviation.

c. Compute the average MAE and RMSE of the Probabilistic Matrix Factorization (PMF), User based Collaborative Filtering, Item based Collaborative Filtering,
under the 5-folds cross-validation

In [4]:
# Probabilistic Matrix Factorization (PMF) 
# SVD without bias 

print('PMF:')
algo = SVD(biased=False)
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

PMF:
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0125  1.0152  1.0143  1.0105  1.0158  1.0137  0.0019  
MAE (testset)     0.7754  0.7833  0.7825  0.7785  0.7843  0.7808  0.0034  
Fit time          5.50    5.71    5.42    5.81    5.70    5.63    0.14    
Test time         0.21    0.17    0.10    0.15    0.09    0.15    0.04    


{'test_rmse': array([1.01251381, 1.01515485, 1.01426878, 1.01050876, 1.01580767]),
 'test_mae': array([0.77537658, 0.78326117, 0.78253784, 0.7784981 , 0.78431705]),
 'fit_time': (5.499775409698486,
  5.713310718536377,
  5.422422885894775,
  5.812021017074585,
  5.696417331695557),
 'test_time': (0.21146202087402344,
  0.17309784889221191,
  0.09870648384094238,
  0.15244698524475098,
  0.09373164176940918)}

In [5]:
# Item-Base CF 

print('Item-Based CF:\n')
algo = KNNBasic(sim_options = {'user_based': False})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

Item-Based CF:

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9337  0.9242  0.9371  0.9314  0.9463  0.9345  0.0072  
MAE (testset)     0.7213  0.7130  0.7228  0.7205  0.7268  0.7209  0.0045  
Fit time          4.80    4.77    5.73    4.84    4.50    4.93    0.42    
Test time         7.47    7.41    8.55    6.89    6.76    7.42    0.63    


{'test_rmse': array([0.933737  , 0.92416262, 0.9371251 , 0.93143   , 0.94626968]),
 'test_mae': array([0.72126604, 0.71303842, 0.72283436, 0.72053326, 0.72682809]),
 'fit_time': (4.795452117919922,
  4.770017862319946,
  5.731219053268433,
  4.844723463058472,
  4.502467393875122),
 'test_time': (7.466063737869263,
  7.4095234870910645,
  8.553166627883911,
  6.888579607009888,
  6.764756441116333)}

In [6]:
# User-based CF

print('User-Based CF:\n')
algo = KNNBasic(sim_options = {'user_based': True})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

User-Based CF:

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9668  0.9807  0.9677  0.9707  0.9623  0.9696  0.0061  
MAE (testset)     0.7433  0.7506  0.7444  0.7445  0.7417  0.7449  0.0030  
Fit time          0.26    0.27    0.26    0.26    0.30    0.27    0.01    
Test time         1.52    1.69    1.64    1.64    1.60    1.62    0.06    


{'test_rmse': array([0.96679951, 0.98068154, 0.96768541, 0.97074042, 0.96232233]),
 'test_mae': array([0.74331456, 0.75060037, 0.74436611, 0.74447527, 0.74173126]),
 'fit_time': (0.26047658920288086,
  0.266848087310791,
  0.2587890625,
  0.2550492286682129,
  0.29558658599853516),
 'test_time': (1.520571231842041,
  1.6924231052398682,
  1.639120101928711,
  1.6413354873657227,
  1.5968151092529297)}

In [7]:
#SVD - just curiosity 
'''print('SVD:')
algo = SVD()
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)'''

"print('SVD:')\nalgo = SVD()\ncross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)"

d. Compare the average (mean) performances of User-based collaborative filtering, item-based collaborative filtering, PMF with respect to RMSE and MAE. Which
ML model is the best in the movie rating data?

A: Comparing the three models' average performances (PMF, User CF and Item CF), the best model is the Item-based CF which presents the lowest RMSE and MAE values. 

e. Examine how the cosine, MSD (Mean Squared Difference), and Pearson similarities impact the performances of User based Collaborative Filtering and Item based Collaborative Filtering. 
Plot your results. Is the impact of the three metrics on User based Collaborative Filtering consistent with the impact of the three metrics on Item based Collaborative Filtering?

In [8]:
# User-Based CF:

# 1. Cosine 

print('User-Based CF - Cosine :\n')
algo = KNNBasic(sim_options = {'name':'cosine','user_based': True})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

#2. MSD (It is the default)

print('User-Based CF - MSD :\n')
algo = KNNBasic(sim_options = {'name':'MSD','user_based': True})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

#3. Pearson

print('User-Based CF - Pearson:\n')
algo = KNNBasic(sim_options = {'name':'pearson','user_based': True})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

User-Based CF - Cosine :

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9965  0.9952  0.9885  1.0000  0.9939  0.9948  0.0037  
MAE (testset)     0.7690  0.7653  0.7627  0.7743  0.7690  0.7681  0.0039  
Fit time          0.82    0.97    0.73    0.88    0.71    0.82    0.10    
Test time         1.65    1.71    1.64    1.61    1.60    1.64    0.04    
User-Based CF - MSD :

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity

{'test_rmse': array([0.99911756, 0.99506238, 0.99793914, 1.00801538, 0.99931748]),
 'test_mae': array([0.77235711, 0.76893715, 0.77251625, 0.77891504, 0.77410474]),
 'fit_time': (1.002840518951416,
  0.9247431755065918,
  1.0507705211639404,
  0.9064345359802246,
  1.0197665691375732),
 'test_time': (1.4597039222717285,
  1.4489178657531738,
  1.4543347358703613,
  1.5223376750946045,
  1.4887654781341553)}

In [9]:
# Item-Based CF:

# 1. Cosine 

print('User-Based CF - Cosine :\n')
algo = KNNBasic(sim_options = {'name':'cosine','user_based': False})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

#2. MSD (It is the default)

print('User-Based CF - MSD :\n')
algo = KNNBasic(sim_options = {'name':'MSD','user_based': False})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

#3. Pearson

print('User-Based CF - Pearson:\n')
algo = KNNBasic(sim_options = {'name':'pearson','user_based': False})
cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv= 5, verbose=True)

User-Based CF - Cosine :

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9960  0.9968  1.0043  0.9888  0.9894  0.9951  0.0057  
MAE (testset)     0.7754  0.7752  0.7825  0.7712  0.7699  0.7749  0.0044  
Fit time          15.93   17.99   17.64   18.35   16.39   17.26   0.94    
Test time         6.86    7.47    7.98    8.93    6.62    7.57    0.83    
User-Based CF - MSD :

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity

{'test_rmse': array([0.98852416, 0.98922914, 0.9795615 , 0.98853671, 1.00062096]),
 'test_mae': array([0.7637972 , 0.76986762, 0.76025872, 0.76824297, 0.77740646]),
 'fit_time': (24.695643186569214,
  24.19694447517395,
  27.181466579437256,
  22.050141096115112,
  22.313855409622192),
 'test_time': (7.210363149642944,
  7.850782155990601,
  7.098848819732666,
  6.43154764175415,
  6.4165425300598145)}

f. Examine how the number of neighbors impacts the performances of User based Collaborative Filtering and Item based Collaborative Filtering? Plot your results.

In [10]:
# User-Based CF

print('User-Based CF:\n')
for i in range(1,40):
    algo = KNNBasic(k=i, sim_options = {'user_based': True})
    print('KNN= ', i)
    error = cross_validate(algo, data, measures = ['RMSE', 'MAE'], verbose = True)
    

User-Based CF:

KNN=  1
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2156  1.2070  1.2194  1.2182  1.2159  1.2152  0.0044  
MAE (testset)     0.9048  0.8981  0.9081  0.9043  0.9074  0.9045  0.0035  
Fit time          0.32    0.27    0.27    0.25    0.32    0.29    0.03    
Test time         0.86    0.89    0.78    0.85    0.79    0.83    0.04    
KNN=  2
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd simil

In [11]:
# Item-Based CF

print('User-Based CF:\n')
for i in range(1,40):
    algo = KNNBasic(k=i, sim_options = {'user_based': False})
    print('KNN= ', i)
    error = cross_validate(algo, data, measures = ['RMSE', 'MAE'], verbose = True)


User-Based CF:

KNN=  1
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.3204  1.3094  1.3207  1.3149  1.3046  1.3140  0.0063  
MAE (testset)     0.9742  0.9700  0.9771  0.9694  0.9664  0.9714  0.0038  
Fit time          4.59    4.77    4.50    4.58    4.51    4.59    0.10    
Test time         4.92    5.08    5.15    5.22    5.17    5.11    0.10    
KNN=  2
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd simil

g. Identify the best number of neighbor (denoted by K) for User/Item based collaborative filtering in terms of RMSE. Is the best K of User based collaborative
filtering the same with the best K of Item based collaborative filtering?

A:

User-Based best K = 17 with RMSE 0.9615
Item-Based best K = 37 with RMSE 0.9349
No, the best K is not the same for User-based and Item-based collaborative filtering. However, we can see that the graph for both models present the same shape.  
