# Collaborative Filtering [60 points]
<br>
• Read the attached paper on Empirical Analysis of Predictive Algorithms for Collaborative
Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have
time.
<br>
• The dataset we will be using is a subset of the movie ratings data from the Netflix Prize.
You need to download it via Elearning. It contains a training set, a test set, a movies file,
a dataset description file, and a README file. The training and test sets are both subsets
of the Netflix training data. You will use the ratings provided in the training set to predict
those in the test set. You will compare your predictions with the actual ratings provided in
the test set. The evaluation metrics you need to use are the Mean Absolute Error and the
Root Mean Squared Error. The dataset description file further describes the dataset, and
will help you get started. The README file is from the original set of Netflix files, and has
been included to comply with the terms of use for this data.
<br>
• Implement (use Python3 and numpy; the latter is a must for this part) the collaborative
filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section
2.1.2) for making the predictions.


In [6]:
import pandas as pd
import os
from sklearn.neighbors import NearestNeighbors
import numpy as np
import math
import torch

In [9]:
path=r"/content/drive/MyDrive/netflix" # @TA/Grader : Change path to the location with the unzipped datasets.
os.chdir(path)
for file in os.listdir():
    if file in ['TestingRatings.txt','TrainingRatings.txt']:
         file_path = f"{path}/{file}"
         if file=='TestingRatings.txt':
              test_df = pd.read_csv (file_path,header=None)
              test_df.columns = ['Movie','User','Rating']
         elif file=='TrainingRatings.txt':
              train_df=pd.read_csv (file_path,header=None)
              train_df.columns = ['Movie','User','Rating']
    

In [10]:
train_df

Unnamed: 0,Movie,User,Rating
0,8,1744889,1.0
1,8,1395430,2.0
2,8,1205593,4.0
3,8,1488844,4.0
4,8,1447354,1.0
...,...,...,...
3255347,17742,46222,3.0
3255348,17742,2534701,1.0
3255349,17742,208724,3.0
3255350,17742,483107,2.0


In [11]:
test_df


Unnamed: 0,Movie,User,Rating
0,8,573364,1.0
1,8,2149668,3.0
2,8,1089184,3.0
3,8,2465894,3.0
4,8,534508,1.0
...,...,...,...
100473,17742,1898310,2.0
100474,17742,716096,4.0
100475,17742,38115,3.0
100476,17742,2646347,5.0


In [13]:
test_data = test_df.pivot(index = 'User',columns='Movie')['Rating']
test_data = test_data.fillna(0)
test_data

Movie,8,28,43,48,61,64,66,92,96,111,122,123,127,145,154,156,174,185,192,207,214,218,222,229,237,259,267,276,287,305,318,323,336,359,361,380,395,398,409,417,...,17334,17337,17338,17344,17348,17358,17394,17411,17423,17447,17454,17466,17515,17522,17523,17534,17536,17551,17554,17556,17558,17561,17574,17616,17624,17626,17635,17640,17642,17650,17653,17654,17689,17693,17706,17725,17728,17734,17741,17742
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
481,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2648869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2648885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2649120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2649267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
train_data = train_df.pivot(index = 'User',columns='Movie')['Rating']
train_data = train_data.fillna(0)
train_data

Movie,8,28,43,48,61,64,66,92,96,111,122,123,127,140,145,154,156,174,185,192,207,214,218,222,229,237,259,267,276,287,305,318,323,336,359,361,380,395,398,409,...,17337,17338,17344,17348,17358,17394,17411,17423,17447,17454,17466,17515,17522,17523,17534,17536,17551,17554,17556,17558,17561,17574,17616,17624,17626,17635,17640,17642,17650,17653,17654,17660,17689,17693,17706,17725,17728,17734,17741,17742
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7,5.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
79,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
481,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2648869,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0
2648885,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2649120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2649267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
train_data.describe()

Movie,8,28,43,48,61,64,66,92,96,111,122,123,127,140,145,154,156,174,185,192,207,214,218,222,229,237,259,267,276,287,305,318,323,336,359,361,380,395,398,409,...,17337,17338,17344,17348,17358,17394,17411,17423,17447,17454,17466,17515,17522,17523,17534,17536,17551,17554,17556,17558,17561,17574,17616,17624,17626,17635,17640,17642,17650,17653,17654,17660,17689,17693,17706,17725,17728,17734,17741,17742
count,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,...,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0,28978.0
mean,0.298468,1.588757,0.004624,0.208158,0.004693,0.00383,0.004797,0.007143,0.015184,0.901926,0.143143,0.010111,0.161433,0.003382,0.014563,0.005521,0.393264,0.011906,0.166092,0.010353,0.005211,0.006212,0.003347,0.003347,0.02426,0.020912,0.023846,0.068017,0.004072,0.006798,0.764407,0.003658,0.037408,0.010525,0.272862,0.842881,0.032024,0.132721,0.089033,0.029505,...,0.003969,0.316412,0.013493,0.059908,0.769204,0.010146,0.43885,0.233626,0.026123,0.025709,0.04003,0.0137,0.059769,0.010732,0.003589,0.21047,0.00635,0.053247,0.028642,0.927221,0.106288,0.806232,0.005936,0.251501,0.007626,0.021775,0.009732,0.120574,0.081993,0.189937,0.278763,0.002554,0.002209,0.250949,0.0137,0.23718,0.006315,0.004141,0.112844,0.023501
std,0.991727,1.963537,0.116514,0.882948,0.120301,0.107774,0.110581,0.168991,0.230931,1.486562,0.668878,0.175258,0.684246,0.096826,0.223839,0.135384,1.134171,0.201531,0.733908,0.192062,0.127386,0.143522,0.10779,0.097361,0.307724,0.281691,0.288781,0.523952,0.099611,0.154943,1.517229,0.102361,0.332181,0.187598,1.058647,1.617233,0.322419,0.728005,0.532231,0.338068,...,0.121326,0.989781,0.205502,0.491428,1.456794,0.176922,1.239514,0.913832,0.298691,0.298612,0.332187,0.239603,0.467108,0.20381,0.104035,0.91976,0.121367,0.424629,0.294444,1.582263,0.594711,1.535187,0.144492,0.936487,0.172206,0.27209,0.181755,0.68671,0.526064,0.794401,0.933665,0.087885,0.075656,0.868658,0.227183,0.9187,0.142188,0.117418,0.622673,0.267855
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


# Evaluating KNN using Facebook AI Similarity Search


I tried using FAISS ( Facebook AI Similarity Search) Algorithm for similarity search for dense vectors (as in our case) to decrease the runtime.
	However, since it provides only good approximation (good guesses) and not the exact neighbours the rmse was getting compromised. 
	Ultimately resorted to using knn  using kd_tree and achieved rmse of ~1.173129858633675


In [1]:

# References: 
#     https://github.com/facebookresearch/faiss/wiki/Getting-started
#     https://davidefiocco.github.io/nearest-neighbor-search-with-faiss/
#     https://towardsdatascience.com/make-knn-300-times-faster-than-scikit-learns-in-20-lines-5e29d74e76bb
import numpy as np
import faiss
class FaissKNeighbors:
    def __init__(self, k=15):
        self.index = None
        self.y = None
        self.k = k

    def fit(self, X, y):
        print(X.shape[1])
        self.index = faiss.IndexFlatL2(X.shape[1])
        print(self.index)
        self.index.add(X.astype(np.float32))
        self.y = y
        

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self.y[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions

# Storing K neighbours:

In [17]:
test_data.shape[0]

27555

In [18]:
knn_neighbors=np.zeros((test_data.shape[0],15),dtype=int)
knn_neighbors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [19]:
knn_neighbors.shape

(27555, 15)

In [20]:
knn_neighbors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [4]:
#Store knn
def store_knn(knn_neighbors,ind,indices):
    knn_neighbors[ind]=indices
    
#Find K-nearest-neighbors of a record
def train_knn(mat,k):
  nbrs = NearestNeighbors(n_neighbors=k+1, algorithm='ball_tree').fit(mat)
  return nbrs

#@TA/Grader:  
# Note that I have obsevred that even with memoization technique, the computing knn for ~28,000 neighbours as a part of ~1,00,000 predictions had taken me around 
# 	~1 hour 40 mins to complete the experiment. 
# 	So, I have precomputed '15'-knn neighbours and have saved them in a .csv file for all users in the train set. 
	
	

# 	def find_knn(mat,a,k,knn_neighbors,nbrs,pre_trainned_knn,pre_trained=True):
# 									^

# 	By passing pre_trained argument as FALSE --> The model will compute the Knn neighbours during runtime ( Longer run-time)
# 	By passing pre_trained argument as TRUE-->   The model will fetch the Knn neighbours from the stored values ( Smaller run-time | Not recommended in Production Environment as the neighbours of the user are meant to be changing dynamically.)


def find_knn(mat,a,k,knn_neighbors,nbrs,pre_trainned_knn,pre_trained=True):
    if pre_trained==True:
        return pre_trainned_knn[a]
    else:
        if (knn_neighbors[a]==np.zeros((15))).all():
            #Neighbours havent  been computed ever before for this user, Lets do it !
            #nbrs = NearestNeighbors(n_neighbors=k+1, algorithm='ball_tree').fit(mat)
            distances, indices = nbrs.kneighbors(mat[a].reshape(1,-1))
            store_knn(knn_neighbors,a,indices[0][1:])
            return indices[0][1:]   
        else:
            #We've got you covered once before!
            #print( knn_neighbors[a])
            return knn_neighbors[a]

#Find average movie rating for given by a user
def calculate_average_rating(I):                 #I being item set of the movies watched by a user
    #print(f"Movies watched:{np.count_nonzero(I)}")
    return np.sum(I)/np.count_nonzero(I)

#Locate the rating for a movie 'j' given by a user 'i'
def find_rating(mat,i,j):     #i-->User  j-->Movie     #take mat as np array
    return mat[i][j]

#Pearson co-efficient or cosine centered similarity 
def evaluate_pearson_coeff(mat,a,i,j_movies):        #a--> User from test|index
                                                     #range of j neighbours #i- index of one of the knn
        v_a_avg=calculate_average_rating(mat[a])
        v_i_avg=calculate_average_rating(mat[i])
        nmr=0
        den_A=0
        den_B=0
        for j in range(0,j_movies):        #--->1821
            v_a_j=find_rating(mat,a,j)
            v_i_j=find_rating(mat,i,j)
            if  (v_a_j!=0 and v_i_j!=0):
                A=v_a_j-v_a_avg
                B=v_i_j-v_i_avg
                nmr+=A*B
                den_A+=A*A
                den_B+=B*B
        #print(f"nmr: {nmr} dmr:{math.sqrt(den_A*den_B)} coeff:{nmr/math.sqrt(den_A*den_B)}")    #--->Life saver
        den=math.sqrt(den_A*den_B)
        if den==0:
            den+=0.0000000001
        return nmr/den
    
#Evaluate kappa
def kappa(knn_indices,a,j_movies):
    sum=0
    for ind in (knn_indices):
        sum+=abs(evaluate_pearson_coeff(np.asarray(train_data),a,ind,j_movies))
    if sum==0:
      return 0
    else:
      return 1/sum

#predict rating for a on m 
def pred_vote(mat,m,a,j_movies,nbrs,pre_trainned_knn,neighbours=15):
    v_a_avg=calculate_average_rating(mat[a])
    knn_indices= find_knn(mat,a,neighbours,knn_neighbors,nbrs,pre_trainned_knn)                                                                                                                                                  
    kapp_coeff=kappa(knn_indices,a,j_movies)
    
    sum=0
    for ind in knn_indices:
        if(find_rating(mat,ind,m)!=0):
            sum+=evaluate_pearson_coeff(mat,a,ind,j_movies)*(find_rating(mat,ind,m)-calculate_average_rating(mat[ind]))
    return v_a_avg+(kapp_coeff*sum)

def ceiling_floor_prediction(val):
    if val%int(val)>=.5:
        return math.ceil(val)
    else:
        return math.floor(val)


In [29]:
test_data_np=np.asarray(test_data)

In [31]:
count=0
import time
t = time.localtime()
current_time1 = time.strftime("%H:%M:%S", t)
nbrs=train_knn(np.asarray(train_data),15)

#@TA/Grader: Kindly change the path to fetch pre-trained neighbours

pre_trainned_knn=np.asarray(pd.read_csv(r"C:\Users\ab1997\Desktop\CS6375 HW2\results\knn_historical.csv"))
print(pre_trainned_knn)
for i in range(0,test_data_np.shape[0]):
    for j in range(0,test_data_np.shape[1]):
        if (test_data_np[i][j]!=0):
            pred=pred_vote(np.asarray(train_data),j,i,train_data.shape[1],nbrs,pre_trainned_knn,15)
            cf_pred=ceiling_floor_prediction(pred)
            print(f"{count} User:{i} Movie:{j} Actual:{test_data_np[i][j]} Pred:{pred} CFPred:{cf_pred}")
            with open("prediction_file.txt", "a+") as wf:
                wf.write(str(test_data_np[i][j])+','+str(pred)+','+str((cf_pred)))
                wf.write("\n")
            count+=1

t = time.localtime()
current_time2 = time.strftime("%H:%M:%S", t)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
23545 User:6525 Movie:1125 Actual:3.0 Pred:3.75 CFPred:4
23546 User:6526 Movie:487 Actual:3.0 Pred:2.982142857142857 CFPred:3
23547 User:6526 Movie:546 Actual:4.0 Pred:2.982142857142857 CFPred:3
23548 User:6526 Movie:841 Actual:4.0 Pred:2.982142857142857 CFPred:3
23549 User:6526 Movie:1039 Actual:1.0 Pred:2.982142857142857 CFPred:3
23550 User:6526 Movie:1078 Actual:2.0 Pred:2.982142857142857 CFPred:3
23551 User:6526 Movie:1659 Actual:3.0 Pred:2.982142857142857 CFPred:3
23552 User:6527 Movie:814 Actual:5.0 Pred:3.415929203539823 CFPred:3
23553 User:6527 Movie:905 Actual:3.0 Pred:3.415929203539823 CFPred:3
23554 User:6527 Movie:1643 Actual:4.0 Pred:3.415929203539823 CFPred:3
23555 User:6528 Movie:433 Actual:4.0 Pred:3.4863013698630136 CFPred:3
23556 User:6528 Movie:464 Actual:4.0 Pred:3.4863013698630136 CFPred:3
23557 User:6529 Movie:402 Actual:3.0 Pred:4.413043478260869 CFPred:4
23558 User:6529 Movie:905 Actual:5.0 Pred:4.



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
95478 User:26185 Movie:1403 Actual:1.0 Pred:3.272727272727273 CFPred:3
95479 User:26186 Movie:557 Actual:2.0 Pred:3.013157894736842 CFPred:3
95480 User:26187 Movie:130 Actual:3.0 Pred:2.9826589595375723 CFPred:3
95481 User:26187 Movie:952 Actual:5.0 Pred:2.9826589595375723 CFPred:3
95482 User:26188 Movie:139 Actual:4.0 Pred:2.9285714285714284 CFPred:3
95483 User:26188 Movie:464 Actual:5.0 Pred:2.9285714285714284 CFPred:3
95484 User:26188 Movie:596 Actual:3.0 Pred:2.9285714285714284 CFPred:3
95485 User:26188 Movie:656 Actual:5.0 Pred:2.9285714285714284 CFPred:3
95486 User:26188 Movie:785 Actual:4.0 Pred:2.9285714285714284 CFPred:3
95487 User:26189 Movie:464 Actual:5.0 Pred:2.7325581395348837 CFPred:3
95488 User:26189 Movie:1000 Actual:3.0 Pred:3.4815100243747112 CFPred:3
95489 User:26189 Movie:1023 Actual:3.0 Pred:2.7325581395348837 CFPred:3
95490 User:26190 Movie:284 Actual:4.0 Pred:3.9215686274509802 CFPred:4
95491 User:

In [32]:
print(f"Start time:{current_time1} \n End Time:{current_time2} Total {count}")

Start time:21:12:42 
 End Time:22:43:56 Total 100478


# Evaluating RMSE

Evaluating the RMSE for the regression prediction and the ceil/floor of the rgression prediction:

In [50]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
    

In [34]:
predictions = pd.read_csv (r"prediction_file.txt",header=None)
predictions.columns = ['True','Prediction','Ceil_Floor']
predictions

Unnamed: 0,True,Prediction,Ceil_Floor
0,5.0,3.903846,4
1,5.0,3.903846,4
2,4.0,3.713710,4
3,3.0,3.630952,4
4,4.0,3.665038,4
...,...,...,...
100473,3.0,3.398439,3
100474,4.0,2.896945,3
100475,1.0,3.588832,4
100476,3.0,3.588832,4


In [48]:
#RMSE value for the actual regression predictions
math.sqrt(mean_squared_error(np.asarray(predictions["True"]),np.asarray(predictions["Prediction"])))

1.173129858633675

In [52]:
#Mean absolue error value for the actual regression predictions
mean_absolute_error(np.asarray(predictions["True"]),np.asarray(predictions["Prediction"]))

0.9487771711722169

In [49]:
#RMSE value for the adjusted (Ceil/Floor) regression predictions
math.sqrt(mean_squared_error(np.asarray(predictions["True"]),np.asarray(predictions["Ceil_Floor"])))

1.2196432299945366

In [51]:
#Mean absoolute error value for the adjusted (Ceil/Floor) regression predictions
mean_absolute_error(np.asarray(predictions["True"]),np.asarray(predictions["Ceil_Floor"]))

0.9299747208344115