# User-Based-Collaborative-Filtering-Movie-Recommender
User-Based Collaborative Filtering is a popular method of making recommendations because it is easy to understand and implement, and can handle very large datasets. It works by finding other users who have similar tastes to the user for whom we want to make recommendations, and then recommending items that those similar users have liked. It is based on the idea that similar users will have similar preferences for items. 
For the Recommender System, following approach is used.

# Importing necessary packages.
It is necessary to import the required libraries to perform different operations before writing program.

In [94]:
#importing necessary packages for the program
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

# Importing Dataset 
After importing necessary libraries, it is necessary to import the required dataset that are to be used to train the system for the accurate result. For this system, two csv files are imported which were extracted from Kaggle. After they were merged into a single variable for better processing of data's for future processes.

In [95]:
# reading the csv files available and merging them into one for easy processing of dataset
user_ratings = pd.read_csv('ratings_user-item.csv')
movies = pd.read_csv('tagsngenres.csv')
user_ratings = pd.merge(movies,user_ratings)
# print(user_ratings.shape)
user_ratings.head()

Unnamed: 0,movieId,title,genres,tags,userId,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,1,4.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,5,4.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,7,4.5
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,15,2.5
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar,17,4.5


# Pivoting Table
After impoting the dataset, it is necessary to pivot the tables from the dataset. pivoting tables of dataset is necessary in collaborative filtering because it allows the efficient calculation of similarities and correlations between users or items, it also makes it easier to extract meaningful insights from the data and make the models more computationally efficient.

In [96]:
#pivoting the dataset table and checking the before and after columns of dataset
Ratings = user_ratings.pivot_table(index=['userId'],columns=['title'],values='rating')
Ratings.head()
print("Before: ",Ratings.shape)
Ratings = Ratings.dropna(thresh=5, axis=1).fillna(0,axis=1)
print("After: ",Ratings.shape)

Before:  (610, 1545)
After:  (610, 1119)


# Correlation of table
It is necessary for the table to correlate as to reduct the NaN data's existing in the dataset. For this system, Pearson Method is used for the correlation of datasets.
Pearson correlation algorithm as a memory-based collaborative filtering method overcomes the problem of scalability by calculating the similarity between pairs of items rated by the users, or between pairs of users who rate the same item.

In [97]:
# computing the pairwise correlation of columns in a DataFrame
corrMatrix = Ratings.corr(method='pearson')
corrMatrix.head(10)

title,(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),127 Hours (2010),13 Going on 30 (2004),2001: A Space Odyssey (1968),21 Grams (2003),...,Yojimbo (1961),You Can Count on Me (2000),You've Got Mail (1998),Young Frankenstein (1974),Zack and Miri Make a Porno (2008),Zelig (1983),Zero Dark Thirty (2012),Zombieland (2009),Zoolander (2001),eXistenZ (1999)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(500) Days of Summer (2009),1.0,0.142471,0.273989,0.148903,0.142141,0.159756,0.200135,0.297152,0.113616,0.094029,...,0.029789,0.049562,0.140981,0.066077,0.374515,0.051286,0.178655,0.355723,0.252226,0.053614
10 Cloverfield Lane (2016),0.142471,1.0,-0.005799,0.006139,-0.016835,0.031704,0.272943,-0.027835,0.09931,-0.030166,...,0.036295,-0.017991,0.041093,0.005034,0.242663,-0.01616,0.099059,0.241751,0.195054,0.177846
10 Things I Hate About You (1999),0.273989,-0.005799,1.0,0.223481,0.211473,0.011784,0.043383,0.321071,0.085974,0.092169,...,0.044099,0.14773,0.24794,0.144038,0.243118,0.101622,0.104858,0.158637,0.281934,0.121029
101 Dalmatians (1996),0.148903,0.006139,0.223481,1.0,0.285112,0.119843,0.029967,0.188467,0.110844,0.065939,...,-0.015241,0.045174,0.162952,0.177214,0.114968,0.091345,0.077232,0.113224,0.184324,0.047804
101 Dalmatians (One Hundred and One Dalmatians) (1961),0.142141,-0.016835,0.211473,0.285112,1.0,0.134037,-0.046277,0.218406,0.10223,0.077724,...,0.154346,0.142805,0.209313,0.180318,0.120302,0.150069,0.125816,0.171654,0.27426,0.085606
12 Angry Men (1957),0.159756,0.031704,0.011784,0.119843,0.134037,1.0,0.058862,-0.027672,0.195909,0.008812,...,0.138834,0.066496,0.078347,0.135876,0.104518,0.13448,0.028415,0.144652,0.122107,-0.001708
127 Hours (2010),0.200135,0.272943,0.043383,0.029967,-0.046277,0.058862,1.0,0.043314,0.140796,0.070118,...,0.053763,-0.02036,0.015838,0.003485,0.223135,-0.018288,0.154299,0.198926,0.091416,0.128638
13 Going on 30 (2004),0.297152,-0.027835,0.321071,0.188467,0.218406,-0.027672,0.043314,1.0,0.040458,0.084832,...,-0.026422,0.058544,0.138332,0.01859,0.103892,0.048055,0.171578,0.10823,0.225685,0.010007
2001: A Space Odyssey (1968),0.113616,0.09931,0.085974,0.110844,0.10223,0.195909,0.140796,0.040458,1.0,0.152689,...,0.220379,0.112141,0.171466,0.313117,0.150649,0.217489,0.097214,0.196175,0.148409,0.306034
21 Grams (2003),0.094029,-0.030166,0.092169,0.065939,0.077724,0.008812,0.070118,0.084832,0.152689,1.0,...,0.111872,0.145272,0.053785,0.109025,0.124965,0.229073,0.032101,0.055743,0.111366,0.242192


# Creating Function to Recommend Movies
Now a function recommender() is created in order to carry out the recommending of movies. This function  takes in two inputs, a movie name and a rating. First, it calculates the correlation between the input movie and all other movies in the dataset using the Pearson correlation method. This gives us a correlation score between the input movie and all other movies in the dataset.

Then it multiplies these correlation scores by the difference of the input rating and 2.5. This is done to give more weight to the movies that are rated highly by the user.

Finally, the function sorts the resulting similarity scores in descending order and returns the sorted similar ratings.

In [98]:
#creating function to carry out recommending
def recommender(movie_name,rating):
    similar_ratings = corrMatrix[movie_name]*(rating-2.5)
    similar_ratings = similar_ratings.sort_values(ascending=False)
    #print(type(similar_ratings))
    return similar_ratings

# Checking the Result with Dummy Data
After the function is created it is now used in the dummy data that was written for the testing of the fucntion. Below a series of romantic movies is rated high on the dummy data. Hence, after running the algorithm there should be recommendation of romantic movies which was achieved as seen below. Which concludes the system was working fine. 

In [99]:
# creating dummy data to test the system
dummy_data = [("(500) Days of Summer (2009)",5),("Leaving Las Vegas (1995)",4),("Aliens (1986)",1),("2001: A Space Odyssey (1968)",2)]
recommended = pd.DataFrame()
for movie,rating in dummy_data:
    recommended = pd.concat([recommender(movie,rating) for movie,rating in dummy_data], axis=0)
    
recommended.head(10)

title
(500) Days of Summer (2009)           2.500000
Scott Pilgrim vs. the World (2010)    1.032022
Hangover                              1.015052
I Love You                            1.001149
Moonrise Kingdom (2012)               0.972756
Burn After Reading (2008)             0.956260
Notebook                              0.952151
Juno (2007)                           0.947093
Spanglish (2004)                      0.941843
Zack and Miri Make a Porno (2008)     0.936288
dtype: float64

#  Performance Metrics
As this is user-based collaborative filtering i.e. regression problem, hence, it is necessary to calculate the performance metrics for the problem. It includes calculation of <b>MAE</b>(Mean Absolute Error), <b>MSE</b>(Mean Squared Error) and <b>RMSE</b>(Root Mean Square Error). These calculation are done below in the function named performance_metrics(). For the calculation of this metrics, it is necessary to have the predicted ratings of the movie along with the actual movie. Then using their respective formulas for the calculation.

In [100]:
def performance_metrics(predicted_ratings, actual_ratings):
    mae = mean_absolute_error(predicted_ratings, actual_ratings)
    mse = mean_squared_error(predicted_ratings, actual_ratings)
    rmse = sqrt(mse)
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse}

# Using the function
predicted_ratings = recommended[:3000]
print (predicted_ratings.shape)

actual_ratings = user_ratings[user_ratings['title'].isin(recommended.index)]['rating'][:3000]
print (actual_ratings.shape)

metrics = performance_metrics(predicted_ratings, actual_ratings)

print("Mean Absolute Error (MAE): ", metrics['MAE'])
print("Mean Squared Error (MSE): ", metrics['MSE'])
print("Root Mean Squared Error (RMSE): ", metrics['RMSE'])


(3000,)
(3000,)
Mean Absolute Error (MAE):  3.6521624320978368
Mean Squared Error (MSE):  14.237474478057697
Root Mean Squared Error (RMSE):  3.77325780699619


# Confusion Metrics
The confusion matrix is used for classification problems and it is used to evaluate the accuracy of a classification model. But in this case, the system is not performing a classification problem but a collaborative filtering problem (regression problem) . Therefore, it doesn't make sense to use the confusion matrix.