# Fellowship.AI Challenge
### Denzil Sikka

I'm trying to build a basic recommender system based on the Lab41 MovieLens dataset. I used the MovieLens 100K Dataset from 1000 users on 1700 movies. 


In [35]:
import pandas as pd
import numpy as np
from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt

I used the above libraries to build this recommendation engine.

I then read in the data file u.data. From the full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies.  Users and items are numbered consecutively from 1.  The data is randomly ordered. This is a tab separated list of 
user id | item id | rating | timestamp

In [36]:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols,
 encoding='latin-1')


num_users = ratings.user_id.unique().shape[0]
num_items = ratings.movie_id.unique().shape[0]
print('There are ' + str(num_users) + ' users and ' + str(num_items) + ' movies in this dataset.')


There are 943 users and 1682 movies in this dataset.


I used the scikit-learn library to split the dataset into testing and training.

In [37]:
training_ratings_data, testing_ratings_data = cv.train_test_split(ratings, test_size=0.25)

print("")
print("Training Ratings Data - 75%")
print(training_ratings_data.shape)
print(training_ratings_data.head())
print("")
print("Testing Ratings Data - 25%")
print(testing_ratings_data.shape)
print(testing_ratings_data.head())


Training Ratings Data - 75%
(75000, 4)
       user_id  movie_id  rating  unix_timestamp
35787       23       230       4       874785809
47269      505       202       3       889333508
63077        9       201       5       886960055
41787      648       118       4       882212200
97897      847       133       3       878941027

Testing Ratings Data - 25%
(25000, 4)
       user_id  movie_id  rating  unix_timestamp
10295        9       340       4       886958715
9969       262       195       2       879791755
98526      870      1014       2       884789665
83713      453       202       4       877553999
49993      407         8       5       875042425


I then created user-item matrix and then calculated two types of similarity: Item-Item and User-Item. Item-Item Collaborative Filterning is measured by observing users who have rated both the same items. User-Item Collaborative Filtering is measured between users by observing all the items rated by both users.

In [38]:
training_ratings_matrix = np.zeros((num_users, num_items))

for row in training_ratings_data.itertuples():
    training_ratings_matrix[row[1]-1, row[2]-1] = row[3]

testing_ratings_matrix = np.zeros((num_users, num_items))

for row in testing_ratings_data.itertuples():
    testing_ratings_matrix[row[1]-1, row[2]-1] = row[3]

print("")
print("User-Item Matrix")
print("")
print(training_ratings_matrix)


User-Item Matrix

[[ 0.  3.  0. ...,  0.  0.  0.]
 [ 4.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 5.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  5.  0. ...,  0.  0.  0.]]


There are several different types of similarity metrics in recommender systems. In this case, I used cosine similarity where the ratings are seen as vectors in n-dimensional space and the similarity is based on the cosine of the angle between these vectors. The smaller the angle between the two vectors then larger the cosine value.

To calculate similarity between two vectors, you take their dot product and then divide by the product of the Euclidean lengths of the vectors.

In [39]:
user_similarity = pairwise_distances(training_ratings_matrix, metric='cosine')
item_similarity = pairwise_distances(training_ratings_matrix.T, metric='cosine')
print("")
print("User Similarity")
print(user_similarity)
print("")
print("Item Similarity")
print("")
print(item_similarity)


User Similarity
[[ -1.99840144e-15   8.89420763e-01   9.74215339e-01 ...,   9.41423031e-01
    8.65712798e-01   7.57741299e-01]
 [  8.89420763e-01   1.11022302e-16   9.11208786e-01 ...,   9.15725017e-01
    9.00824165e-01   9.75989936e-01]
 [  9.74215339e-01   9.11208786e-01   0.00000000e+00 ...,   9.55166416e-01
    8.90081850e-01   9.84672217e-01]
 ..., 
 [  9.41423031e-01   9.15725017e-01   9.55166416e-01 ...,   1.11022302e-16
    8.94927475e-01   9.54415769e-01]
 [  8.65712798e-01   9.00824165e-01   8.90081850e-01 ...,   8.94927475e-01
    1.11022302e-16   8.55512219e-01]
 [  7.57741299e-01   9.75989936e-01   9.84672217e-01 ...,   9.54415769e-01
    8.55512219e-01   0.00000000e+00]]

Item Similarity

[[  9.99200722e-16   7.23215593e-01   7.60700484e-01 ...,   1.00000000e+00
    1.00000000e+00   9.46859981e-01]
 [  7.23215593e-01   2.22044605e-16   8.32015536e-01 ...,   1.00000000e+00
    1.00000000e+00   9.06432571e-01]
 [  7.60700484e-01   8.32015536e-01   1.11022302e-16 ...,   1

I used the similarity betwen two users as a weight that is multiplied by the rating of one of the users as a means of predicting that user's rating. I did correct for how each user rates items on average because not every user rates in the same way. This is a very rough method in order to correct for that subjectiveness. This also only mattered in the User-User similarity prediction scenario.

In [40]:
item_prediction = training_ratings_matrix.dot(item_similarity) / np.array([np.abs(item_similarity).sum(axis=1)])
print("")
print("Item Prediction")
print(item_prediction)
print("")
mean_user_rating = training_ratings_matrix.mean(axis=1)
print("Mean User Rating")
print(mean_user_rating[0:10])
ratings_diff = (training_ratings_matrix - mean_user_rating[:, np.newaxis])
user_prediction = mean_user_rating[:, np.newaxis] + user_similarity.dot(ratings_diff) / np.array([np.abs(user_similarity).sum(axis=1)]).T
print("")
print("User Prediction")
print(user_prediction)


Item Prediction
[[ 0.34135254  0.35083897  0.37325794 ...,  0.41121571  0.40844233
   0.39707148]
 [ 0.08759883  0.10248107  0.09951639 ...,  0.10285924  0.10463734
   0.10510163]
 [ 0.07074828  0.07400544  0.07215868 ...,  0.06849997  0.0725327
   0.0735303 ]
 ..., 
 [ 0.03241162  0.04129133  0.03947997 ...,  0.04523056  0.0451843
   0.04525505]
 [ 0.1135916   0.12297656  0.13049496 ...,  0.13698888  0.1373365
   0.13692521]
 [ 0.19475647  0.19363828  0.21354984 ...,  0.24458693  0.24256837
   0.2362278 ]]

Mean User Rating
[ 0.40844233  0.10463734  0.0725327   0.0332937   0.22235434  0.34066587
  0.72770511  0.09036861  0.04221165  0.35909631]

User Prediction
[[ 1.63894444  0.52528916  0.46802603 ...,  0.25810904  0.25562256
   0.25785514]
 [ 1.39544438  0.27495301  0.17152398 ..., -0.06434232 -0.06624518
  -0.06276104]
 [ 1.41344689  0.23519124  0.14342942 ..., -0.09601814 -0.09740923
  -0.09397189]
 ..., 
 [ 1.26853077  0.20031025  0.102637   ..., -0.11937305 -0.12157668
  -0.118

I only wanted to consider predicted ratings that are in the test dataset so I filtered out all other elements. I scaled user ratings to be back out of 5. 

In [41]:
user_ratings_prediction = user_prediction[testing_ratings_matrix.nonzero()].flatten()
ratings_five = [min(round(i*5), 5) for i in user_ratings_prediction]
user_ratings_prediction = ratings_five
user_testing_ratings_prediction = testing_ratings_matrix[testing_ratings_matrix.nonzero()].flatten()
print("")
print("User Ratings Prediction for Test Data Set")
print(user_ratings_prediction[0:10])
print("")
print("User Test Data Set")
print(user_testing_ratings_prediction[0:10])
print("")

item_ratings_prediction = item_prediction[testing_ratings_matrix.nonzero()].flatten()
ratings_five = [min(round(i*5), 5) for i in item_ratings_prediction]
item_ratings_prediction = ratings_five
item_testing_ratings_prediction = testing_ratings_matrix[testing_ratings_matrix.nonzero()].flatten()
print("")
print("Item Ratings Prediction for Test Data Set")
print(item_ratings_prediction[0:10])
print("")
print("Item Test Data Set")
print(item_testing_ratings_prediction[0:10])


User Ratings Prediction for Test Data Set
[5, 2.0, 5, 4.0, 2.0, 4.0, 3.0, 5.0, 2.0, 1.0]

User Test Data Set
[ 5.  4.  4.  5.  3.  4.  3.  4.  5.  2.]


Item Ratings Prediction for Test Data Set
[2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0]

Item Test Data Set
[ 5.  4.  4.  5.  3.  4.  3.  4.  5.  2.]


I used the popular metric Root Mean Squared Error (RMSE) to evaluate the accuracy of the predicted ratings. 

In [42]:
user_prediction_error_eval = sqrt(mean_squared_error(user_ratings_prediction, user_testing_ratings_prediction))
print('User-based CF RMSE')
print(user_prediction_error_eval)
print("")
item_prediction_error_eval = sqrt(mean_squared_error(item_ratings_prediction, item_testing_ratings_prediction))
print('User-based CF RMSE')
print(item_prediction_error_eval)

User-based CF RMSE
1.8255300600099686

User-based CF RMSE
2.654648752660133


This memory-based algorithm is easy to implement and produced reasonable predictions; however, it does not address the cold-start problem (when a new user or new item enters the system and there is nothing to judge by) and it is difficult to scale. It is not the best method for the sparsity level of the MovieLens dataset:

In [43]:
sparsity=round(1.0-len(ratings)/float(num_users*num_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


The model-based collaborative filtering method is scalable and can deal with higher sparsity levels, compared to this memory-based mode. It still suffers from the cold-start problem.