In this notebook, we will use the famous MovieLens dataset. It contains 100k movie ratings from 943 users and a selection of 1682 movies.

First, let's import some useful libraries.

In [None]:
import numpy as np
import pandas as pd

Read the file that contains the full dataset and specify the separator argument for a tab separated file.

In [None]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=column_names)

View the data.

In [None]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


Now, use the Movie_ID_Titles file to grab the movie names.

In [None]:
movie_titles=pd.read_csv('Movie_Id_Titles')

In [None]:
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


We then merge these dataframes.

In [None]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)


Number of unique users and movies.

In [None]:
n_users=df.user_id.nunique()
n_items=df.item_id.nunique()

print('No. of Users: '+str(n_users))
print('No. of Movies: '+str(n_items))

No. of Users: 944
No. of Movies: 1682


Now, let's apply train-test split.

In [None]:
from sklearn.model_selection import train_test_split
train_data,test_data=train_test_split(df,test_size=0.25)

Memory-based collaborative filtering.

In [None]:
#We're going to create two user-item matrices where one is for training and the other for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

Use the pairwise_distances function to calculate the cosine similarity. The output will range from 0 to 1 since the ratings are all positive.

In [None]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

Let's make predictions now.

In [None]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [None]:
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

Do the evaluation using root mean squared error.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [None]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))

User-based CF RMSE: 3.13688723434184
Item-based CF RMSE: 3.466514038901858


The issue with memory-based CF is that it doesn't scale to real-world situations and cannot handle a new user or new item entering the system.

On the other hand, model-based CF methods are scalable, but also struggles when new users or items without any ratings enter the system.

Let us try model-based CF now.

Calculate the sparsity level of MovieLens dataset.

In [None]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


Collaborative Filtering can be formulated by approximating a matrix by using singular value decomposition (SVD). So let's do this factorization method.

In [None]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

u, s, vt = svds(train_data_matrix, k = 20) #We choose this k
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))

User-based CF MSE: 2.7330642019806413


Memory-based models seem to be based on similarity between items or users, and we use cosine-similarity for it.

Model-based CF seem to be based on matrix factorization and we use SVD to factorize the matrix.