# Lab 8: Recommender System

In this assignment, we will study how to do user-based collaborative filtering and item-based collaborative filtering. 

## 1. Dataset

In this assignment, we will use MovieLens-100K dataset. It includes about 100,000 ratings from 1000 users on 1700 movies.  

In [36]:
from math import sqrt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors

# 1. load data
user_ratings_train = pd.read_csv('./ml-100k/u1.base', sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])
user_ratings_test = pd.read_csv('./ml-100k/u1.test', sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])
movie_info =  pd.read_csv('./ml-100k/u.item', sep='|', names=['movie_id','title'], usecols=[0,1], encoding="ISO-8859-1")
user_ratings_train = pd.merge(movie_info, user_ratings_train)
user_ratings_test = pd.merge(movie_info, user_ratings_test)

# 2. get the rating matrix. Each row is a user, and each column is a movie.
user_ratings_train = user_ratings_train.pivot_table(index=['user_id'], columns=['title'], values='rating')
user_ratings_test = user_ratings_test.pivot_table(index=['user_id'], columns=['title'], values='rating')
user_ratings_train = user_ratings_train.reindex(index=user_ratings_train.index.union(user_ratings_test.index),
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

user_ratings_test = user_ratings_test.reindex(index=user_ratings_train.index.union(user_ratings_test.index),
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

print(user_ratings_train.shape)
print(user_ratings_test.shape)

(943, 1664)
(943, 1664)


## Task 1. User-based CF

* Use pearson correlation to get the similarity between different users.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [37]:
user_ratings_train = user_ratings_train.fillna(0)
knn = NearestNeighbors(metric='correlation')
knn.fit(user_ratings_train.values)
distances, indices = knn.kneighbors(user_ratings_train.values, n_neighbors=6)
users_avg = user_ratings_train.mean(axis=1).values
train = user_ratings_train.values
test = user_ratings_test.fillna(-1).values
pred_arr = []
real_arr = []
for i in range(test.shape[0]):
    for j in range(test.shape[1]):
        if test[i][j] != -1:
            label = test[i][j]
            real_arr.append(label)
            simScores = distances[i, 1:]
            nnUsers = indices[i, 1:]
            rDiff = []
            for user in nnUsers:
                value = train[user][j] - users_avg[user]
                rDiff.append(value)
            pred = users_avg[i] + np.sum(np.multiply(simScores, rDiff)) / np.sum(simScores)
            pred_arr.append(pred)
from sklearn.metrics import mean_absolute_error as mae
loss = mae(real_arr, pred_arr)
print(loss)


2.2758162801671507


## Task 2. Item-based CF
* Use cosine similarity to get the similarity between different items.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [38]:
user_ratings_train = user_ratings_train.fillna(0)
knn = NearestNeighbors(metric='cosine')
user_ratings_train = user_ratings_train.transpose()
user_ratings_test = user_ratings_test.transpose()
knn.fit(user_ratings_train.values)
distances, indices = knn.kneighbors(user_ratings_train.values, n_neighbors=6)
users_avg = user_ratings_train.mean(axis=1).values
train = user_ratings_train.values
test = user_ratings_test.fillna(-1).values
pred_arr = []
real_arr = []
print(train.shape)
print(test.shape)

(1664, 943)
(1664, 943)


In [39]:
for i in range(test.shape[0]):
    for j in range(test.shape[1]):
        if test[i][j] != -1:
            label = test[i][j]
            simScores = distances[i, 1:]
            nnUsers = indices[i, 1:]
            rDiff = []
            for user in nnUsers:
                value = train[user][j] - users_avg[user]
                rDiff.append(value)
            if np.sum(simScores) == 0:
                print("divide by 0 error at i = " + str(i) + ", j = " + str(j))
            else:
                pred = users_avg[i] + np.sum(np.multiply(simScores, rDiff)) / np.sum(simScores)
                pred_arr.append(pred)
                real_arr.append(label)

divide by 0 error at i = 163, j = 12
divide by 0 error at i = 253, j = 57
divide by 0 error at i = 253, j = 113
divide by 0 error at i = 253, j = 304
divide by 0 error at i = 267, j = 166
divide by 0 error at i = 323, j = 404
divide by 0 error at i = 357, j = 206
divide by 0 error at i = 401, j = 166
divide by 0 error at i = 615, j = 397
divide by 0 error at i = 750, j = 180
divide by 0 error at i = 900, j = 12
divide by 0 error at i = 1006, j = 206
divide by 0 error at i = 1272, j = 404
divide by 0 error at i = 1358, j = 180
divide by 0 error at i = 1541, j = 194


In [40]:
loss = mae(real_arr, pred_arr)
print(loss)


2.555891827810941
