# Lab 8: Recommender System

In this assignment, we will study how to do user-based collaborative filtering and item-based collaborative filtering. 

## 1. Dataset

In this assignment, we will use MovieLens-100K dataset. It includes about 100,000 ratings from 1000 users on 1700 movies.  

In [3]:
from math import sqrt
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics.pairwise import linear_kernel
from sklearn.neighbors import NearestNeighbors

# 1. load data
user_ratings_train = pd.read_csv('./ml-100k/u1.base', sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])
user_ratings_test = pd.read_csv('./ml-100k/u1.test', sep='\t',names=['user_id','movie_id','rating'], usecols=[0,1,2])
movie_info =  pd.read_csv('./ml-100k/u.item', sep='|', names=['movie_id','title'], usecols=[0,1], encoding="ISO-8859-1")
user_ratings_train = pd.merge(movie_info, user_ratings_train)
user_ratings_test = pd.merge(movie_info, user_ratings_test)

# 2. get the rating matrix. Each row is a user, and each column is a movie.
user_ratings_train = user_ratings_train.pivot_table(index=['user_id'], columns=['title'], values='rating')
user_ratings_test = user_ratings_test.pivot_table(index=['user_id'], columns=['title'], values='rating')
user_ratings_train = user_ratings_train.reindex(index=user_ratings_train.index.union(user_ratings_test.index),
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

user_ratings_test = user_ratings_test.reindex(index=user_ratings_train.index.union(user_ratings_test.index),
                            columns=user_ratings_train.columns.union(user_ratings_test.columns) )

print(user_ratings_train.shape)
print(user_ratings_test.shape)

(943, 1664)
(943, 1664)


## Task 1. User-based CF

* Use pearson correlation to get the similarity between different users.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [5]:
user_ratings_train = user_ratings_train.fillna(0)
knn = NearestNeighbors(metric='correlation')
knn.fit(user_ratings_train.values)
distances, indices = knn.kneighbors(user_ratings_train.values, n_neighbors=6)
#print(distances, indices)
users_avg = user_ratings_train.mean(axis=1).values
test = user_ratings_test.fillna(-1).values
pred_arr = []
real_arr = []
for a in range(test.shape[0]):
    for b in range(test.shape[0]):
        for p in range(test.shape[1]):
            if test[b][p] != -1:
                pred = (distances[a][b])
                pred_arr.append(pred)
                real_arr.append(test[b][p])

[[1.11022302e-16 6.47544752e-01 6.60925605e-01 6.87676851e-01
  6.91347417e-01 6.93769310e-01]
 [0.00000000e+00 5.33992742e-01 5.57585434e-01 5.59853814e-01
  5.64183989e-01 5.71700215e-01]
 [2.22044605e-16 6.01102428e-01 6.35592747e-01 6.50536175e-01
  6.55089862e-01 6.61778133e-01]
 ...
 [0.00000000e+00 5.06089531e-01 5.44251320e-01 5.85830872e-01
  6.03092022e-01 6.04274489e-01]
 [0.00000000e+00 6.27357479e-01 6.42508716e-01 6.68873411e-01
  6.82356423e-01 6.83230542e-01]
 [1.11022302e-16 5.13492714e-01 5.39363404e-01 5.52326991e-01
  5.56149507e-01 5.58960736e-01]] [[  0 822 513 863 912 520]
 [  1 519 677 734 700 265]
 [  2 655 751 610 783 586]
 ...
 [940 688 816 729 581 741]
 [941 453 779 473 733 487]
 [942 681 932 708 631 585]]


## Task 2. Item-based CF
* Use cosine similarity to get the similarity between different items.
* Based on the obtained similarity score, predict the ratings. You can use 5 nearest neighbors or 10 nearest neighbors.
* Compute MAE for the testing set.

In [23]:
# your code