# Collaborative Filtering

# 1. Recommendations with User Ratings 

In this first part,  we still focus on the rating prediction recommendation task with explicit feedback. We will need to build **personalized** models as opposed to non personalized ones from before.

For this part, we will:

* load and process the MovieLens 1M dataset, 
* build a baseline estimation model,
* build a user-user collaborative filtering model,
* improve the user-user collaborative filtering model, and
* evaluate and compare these different models.

Preprocess as before:

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix

data_df = pd.read_csv('./ratings.dat', sep='::', names=["UserID", "MovieID", "Rating", "Timestamp"])

# First, generate dictionaries for mapping old id to new id for users and movies
unique_MovieID = data_df['MovieID'].unique()
unique_UserID = data_df['UserID'].unique()
j = 0
user_old2new_id_dict = dict()
for u in unique_UserID:
    user_old2new_id_dict[u] = j
    j += 1
j = 0
movie_old2new_id_dict = dict()
for i in unique_MovieID:
    movie_old2new_id_dict[i] = j
    j += 1
    
# Then, use the generated dictionaries to reindex UserID and MovieID in the data_df
user_list = data_df['UserID'].values
movie_list = data_df['MovieID'].values
for j in range(len(data_df)):
    user_list[j] = user_old2new_id_dict[user_list[j]]
    movie_list[j] = movie_old2new_id_dict[movie_list[j]]
data_df['UserID'] = user_list
data_df['movieID'] = movie_list

# generate train_df with 70% samples and test_df with 30% samples, and there should have no overlap between them.
train_index = np.random.random(len(data_df)) <= 0.7
train_df = data_df[train_index]
test_df = data_df[~train_index]

# generate train_mat and test_mat
num_user = len(data_df['UserID'].unique())
num_movie = len(data_df['MovieID'].unique())

train_mat = coo_matrix((train_df['Rating'].values, (train_df['UserID'].values, train_df['MovieID'].values)), shape=(num_user, num_movie)).astype(float).toarray()
test_mat = coo_matrix((test_df['Rating'].values, (test_df['UserID'].values, test_df['MovieID'].values)), shape=(num_user, num_movie)).astype(float).toarray()

  data_df = pd.read_csv('./ratings.dat', sep='::', names=["UserID", "MovieID", "Rating", "Timestamp"])


## 1a: Build the Baseline Estimation Model

First, let's implement a simple personalized recommendation model -- the baseline estimate : $b_{u,i}=\mu+b_i+b_u$, where $\mu$ is the overall mean rating for all items, $b_u$ = average rating of user $u-\mu$, $b_i$ = average rating of item $i-\mu$. We store the prediction as a numpy array variable 'prediction_mat' of size (#users, #movies) with each entry showing the predicted rating for the corresponding user-movie pair.

In [2]:
# calculate the prediction_mat by the baseline estimation recommendation algorithm

#  Convert to nans for convenience
train_mat[train_mat==0] = np.nan

total_rating_avg = np.nanmean(train_mat)
#prediction_mat = np.empty([num_user, num_movie])

u = np.empty([num_user, num_movie])
u.fill(total_rating_avg)
b_i = np.nanmean(train_mat, axis = 0).reshape(1,num_movie) - u[0].reshape(1,num_movie)
b_u = np.nanmean(train_mat, axis = 1).reshape(num_user, 1) - u[:,0].reshape(num_user, 1)

prediction_mat = u + b_u + b_i

prediction_mat, prediction_mat.shape

  b_i = np.nanmean(train_mat, axis = 0).reshape(1,num_movie) - u[0].reshape(1,num_movie)


(array([[4.95751623, 4.04074232, 4.70622401, ..., 1.56084956, 5.56084956,
                nan],
        [4.53044856, 3.61367465, 4.27915634, ..., 1.13378189, 5.13378189,
                nan],
        [4.57223484, 3.65546094, 4.32094262, ..., 1.17556818, 5.17556818,
                nan],
        ...,
        [4.81465908, 3.89788518, 4.56336687, ..., 1.41799242, 5.41799242,
                nan],
        [4.68645396, 3.76968005, 4.43516174, ..., 1.28978729, 5.28978729,
                nan],
        [4.34684793, 3.43007402, 4.09555571, ..., 0.95018126, 4.95018126,
                nan]]),
 (6040, 3706))

Now, with this prediction_mat based on the baseline estimate, let's use RMSE to evaluate the quality of the baseline estimate model. 

In [3]:
# calculate and print out the RMSE for your prediction_df and the test_df

prediction_mat[np.isnan(prediction_mat)] = 0

indicator_mat = (test_mat > 0).astype(float)
test_rmse = (np.sum(((prediction_mat - test_mat) * indicator_mat) ** 2) / np.sum(indicator_mat)) ** 0.5

test_rmse

0.938653322433489

## 1b: User-user Collaborative Filtering with Jaccard Similarity 

In this part, we need to build a user-user collaborative filtering recommendation model with **Jaccard similarity** to predict user-movie ratings. 

The prediction of the score for a user-item pair $(u,i)$ should use the formulation: $p_{u,i}=\bar{r}_u+\frac{\sum_{u^\prime\in N}s(u,u^\prime)(r_{u^\prime,i}-\bar{r}_{u^\prime})}{\sum_{u^\prime\in N}|s(u, u^\prime)|}$, where $s(u, u^\prime)$ is the Jaccard similarity. We set the size of $N$ as 10.


In [4]:
# calculate the prediction_mat by your user-user collaborative filtering recommendation algorithm

train_mat[train_mat==0] = np.nan

bool_mat = (~np.isnan(train_mat)).astype(float)
    
num_rating_per_user = np.sum(bool_mat, axis=1, keepdims=True)
numerator = np.matmul(indicator_mat, indicator_mat.T)
denominator = num_rating_per_user + num_rating_per_user.T - numerator
denominator[denominator==0] = 1
jaccard_table = numerator/denominator

jaccard_table, jaccard_table.shape


(array([[0.34615385, 0.        , 0.        , ..., 0.        , 0.00892857,
         0.00374532],
        [0.        , 0.21794872, 0.        , ..., 0.00952381, 0.00581395,
         0.00923077],
        [0.        , 0.        , 0.375     , ..., 0.        , 0.        ,
         0.00377358],
        ...,
        [0.        , 0.00952381, 0.        , ..., 0.69230769, 0.01136364,
         0.00826446],
        [0.00892857, 0.00581395, 0.        , ..., 0.01136364, 0.40540541,
         0.01967213],
        [0.00374532, 0.00923077, 0.00377358, ..., 0.00826446, 0.01967213,
         0.30167598]]),
 (6040, 6040))

In [5]:
# will contain avg rating for each user

train_mat[train_mat==0] = np.nan
user_means = np.nanmean(train_mat, axis = 1).T

prediction_mat = np.empty([num_user, num_movie])

for user_id in range(num_user):

    neighborhood = (-jaccard_table[user_id]).argsort()[:10]
    p_u = np.zeros([1, num_movie])
    denominator = 0

    for other_user in neighborhood:

        r_u_prime = train_mat[other_user]
        r_u_prime[np.isnan(r_u_prime)] = user_means[other_user]

        jaccard_sim = jaccard_table[user_id][other_user]
        diff_mean = r_u_prime - np.full((1,num_movie),user_means[other_user])
        
        p_u += jaccard_sim * diff_mean
        denominator += abs(jaccard_sim)


    prediction_mat[user_id] = p_u/denominator + user_means[user_id]

prediction_mat.shape

(6040, 3706)

In [6]:
# calculate and print out the RMSE for prediction_df and the test_df

prediction_mat[np.isnan(prediction_mat)] = 0

indicator_mat = (test_mat > 0).astype(float)
test_rmse = (np.sum(((prediction_mat - test_mat) * indicator_mat) ** 2) / np.sum(indicator_mat)) ** 0.5

test_rmse

1.0122187582515219

## 1c: Improve the Collaborative Filtering Model

Now we try to see if we can improve Collaborative Filtering with different similarity metrics + item-item comparisons


In [7]:
# Create item item cosine table
from sklearn.metrics.pairwise import cosine_similarity

cosine_movie = np.zeros(shape=(num_movie, num_movie))

for movie_id in range(num_movie):
    for other_movie in range(num_movie):
        cosine_movie[movie_id][other_movie] = (np.dot(train_mat[movie_id], train_mat[other_movie])/(np.linalg.norm(train_mat[movie_id])*np.linalg.norm(train_mat[other_movie])))    

In [8]:
train_mat[train_mat==0] = np.nan
train_mat = train_mat.T
movie_means = np.nanmean(train_mat, axis = 0)
prediction_mat = np.empty([num_movie, num_user])

for movie_id in range(num_movie):

    neighborhood = (-cosine_movie[movie_id]).argsort()[:10]
    numerator = np.zeros([1, num_user])
    denominator = 0

    for other_movie in neighborhood:

        r_m_prime = train_mat[other_movie]
        r_m_prime[np.isnan(r_m_prime)] = movie_means[other_movie]

        cosine_sim = cosine_movie[movie_id][other_movie]
        numerator += cosine_sim * r_m_prime

        denominator += abs(cosine_sim)


    prediction_mat[movie_id] = numerator/denominator 


prediction_mat = prediction_mat.T
train_mat = train_mat.T
prediction_mat.shape



(6040, 3706)

In [9]:
prediction_mat[np.isnan(prediction_mat)] = 0

indicator_mat = (test_mat > 0).astype(float)
test_rmse = (np.sum(((prediction_mat - test_mat) * indicator_mat) ** 2) / np.sum(indicator_mat)) ** 0.5

test_rmse

1.0390849247045915

# 2. Recommendations with implicit feedback

 Now we implement a user-user collaborative filtering algorithm for recommendation.

Explicit -> Implicit

In [10]:
train_mat = (train_mat > 0).astype(float)
test_mat = (test_mat > 0).astype(float)

The predicted preference score from user $u$ to movie $i$ can be calculated as: $p_{u,i}=\frac{\sum_{u^\prime\in N}s(u,u^\prime)r_{u^\prime,i}}{\sum_{u^\prime\in N}|s(u,u^\prime)|}$, where $s(u,u^\prime)$ is the cosine similarity, and we set the size of $N$ as 10.

Using this knowledge, generate our ranked lists for top 50 movies/user

In [13]:
# Create user user cosine table
from sklearn.metrics.pairwise import cosine_similarity

cosine_user = np.zeros(shape=(num_user, num_user))

for user_id in range(num_user):
    for other_user in range(num_user):
        cosine_user[user_id][other_user] = (np.dot(train_mat[user_id], train_mat[other_user])/(np.linalg.norm(train_mat[user_id])*np.linalg.norm(train_mat[other_user])))    

In [14]:
cosine_user

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [22]:
prediction_mat = np.empty([num_user, num_movie])
#user_means = np.nanmean(train_mat, axis = 1).T

for user_id in range(num_user):

    neighborhood = (-cosine_user[user_id]).argsort()[:10]
    numerator = np.zeros([1, num_movie])
    denominator = 0

    for other_user in neighborhood:

        r_u_prime = train_mat[other_user]
        #r_u_prime[np.isnan(r_u_prime)] = user_means[other_user]

        cosine_sim = cosine_user[user_id][other_user]
        numerator += cosine_sim * r_u_prime

        denominator += abs(cosine_sim)


    prediction_mat[user_id] = numerator/denominator 


In [23]:
prediction_mat

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [19]:
prediction_mat.astype(int)

top_50_ranked = np.empty([num_user, 50])

for user_id in range(num_user):
    top_50_ranked[user_id] = np.argpartition(prediction_mat[user_id], -50)[-50:]

top_50_ranked

array([[1259., 1260., 1251., ..., 1220., 1219., 3705.],
       [1259., 1260., 1251., ..., 1220., 1219., 3705.],
       [1259., 1260., 1251., ..., 1220., 1219., 3705.],
       ...,
       [1259., 1260., 1251., ..., 1220., 1219., 3705.],
       [1259., 1260., 1251., ..., 1220., 1219., 3705.],
       [1259., 1260., 1251., ..., 1220., 1219., 3705.]])

Evaulate our recommender with recall and precision at k

In [20]:
# Calculate recall@k, precision@k with k=5, 20, 50 and print out the average over all users for these 6 metrics.

def recall_k(user_id, k):

    num_relevant = 0

    for i in range(k):

        # get top k movie ids for each user
        movie_id = int(top_50_ranked[user_id][i])

        # count all relevant values that are in top k
        if test_mat[user_id][movie_id] == 1:
            num_relevant += 1
    
    # count all relevant values
    total_relevant = np.count_nonzero(test_mat[user_id]==1)

    return num_relevant/total_relevant

def precision_k(user_id, k):

    num_relevant = 0

    for rank in range(k):

        # get top k movie ids for each user
        movie_id =int(top_50_ranked[user_id][rank])

        # count all relevant values in top k
        if test_mat[user_id][movie_id] ==1:
            num_relevant += 1
    
    return num_relevant/k
validation_df = {}

for k in [5,20,50]:

        # make empty lists to contain precision and recal scores
        recall_list = []
        precision_list = []

        for user in range(num_user):

            # check that a user in test_mat does not have all nan values
            if np.all(test_mat[user]==0) == False:

                # add recall and precision scores to list
                recall_list.append(recall_k(user,k))
                precision_list.append(precision_k(user,k))

        # add lists to dictionary
        validation_df[f"recall_{k}"] = recall_list
        validation_df[f"precision_{k}"] = precision_list

validation_df = pd.DataFrame(validation_df)
validation_df.mean(axis=0)

recall_5        0.001675
precision_5     0.020927
recall_20       0.006730
precision_20    0.018932
recall_50       0.016907
precision_50    0.018821
dtype: float64