# Basic recommender

We will be implementing a very simple recommender system on the movie lens dataset

# 1. Recommendations with User Ratings

In this first part, we're going to build a non-personalized recommender based on user ratings.  In many online platforms, such as Amazon, IMDb, and MovieLens, users are able to express their preference to items by explicit ratings (like by assigning a 1-5 star rating to a movie). We're going to use those ratings to generate a recommendation.

For this part, we will:

* load and process the MovieLens 1M dataset, 
* build the non-personalized recommender, and 
* evaluate the recommender.

## 1a: Load and process the data

In [1]:
import pandas as pd

data_df = pd.read_csv('./ratings.dat', sep='::', names=["UserID", "MovieID", "Rating", "Timestamp"])
data_df.head()


Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Get some more info about the dimensions of the dataset

In [2]:
# count and print how many unique users, unique movies, and ratings in this dataset
# Your Code Here...
print("unique users: ", data_df['UserID'].unique().size)
print("unique movies: ", data_df['MovieID'].unique().size)
print("total ratings: ", data_df['Rating'].size)


unique users:  6040
unique movies:  3706
total ratings:  1000209


Because in Python, the index for a list starts from 0, it is more convenient if we have the ids of users and movies start from 0 as well. Moreover, we also need to make sure the UserID and MovieID are continuous, so in the next cell, we reindex UserID and MovieID.

In [3]:
# First, generate dictionaries for mapping old id to new id for users and movies
unique_MovieID = data_df['MovieID'].unique()
unique_UserID = data_df['UserID'].unique()
j = 0
user_old2new_id_dict = dict()
for u in unique_UserID:
    user_old2new_id_dict[u] = j
    j += 1
j = 0
movie_old2new_id_dict = dict() 
for i in unique_MovieID:
    movie_old2new_id_dict[i] = j
    j += 1
    
# Then, use the generated dictionaries to reindex UserID and MovieID in the data_df
for j in range(len(data_df)):
    data_df.at[j, 'UserID'] = user_old2new_id_dict[data_df.at[j, 'UserID']]
    data_df.at[j, 'MovieID'] = movie_old2new_id_dict[data_df.at[j, 'MovieID']]

Now we can randomly split into train/test sets

In [224]:
import numpy as np
# generate train_df with 70% samples and test_df with 30% samples, and there should have no overlap between them.
# Your Code Here...

train_df = data_df.sample(frac=0.7, random_state=0)
test_df = data_df.drop(train_df.index)

print(train_df.size)
print(test_df.size)

if(train_df.size + test_df.size == data_df.size):
    print("no overlap!")



2800584
1200252
no overlap!


We want to work with numpy array variables, and missing entries are filled with zeros

In [238]:
from scipy.sparse import coo_matrix

num_user = len(data_df['UserID'].unique())
num_movie = len(data_df['MovieID'].unique())

train_mat = coo_matrix((train_df['Rating'].values, (train_df['UserID'].values, train_df['MovieID'].values)), shape=(num_user, num_movie)).toarray().astype(float)
test_mat = coo_matrix((test_df['Rating'].values, (test_df['UserID'].values, test_df['MovieID'].values)), shape=(num_user, num_movie)).toarray().astype(float)

## 1b: Build the non-personalized recommender

This model is very simple: for each movie, we calculate the average rating of this movie in the training dataset, and use this average rating as the prediction for all users with respect to this movie. In this way, the prediction will be the same across all users, i.e., it is non-personalized.


In [241]:

# compute average of train_mat ignoring zeros
train_mat[train_mat==0] = np.nan
test_mat[test_mat==0] = np.nan

total_rating_avg = np.nanmean(train_mat)

# fill in any columns that are missing
for movie_id in range(num_movie):
    movie_column = train_mat[:,movie_id]
    if np.count_nonzero(np.isnan(movie_column)) == num_user:
        train_mat[:,movie_id] = total_rating_avg

# initialize ratings array to be filled
prediction_row = np.zeros((num_movie),dtype=float)

# average for each row ignoring zeros
for column in range(num_movie):
        rating_mean = np.nanmean(train_mat[:,column])
        prediction_row[column] = rating_mean

# clone single row for every user now
prediction_mat = np.array([prediction_row for i in range(num_user)])

prediction_mat.shape


(6040, 3706)

Top 5 movie Ids

In [217]:
# this will sort prediction_mat in descending order and grab the 1st 5 indices
# I am working with prediction_row instead of mat for simplicity
top_5_indices = (-prediction_row).argsort()[:5]

for movie_id in top_5_indices:
    print(movie_id, prediction_row[movie_id])


3610 5.0
3701 5.0
3418 5.0
3693 5.0
2562 5.0


##1c: Evaluate the non-personalized recommender

In [242]:
# get rmse for predictions

def get_rmse(prediction_mat, test_mat):
    # sqrt of average of squared differences

    sum_squared_diff = 0
    n = 0

    for user_id in range(num_user):
        for movie_id in range(num_movie):

            actual = test_mat[user_id][movie_id]
            predict = prediction_mat[user_id][movie_id]

            if not np.isnan(actual):

                sum_squared_diff += (actual - predict)**2
                n += 1

    return np.sqrt(sum_squared_diff/n)

rmse = get_rmse(prediction_mat,test_mat)
rmse

0.9807496351980143

# 2. Recommendations with implicit feedback (50 points total)

In many scenarios, we may not have explcit ratings. But we often have lots of implicit feedback. For this part, we're going to build a simple non-personalized implicit recommendation algorithm. Since feedback like user clicks, purchases, and views is much more widespread than explicit ratings, implicit recommenders offer great opportunities for far-reaching impact. 

Concretely, the task of implicit recommendation is to recommend items to users based on implicit signals from users, i.e., we only know what items a user is interested in, but have no idea what items the user dislikes. So for this case, the dataset we could use for this implicit recommendation experiment only contains binary data with 1 representing that the user likes the item, and with 0 representing that we don't know the user's preference towards the item. Because of this, we cannot use the same evaluation method as explicit recommendation. Instead, we need to evaluate the implicit recommendation quality by a ranking task.


## 2a: Process the data

If there is a rating, cast as a 1



In [243]:
# Your Code Here...
#train_df.drop(['Rating'],axis=1)

train_mat[np.isnan(train_mat)==False] = 1
test_mat[np.isnan(test_mat)==False] = 1

train_mat.shape

(6040, 3706)

## 2b: Build the non-personalized recommender

In this part, you need to build a non-personalized recommendation model to provide a ranked list of 50 movies as the recommendation for each user. The model is very simple: for each user, the recommendation list is to rank the unwatched movies by their **popularity**, where the popularity is the number of implicit feedback each movie gets. In this case, although it is non-personalized recommender, the recommendation results may be different for users because the unwatched movies are different across users.

In [1]:
# Generate a ranked list of movies by the popularity based recommendation algorithm. 

train_user_ranking = np.empty((num_user,50),dtype=(int,2))

# Sum up movies, ignoring nans
movie_ranking = np.nansum(train_mat, axis=0)
movie_ranking.astype(int)

# I was a little irritated numpy doesn't have a parameter for descending sort!
# Anyways, we sort and reverse array for descending order
# efficient implementation of algo using insort?

for user_index in range(num_user):

    # np array of only unwatched, nan values. [0] at end bc it made a tuple
    user_unwatched = np.where(np.isnan(train_mat[user_index]))[0]

    # make np array of (movie id, movie ranking) for all of the unwatched movies. sort descending
    user_unwatched_ranked = [(movie_id, movie_ranking[movie_id]) for movie_id in user_unwatched]
    user_unwatched_ranked.sort(key = lambda x: x[1], reverse=True)

    # cast as np array of tuples, then assign user index to array
    train_user_ranking[user_index] = np.asarray(user_unwatched_ranked[:50],dtype=(int,2))


# print out the id and popularity of the top5 movies for the first user.

for i in range(5):
    print(train_user_ranking[0][i])

NameError: name 'np' is not defined

## 2c: Evaluate the non-personalized recommender 

In this part, we evaluate the non-personalized recommendation by the held-out testing dataset test_mat for each user. For the implicit recommendation, two typical metrics are recall@k and precision@k. Here, we will calculate recall@k and  precision@k for k=5, 20, 50 for each user, i.e., six metrics for every user. 

In [256]:
# Calculate recall@k and precision@k with k=5, 20, 50 and print out the average over all users for these 9 metrics.

def recall_k(user_id, k):

    num_relevant = 0

    for i in range(k):

        # get top k movie ids for each user
        predicted = train_user_ranking[user_id][i]
        movie_id = predicted[0]

        # count all relevant values that are in top k
        if test_mat[user_id][movie_id] == 1:
            num_relevant += 1
    
    # count all relevant values
    total_relevant = np.count_nonzero(test_mat[user_id]==1)

    return num_relevant/total_relevant

        
def precision_k(user_id, k):

    num_relevant = 0

    for rank in range(k):

        # get top k movie ids for each user
        predicted = train_user_ranking[user_id][rank]
        movie_id = predicted[0]

        # count all relevant values in top k
        if test_mat[user_id][movie_id] ==1:
            num_relevant += 1
    
    return num_relevant/k


validation_df = {}

for k in [5,20,50]:

        # make empty lists to contain precision and recal scores
        recall_list = []
        precision_list = []

        for user in range(num_user):

            # check that a user in test_mat does not have all nan values
            if np.all(np.isnan(test_mat[user])) == False:

                # add recall and precision scores to list
                recall_list.append(recall_k(user,k))
                precision_list.append(precision_k(user,k))

        # add lists to dictionary
        validation_df[f"recall_{k}"] = recall_list
        validation_df[f"precision_{k}"] = precision_list

validation_df = pd.DataFrame(validation_df)
validation_df.mean(axis=0)

recall_5        0.037529
precision_5     0.273808
recall_20       0.108869
precision_20    0.211283
recall_50       0.194742
precision_50    0.157930
dtype: float64

16
