#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Homework 3:   Recommender System Practice: Rating Prediction and Top-K Item Recommendation

### 100 points [ 6% of your final grade]

### Due: April 10, 2020

*Goals of this homework:* Understand matrix factorization (MF) using explicit feedback and Bayesian Personalized Ranking (BPR) using implicit feedback for recommendation. Explore different methods for two real-world recommendation senarios: rating prediction and top-K item recommendation.

*Submission instructions (eCampus):* To submit your homework, rename this notebook as `UIN_hw3.ipynb`. For example, my homework submission would be something like `555001234_hw3.ipynb`. Submit this notebook via eCampus (look for the homework 3 assignment there). Your notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so). So please run all the cells for us, and then submit.

*Late submission policy:* For this homework, you may use as many late days as you like (up to the total late days you have remaining).

*Collaboration policy:* You are expected to complete each homework independently. Your solution should be written by you without the direct aid or help of anyone else. However, we believe that collaboration and team work are important for facilitating learning, so we encourage you to discuss problems and general problem approaches (but not actual solutions) with your classmates. You may post on Piazza, search StackOverflow, etc. But if you do get help in this way, you must inform us by **filling out the Collaboration Declarations at the bottom of this notebook**. 

*Example: I found helpful code on stackoverflow at https://stackoverflow.com/questions/11764539/writing-fizzbuzz that helped me solve Problem 2.*

The basic rule is that no student should explicitly share a solution with another student (and thereby circumvent the basic learning process), but it is okay to share general approaches, directions, and so on. If you feel like you have an issue that needs clarification, feel free to contact either me or the TA.

# Part 1. Matrix Factorization for Rating Prediction (70 points total)

In some platforms, such as MovieLens, users express their preference on items using explict feedback like ratings.

In this part, you will implement matrix factorization to predict ratings on MovieLens data. After removing users who left less than 20 ratings and movies with less than 20 ratings, the provided dataset has only ~1,200 items and ~500 users. You can also check the title and genres of each movie in *movies_info.csv*.

## Part 1a: Load the Data (5 points)

Please download the dataset from Piazza. There are about 65,000 ratings in total. We split the rating data into two sets. You will train with 70% of the data (in *train_movie.csv*) and test on the remaining 30% of data (in *test_movie.csv*). Each of train and test files has lines having this format: UserID, MovieID, Rating. 

First you will need to load the data and store it with any structure you like. Please report the numbers of unique users and movies in the dataset. 

In [2]:
# load the data, then print out the number of
# movies and users in each of train and test sets.
# Your Code Here...

import pandas as pd

def load_data(file_path):
    data = pd.read_csv(file_path, sep=',', header=None, skiprows=1)
    data.columns = ['user','movie','rating']
    return data

train = load_data('train_movie.csv')
test = load_data('test_movie.csv')

number_users = len(train.user.unique())
number_movies = len(train.movie.unique())
print("Number of users in train set is:" , number_users)
print("Number of movie in train set is:" , number_movies)

number_users = len(test.user.unique())
number_movies = len(test.movie.unique())
print("Number of users in test set is:" , number_users)
print("Number of movie in test set is:" , number_movies)

Number of users in train set is: 541
Number of movie in train set is: 1211
Number of users in test set is: 541
Number of movie in test set is: 1211


## Part 1b: Matrix Factorization (40 points)

In class, we introduced how matrix factorization works for recommendation. Now it is your term to implement it. There are different methods to obtain the latent factor matrices **P** and **Q**, like gradient descent, Alternating Least Squares (ALS), and so on. Pick one of them and implement your MF model. *You can refer to tutorials and resources online. Remember our **collaboration policy** and you need to inform us of the resources you refer to.* 

Please report MAE and RMSE of your MF model for the test set.

In [4]:
# Your Code Here...
# Report Mean Absolute Error and Root Mean Squared Error for test

data_matrix = train.pivot_table(index='movie', columns='user', values='rating').fillna(0)

In [6]:
import numpy as np

def MF(dataMat, k, alpha, beta, maxIter):
    for step in range(maxIter):
        for i in range(m):
            for j in range(n):
                if dataMat[i, j] > 0:
                    error = dataMat[i, j]
                    for r in range(k):
                        error = error - p[i, r] * q[r, j]
                    for r in range(k):
                        p[i, r] = p[i, r] + alpha * (2 * error * q[r, j] - beta * p[i, r])
                        q[r, j] = q[r, j] + alpha * (2 * error * p[i, r] - beta * q[r, j])
        loss = 0.0
        for i in range(m):
            for j in range(n):
                if dataMat[i, j] > 0:
                    error = 0.0
                    for r in range(k):
                        error = error + p[i, r] * q[r, j]
                    loss = np.power((dataMat[i, j] - error), 2)
                    for r in range(k):
                        loss = loss + beta * (p[i, r] * p[i, r] + q[r, j] * q[r, j]) / 2
        if loss < 0.001:
            break
    return p, q


In [7]:
matrix = np.array(data_matrix)
m, n = np.shape(data_matrix)
k = 2

p = np.random.rand(m, k)
q = np.random.rand(k, n)


P, Q = MF(matrix, k, 0.0003, 0.04, 100)
R = np.dot(P, Q)

In [14]:
def predict_ratings():
    actual_ratings = []
    predicted_ratings = []
    for index, row in test.iterrows():
        actual_ratings.append(row["rating"])
        predicted_ratings.append(R[row["movie"]][row["user"]])
    return predicted_ratings, actual_ratings

predicted_ratings, actual_ratings = predict_ratings()

In [20]:
import math

def RMS(act, pred):
    t = 0
    for i in range(len(act)):
        t = t + pow((act[i] - pred[i]),2)
    t = t / len(act)
    t = math.sqrt(t)
    return t

def MAE(act, pred):
    t = 0
    for i in range(len(act)):
        t = t + abs((act[i] - pred[i]))
    t = t / len(act)
    return t

print("RMSE =", RMS(actual_ratings, predicted_ratings))
print("MAE =", MAE(actual_ratings, predicted_ratings))

RMSE = 0.8873527538813737
MAE = 0.6784261394149478


Which method did you use to obtain **P** and **Q**? What are the advantages and disadvantages of the method you pick? *provide a brief (1-2 paragraph) discussion based on these questions.*

Use gradient descent and regularization to obtain P and Q.

The advantage of gradient descent is that it requires no computation of Hessian matrix, making it fast per iteration.

The disadvantage of gradient descent is that if learning rate is not set properly, it takes a lot of iterations to stop.

## Part 1c: Improve MF (25 points)

Given your results in the previous part, can you do better? For this last part you should report on your best attempt at improving MAE and RMSE. Provide code, results, plus a brief discussion on your approach. Hints: You may consider using the title or genres information, trying other algorithms, designing a hybrid system or considering a neighborhood like this paper [Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model](https://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf). *You can do anything you like to improve MAE and RMSE.*

You will get full marks for this part if you get better results than your MF results (of course we will also judge whether what you do here is reasonable or not). You will get partial marks for a reasonable effort even if you do not improve your MF results. Additionally, you will get 5 points as bonus if your model performs the best among the whole class.

In [21]:
# Your Code Here...
# Report Mean Absolute Error and Root Mean Squared Error for test

f = open("train_movie.csv", encoding='UTF-8')
f2 = open("train_movie_new.csv", 'w', encoding='UTF-8')
line = f.readline()
while line:
    if line != "\n":
        f2.write(line)
    line = f.readline()
f.close()
f2.close()

In [22]:
import surprise

reader = surprise.Reader(sep=',', skip_lines=1)
data = surprise.Dataset.load_from_file('train_movie_new.csv', reader)

alg = surprise.SVDpp()
output = alg.fit(data.build_full_trainset())

In [28]:
def SVDpp():
    for index, row in test.iterrows():
        act_ratings.append(row["rating"])
        predict = alg.predict(str(row["user"]), str(row["movie"]))
        pred_svdpp.append(predict.est)
    return pred_svdpp, act_ratings

act_ratings = []
pred_svdpp = []
pred_svdpp, act_svdpp = SVDpp()


print("RMSE =", RMS(act_svdpp, pred_svdpp))
print("MAE =", MAE(act_svdpp, pred_svdpp))

RMSE = 0.8745614362635963
MAE = 0.6720675480371989


Please explain what you do to improve the recommendation in 1-2 paragraphs.

Use the SVD++ to improve the recommendation. Implement it by using the surprise package.

The SVD++ algorithm, an extension of SVD taking into account implicit ratings. In the above case, the implicit ratings are the movie rating given by users.

# Part 2. Bayesian Personalized Ranking (BPR) for Top-K Item Recommendation (30 points)

Compared with rating prediction in part 1, a more popular scenario recently is personalized top-K item ranking for each user based on the user's implicit feedback. Examples include ranking videos on YouTube and ranking products on Aamzon. In practice, users tend to provide implicit feedback (e.g., the user clicked a product URL on Amazon or played a video on YouTube) rather than explicit feedback (e.g., ratings or reviews) in most cases.

In this part, you will experiment with Bayesian Personalized Ranking (BPR) to rank items on a [Spotify Playlist Recommendation Dataset](http://people.tamu.edu/~yunhe/pubs/AttListCIKM2019.pdf). If a user ever followed a playlist, this interaction is treated as an implicit feedback. In our sampled dataset, there are ~10,000 users and ~7,000 playlists.

BPR can generate scores of items for each user. You should rank all items based on the scores for each user and evaluate the ranking performance.

For example, if user 0 has two interacted playlists 23, 78 in test.txt. If the top-10 playlists for user 0 returned by BPR is [12,45,78,34,23,90,134,33,46,9], then the precision@10 for user 0 is 0.2 because the two playlists in test.txt are recommended in top-10: 2/10=0.2. Please report NDCG@10 in this part.

## Load the Data

Please download the dataset from Piazza. There are about 90,000 interactions in total, which are split into training.txt, validation.txt and text.txt. You will train on train.txt, tune hyperparameters on validation.txt and report final result on test.txt in terms of NDCG@10. 

Each of the train and test files has lines having this format: UserID, PlaylistID, 1.0. 

First you will need to load the data and store it with any structure you like. Please report the numbers of unique users and movies in the dataset. 

In [29]:
# load the data, then print out the number of
# playlists and users in each of train and test sets.
# Your Code Here...


def load_data(file_path):
    data = pd.read_csv(file_path, sep='\t', header=None, skiprows=1)
    data.columns = ['UserID','PlaylistID','Value']
    return data

train=load_data('train.txt')
test=load_data('test.txt')

number_users = len(train.UserID.unique())
number_movies = len(train.PlaylistID.unique())
print("Number of users in train set is:" , number_users)
print("Number of playlists in train set is:" , number_movies)

number_users = len(test.UserID.unique())
number_movies = len(test.PlaylistID.unique())
print("Number of users in test set is:" , number_users)
print("Number of playlists in test set is:" , number_movies)

Number of users in train set is: 10183
Number of playlists in train set is: 7787
Number of users in test set is: 5846
Number of playlists in test set is: 3604


## BPR by Using Package

Compared with MF, BPR is more complicated to implement. In this part, you can use a BPR package to experiment with top-K item recommendation. Some good packages include https://github.com/benfred/implicit.

In [1]:
# your code to call other BPR packages for top-K recommendation.
# Report average NDCG@10 for all users on test.txt


from scipy.sparse import coo_matrix

f = open("train.txt", encoding='UTF-8')
line = f.readline()
UserID = []
PlaylistID = []
while line:
    line_list = line.split("\t")
    UserID.append(int(line_list[0]))
    PlaylistID.append(int(line_list[1]))
    line = f.readline()
f.close()

data_coo_matrix = coo_matrix(([1.0] * len(UserID), (PlaylistID, UserID)))

In [2]:
import implicit
model = implicit.bpr.BayesianPersonalizedRanking()
model.fit(data_coo_matrix)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [3]:
f = open("train.txt", encoding='UTF-8')
line = f.readline()
UserID = []
PlaylistID = []
while line:
    line_list = line.split("\t")
    UserID.append(int(line_list[0]))
    PlaylistID.append(int(line_list[1]))
    line = f.readline()
f.close()

data_csr_matrix_train = coo_matrix(([1.0] * len(UserID), (UserID, PlaylistID))).tocsr()

In [4]:
f = open("test.txt", encoding='UTF-8')
line = f.readline()
UserID = []
PlaylistID = []
while line:
    line_list = line.split("\t")
    UserID.append(int(line_list[0]))
    PlaylistID.append(int(line_list[1]))
    line = f.readline()
f.close()

data_csr_matrix_test = coo_matrix(([1.0] * len(UserID), (UserID, PlaylistID))).tocsr()

In [7]:
from implicit.evaluation import ndcg_at_k
ndcg = ndcg_at_k(model, data_csr_matrix_train, data_csr_matrix_test, K=10)
print("NDCG@10 =", ndcg)

HBox(children=(FloatProgress(value=0.0, max=10182.0), HTML(value='')))


NDCG@10 = 0.09944746866128336


## Collaboration declarations

*If you collaborated with anyone (see Collaboration policy at the top of this homework), you can put your collaboration declarations here.*