# Advanced Collaborative Filtering

Note: this lab has been tested with Python 3.10. We recommend using the same Python version if there are problems with libraries used in this lab.

In [1]:
# Load data generated in W6 Lab or the provided data splits (see Absalon, W7 Lab)
import pandas as pd

df_train = pd.read_pickle("train_dataframe.pkl")
df_test = pd.read_pickle("test_dataframe.pkl")

## Exercise 1

For this exercise, you can use the Python library Scikit-Surprise. Please find the documentation here: https://surprise.readthedocs.io/en/stable/getting_started.html.

Define an SVD model with user and item biases that uses Stochastic Gradient Descend (SGD) to estimate the low-rank matrix based on only observed ratings.

Fit the model on the full training set with $30$ latent factors and $100$ epochs. Keep Scikit-Surprise's default setting for all other parameters, but set the random state to $0$ for comparable results.

Use the model to predict the unobserved ratings for the users in the training set. How many predictions are there and what is the average of all the predictions? Round the average of all predictions to the third decimal point.

In [2]:
import random
import numpy as np
from surprise import Reader, Dataset, SVD

# Convert train data format
reader = Reader(rating_scale=(1, 5))
training_matrix = Dataset.load_from_df(df_train[['reviewerID', 'asin', 'overall']], reader)

my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

trainset = training_matrix.build_full_trainset()
testset = trainset.build_anti_testset()

# We'll use the famous SVD algorithm.
algo = SVD(n_factors=30, n_epochs=100)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(training_matrix.build_full_trainset())
predictions = algo.test(testset)

print("Number of predictions: ", len(predictions))
print("Average of predictions: ", round(np.mean([pred.est for pred in predictions]), 3))

Number of predictions:  54746
Average of predictions:  4.413


## Exercise 2

We will implement the Neural Matrix Factorization model using the Python library RecBole.
Please find the documentation here: https://recbole.io/docs/

In [3]:
#Uncomment and run the following line if you need to install RecBole
!pip install recbole



In [4]:
#Uncomment and run the following line if you need to install ray. This is needed when calling run_recbole
!pip install ray



In [5]:
import os
import pandas as pd
from recbole.quick_start import run_recbole

### Exercise 2.1

Convert the dataset to the format which can be read by RecBole.

More information regarding the input data format can be found here:
https://recbole.io/docs/user_guide/usage/running_new_dataset.html

In [6]:
#We are creating a dictionary that maps the column names in our dataset to the column names required by RecBole

#Fill this dictionary with keys that are column names in our dataset that correspond to user_id, item_id, rating, and timestamp
#Fill the values of the dictionary according to the given documentation

col_name_dict = {
    'reviewerID': 'user_id:token',
    'asin': 'item_id:token',
    'overall': 'rating:float',
    'unixReviewTime': 'timestamp:float'
}

In [7]:
#this method converts a dataframe to a .inter file, and saves it in the folder "data" under the name 'file_name'
def convert_df_to_inter(df:pd.DataFrame, col_name_dict:dict, file_name:str):
    inter = df.copy()
    selected_cols = col_name_dict.keys()
    inter = inter[selected_cols]

    # Rename the columns in inter using col_name_dict
    inter.rename(columns=col_name_dict, inplace=True)

    if not os.path.exists("data"):
        os.makedirs("data")
    inter.to_csv("data/"+file_name, index=False, sep="\t")

In [8]:
#create an extra, empty dataframe with the same column names in the keys of col_name_dict
#we will use this as a dummy validation file
df_extra = pd.DataFrame(columns=col_name_dict.keys())

In [9]:
print(df_train.dtypes)

convert_df_to_inter(df_train, col_name_dict, "data.train.inter")
convert_df_to_inter(df_test, col_name_dict, "data.test.inter")
convert_df_to_inter(df_extra, col_name_dict, "data.extra.inter")

overall           float64
verified             bool
reviewTime         object
reviewerID         object
asin               object
style              object
reviewerName       object
reviewText         object
summary            object
unixReviewTime      int64
vote               object
image              object
dtype: object


### Exercise 2.2
Train the Neural Matrix Factorization model on the whole training dataset for $100$ epochs.

Evaluate the model on the test set, based on HR, MRR, Precision, MAP, and Recall at $k \in \{5, 10, 20\}$ respectively and round the scores up to 3 decimal places (It is fine if you have different results in the third decimal point).
Keep the rest of the default settings of RecBole the same.

Note: RecBole's MAP normalises the recall base by $\min\{k,G\}$, where $G$ is the recall base (see W7 lecture and homework solution)

Note: A non exhaustive list of properties that can be set using the config_dict parameter of the run_recbole() method can be found here:
https://github.com/RUCAIBox/RecBole/blob/master/recbole/properties/overall.yaml


In [12]:
result = run_recbole(
                    model="NeuMF",
                    dataset="data",
                    config_dict={
                        "data_path":"./",
                        "benchmark_filename": ['train', 'extra', 'test'],
                        "topk": [5, 10, 20],
                        "loss_decimal_place": 3,
                        "metric_decimal_place": 3,
                        "metrics": ["Hit", "MRR", "Precision", "MAP", "Recall"]
                    })

#print the results here

22 Feb 17:35    INFO  ['c:\\Users\\david\\anaconda3\\envs\\WRS\\lib\\site-packages\\ipykernel_launcher.py', '--f=c:\\Users\\david\\AppData\\Roaming\\jupyter\\runtime\\kernel-v31c0749c0bc3220c1bba7e03ac556b6002aa57885.json']
22 Feb 17:35    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = ./data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 3

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'

## Exercise 3

Let's create a graph-based recommender system, defining neighbourhoods with random walks. Build a bipartite graph (i.e., edges only between users and items) where nodes are users and items; a **bidirectional** edge $(u,i)$ exists in the graph if user $u$ has rated item $i$ with a score $>3$. 

Implement the Page Rank algorithm to find the top-10 recommended items for user `ARARUVZ8RUF5T`. You can use the `pagerank` method from the library `networkx`. Assume a damping factor of $0.85$ and leave the rest of parameters by default.

In [None]:
# !pip install networkx

In [None]:
import numpy as np
import networkx as nx
from operator import itemgetter

# Prepare the data
def convert_data(df):
    df_convert = #get the rows in the df where the rating is >3
    df_convert = df_convert[["asin","reviewerID"]]
    df_convert_arr = df_convert.values
    return df_convert_arr

train_df = convert_data(df_train)

In [None]:
''' Hyper Parameters '''
def parameter_dict_from_vector(vector):
    return {
        "W_USER_ITEM" : vector[0],
        "W_USER_ITEM_BACK" : vector[1]
        }

''' Building Graph '''
class InteractionGraph:
    def __init__(self):
        self.graph = nx.MultiDiGraph()
        
    def add_nodes_from_edge_array(self, edge_array, type_1, type_2):
        nodes = [(x[0], {'type': type_1}) for x in edge_array] \
        + [(x[1], {'type': type_2}) for x in edge_array]
        self.graph.add_nodes_from(nodes)

    def add_edges_from_array(self, array, weight_front=1.0, weight_back=1.0):
        forward_edges = [(x[0], x[1], weight_front) for x in array]
        back_edges = [(x[1], x[0], weight_back) for x in array]
        self.graph.add_weighted_edges_from(forward_edges)
        self.graph.add_weighted_edges_from(back_edges)

def build_graph(parameter_dictionary, user_item_array):
    multigraph = InteractionGraph()
    multigraph.add_nodes_from_edge_array(user_item_array, 'item', 'user')
    multigraph.add_edges_from_array(user_item_array, 
                                    parameter_dictionary["W_USER_ITEM"], 
                                    parameter_dictionary["W_USER_ITEM_BACK"])
    return multigraph

class RecommendationEngine:
    def __init__(self, multigraph, damping_factor = 0.3):
        self.graph = nx.DiGraph()
        for u,v,d in multigraph.graph.edges(data=True):
            w = d['weight']
            if self.graph.has_edge(u,v):
                self.graph[u][v]['weight'] += w
            else:
                self.graph.add_edge(u,v,weight=w)
        self.nodes = list(self.graph.nodes)
        self.damping_factor = damping_factor
        
        #this part keeps track of items that have been rated by each user in the training set
        self.user_item_dict = {}
        for n in multigraph.graph.nodes.data():
            if n[1]['type'] == 'user':
                 self.user_item_dict[n[0]] = set()
        for e in multigraph.graph.edges:
            if e[0] in self.user_item_dict:
                 self.user_item_dict[e[0]].add(e[1])

    def generate_pr(self, damping_factor):
        pr = #use the pagerank method here
        pr_sorted = dict(
            #sort pr by descending probability values
            )
        pr_list = [(k, v) for k, v in pr_sorted.items()]
        return pr_list
    
    def generate_recommendations(self, user):
        pr_list = self.generate_pr(self.damping_factor)
        result = #Given the user, remove items in their recommendation list that they have rated in the training set
        #hint: you can use user_item_dict for this
        return result

In [None]:
damping_factor = #fill this

# Build the graph
graph = build_graph(parameter_dict_from_vector(np.ones(2)), train_df)
# Build the recommender system with page rank
recommender = RecommendationEngine(graph, damping_factor)

# Get top-K recommendations for the given user 
user_id = "ARARUVZ8RUF5T"
K = 10

# write your code to get the top-K recommendations

Credits: the provided codes in Exercise 3 are modified from
https://arxiv.org/pdf/2301.11009.pdf