# Advanced Collaborative Filtering

Note: this lab has been tested with Python 3.10. We recommend using the same Python version if there are problems with libraries used in this lab.

In [19]:
# Load data generated in W6 Lab or the provided data splits (see Absalon, W7 Lab)
import pandas as pd

df_train = pd.read_pickle("train_dataframe.pkl")
df_test = pd.read_pickle("test_dataframe.pkl")

## Exercise 1

For this exercise, you can use the Python library Scikit-Surprise. Please find the documentation here: https://surprise.readthedocs.io/en/stable/getting_started.html.

Define an SVD model with user and item biases that uses Stochastic Gradient Descend (SGD) to estimate the low-rank matrix based on only observed ratings.

Fit the model on the full training set with $30$ latent factors and $100$ epochs. Keep Scikit-Surprise's default setting for all other parameters, but set the random state to $0$ for comparable results.

Use the model to predict the unobserved ratings for the users in the training set. How many predictions are there and what is the average of all the predictions? Round the average of all predictions to the third decimal point.

In [20]:
import random
import numpy as np
from surprise import Reader, Dataset, SVD

In [21]:
# Convert train data format
reader = Reader(rating_scale=(1, 5))
training_matrix = Dataset.load_from_df(df_train[['reviewerID', 'asin', 'overall']], reader)

my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

trainset = training_matrix.build_full_trainset()

alg = SVD(n_factors=30, n_epochs=100, random_state=my_seed)
alg.fit(trainset)

testset = trainset.build_anti_testset()

# # Detect users from training set that are not in test and filter them out
# all_train_users_id = df_train[df_train.reviewerID.isin(df_train.reviewerID.unique().tolist())].reviewerID.unique()
# all_train_items_id = df_train.asin.unique()

# training_user_item_combinations = set(zip(df_train.reviewerID, df_train.asin))


#unobserved_ratings = [(reviewerID,asin,0) for reviewerID in all_train_users_id for asin in all_train_items_id if (reviewerID, asin) not in training_user_item_combinations]

# For some reason we need to predict user-item entries for users we don't have any info in the train set - Cold start problem, hello?
predictions = alg.test(testset)

#Write your code here

In [22]:
len(predictions)

54746

In [23]:
value_predictions = [pred.est for pred in predictions]
sum(value_predictions)/len(value_predictions)

4.413418116878057

## Exercise 2

We will implement the Neural Matrix Factorization model using the Python library RecBole.
Please find the documentation here: https://recbole.io/docs/

In [150]:
#Uncomment and run the following line if you need to install RecBole
#!pip install recbole

In [151]:
#Uncomment and run the following line if you need to install ray. This is needed when calling run_recbole
#!pip install ray

In [2]:
import os
import pandas as pd
from recbole.quick_start import run_recbole

### Exercise 2.1

Convert the dataset to the format which can be read by RecBole.

More information regarding the input data format can be found here:
https://recbole.io/docs/user_guide/usage/running_new_dataset.html

In [3]:
df_train.columns

Index(['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'style',
       'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote',
       'image'],
      dtype='object')

In [4]:
#We are creating a dictionary that maps the column names in our dataset to the column names required by RecBole

#Fill this dictionary with keys that are column names in our dataset that correspond to user_id, item_id, rating, and timestamp
#Fill the values of the dictionary according to the given documentation
col_name_dict = {
    "reviewerID": "user_id:token",
    "asin": "item_id:token",
    "overall": "rating:float",
    "unixReviewTime": "timestamp:float"
                }

In [5]:
#this method converts a dataframe to a .inter file, and saves it in the folder "data" under the name 'file_name'
def convert_df_to_inter(df:pd.DataFrame, col_name_dict:dict, file_name:str):
    inter = df.copy()
    selected_cols = col_name_dict.keys()
    inter = inter[selected_cols]
    #write your code to rename the columns in inter using col_name_dict
    inter = inter.rename(columns=col_name_dict)

    if not os.path.exists("data"):
        os.makedirs("data")
    inter.to_csv("data/"+file_name, index=False, sep="\t")

In [6]:
list(col_name_dict.keys())

['reviewerID', 'asin', 'overall', 'unixReviewTime']

In [7]:
#create an extra, empty dataframe with the same column names in the keys of col_name_dict
#we will use this as a dummy validation file
df_extra = pd.DataFrame(columns=col_name_dict.keys())

In [8]:
convert_df_to_inter(df_train, col_name_dict, "data.train.inter")
convert_df_to_inter(df_test, col_name_dict, "data.test.inter")
convert_df_to_inter(df_extra, col_name_dict, "data.extra.inter")

### Exercise 2.2
Train the Neural Matrix Factorization model on the whole training dataset for $100$ epochs.

Evaluate the model on the test set, based on HR, MRR, Precision, MAP, and Recall at $k \in \{5, 10, 20\}$ respectively and round the scores up to 3 decimal places (It is fine if you have different results in the third decimal point).
Keep the rest of the default settings of RecBole the same.

Note: RecBole's MAP normalises the recall base by $\min\{k,G\}$, where $G$ is the recall base (see W7 lecture and homework solution)

Note: A non exhaustive list of properties that can be set using the config_dict parameter of the run_recbole() method can be found here:
https://github.com/RUCAIBox/RecBole/blob/master/recbole/properties/overall.yaml


In [54]:
result = run_recbole(
                    model="NeuMF",
                    dataset="data",
                    config_dict={
                        "data_path":"./",
                        "benchmark_filename": ['train', 'extra', 'test'],
                        "epochs": 100,
                        "seed": 2020,
                        "show_progress": True,
                        "metrics": ["Hit","MRR","MAP","Precision", "Recall"],
                        "topk": [5, 10, 20],
                        "metric_decimal_place": 3,
                        "eval_args": {"split":  {"KFold": 5}}
                    })


02 Mar 02:16    INFO  ['/opt/anaconda3/envs/recsys/lib/python3.10/site-packages/ipykernel_launcher.py', '--f=/Users/danielpenchev/Library/Jupyter/runtime/kernel-v39e31d61ae42945e50d964dd5c12404156b034fff.json']
02 Mar 02:16    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True


data_path = ./data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 100
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'KFold': 5}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Hit', 'MRR', 'MAP', 'Precision', 'Recall']
topk = [5, 10, 20]
valid_metric = MRR@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 3

Dataset Hyper Parameters:
field_separator = 	
seq_separator =  
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len

In [9]:
result = run_recbole(
                    model="NeuMF",
                    dataset="data",
                    config_dict={
                        "data_path":"./",
                        "benchmark_filename": ['train', 'extra', 'test'],
                        "epochs": 100,
                        "seed": 2020,
                        "show_progress": True,
                        "metrics": ["Hit","MRR","MAP","Precision", "Recall"],
                        "topk": [5, 10, 20],
                        "metric_decimal_place": 3
                    })


01 Mar 01:44    INFO  ['/opt/anaconda3/envs/recsys/lib/python3.10/site-packages/ipykernel_launcher.py', '--f=/Users/danielpenchev/Library/Jupyter/runtime/kernel-v39e31d61ae42945e50d964dd5c12404156b034fff.json']
01 Mar 01:44    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = ./data
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 100
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable

In [160]:
result

{'best_valid_score': -inf,
 'valid_score_bigger': True,
 'best_valid_result': None,
 'test_result': OrderedDict([('hit@5', 0.808),
              ('hit@10', 0.819),
              ('hit@20', 0.857),
              ('mrr@5', 0.529),
              ('mrr@10', 0.53),
              ('mrr@20', 0.533),
              ('map@5', 0.529),
              ('map@10', 0.53),
              ('map@20', 0.533),
              ('precision@5', 0.162),
              ('precision@10', 0.082),
              ('precision@20', 0.043),
              ('recall@5', 0.808),
              ('recall@10', 0.819),
              ('recall@20', 0.857)])}

## Exercise 3

Let's create a graph-based recommender system, defining neighbourhoods with random walks. Build a bipartite graph (i.e., edges only between users and items) where nodes are users and items; a **bidirectional** edge $(u,i)$ exists in the graph if user $u$ has rated item $i$ with a score $>3$. 

Implement the Page Rank algorithm to find the top-10 recommended items for user `ARARUVZ8RUF5T`. You can use the `pagerank` method from the library `networkx`. Assume a damping factor of $0.85$ and leave the rest of parameters by default.

In [None]:
# !pip install networkx

In [49]:
import numpy as np
import networkx as nx
from operator import itemgetter

# Prepare the data
def convert_data(df):
    df_convert = df[df.overall > 3] #get the rows in the df where the rating is >3
    df_convert = df_convert[["asin","reviewerID"]]
    df_convert_arr = df_convert.values
    return df_convert_arr

train_df = convert_data(df_train)

In [50]:
''' Hyper Parameters '''
def parameter_dict_from_vector(vector):
    return {
        "W_USER_ITEM" : vector[0],
        "W_USER_ITEM_BACK" : vector[1]
        }

''' Building Graph '''
class InteractionGraph:
    def __init__(self):
        self.graph = nx.MultiDiGraph()
        
    def add_nodes_from_edge_array(self, edge_array, type_1, type_2):
        nodes = [(x[0], {'type': type_1}) for x in edge_array] \
        + [(x[1], {'type': type_2}) for x in edge_array]
        self.graph.add_nodes_from(nodes)

    def add_edges_from_array(self, array, weight_front=1.0, weight_back=1.0):
        forward_edges = [(x[0], x[1], weight_front) for x in array]
        back_edges = [(x[1], x[0], weight_back) for x in array]
        self.graph.add_weighted_edges_from(forward_edges)
        self.graph.add_weighted_edges_from(back_edges)

def build_graph(parameter_dictionary, user_item_array):
    multigraph = InteractionGraph()
    multigraph.add_nodes_from_edge_array(user_item_array, 'item', 'user')
    multigraph.add_edges_from_array(user_item_array, 
                                    parameter_dictionary["W_USER_ITEM"], 
                                    parameter_dictionary["W_USER_ITEM_BACK"])
    return multigraph

class RecommendationEngine:
    def __init__(self, multigraph, damping_factor = 0.3):
        self.graph = nx.DiGraph()
        
        # if we have multple edges with the same source and destination, then create a single edge with the cummulative sum of those edges' weight
        for u,v,d in multigraph.graph.edges(data=True):
            w = d['weight']
            if self.graph.has_edge(u,v):
                self.graph[u][v]['weight'] += w
            else:
                self.graph.add_edge(u,v,weight=w)
        self.nodes = list(self.graph.nodes)
        self.damping_factor = damping_factor
        
        #this part keeps track of items that have been rated by each user in the training set
        self.user_item_dict = {}
        for n in multigraph.graph.nodes.data():
            if n[1]['type'] == 'user':
                self.user_item_dict[n[0]] = set()
        for e in multigraph.graph.edges:
            if e[0] in self.user_item_dict:
                self.user_item_dict[e[0]].add(e[1])

    def generate_pr(self, damping_factor):
        pr = nx.pagerank(self.graph, damping_factor)
        pr_sorted = dict(
            #sort pr by descending probability values
            sorted(pr.items(), key=lambda x: x[1], reverse=True)
            )
        pr_list = [(k, v) for k, v in pr_sorted.items()]
        return pr_list
    
    def generate_recommendations(self, user):
        pr_list = self.generate_pr(self.damping_factor)
        print(self.graph.nodes)
        result = [item for (item, _) in pr_list if item not in self.user_item_dict[user] and item not in self.user_item_dict.keys()]        
        #Given the user, remove items in their recommendation list that they have rated in the training set
        #hint: you can use user_item_dict for this
        return result

('B000URXP6E', 0.12199812480203007), 
('B0012Y0ZG2', 0.0734194898666046), 
('B00006L9LC', 0.06842860929512946),

In [32]:
train_df

array([['B0009RF9DW', 'A105A034ZG9EHO'],
       ['B000FI4S1E', 'A105A034ZG9EHO'],
       ['B000URXP6E', 'A105A034ZG9EHO'],
       ...,
       ['B0009RF9DW', 'AZRD4IZU6TBFV'],
       ['B000FI4S1E', 'AZRD4IZU6TBFV'],
       ['B000URXP6E', 'AZRD4IZU6TBFV']], dtype=object)

In [35]:
df_train[df_train.reviewerID=="A281NPSIMI1C2R"]

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
487,5.0,False,"06 28, 2005",A281NPSIMI1C2R,B0002JHI1I,,Rebecca of Amazon,CoQ10 is essential for healthy youthful skin. ...,Heavenly Lavender Fragrance,1119916800,6.0,
489,5.0,False,"09 22, 2017",A281NPSIMI1C2R,B0006O10P4,"{'Size:': ' 3 oz.', 'Scent Name:': ' Frankince...",Rebecca of Amazon,"Warm, soothing and a little spicy. This zum ba...",Soothing and Spicy,1506038400,,


In [48]:
train_df

array([['B0009RF9DW', 'A105A034ZG9EHO'],
       ['B000FI4S1E', 'A105A034ZG9EHO'],
       ['B000URXP6E', 'A105A034ZG9EHO'],
       ...,
       ['B0009RF9DW', 'AZRD4IZU6TBFV'],
       ['B000FI4S1E', 'AZRD4IZU6TBFV'],
       ['B000URXP6E', 'AZRD4IZU6TBFV']], dtype=object)

In [52]:
damping_factor = 0.85 #fill this

# Build the graph
graph = build_graph(parameter_dict_from_vector(np.ones(2)), train_df)
# Build the recommender system with page rank
recommender = RecommendationEngine(graph, damping_factor)

#print(recommender.graph.nodes)

# Get top-K recommendations for the given user 
user_id = "ARARUVZ8RUF5T"
K = 10

recommender.generate_recommendations(user_id)[:K]

['B0009RF9DW', 'A105A034ZG9EHO', 'A10JB7YPWZGRF4', 'A10P0NAKKRYKTZ', 'A115LE3GBAO8I6', 'A121C9UWQFW5W6', 'A12HWYJ6G58FGV', 'A12X146LZM6KM0', 'A12X3J7IITW1J6', 'A13B2J5IBGS8Y2', 'A149WHPACMTSLO', 'A14BV6AW987ZDN', 'A14SJT4M0BP298', 'A14Y0FPHPBKBAF', 'A14YENXHLKG9WL', 'A150XCEZV6KF5G', 'A156IOMOA59X7N', 'A1577W1CXJ2WI9', 'A157AUOFPJQ46Q', 'A15IDG691XH2XI', 'A15TMUEDCVMOBF', 'A177B2VPWX4P55', 'A17ZAS11CK1HG8', 'A18KOO6VMJJ70S', 'A190ENIBZ72LMF', 'A19Q4TR24DNXB5', 'A1A6EANMBA02NW', 'A1A7LP8GUKEPZM', 'A1AFK6DIZFYQ2V', 'A1AFSSJ58HYER7', 'A1AIIT0GCPBL2M', 'A1AKH1GJBE5CX5', 'A1AWB5QE4T9LPM', 'A1B1HM7OZLXFO2', 'A1B5D7AI4KYG98', 'A1CF9LS3PV9ZOR', 'A1CJRTU646COUL', 'A1CURM7FGX5UJP', 'A1D3G0HL756T8B', 'A1DFZPQPCHBYTY', 'A1EPD7UQU3MXBT', 'A1F1B2NU2TQAJV', 'A1FEQTQIVZKAY', 'A1GO71H0UB95WN', 'A1GVT488CKU6ZB', 'A1GVTXX0S7ANZ8', 'A1HPJKECRYBG6V', 'A1IT5WIUU0FKAV', 'A1IW22SHMMASQ6', 'A1JDG0KTCUW9BT', 'A1JQIYKCPYFKG2', 'A1K85CL9XZYRV7', 'A1L0QECT7J93ZP', 'A1L4ZAG36ZO15M', 'A1LA3IVNDE22NW', 'A1LNES65GKVL0

['B000URXP6E',
 'B0012Y0ZG2',
 'B00006L9LC',
 'B0009RF9DW',
 'B000FI4S1E',
 'B001OHV1H4',
 'B00W259T7G',
 'B0010ZBORW',
 'B0013NB7DW',
 'B019FWRG3C']

Credits: the provided codes in Exercise 3 are modified from
https://arxiv.org/pdf/2301.11009.pdf