# Evaluation of Recommender Systems

Based on the same dataset used on previous weeks, let us evaluate the Collaborative Filtering (CF) model implemented last week.

In [4]:
# Load data splits from Week 6, the files are also uploaded in Absalon
import pandas as pd 
train_df = pd.read_pickle("train_dataframe.pkl") 
test_df = pd.read_pickle("test_dataframe.pkl")

Recall that `reviewerID` corresponds to user, `asin` corresponds to item, and `overall` is the user-given rating to the item.

## Exercise 1

Based on the user-based neighborhood model that was created last week, let's make a general system that can be used to generate recommendations for all users and items. The system would take into account the mean rating of each user. We can use Scikit-Surprise for this.
https://surprise.readthedocs.io/en/stable/index.html

Use cosine as similarity measure and try to vary the (maximum) number of neighbors to take into account when predicting ratings. Set the random state to $0$ for comparable results. Keep Scikit-Surprise's default settings for all other parameters. 

Is it better to use $1$ or $10$ neighbors? You should determine this based on the Root Mean Square Error (RMSE) over 3-fold cross-validation.

In [5]:
# Uncomment and run the following line if you need to install scikit-surprise, note that this library is not the same as sklearn
!pip install scikit-surprise



In [6]:
import random
import pandas as pd
import numpy as np
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise.model_selection import KFold
from sklearn.metrics import mean_squared_error as mse

In [7]:
# 1. Convert train data format
reader = Reader(rating_scale=(1, 5))
training_matrix = Dataset.load_from_df(train_df[['reviewerID', 'asin', 'overall']], reader)

In [8]:
# 2. Fix the random seed
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)

# 3. Define a cross-validation iterator
kf = KFold(n_splits=3)

rmse_result = dict()

list_neighbour = [1, 10]
for neighbour in list_neighbour:
    algo = KNNWithMeans(k=neighbour,
                        sim_options={"name":"cosine","user_based":True},
                        verbose=False,
                        random_state=0)
    rmse_result[neighbour] = {}
    
    fold = 0
    for trainset, testset in kf.split(training_matrix):

        # train and test algorithm.
        algo.fit(trainset)
        
        predictions_KNN = algo.test(testset)
        df_pred_KNN = pd.DataFrame(predictions_KNN)

        actual_ratings = df_pred_KNN['r_ui']
        predicted_ratings = df_pred_KNN['est']
        rmse_result[neighbour][fold] = np.sqrt(mse(actual_ratings, predicted_ratings))

        fold+=1

In [9]:
# Convert the RMSE results dictionary to a DataFrame
df_rmse = pd.DataFrame(rmse_result)

# Compute the average RMSE across folds for each neighbor
avg_rmse_per_neighbor = df_rmse.mean()

# Find the neighbor with the lowest average RMSE
best_neighbor = avg_rmse_per_neighbor.idxmin()

print("Lowest average RMSE:", avg_rmse_per_neighbor.min())
print('Number of neighbors with lowest validation RMSE:', best_neighbor)

Lowest average RMSE: 0.4356721705776638
Number of neighbors with lowest validation RMSE: 10


## Exercise 2

### 2.1
Fit the neigborhood-based model defined in exercise 1 on the full training set with cosine as similarity measure and either $1$ or $10$ neighbors based on what you found to be better in exercise 1. Keep Scikit-Surprise's default settings for all other parameters, but set the random state to $0$ for comparable results.

Use the model to predict the unobserved ratings for the users in the training set. Remove predictions for users that are not in the test set (`test_df`).

How many predictions are there and what is the average of all the predictions (rounded to 2 decimal places)?

*Note:* there may be items in the test set that are not present in the training set; these items are not included in counting the number of predictions

In [10]:
sim_options = {'name': 'cosine',
               'user_based': True
               }
algo = KNNWithMeans(k= 10,
                    sim_options=sim_options, 
                    random_state=0, 
                    verbose=False)

train_data = training_matrix.build_full_trainset()
algo.fit(train_data)

unobserved_ratings = train_data.build_anti_testset()
pred_KNN = algo.test(unobserved_ratings)

# Detect users from training set that are not in test
test_users = set(test_df['reviewerID'])

# Filter predictions: keep only those for users in the test set.
filtered_preds = [pred.est for pred in pred_KNN if pred.uid in test_users]

# Get the number of predictions and the average value rounded to 2 decimals.
num_predictions = len(filtered_preds)
avg_prediction = round(sum(filtered_preds) / num_predictions, 2) if num_predictions else None

print("Number of predictions:", num_predictions)
print("Average prediction:", avg_prediction)

Number of predictions: 52988
Average prediction: 4.73


### 2.2
Report the RMSE of the rating prediction of users and items in `test_df` (rounded to 3 decimal places).

Note that the documentation https://surprise.readthedocs.io/en/stable/predictions_module.html defines `r_ui` as the true rating of user $u$ for item $i$, but this can be somewhat misleading, as it depends on the input. If you run the prediction based on the anti-testset of the training set, then it won't have access to the true rating and instead use the mean rating of all users over all items, which then subsequently lands in the prediction class. 

In [11]:
df_pred_KNN = pd.DataFrame(pred_KNN)

# Merge test_df and df_pred_KNN on the corresponding key columns using an inner join.
merged_df = pd.merge(
    test_df,                                        # complete test set
    df_pred_KNN,                                    # predictions as DataFrame
    left_on=['reviewerID', 'asin'],                 # keys from test set
    right_on=['uid', 'iid'],                        # keys from predictions
    how='inner'
)

# Sort by the keys if needed
merged_df = merged_df.sort_values(by=["reviewerID", "asin"]).reset_index(drop=True)

# Extract actual and predicted ratings from the merged DataFrame.
actual_vals = merged_df["overall"]
est_vals = merged_df["est"]

print(f"Actual values shape: {actual_vals.shape}")
print(f"Predicted values shape: {est_vals.shape}")

# Compute RMSE
rmse_value = np.sqrt(mse(actual_vals, est_vals))
print(f"Test RMSE: {rmse_value:.3f}")

Actual values shape: (830,)
Predicted values shape: (830,)
Test RMSE: 0.295


## Exercise 3
Define a general method to get the top-k recommendations for each user, based on the rating predictions obtained in Exercise 2.1.

Print the top-k with $k=\{5, 10, 20\}$ recommendations for the user with ID `ARARUVZ8RUF5T` and its estimated ratings. Round the printed estimated ratings to 2 decimal places.

In [32]:
from collections import defaultdict
from surprise.prediction_algorithms.predictions import Prediction
from typing import Dict, List
import numpy as np

def get_top_k(predictions: List[Prediction], 
              k: int) -> Dict[str, List]:
    """Compute the top-K recommendation for each user from a set of predictions.
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        k(int): The number of recommendation to output for each user.
    Returns:
        A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """
    topk = defaultdict(list)

    # Sort first by uid, then by est in descending order
    sorted_predictions = sorted(predictions, key=lambda x: x.est, reverse=True)

    # Extract the top k predictions per user
    for pred in sorted_predictions:
        if len(topk[pred.uid]) < k:  # Ensure only top-k per user
            topk[pred.uid].append((pred.iid, pred.est))

    return topk

def print_top_k(user_id: str, topk: Dict[str, List]) -> None:
    user_ratings = topk[user_id]
    print(f"TOP-{len(user_ratings)} predictions for user {user_id}: {[(item, round(rating,2)) for (item, rating) in user_ratings]}")

In [33]:
for k in [5, 10, 20]:
    topk = get_top_k(pred_KNN, k)
    print_top_k("ARARUVZ8RUF5T", topk)

TOP-5 predictions for user ARARUVZ8RUF5T: [('B000WR2HB6', 5), ('B000FOI48G', 4.68), ('B000VV1YOY', 4.67), ('B001ET7FZE', 4.6), ('B000PKKAGO', 4.5)]
TOP-10 predictions for user ARARUVZ8RUF5T: [('B000WR2HB6', 5), ('B000FOI48G', 4.68), ('B000VV1YOY', 4.67), ('B001ET7FZE', 4.6), ('B000PKKAGO', 4.5), ('B00EF1QRMU', 4.47), ('B016V8YWBC', 4.46), ('B00W259T7G', 4.42), ('B00CZH3K1C', 4.33), ('B000GLRREU', 4.23)]
TOP-20 predictions for user ARARUVZ8RUF5T: [('B000WR2HB6', 5), ('B000FOI48G', 4.68), ('B000VV1YOY', 4.67), ('B001ET7FZE', 4.6), ('B000PKKAGO', 4.5), ('B00EF1QRMU', 4.47), ('B016V8YWBC', 4.46), ('B00W259T7G', 4.42), ('B00CZH3K1C', 4.33), ('B000GLRREU', 4.23), ('B00N2WQ2IW', 4.22), ('B00EYZY6LQ', 4.2), ('B01BNEYGQU', 4.17), ('B002GP80EU', 4.04), ('B0009RF9DW', 4.0), ('B000FI4S1E', 4.0), ('B000URXP6E', 4.0), ('B00006L9LC', 4.0), ('B0012Y0ZG2', 4.0), ('B001OHV1H4', 4.0)]


## Exercise 4
Report Precision@k (P@k), MAP@k and the MRR@k with $k=\{5, 10, 20\}$ averaged across users for the CF model. Round the scores to 3 decimal places. When computing P@k and MAP@k, we consider as relevant items those with an observed rating $\geq 4.0$ (i.e., those items from the test set with a rating $\geq$ 4.0). Thus, in this exercise, if a user receives an item that is present in the user’s test split, the item is considered relevant since the test split only contains items with ratings $\geq 4.0$. Reflect on the differences obtained between the metrics and the different cut-off $k$.

In [68]:
import numpy as np
from __future__ import (absolute_import, division, print_function, unicode_literals)
from collections import defaultdict
from surprise import Dataset


def precision_at_k(predictions: List[Prediction], 
                   df_test: pd.DataFrame,
                   k: int) -> Dict[str, float]:
    """Compute precision at k for each user
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
        k(int): The number of recommendation to output for each user.
    Returns:
        A dict where keys are user ids (str)
        and values are the P@k (float) for each of them
    """

    precisions = defaultdict(float)
    
    # First map the predictions to each user.
    topk = get_top_k(predictions, k)

    # Cycle for each key in the top-k dictionary
    for user_id, raccomendation in topk.items():

        # Get the actual items the user has rated in the test set
        relevant_items = df_test[(df_test['reviewerID'] == user_id) & (df_test['overall'] >= 4)]['asin'].tolist()

        # Count the number of hits in the top-k list:
        num_hits = sum([1 for (item, _) in raccomendation if item in relevant_items])
        
        # Compute the precisiosn at k
        precisions[user_id] = num_hits / k

    return precisions



def mean_average_precision(predictions: List[Prediction], 
                           df_test: pd.DataFrame,
                           k: int) -> float:
    """Compute the mean average precision 
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
        k(int): The number of recommendation to output for each user.
    Returns:
        The MAP@k (float)
    """

    average_precision_users = []
    
    for user_id, recommendations in topk.items():
        relevant_items = df_test[(df_test['reviewerID'] == user_id) & (df_test['overall'] >= 4)]['asin'].tolist()
        if not relevant_items:
            continue
        score = 0.0
        num_hits = 0
        for i, (item, _) in enumerate(recommendations, start=1):
            if item in relevant_items:
                num_hits += 1
                score += num_hits / i
        if num_hits > 0:
            average_precision_users.append(score / min(len(relevant_items), k))
        else:
            average_precision_users.append(0.0)
    
    mapk = np.mean(average_precision_users)
    return mapk
    

def mean_reciprocal_rank(predictions: List[Prediction], 
                         df_test: pd.DataFrame, 
                         k) -> float:
    """Compute the mean reciprocal rank 
    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
        k(int): The number of recommendation to output for each user.
    Returns:
        The MRR@k (float)
    """
    
    reciprocal_rank = []
    
    # Write your code here
    
    mean_rr = np.mean(reciprocal_rank)
    return mean_rr

In [69]:
# -------- NB BASED --------
print("Metrics for Neighborhood based CF:")
# PRECISION
precisions_nb = precision_at_k(pred_KNN, 
    test_df, k=5)
print("Averaged P@5: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(pred_KNN, 
    test_df, k=5)
print("MAP@5: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(pred_KNN, 
    test_df, k=5)
print("MRR@5: {:.3f}".format(mrr_nb))



# PRECISION
precisions_nb = precision_at_k(# Complete, 
    test_df, k=10)
print("Averaged P@10: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(# Complete, 
    test_df, k=10)
print("MAP@10: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(# Complete, 
    test_df, k=10)
print("MRR@10: {:.3f}".format(mrr_nb))



# PRECISION
precisions_nb = precision_at_k(# Complete, 
    test_df, k=20)
print("Averaged P@20: {:.3f}".format(sum(prec for prec in precisions_nb.values()) / len(precisions_nb)))
# MAP 
map_nb = mean_average_precision(# Complete, 
    test_df, k=20)
print("MAP@20: {:.3f}".format(map_nb))
# MRR
mrr_nb = mean_reciprocal_rank(# Complete, 
    test_df, k=20)
print("MRR@20: {:.3f}".format(mrr_nb))

Metrics for Neighborhood based CF:
Averaged P@5: 0.143
MAP@5: 0.174
MRR@5: nan


TypeError: precision_at_k() missing 1 required positional argument: 'df_test'

## Exercise 5

Based on the top-5, top-10 and top-20 predictions from Exercise 3, compute the system’s hit rate averaged over the total number of users in the test set.

In [None]:
def hit_rate(top_k: Dict[str, List[str]],
             df_test: pd.DataFrame) -> float:
    """Compute the hit rate
    Args:
        top_k: A dictionary where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n (output of get_top_k())
        df_test: Pandas DataFrame containing user-item ratings in 
            the test split.
    Returns:
        The average hit rate
    """
    hits = 0
    
    # Write your code here
    
    return hits

print("Hit Rate for Neighborhood based CF:")
print("Hit Rate (top-5): {:.3f}".format(hit_rate( #Complete )))
print("Hit Rate (top-10): {:.3f}".format(hit_rate( #Complete )))
print("Hit Rate (top-20): {:.3f}".format(hit_rate( #Complete )))