*UE Learning from User-generated Data, CP MMS, JKU Linz 2024*
# Exercise 4: Evaluation

In this exercise we evaluate accuracy of three different RecSys we already implemented. First we implement DCG and nDCG metrics, then we create a simple evaluation framework to compare the three recommenders in terms of nDCG. The implementations for the three recommender systems are provided in a file rec.py and are imported later in the notebook.
Please consult the lecture slides and the presentation from UE Session 4 for a recap.

Make sure to rename the notebook according to the convention:

LUD24_ex04_k<font color='red'><Matr. Number\></font>_<font color='red'><Surname-Name\></font>.ipynb

for example:

LUD24_ex04_k000007_Bond_James.ipynb

## Implementation
In this exercise, as before, you are required to write a number of functions. Only implemented functions are graded. Insert your implementations into the templates provided. Please don't change the templates even if they are not pretty. Don't forget to test your implementation for correctness and efficiency. **Make sure to try your implementations on toy examples and sanity checks.**

Please **only use libraries already imported in the notebook**.

In [1]:
import pandas as pd
import numpy as np

## <font color='red'>TASK 1/2</font>: Evaluation Metrics

Implement DCG and nDCG in the corresponding templates.

### DCG Score
Implement DCG following the input/output convention:
#### Input:
* predictions - (not an interaction matrix!) numpy array with recommendations. Row index corresponds to User_id, column index corresponds to the rank of the item mentioned in the cell. Every cell (i,j) contains **item id** recommended to the user (i) on the position (j) in the list. For example:

The following predictions structure [[12, 7, 99], [0, 97, 6]] means that the user with id==1 (second row) got recommended item **0** on the top of the list, item **97** on the second place and item **6** on the third place.

* test_interaction_matrix - (plain interaction matrix format as before!) interaction matrix constructed from interactions held out as a test set, rows - users, columns - items, cells - 0 or 1

* topK - integer - top "how many" to consider for the evaluation. By default top 10 items are to be considered

#### Output:
* DCG score

Don't forget, DCG is calculated for every user separately and then the average is returned.


<font color='red'>**Attention!**</font> Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

In [2]:
def get_dcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK: int = 10) -> float:
    """
    predictions - 2D np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - 2D np.ndarray, test interaction matrix for each user;
    
    returns - float, mean dcg score over all user;
    """
    score = None

    # TODO: YOUR IMPLEMENTATION.
    dcg_scores = []
    for user_id in range(predictions.shape[0]):
        preds = predictions[user_id, :topK]
        actual = test_interaction_matrix[user_id, :]

        dcg = 0
        for i, item in enumerate(preds):
            if actual[item] == 1:
                dcg += 1 / np.log2(i+2)

        dcg_scores.append(dcg)

    score = np.mean(dcg_scores)

    return score

In [3]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

dcg_score = get_dcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(dcg_score, 1), "1 expected"

* Can DCG score be higher than 1?
* Can the average DCG score be higher than 1?
* Why?

### nDCG Score

Following the same parameter convention as for DCG implement nDCG metric.

<font color='red'>**Attention!**</font> Remember that ideal DCG is calculated separately for each user and depends on the number of tracks held out for them as a Test set! Use logarithm with base 2 for discounts! Remember that the top1 recommendation shouldn't get discounted!

<font color='red'>**Note:**</font> nDCG is calculated for **every user separately** and then the average is returned. You do not necessarily need to use the function you implemented above. Writing nDCG from scratch might be a good idea as well.

In [4]:
def get_ndcg_score(predictions: np.ndarray, test_interaction_matrix: np.ndarray, topK=10) -> float:
    """
    predictions - np.ndarray, predictions of the recommendation algorithm for each user;
    test_interaction_matrix - np.ndarray, test interaction matrix for each user;
    topK - int, topK recommendations should be evaluated;
    
    returns - float, average ndcg score over all users;
    """
    score = None
    
    # TODO: YOUR IMPLEMENTATION.
    ndcg_scores = []
    for user_id in range(predictions.shape[0]):
        preds = predictions[user_id, :topK]
        actual = test_interaction_matrix[user_id, :]

        dcg, idcg = 0, 0
        for i, item in enumerate(preds):
            if actual[item] == 1:
                dcg += 1 / np.log2(i+2)

        num_relevant_items = np.sum(actual)
        for i in range(min(num_relevant_items, topK)):
            idcg += 1 / np.log2(i+2)

        ndcg = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg)

    score = np.mean(ndcg_scores)
    
    return score

In [5]:
predictions = np.array([[0, 1, 2, 3], [3, 2, 1, 0]])
test_interaction_matrix = np.array([[1, 0, 0, 0], [0, 0, 0, 1]])

ndcg_score = get_ndcg_score(predictions, test_interaction_matrix, topK=4)

assert np.isclose(ndcg_score, 1), "ndcg score is not correct."

* Can nDCG score be higher than 1?

## <font color='red'>TASK 2/2</font>: Evaluation
Use provided rec.py (see imports below) to build a simple evaluation framework. It should be able to evaluate POP, ItemKNN and SVD.

*Make sure to place provided rec.py next to your notebook for the imports to work.*


In [6]:
from rec import svd_decompose, svd_recommend_to_list  #SVD
from rec import inter_matr_implicit
from rec import recTopK  #ItemKNN
from rec import recTopKPop  #TopPop

Load the users, items and both the train interactions and test interactions
from the **new version of the lfm-tiny-tunes dataset** provided with the assignment

In [7]:
def read(dataset, file):
    return pd.read_csv(dataset + '/' + dataset + '.' + file, sep='\t')

# TODO: YOUR IMPLEMENTATION

users = read("lfm-tiny-tunes", 'user')
items = read("lfm-tiny-tunes", 'item')
train_inters = read("lfm-tiny-tunes", 'inter_train')
test_inters = read("lfm-tiny-tunes", 'inter_test')

train_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=train_inters,
                                               dataset_name="lfm-tiny-tunes")
test_interaction_matrix = inter_matr_implicit(users=users, items=items, interactions=test_inters,
                                              dataset_name="lfm-tiny-tunes")

### Get Recommendations

Implement the function below to get recommendations from all 3 recommender algorithms. Make sure you use the provided config dictionary and pay attention to the structure for the output dictionary - we will use it later.

In [8]:
config_predict = {
    #interaction matrix
    "train_inter": train_interaction_matrix,
    #topK parameter used for all algorithms
    "top_k": 10,
    #specific parameters for all algorithms
    "recommenders": {
        "SVD": {
            "n_factors": 50
        },
        "ItemKNN": {
            "n_neighbours": 5
        },
        "TopPop": {
        }
    }
}

In [9]:
def get_recommendations_for_algorithms(config: dict) -> dict:
    """
    config - dict, configuration as defined above;

    returns - dict, already predefined below with name "rec_dict";
    """

    #use this structure to return results
    # rec_dict = {"recommenders": {
    #     "SVD": {
    #         #Add your predictions here
    #         "predictions": np.array([])
    #     },
    #     "ItemKNN": {
    #         "predictions": np.array([])
    #     },
    #     "TopPop": {
    #         "predictions": np.array([])
    #     },
    # }}

    # predictions = recomendations???
    rec_dict = {"recommenders": {
        "SVD": {
            "recommendations": np.array([])
        },
        "ItemKNN": {
            "recommendations": np.array([])
        },
        "TopPop": {
            "recommendations": np.array([])
        },
    }}

    # TODO: YOUR IMPLEMENTATION.
    recommenders = config["recommenders"]
    
    for i in range(len(recommenders)):
        key = list(recommenders.keys())[i]
        rec = recommenders[key]
        recommendations = []
        for user_id in range(len(config["train_inter"])):
            if key == "SVD":
                seen_item_ids = np.where(config["train_inter"][user_id] != 0)
                decomp = svd_decompose(config["train_inter"], rec["n_factors"])
                recom = svd_recommend_to_list(user_id, seen_item_ids, *decomp, config["top_k"])
                recommendations.append(recom)
            elif key == "ItemKNN":
                recom = recTopK(config["train_inter"], user_id, config["top_k"], rec["n_neighbours"])
                recommendations.append(recom)
            elif key == "TopPop":
                recom = recTopKPop(config["train_inter"], user_id, config["top_k"])
                recommendations.append(recom)

        if recommendations:
            rec_dict["recommenders"][key]["recommendations"] = np.array(recommendations)

    return rec_dict

In [10]:
recommendations = get_recommendations_for_algorithms(config_predict)

assert "SVD" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["SVD"]
assert isinstance(recommendations["recommenders"]["SVD"]["recommendations"], np.ndarray)
assert "ItemKNN" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["ItemKNN"]
assert isinstance(recommendations["recommenders"]["ItemKNN"]["recommendations"], np.ndarray)
assert "TopPop" in recommendations["recommenders"] and "recommendations" in recommendations["recommenders"]["TopPop"]
assert isinstance(recommendations["recommenders"]["TopPop"]["recommendations"], np.ndarray)


In [48]:
# Results for first 5 users
# SVD 
svd_preds = recommendations["recommenders"]["SVD"]["recommendations"]
print(f"SVD: \n{svd_preds[:5]}")

# ItemKNN
itemknn_preds = recommendations["recommenders"]["ItemKNN"]["recommendations"]
print(f"\nItemKNN: \n{itemknn_preds[:5]}")

# TopPop
top_pop_preds = recommendations["recommenders"]["TopPop"]["recommendations"]
print(f"\nTopPop: \n{top_pop_preds[:5]}")

SVD: 
[[262 336 125 158  15 251  25  18 269 242]
 [  6 153  43  89 251 146 125  97 216  80]
 [317  15 130 251 259 335  14  93  62 233]
 [ 17 144 146  15 142  10 198  42  19  41]
 [ 29 115  97 251 156  62  68 335 142 179]]

ItemKNN: 
[[118  16 119 117  39  17  40   1 251 125]
 [153  33  40  17 251 144  30  31  97  86]
 [  4 117 317 133 257 233  80  78  58  77]
 [ 31 105  10   4  53 144  86   3  40 119]
 [118 119  29  12  40 105  16  47  35  99]]

TopPop: 
[[ 16  40  33 105  47  35 118  45 119  30]
 [ 16   3  40  33 105  47 118  35  45 119]
 [  3  40  33 105  47 118  35  45 119   4]
 [  3  40 105  47 118  45 119   4  58  72]
 [ 16  40  33 105  47  35 118  45 119   4]]


### Evaluate Recommendations

Implement the function such that it evaluates the previously generated recommendations. Make sure you use the provided config dictionary. **DO NOT** load it directly from the *config_test*. Pay attention to the structure for the output dictionary.

In [30]:
config_test = {
    "top_k": 10,
    "test_inter": test_interaction_matrix,
    "recommenders": {}  # here you can access the recommendations from get_recommendations_for_algorithms

}
# add dictionary with recommendations to config dictionary
config_test.update(recommendations)

In [34]:
def evaluate_algorithms(config: dict) -> dict:
    """
    config - dict, configuration as defined above;

    returns - dict, { Recommender Key from input dict: { "ndcg": float - ndcg from evaluation for this recommender} };
    """

    metrics = {
        "SVD": {
        },
        "ItemKNN": {
        },
        "TopPop": {
        },
    }

    # TODO: YOUR IMPLEMENTATION.
    test_inter_matrix = config["test_inter"]
    top_k = config["top_k"]

    for recommender in config["recommenders"]:
        predictions = config["recommenders"][recommender]["recommendations"]
        ndcg_score = get_ndcg_score(predictions, test_inter_matrix, top_k)
        metrics[recommender]["ndcg"] = ndcg_score

    return metrics

### Evaluating Every Algorithm
Make sure everything works.
We expect KNN to outperform other algorithms on our small data sample.

In [35]:
evaluations = evaluate_algorithms(config_test)

assert "SVD" in evaluations and "ndcg" in evaluations["SVD"] and isinstance(evaluations["SVD"]["ndcg"], float)
assert "ItemKNN" in evaluations and "ndcg" in evaluations["ItemKNN"] and isinstance(evaluations["ItemKNN"]["ndcg"], float)
assert "TopPop" in evaluations and "ndcg" in evaluations["TopPop"] and isinstance(evaluations["TopPop"]["ndcg"], float)

In [36]:
for recommender in evaluations.keys():
    print(f"{recommender} ndcg: {evaluations[recommender]['ndcg']}")

SVD ndcg: 0.14300409512681314
ItemKNN ndcg: 0.20568927986328173
TopPop ndcg: 0.09429753895348715


## Questions and Potential Future Work
* How would you try improve performance of all three algorithms?
* What other metrics would you consider to compare these recommender systems?

In [None]:
# The end.