# Metrics

Here are common metrics that have been designed or adapted specifically for recommendation systems.

In [1]:
import numpy as np
import pandas as pd

import unittest

from IPython.display import HTML, Latex
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

from surprise.prediction_algorithms.slope_one import SlopeOne
from surprise.model_selection import cross_validate
from surprise.dataset import Dataset
from surprise.reader import Reader
from surprise.prediction_algorithms.knns import KNNBasic

header_template = "<p style='font-size:17px'>{}</p>"

**Sources**

- Article on TSD [Evaluation Metrics for Recommendation Systems — An Overview](https://towardsdatascience.com/evaluation-metrics-for-recommendation-systems-an-overview-71290690ecba);
- [Mean average precision for ranking and classification](https://www.evidentlyai.com/ranking-metrics/mean-average-precision-map).

### Task

The following cell has generated a taskt that we will use as an example.  It is generated in the format `<obtect/item> <-> relevance`. 

All the sources I checked describe how to estimate the performance of the models in the case of binary output, where there are "relevant" and "non-relevant". Despite the fact that in life often occur and I have met with tasks where the pair object/item put in correlation to non-binary values (ratings or even preferences expressed in spent money), for simplicity in the beginning let's consider the classical variant. So we'll have following definition:

$$r_{ij} =\begin{cases}
1 - \text{i-th item is relevant for j-th object}\\
0 - \text{other case}.
\end{cases}, i=\overline{1,n}, j=\overline{1,m}.$$

Where:
- $n$ - number of the items under consideration;
- $m$ - number of the objects under consideration.

In [2]:
r_width = 10
r_height = 30
np.random.seed(10)

R, c = make_blobs(
    n_samples=r_height,
    n_features=r_width,
    centers=3,
    random_state=10,
    cluster_std=1
)
R = np.round((R-R.min())/(R.max()-R.min())).astype(int)

# genrating combinations of object/item to be empty
combination_counts = 20
nan_combinations = np.concatenate(
    [
        np.random.randint(0, R.shape[0], [combination_counts,1]),
        np.random.randint(0, R.shape[1], [combination_counts,1])
    ],
    axis=1
)

R_frame = pd.Series(
    R.ravel(),
    index = pd.MultiIndex.from_tuples(
            [
                (j,i) 
                for j in np.arange(R.shape[1]) 
                for i in np.arange(R.shape[0])
            ],
            names = ["object", "item"]
    ),
    name = "relevant"
).reset_index()

R_frame.sample(10)

Unnamed: 0,object,item,relevant
148,4,28,0
147,4,27,1
154,5,4,1
8,0,8,1
105,3,15,1
24,0,24,0
125,4,5,0
276,9,6,0
130,4,10,1
226,7,16,0


### Solutions

We need the results of some algorithms to calculate metrics for them. We'll compare the results of the models:

- Random model - just random scores for each item;
- Model provided by `surprise.prediction_algorithms.knns.KNNBasic`.

In [3]:
reader = Reader(rating_scale=(0,1))
surp_dataset = Dataset.load_from_df(
    R_frame[["object", "item", 'relevant']], 
    reader
)
my_data_set = surp_dataset.build_full_trainset()

np.random.seed(10)
R_frame["Random scores"] = np.random.normal(size=len(R_frame))

model = KNNBasic(k=25,verbose=False)
model = model.fit(my_data_set)
R_frame["KNN scores"] = R_frame[["object", "item"]].apply(
    lambda row: model.predict(
        row["object"], row["item"]
    ).est, 
    axis = 1
)

Now, for each object $j=\overline{1,m}$, we have two arrays of scores: $S_{1j}$ and $S_{2j}$, generated by the first and second models, respectively. The array for the $M$-th model should be represented as $S_{M,j} = \{s_{M,j,1}, s_{M,j,2}, ..., s_{M,j,n}\}$. If $s_{M,j,t} > s_{M,j,k}$, it indicates that the $t$-th item is considered more relevant than the $k$-th item for object $j$, according to the $M$-th model.

So now we can order items according to relevance by model. So we can define orders of the items:

$$I_{M,j}=\{i_1, i_2, ... , i_n\}: k<t \Leftrightarrow s_{M,j,i_k} > s_{M,j,i_t}.$$

Or, in simple words, in $I_{M,j}$ items go in descending order of preference for the $j$-th object according to the $M$-th model.

We also introduce the sequence of real relevance of elements in accordance with the model.

$$R'_{M,j}=\{r_{i_1,j}, r_{i_2,j}, ..., r_{i_n,j}\}$$

The following cell displays $I_{1,j}$, $I_{2,j}$ and $R'_{1,j}$, $R'_{2,j}$ for selected $j$.

In [4]:
object_ind = 8
temp_object = R_frame[R_frame["object"] == object_ind]
display(temp_object)

get_order = lambda scores_name: ",".join(
    temp_object
    .sort_values(scores_name, ascending=False)["item"]
    .astype(str)
    .to_list()
)
get_relevances_order = lambda scores_name: ", ".join(
    temp_object
    .sort_values(
        scores_name, 
        ascending=False
    )["relevant"]
    .astype("str").to_list()
)

display(HTML(header_template.format("Random model")))
random_order = get_order("Random scores")
random_rel_order = get_relevances_order("Random scores")
display(Latex(f"$I_{{1,{object_ind}}}=\{{ {random_order} \}}$"))
display(Latex(f"$R'_{{1,{object_ind}}}=\{{ {random_rel_order} \}}$"))

display(HTML("<hr>"))

display(HTML(header_template.format("KNN model")))
KNN_order = get_order("KNN scores")
KNN_rel_order = get_relevances_order("KNN scores")
display(Latex(f"$I_{{2,{object_ind}}}=\{{ {KNN_order} \}}$"))
display(Latex(f"$R'_{{2,{object_ind}}}=\{{ {KNN_rel_order} \}}$"))

del object_ind, temp_object

Unnamed: 0,object,item,relevant,Random scores,KNN scores
240,8,0,1,-2.017719,1.0
241,8,1,0,0.540541,0.357859
242,8,2,1,-1.442299,0.464464
243,8,3,1,-1.60885,1.0
244,8,4,0,-1.006569,0.563374
245,8,5,0,-0.257534,0.357859
246,8,6,0,0.730507,0.535536
247,8,7,1,-1.698401,0.559097
248,8,8,0,1.674076,0.535536
249,8,9,0,1.163724,0.535536


<IPython.core.display.Latex object>

<IPython.core.display.Latex object>

<IPython.core.display.Latex object>

<IPython.core.display.Latex object>

As you can see, according to the KNN model, relevant itmes are more likely to get higher scores.

## recall@k

$recall_j@k$ gives a measure of how many of the relevant items are present in top $k$ out of all the relevant items, where $k$ is the number of recommendations generated for a $j$-th object. Or more formally:

$$recall_j@k = \frac{\sum_{i=1}^k r_{ij}}{\sum_{i=1}^n r_{ij}}$$

Where:
- items are sorted according to their preference for $j$-th object for the model under consideration;
- $\sum_{i=1}^k r_{ij}$ - number of relevant items in first $k$ items;
- $\sum_{i=1}^n r_{ij}$ - total number of relevant items for $j$-th object.

### Consider specific

Let's examine a specific object to gain a clear understanding of the situation and calculate the recall at 3 ($recall@3$) for it. We will compare models to discern any differences. In the following cell, we have extracted a subframe for the specific object and sorted it based on the results from the models. The example has been selected to highlight the disparity in $recall@3$ between the models:

In [5]:
k = 3
obj = 4

model1_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "Random scores"
    ]
].sort_values(
    "Random scores", 
    ascending=False
).set_index("item")

model2_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "KNN scores"
    ]
].sort_values(
    "KNN scores", 
    ascending=False
).set_index("item")

model1_recall = (
    model1_tab["relevant"].iloc[:k].sum()/
    model1_tab["relevant"].sum()
)
model2_recall = (
    model2_tab["relevant"].iloc[:k].sum()/
    model2_tab["relevant"].sum()
)

display(HTML(
    f"""
    <div style='display: flex;justify-content: space-around;'>
        <div>
            {model1_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model1_recall*100,2)}%
            </p>
        </div>
        <div>
            {model2_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model2_recall*100,2)}%
            </p>
        </div>
    </div>
    """
))

Unnamed: 0_level_0,relevant,Random scores
item,Unnamed: 1_level_1,Unnamed: 2_level_1
3,1,2.465325
18,1,1.985386
8,0,1.656717
25,0,1.614408
19,1,1.447166
4,0,1.383232
16,1,1.339926
2,1,1.236205
29,0,1.134973
6,0,1.022516

Unnamed: 0_level_0,relevant,KNN scores
item,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,1.0
3,1,1.0
20,1,1.0
10,1,1.0
23,1,0.912074
14,1,0.908077
19,1,0.725166
18,1,0.725166
16,1,0.725166
13,1,0.72044


### Python code

There is a function that represents the realisation of $recall@k$ in python.

In [6]:
def recall_k(relevance_array, pred_score, k):
    '''
    The calculation of recall@k is a metric that measures 
    the proportion of relevant items present within the top k 
    recommendations out of all relevant elements. It signifies 
    the ability to identify and include relevant items in the 
    initial recommendations.
    
    Parameters
    ----------
    relevance_array : numpy.array
        binary array marking observations that were relevant;
    pred_score : numpy.array
        predicted scores are expected to be 
        higher the more relevant item is.

    Returns
    ----------
    out : float
        realisation of the metric.
    '''
    if len(relevance_array)!=len(pred_score):
        raise ValueError(
            "`relevance_array` and `pred_score` must be the same size"
        )
    elif len(relevance_array) < k:
        raise ValueError(
            "k is greater than the number of observations"
        )
    
    relevant_in_k = np.sum(
        relevance_array[np.argsort(pred_score)[::-1]][:k]
    )
    relevant_total = np.sum(relevance_array)
    return relevant_in_k/relevant_total

Here is some unitests for function below:

In [7]:
class TestRecall(unittest.TestCase):
    def test_different_sizes(self):
        '''
        We must check that if the sizes of arrays with 
        relevance and prediction differ, an error must 
        be rased.
        '''
        with self.assertRaises(ValueError):
            recall_k(
                np.array([1, 1, 0]),
                np.array([0.3, 0.2, 0.3, 0.2]),
                1
            )

    def test_k_more_obs(self):
        '''
        K cannot be more than the number of observations 
        we are considering.
        '''
        with self.assertRaises(ValueError):
            recall_k(
                np.array([1, 1, 0, 0, 1]),
                np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
                10
            )
    
    def test_computions(self):
        '''
        Just basic test with known result
        '''
        real_ans = recall_k(
            np.array([1, 1, 0, 0, 1]),
            np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
            3
        )
        exp_ans = 2/3
        self.assertAlmostEqual(real_ans, exp_ans, delta=0.000001)
ans = unittest.main(argv=[''], verbosity=2, exit=False)
del TestRecall

test_computions (__main__.TestRecall)
Just basic test with known result ... ok
test_different_sizes (__main__.TestRecall)
We must check that if the sizes of arrays with ... ok
test_k_more_obs (__main__.TestRecall)
K cannot be more than the number of observations ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK


The following cell shows the code to calculate the recall for our example. We calculated it for each object, but then took the average.

In [8]:
show = R_frame.groupby("object").apply(
    lambda object: pd.Series({
        "recall for model 1" : recall_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["Random scores"].to_numpy(),
            k=4
        ),
        "recall for model2" : recall_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["KNN scores"].to_numpy(),
            k=4
        )
    }),
    include_groups=False
)
display(show)
display(show.mean().rename("mean value").to_frame().T)

Unnamed: 0_level_0,recall for model 1,recall for model2
object,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.1,0.2
1,0.2,0.2
2,0.05,0.2
3,0.05,0.2
4,0.142857,0.285714
5,0.2,0.266667
6,0.055556,0.222222
7,0.111111,0.222222
8,0.117647,0.235294
9,0.105263,0.210526


Unnamed: 0,recall for model 1,recall for model2
mean value,0.113243,0.224265


## precision@k

$precision_j@k$ is a fraction of relevant elements in the first $k$ recommendations for $j$-th object. Or more formally:

$$precision_j@k = \frac{\sum_{i=1}^k r_{ij}}{k}$$

Where:
- items are sorted according to their preference for $j$-th object for the model under consideration;
- $\sum_{i=1}^k r_{ij}$ - number of relevant items in first $k$ items for $j$-th object.

### Consider specific

Let's examine a specific object to gain a clear understanding of the situation and calculate the recall at 3 ($precision@3$) for it. We will compare model 1 and model 2 to discern any differences. In the following cell, we have extracted a subframe for the specific object and sorted it based on the results from the models. The example has been selected to highlight the disparity in $precision@3$ between the models:

In [9]:
k = 3
obj = 4

model1_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "Random scores"
    ]
].sort_values(
    "Random scores", 
    ascending=False
).set_index("item")

model2_tab = R_frame.loc[
    R_frame["object"] == obj,
    [
        "item",
        "relevant",
        "KNN scores"
    ]
].sort_values(
    "KNN scores", 
    ascending=False
).set_index("item")

model1_recall = (
    model1_tab["relevant"].iloc[:k].mean()
)
model2_recall = (
    model2_tab["relevant"].iloc[:k].mean()
)

display(HTML(
    f"""
    <div style='display: flex;justify-content: space-around;'>
        <div>
            {model1_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model1_recall*100,2)}%
            </p>
        </div>
        <div>
            {model2_tab.to_html()}
            <p style='font-size:20px'>
                recall@{k} - {round(model2_recall*100,2)}%
            </p>
        </div>
    </div>
    """
))

Unnamed: 0_level_0,relevant,Random scores
item,Unnamed: 1_level_1,Unnamed: 2_level_1
3,1,2.465325
18,1,1.985386
8,0,1.656717
25,0,1.614408
19,1,1.447166
4,0,1.383232
16,1,1.339926
2,1,1.236205
29,0,1.134973
6,0,1.022516

Unnamed: 0_level_0,relevant,KNN scores
item,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,1.0
3,1,1.0
20,1,1.0
10,1,1.0
23,1,0.912074
14,1,0.908077
19,1,0.725166
18,1,0.725166
16,1,0.725166
13,1,0.72044


### Python code

There is a function that represents the realisation of $precision@k$ in python.

In [10]:
def precision_k(relevance_array, pred_score, k):
    '''
    Calculation Precision@k. This is a metric used to 
    assess the accuracy of recommendations by calculating 
    the proportion of relevant items in the first k 
    recommendations out of all the items recommended. 
    It quantifies the precision and effectiveness of 
    the recommendation system in providing highly relevant 
    suggestions within the initial set of recommendations.

    Parameters
    ----------
    relevance_array : numpy.array
        binary array marking observations that were relevant;
    pred_score : numpy.array
        predicted scores are expected to be 
        higher the more relevant item is.

    Returns
    ----------
    out : float
        realisation of the metric.
    '''
    if len(relevance_array)!=len(pred_score):
        raise ValueError(
            "`relevance_array` and `pred_score` must be the same size"
        )
    elif len(relevance_array) < k:
        raise ValueError(
            "k is greater than the number of observations"
        )
    return np.mean(
        relevance_array[np.argsort(pred_score)[::-1]][:k]
    )

Here is some unitests for function below:

In [11]:
class TestPrecision(unittest.TestCase):
    def test_different_sizes(self):
        '''
        We must check that if the sizes of arrays with 
        relevance and prediction differ, an error must 
        be rased.
        '''
        with self.assertRaises(ValueError):
            precision_k(
                np.array([1, 1, 0]),
                np.array([0.3, 0.2, 0.3, 0.2]),
                1
            )

    def test_k_more_obs(self):
        '''
        K cannot be more than the number of observations 
        we are considering.
        '''
        with self.assertRaises(ValueError):
            precision_k(
                np.array([1, 1, 0, 0, 1]),
                np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
                10
            )
    
    def test_computions(self):
        '''
        Just basic test with known result
        '''
        real_ans = precision_k(
            np.array([1, 1, 0, 0, 1]),
            np.array([0.4, 0.1, 0.2, 0.5, 0.3]),
            3
        )
        exp_ans = 2/3
        self.assertAlmostEqual(real_ans, exp_ans, delta=0.000001)

ans = unittest.main(argv=[''], verbosity=2, exit=False)
del TestPrecision

test_computions (__main__.TestPrecision)
Just basic test with known result ... ok
test_different_sizes (__main__.TestPrecision)
We must check that if the sizes of arrays with ... ok
test_k_more_obs (__main__.TestPrecision)
K cannot be more than the number of observations ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.002s

OK


The following cell shows the code to calculate the precision for our example. We calculated it for each object, but then took the average.

In [12]:
show = R_frame.groupby("object").apply(
    lambda object: pd.Series({
        "precision for model 1" : precision_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["Random scores"].to_numpy(),
            k=4
        ),
        "precision for model2" : precision_k(
            relevance_array=object["relevant"].to_numpy(),
            pred_score=object["KNN scores"].to_numpy(),
            k=4
        )
    }),
    include_groups=False
)
display(show)
display(show.mean().rename("mean value").to_frame().T)

Unnamed: 0_level_0,precision for model 1,precision for model2
object,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.5,1.0
1,1.0,1.0
2,0.25,1.0
3,0.25,1.0
4,0.5,1.0
5,0.75,1.0
6,0.25,1.0
7,0.5,1.0
8,0.5,1.0
9,0.5,1.0


Unnamed: 0,precision for model 1,precision for model2
mean value,0.5,1.0


## AP@k (average precision)

This metric is also counted for each object individually. So for $j$-th object it'll take following formula:

$$AP_j@k=\frac{1}{N_j} \sum_{t=1}^k precision_j@t*r_{tj}$$

Where $N_j=\sum_{i}^k r_{ij}$ - number of relevant items for $j$-th object in $k$ best according to the model.
We take $k$ best elements and try to compute $precision@t$ for each $t=\overline{1,k}$. We add to the numerator only those precisions that correspond to the relevant values - in irrelevant cases $r_{tj}=0$ will remove the corresponding $precision@t$.