# Evaluation

## Prediction Metrics (Similiar to Regression Problem)
    - RMSE
    - MSE
    - MAE

### Hit Metrics (Similiar to Classification Metrics)
**Hit** - defined by relevancy, a hit usually means whether the recommended "k" items hit the "relevant" items by the user. For example, a user may have clicked, viewed, or purchased an item for many times, and a hit in the recommended items indicate that the recommender performs well. Metrics like "precision", "recall", etc. measure the performance of such hitting accuracy.

    - Precision@k
    - Recall@k
  

### Ranking Metrics

**Ranking** - ranking metrics give more explanations about, for the hitted items, whether they are ranked in a way that is preferred by the users whom the items will be recommended to. Metrics like "mean average precision", "ndcg", etc., evaluate whether the relevant items are ranked higher than the less-relevant or irrelevant items. 

    - MeanReciprocalRank@k
    - MeanAveragePrecision@k
    - NDCG@k


In [1]:
import numpy as np
import pandas as pd

In [2]:
df_true = pd.DataFrame(
        {
            "USER": [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
            "ITEM": [1, 2, 3, 1, 4, 5, 6, 7, 2, 5, 6, 8, 9, 10, 11, 12, 13, 14],
            "RATING": [5, 4, 3, 5, 5, 3, 3, 1, 5, 5, 5, 4, 4, 3, 3, 3, 2, 1],
        }
    )

df_pred = pd.DataFrame(
    {
        "USER": [1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        "ITEM": [3, 10, 12, 10, 3, 11, 5, 13, 4, 10, 7, 13, 1, 3, 5, 2, 11, 14],
        "RATING": [14, 13, 12, 14, 13, 12, 11, 10, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5]
    }
)

In [4]:
from recoflow.metrics import MeanAbsoluteError, MeanSquaredError, RootMeanSquaredError
from recoflow.metrics import PrecisionK, RecallK, NDCGK, MeanReciprocalRankK, MeanAveragePrecisionK

In [5]:
RootMeanSquaredError(df_true, df_pred)

7.106335201775948

In [6]:
MeanSquaredError(df_true, df_pred)

50.5

In [7]:
MeanAbsoluteError(df_true, df_pred)

6.25

In [8]:
PrecisionK(df_true, df_pred, 3)

0.2222222222222222

In [9]:
RecallK(df_true, df_pred, 3)

0.14444444444444446

In [10]:
NDCGK(df_true, df_pred, 3)

0.2551202123295406

In [11]:
MeanAveragePrecisionK(df_true, df_pred, 3)

0.12777777777777777

In [13]:
MeanReciprocalRankK(df_true, df_pred, 3)

0.16666666666666666

In [20]:
rating_true = df_true
rating_pred = df_pred
k = 5

In [21]:
common_users = set(rating_true["USER"]).intersection(set(rating_pred["USER"]))
rating_true_common = rating_true[rating_true["USER"].isin(common_users)]
rating_pred_common = rating_pred[rating_pred["USER"].isin(common_users)]
n_users = len(common_users)

In [35]:
rating_pred_common.head(8)

Unnamed: 0,USER,ITEM,RATING
0,1,3,14
1,1,10,13
2,1,12,12
3,2,10,14
4,2,3,13
5,2,11,12
6,2,5,11
7,2,13,10


In [23]:
from recoflow.metrics import _GetTopKItems

In [26]:
df_hit = _GetTopKItems(rating_pred_common, "USER", "RATING", k)

In [32]:
rating_true_common.head(6)

Unnamed: 0,USER,ITEM,RATING
0,1,1,5
1,1,2,4
2,1,3,3
3,2,1,5
4,2,4,5
5,2,5,3


In [33]:
df_hit.head(6)

Unnamed: 0,USER,ITEM,rank
0,1,3,1
1,2,5,4
2,3,10,2
3,3,13,4


In [28]:
df_hit = pd.merge(df_hit, rating_true_common, on=["USER", "ITEM"])[
        ["USER", "ITEM", "rank"]]

In [29]:
df_hit

Unnamed: 0,USER,ITEM,rank
0,1,3,1
1,2,5,4
2,3,10,2
3,3,13,4


In [36]:
# count the number of hits vs actual relevant items per user
df_hit_count = pd.merge(
    df_hit.groupby("USER", as_index=False)["USER"].agg({"hit": "count"}),
    rating_true_common.groupby("USER", as_index=False)["USER"].agg(
        {"actual": "count"}
    ),
    on="USER",
)


In [37]:
df_hit_count

Unnamed: 0,USER,hit,actual
0,1,1,3
1,2,1,5
2,3,2,10


In [None]:
return df_hit, df_hit_count, n_users