### Ref
https://towardsdatascience.com/recommendation-systems-models-and-evaluation-84944a84fb8e

https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093

# How to evaluate recommendation engine ?

<img src="figures/evaluation-metrics.png" width="60%">

Which Recommendation System (RS) is the best?
* **Relevant recommendations** are defined as recommendations of items that the **user has rated positively** in the test data. 
* **The goal** is to **NOT recommend** all the same products that the **user has bought before**. 
  * How do you know if your model is doing a good job at suggesting products?

## The Long Tail

* Consider num. of **Interactions** such as clicks, ratings, or purchases 
* The items in the "**long tail**" usually **do not have enough interactions** to accurately be recommended using user-based recommender systems like collaborative filtering "`head`". <img align="right" style="padding-left:10px;" src="figures/long_tail_final.png" width="70%">  
* Because there are **many** observations of **popular items** in the training data, it is not difficult for a recommender system to learn to **accurately predict these items**.

## Statistical accuracy metrics

Used to evaluate accuracy of a filtering technique 
* by comparing the **predicted ratings** directly with the **actual user rating**

These metrics are good to use when 
* the recommendations are based on **predicting rating** or **number of transactions**. 

They give us 
* a sense of how accurate our **prediction** ratings are, 
* how accurate our **recommendations** are.

### Root Mean Square Error (RMSE)

$$RMSE= \sqrt{\frac{1}{N}\sum(predicted - actual)^2} $$
If a user has given a rating of 5 to a movie and we predicted the rating as 4, then RMSE is 1 . 
* Lesser the RMSE value, better the recommendations

### Mean Absolute Error (MAE)

$$MAE= \frac{1}{N}\sum |predicted - actual| $$
* Lesser the MAE value, better the recommendations

## Decision support accuracy metrics

* Help users `select items that are more similar among available set of items`. 


### Precision and Recall
* These metrics view prediction procedure as a **binary operation** which distinguishes `good items` from those items that are `not good`. 
<img src="figures/precision-recall.png" width="60%">

* **Precision** - Out of all the recommended items, how many did the user actually like?
$$P = \frac{tp}{tp + fp} = \frac{\text{# of our recommendations that are relevant}}{\text{# of items we recommended}}$$
* **Recall** - What proportion of items that a user likes were actually recommended 
$$R = \frac{tp}{tp + fn} =  \frac{ \text{# of our recommendations that are relevant}}{ \text{# of all the possible relevant items}}$$
* **F-measure** - is the harmonic average of the P and R  
F1 best value = 1 (perfect precision and recall) and worst at 0
$$F1=2\frac{ P\cdot R}{P+R}$$

Precision and Recall **don’t** seem to **care about ordering**. 
* So instead we use precision and recall at **cutoff k**. 
* Consider that we make **N recommendations** and consider only the first k element, 
* then only the first two, then only the first three, etc… these subsets can be indexed by k.



<img src="figures/precision-at-4.png" width="80%">

### Why Precision, Recall and F1-Measure May Fool You
* Ideal recommender (example a – f) vs. Worst-case recommender (ex. g – l )
* Four recommendations (R1 – R4) e.g. Precision@4
* Ten items with a varying ratio of relevant items (1 – 9 relevant items)
* Precision, recall and F1-measure are very sensitive to the ratio of relevant items
* They fail to distinguish between an ideal recommender and a worst-case recommender if the ratio of relevant items is varied 


### P@k and R@k

Precision and Recall at cutoff $k$, $P@k$, and $R@k$, 
* considering only the `subset of your recommendations from rank 1 through k`. 
* The rank of the recommendations is determined by the predicted value. 
  * For eg., the product with the highest predicted value is ranked 1, the product with the $k$-*th* highest predicted value is ranked k.
<img src="figures/precision-at-k.png" width="60%">

### Average Precision
If we have to **recommend $N$ items** and there are $m$ **relevant items** in the full space of items, Average Precision AP@N is defined as:

$$AP@N=\frac{1}{m}\sum_{k=1}^NP@k\cdot rel(k)$$

where $rel(k)$ is just an indicator (0/1) that tells us whether that $k$-th item was relevant and $P@k$ is the precision@k. 

> AP rewards you for giving correct recommendations,
AP rewards you for front-loading the recommendations that are most likely to be correct,
AP will never penalize you for adding additional recommendations to your list — just make sure you front-load the best ones.

### Mean Average Precision

$$MAP@N = \frac{1}{|U|}\sum_u(AP@N)_u$$
AP applies to single data points, like a single user. MAP@N just goes a step further and averages the AP across all users.

### Reciprocal Rank

Suppose we have recommended 3 movies to a user, say A, B, C in the given order, but the user only liked movie C. As the rank of movie C is 3, the reciprocal rank will be 1/3 .
* Larger the mean reciprocal rank, better the recommendations


### MAP at k (Mean Average Precision at cutoff k)

Precision at cutoff k is the precision calculated by considering only the subset of your recommendations from rank 1 through k


### NDCG (Normalized Discounted Cumulative Gain)

The main difference between MAP and NDCG is that MAP assumes that an item is either of interest (or not), while NDCG gives the relevance score .
* Let us understand it with an example: suppose out of 10 movies – A to J, we can recommend the first five movies, i.e. A, B, C, D and E while we must not recommend the other 5 movies, i.e., F, G, H, I and J. The recommendation was [A,B,C,D]. 
* So the NDCG in this case will be 1 as the recommended products are relevant for the userCourses & Articles - where you can explore more about recommendation engine