# Metrics

This section focuses on ways to measure something numerically.

In [43]:
import numpy as np
from IPython.display import HTML

## Cross-Entropy

The popularity of this metric stems from the fact that it is differentiable, making it suitable to be used as a loss function when fitting the parameters of machine learning models.

Check [Cross-Entropy section at MLGlossary](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy).

Cross-entropy for $o$-th observation can be written using following formula:

$$-\sum_{c=1}^M y_{o, c} log(p_{o,c})
\\
y_{o,c} = 
\begin{cases}
1, & \text{if the } o\text{-th observation belongs to class } c, \\
0, & \text{otherwise}.
\end{cases}
$$

Where:

- $M$: number of possible classes.
- $o$: index of the observation.
- $c$: index of the class.
- $p_{o,c}$: predicted probability that object $o$ belongs to class $c$; it must satisfy all probability properties, specifically $\sum_{c=1}^M p_{o,c} = 1$.

Popular particular case is cross entropy for binary classification:

$$-(y_o log[p_o] + [1-y_o] log[1-p_o])$$
$$
y_o=\begin{cases}
1, & \text{if the } o\text{-th observation has a learnt trait}, \\
0, & \text{otherwise}.
\end{cases}
$$

- $p_o$: probability that trait under consideration manifests itself in an $o$-th object.

---

Consider example estimating performance of the predicted probabilites to the array where each object belongs to one of three classes - $\{1, 2, 2, 3, 3, 1, 3\}$.

It has to be transformed into an array like $y_{o, c}$. The following cell shows the corresponding Python code.

In [16]:
array = np.array([1, 2, 2, 3, 3, 1, 3])
y = np.concatenate(
    [
        (array==i).astype(int)[None, :].T
        for i in np.sort(np.unique(array))
    ], 
    axis=1
)
y

array([[1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 0, 1],
       [1, 0, 0],
       [0, 0, 1]])

Now suppose we have three algorithms that return probabilities $p_{o,c}$:

In [71]:
np.random.seed(10)

gen_probs = lambda m: np.clip(
    np.random.normal(loc=m, scale=0.1 ,size=y.shape), 
    a_min=0, 
    a_max=1
)
best_probs = np.round(np.where(y==1, gen_probs(0.7), gen_probs(0.3)), 3)
good_probs = np.round(np.where(y==1, gen_probs(0.6), gen_probs(0.4)), 3)
bad_probs = np.round(np.where(y==1, gen_probs(0.5), gen_probs(0.5)), 3)

for v in [best_probs, good_probs, bad_probs]:
    display(v)

array([[0.833, 0.327, 0.538],
       [0.412, 0.762, 0.31 ],
       [0.44 , 0.711, 0.361],
       [0.273, 0.245, 0.82 ],
       [0.252, 0.431, 0.723],
       [0.745, 0.266, 0.426],
       [0.227, 0.366, 0.502]])

array([[0.506, 0.392, 0.347],
       [0.505, 0.566, 0.364],
       [0.388, 0.585, 0.446],
       [0.378, 0.499, 0.671],
       [0.647, 0.249, 0.839],
       [0.692, 0.32 , 0.599],
       [0.574, 0.214, 0.713]])

array([[0.493, 0.47 , 0.585],
       [0.571, 0.522, 0.529],
       [0.453, 0.524, 0.426],
       [0.469, 0.465, 0.592],
       [0.514, 0.527, 0.583],
       [0.305, 0.368, 0.624],
       [0.747, 0.638, 0.486]])

The results of the "algorithms" are generated in such a way that there is a decrease in quality. The first and second "algorithms" have almost perfect accuracy, but the predictions of the first "algorithm" are more confident.

Here are the components of $y_{o, c} \log(p_{o,c})$:

In [78]:
for p in [best_probs, good_probs, bad_probs]:
    display(-np.log(p)*y)

array([[0.18272164, 0.        , 0.        ],
       [0.        , 0.27180872, 0.        ],
       [0.        , 0.34108285, 0.        ],
       [0.        , 0.        , 0.19845094],
       [0.        , 0.        , 0.32434606],
       [0.29437106, 0.        , 0.        ],
       [0.        , 0.        , 0.68915516]])

array([[0.68121861, 0.        , 0.        ],
       [0.        , 0.5691612 , 0.        ],
       [0.        , 0.53614343, 0.        ],
       [0.        , 0.        , 0.39898614],
       [0.        , 0.        , 0.17554457],
       [0.36816932, 0.        , 0.        ],
       [0.        , 0.        , 0.33827386]])

array([[0.7072461 , 0.        , 0.        ],
       [0.        , 0.65008769, 0.        ],
       [0.        , 0.64626359, 0.        ],
       [0.        , 0.        , 0.52424864],
       [0.        , 0.        , 0.53956809],
       [1.1874435 , 0.        , 0.        ],
       [0.        , 0.        , 0.72154666]])

It's interesting that only the predictions for $y_{o,c}$ play a role; more confident predictions generate less cross-entropy for that observation.

And just to be sure, let's compute the average cross-entropy for the entire sample:

In [80]:
for p in [best_probs, good_probs, bad_probs]:
    display(np.sum(-np.log(p)*y, axis=1, keepdims=True).mean())

0.3288480606756968

0.43821387695451447

0.7109148978408835

Obviously, higher-quality algorithms received a lower score.