# Confusion matrix

Inaccuracy matrix is a very important concept for evaluating classification models.

In [3]:
import numpy as np
import pandas as pd

import sklearn
from sklearn.datasets import make_classification

## Example task

Consider a binary classification problem. We have two classes, Positive and Negative.

Let be:

- $P$ is the number of positive observations in the sample;
- $N$ is the number of negative observations in the sample.

The following code generates possible outputs of the classification task:

- $y$ - real targets, 0 corresponds to negative 1 to positive;
- $t_i$ - score that indicates the probability that a particular object belongs to the positive class;
- $\hat{y}=\left[t_i>T\right]$ - final predicts that depends which depend on the cut-off threshold - $T$. You can choose different $T$ and it will fill the confusion matrix - it's discussed in the following sections. For the example below, $T$ is the mean of $t_i=\overline{1,n}$.

In [8]:
x, y = make_classification(
    n_features=1,
    n_informative=1,
    n_redundant=0,
    n_repeated=0,
    n_clusters_per_class=1,
    flip_y=0.3,
    random_state=2
)
x = x.ravel()
pd.DataFrame({
    "$y$" : y, 
    "$t_i$" : x, 
    "$\hat{y}$":(x>np.mean(x)).astype("int")
}).head()

Unnamed: 0,$y$,$t_i$,$\hat{y}$
0,0,-0.644251,0
1,0,-1.162201,0
2,1,-0.624533,0
3,1,2.033877,1
4,1,-1.012203,0


## Idea

Now, suppose we have formed some classifier. We have the following groups of observations.

- True positive - observations that were positive in the sample and we correctly predicted them as positive. We will denote their number as $TP$;
- True negative - observations that were negative in the sample and we correcrly predicted then as negative. We will denote their number as $TN$;
- False positve - observations that were negative in the sample, but which we then mistakenly predicted to be positive. We will denote their number as $FP$;
- False negative - observations that were positive in the sample, but wich we then mistakenly predicted to be negative. We will denote their number as $FN$.

So, if you put the actual value on the rows and the predicted value on the columns, you will get a confusion matrix.

<table>
  <thead>
    <tr>
      <th></th>
      <th>Predicted $N$</th>
      <th>Predicted $P$</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Actual $N$</td>
      <td>$TN$</td>
      <td>$FP$</td>
    </tr>
    <tr>
      <td>Actual $P$</td>
      <td>$FN$</td>
      <td>$TP$</td>
    </tr>
  </tbody>
</table>

Also valuable is the representation of the confusion matrix using relative values.

Let be:

- $P^* = TP + FP$ - number of observations from the sample predicted as positive;
- $N^* = TN + FN$ - number of observations from the sample predicted as negative;
- $TNR = TN/N^*$ - true negative rate, the proportion of correct predictions among observations that are predicted negative;
- $FNR = FN/N^*$ - false negative rate, the proportion of incorrect predictions among observations that are predicted to be negative;
- $TPR = TP/P^*$ - true positive rate, the proportion of correct predictions among observations that are predicted to be positive;
- $FPR = FP/P^*$ - false positve rate, the proportion of incorrect predicitons among observations that are predicted to be negative.


So using these notations the confusion matrix can also be written:

<table>
  <thead>
    <tr>
      <th></th>
      <th>Predicted $N$</th>
      <th>Predicted $P$</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Actual $N$</td>
      <td>$TNR$</td>
      <td>$FPR$</td>
    </tr>
    <tr>
      <td>Actual $P$</td>
      <td>$FNR$</td>
      <td>$TPR$</td>
    </tr>
  </tbody>
</table>

Here is an example of calculating the confusion matrix using `sklearn.metrics.confusion_matrix`:

In [9]:
from sklearn import metrics
sklearn.metrics.confusion_matrix(y, x > np.mean(x))

array([[39,  7],
       [17, 37]])

## Confusion table

### Idea

Many classification models allow to return a score that indicates the probability that a particular object belongs to the positive class. You can select the threshold above which you consider the object under consideration to be positive. Different treshold values will consequently produce different confusion matrixes.

The table that puts in correspondence to some selected threshold the table of contiguity will be called the confusion table.

| $T$   | $TN$ | $FP$ | $FN$ | $TP$ |
|:-----------|:-----:|:-----:|:-----:|:-----:|
| $t_1$      | $TN_1$| $FP_1$| $FN_1$| $TP_1$|
| $t_2$      | $TN_2$| $FP_2$| $FN_2$| $TP_2$|
| ...        | ...    | ...    | ...    | ...    |
| $t_i$      | $TN_i$| $FP_i$| $FN_i$| $TP_i$|
| ...        | ...    | ...    | ...    | ...    |
| $t_n$      | $TN_n$| $FP_n$| $FN_n$| $TP_n$|

### Realisation

I didn't find ready realisation of the similar concept. So here is my own realisation.

The first thing that comes to mind is to use `sklearn.metrix.confusion_matrix` for all needed $t_i$. But this solution is extremely slow - an estimate of the complexity of the algorithm is $O(nT')$, where $n$ number of samples $T'$ is the number of thresholds to check.

The following is a description of the algorithm that will work with complexity $O(n)$ assuming that the observations are sorted in ascending order $s_i$ and all tresholds are sorted in asceding order:

As input we have three arrays:

- $\left\{y_1,y_2, ... , y_n\right\}$ - real classes of the observations, where:
$$y_i = \begin{cases}
    0, \text{if i-th observation negative};\\
    1, \text{if i-th observation positive}.
\end{cases}$$
- $\left\{s_1,s_2, ..., s_n\right\}$ - scores of the observations;
- $\left\{t_1, t_2, ..., t_{T'}\right\}$ - the thresholds we're interested in.

Let's introduce cursors for the values we are interested in, and setup them values for the smallest rational threshold for a given problem:

- $TP'=P=\sum_{i=1}^n{y_i}$ - with the lowest threshold, all positive outcomes are truly classified as positive;
- $FP'=N = n -\sum_{i=1}^n{y_i}$ - with the lowest threshold, all negative outcomes are mistaken classified as positive;
- $TN'=0, FN'=0$ - with the lowest threshold there are no values that classified as negative;
- $i=1$ indexes $y_i$ and $s_i$;
- $j=1$ indexes $t_j$.

Now let's talk about the iterative procedure:

At each step we compare $s_i$ and $t_j$:

- If $s_i < t_j$:
    - If $y_i=1$, then we add one to $FN'$ and subtract one from $TP'$;
    - If $y_i=0$, then we add one to $TN'$ and subtract one from $FP'$;
    - Come the the next observation - add one to $i$;
- If $s_i \geq t_j$:
    - Add the current values of $TP', FP', TN', FN'$ to the confusion table row corresponding to $t_j$;
    - Come to the next threshold, - add one to $j$;
- The algorithm stops when we've circled all thresholds $j>T'$.

In [7]:
def confusion_table(
    y_true,
    y_score,
    tresholds=None
):
    if tresholds is None:
        tresholds = y_score

    for t in tresholds:
        