![Banner](./img/AI_Special_Program_Banner.jpg)


# Theory of classification - example and formalization
---

First, we will look at a small example of classification and learn about the most important key figures for assessing the quality of classification. Then we will formalize the classification problem a little.

## Table of contents
---

- [An example](#An-example)
    - [Error Terms and Quality Measures](#Error-Terms-and-Quality-Measures)
- [Formalization of the classification problem](#Formalization-of-the-classification-problem)
    - [Calculation of basic statistics](#Calculation-of-basic-statistics)
    - [Calculation of error characteristics and quality measures](#Calculation-of-error-characteristics-and-quality-measures)

## An example
---

We consider the following example, which deals with fictitious (summarised)
data on a problem faced by a mail order company. The company is thinking
about increasing its sales by means of a marketing campaign.
The company only wants to send a voucher for further orders to those customers who are not expected to place a new order within 90 days anyway (otherwise money would be lost).

The aim is to make a prediction about voucher dispatch based on knowledge of past orders,
whereby the attributes *delay, customer, item* and *return*
are considered and these are used to determine whether
a new order (within 90 days) will be placed (which would then
would mean that no voucher would have to be sent). 

The corresponding data is shown in the following table. The values in the 'Customer' column mean
**F**emale, **M**ale and company (**C**).

   Case | Delay | Customer | Item | Return | ***Reorder?***
  :--------:|:--------------: | :---------: |:----------:| :---------------: |:------------------
   1 | long | M | single | no | no
   2 | long | M | single | yes | no
   3 | none | M | single | no | yes
   4 | short | F | single | no | yes
   5 | short | C | several | no | yes
   6 | short | C | several | yes | no
   7 | none | C | several | yes | yes
   8 | long | F | single | no | no
   9 | long | C | several | no | yes
   10 | short | F | several | no | yes
   11 | long | F | several | yes | yes
   12 | none | F | single | yes | yes
   13 | none | M | several | no | yes
   14 | short | F | single | yes | no

This means that only *binary* decisions (yes/no) have to be made. More generally, the aim is to divide the data records into two
*classes*. With the help of a *trained model* (**fit**) based on the above training data
we then want to make *predictions* (**predict**) for new data records as to which class they probably belong to.

### Error terms and quality measures

First, we will *describe* the general parameters in relation to the specific
problem and then look at the corresponding formulas.
First of all, four cases must be distinguished with regard to the classification, whereby, regardless of the semantics, a yes decision
is described as *positive* and a no-decision as *negative*.

1.  *true positive*: a new order is correctly predicted

2.  *true negative*: a new order that has not been placed is correctly
    predicted

3.  *false positive*: a new order is predicted but does not take place

4.  *false negative*: it is predicted that no new order will be placed, but this happens nevertheless

The corresponding numbers are determined by simple counting.
They can then be represented in the so-called **confusion matrix** as follows
as follows:

***New order?*** | | |
:------------------:|:---------------:|:---------------:|
| *prediction* / *reality* | yes | no |
| yes | true positive | false positive |
| no | false negative | true negative |

These basic statistics can then be used to calculate a whole range of
derived metrics, whereby we will limit ourselves to the
the following:

-   *Correct classifications* are the number of correctly predicted
    cases;

-   *misclassifications* is the number of incorrectly predicted
    cases;

-   *Accuracy* is the *proportion* of correctly predicted cases;

-   the *positive predictive value* (also *precision* or *relevance*) denotes the probability that a predicted reordering will actually take place; 

-   the *negative predictive value* denotes the probability
    that a new order predicted as not occurring will actually not occur;

-   the *true positive rate* (also *sensitivity*, *hit rate* or
    *Recall*) is the probability that a new order will be predicted;

-   the *true negative rate* (also *specificity* or *selectivity*)
    is the probability that a new order that has not been placed is also predicted;

-   the *false positive rate* (also *fall-out*)
    is the probability that a new order will be predicted despite the fact that no new order has been placed;

-   the *false negative rate* is the probability that a new order will not be predicted;

The *probabilities* are then to be estimated by the *relative frequencies*
in the training data.

## Formalization of the classification problem
---

In order to simplify the notation for the formalization, we imagine
that all attribute values were real numbers, but that they could only be interpreted nominally or ordinally. This can always be achieved by transformation, for example by assigning *any* numerical values to the individual attribute values
or by assigning numerical values to ordinal attributes in such a way
that the *order is preserved*. This allows the assumption, without loss of
generality, that a total of
* $n_t$ data samples $\mathbf{x}^{(1)},\mathbf{x}^{(2)},\dots,\mathbf{x}^{(n_t)}$ each with
* $m$ real-valued features (or *attributes*) $A_1,A_2,\dots,A_m$
* (i.e. $\mathbf{x}^{(i)} = (x_1^{(i)},x_2^{(i)},\dots,x_m^{(i)})^T$)

can be considered, i.e.

$$
\mathcal{D} = \{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\dots,\mathbf{x}^{(n_t)}\} \subset \mathbb{R}^m
$$

represents the *database* (the dataset under consideration). Take $A_i$ (the
$i$-th attribute)

* $m_i$ different values
* $a_{i,j}\>,\ j=1,2,\dots,m_i$ with the
* Frequency $m_{i,j}$.

The data sets from $\mathcal{D}$ are generally assigned
* $\kappa$ classes $y_k\>,\ k=1,\dots,\kappa\>\quad \kappa\in\mathbb{N},\ \kappa\ge2$

As for the attribute values, the classes can also be assumed to be (natural) numbers,
i.e.
* $y_k=k-1\>,\ k=1,\dots,\kappa$.

The only thing that matters here is the ability to *distinguish*
and no ranking order is specified (nominal scale). In the case of a
binary classification as in the voucher problem, $\kappa=2$ and
there are classes 0 and 1; a representation that is also frequently encountered in practice.

In the *training data* considered in the context of a classification
$\mathcal{T}$, each data set $\mathbf{x}^{(l)}$ is then assigned a (true or observed)
class $y^{(l)}\in\{0,1,\dots,\kappa-1\}$, i.e. the $l$-th
training dataset consists of tuples $(\mathbf{x}^{(l)},y^{(l)})$, where for the
class assignments $\mathcal{C}$ applies
$$
\mathcal{C} = (y^{(1)},y^{(2)},\dots,y^{(m)})\>,\ y^{(l)} \in \{0,1,\dots,\kappa-1\}\>,\ l=1,2,\dots,
m$$
so that the training data can be represented as
$$
\mathcal{T} =(\mathcal{D},\mathcal{C}) = \{(\mathbf{x}^{(1)},y^{(1)}),(\mathbf{x}^{(2)},y^{(2)}),\dots,(\mathbf{x}^{(m)},y^{(m)})\}
  \subset \mathbb{R}^m\times\{0,1,\dots,\kappa-1\}\>.
$$
The classification problem then consists of finding a function
$$\begin{aligned}
 f:\mathbb{R}^m&\to\{0,1,\dots,\kappa-1\}\\
 \mathbf{x}&\mapsto f(\mathbf{x})=\hat y\end{aligned}$$
which returns the class desired class $\hat y$ for a data set $\mathbf{x}$, whereby $\hat y$ should of course correspond to the true class. In this context $f$ is called a *classifier* and its application is called *classification*. Determining the function $f$ from the training data $\mathcal{T}$ is called *classifier design*.

### Calculation of basic statistics
For a class $y_k$, we use $p_k$ to denote the *absolute frequency* of the class (or the number of *positive cases*). $n_k$ would then be the number of *negative cases* (or the frequency in which the class $y_k$ *does not* occur), so that for each class $y_k$ applies:

$$
p_k + n_k = m\>.
$$

Similarly, we use $p_{k,i,j}$ to denote the absolute frequency with which the class $y_k$ occurs for the value $a_{i,j}$ of the attribute $A_i$ (and $n_{k,i,j}$ again denotes the absolute frequency of the opposite). This also applies to every class $y_k$ and every value $a_{i,j}$ of the attribute $A_i$:

$$
p_{k,i,j} + n_{k,i,j} = m_{i,j}\>.
$$

### Calculation of error characteristics and quality measure

Now consider a class $y^*\in\{0,1,\dots,\kappa-1\}$.
An assignment to this class is again *positive* and an assignment
to another class is *negative*. Furthermore, consider a
training data set $(\mathbf{x},y)$ and a prediction $f(\mathbf{x})=\hat y$ (classifier value).
If $y=y^*$, then the sample is a positive case, otherwise
a negative one. Furthermore, if $\hat y = y^*$, then it is a
positive classification (or prediction), otherwise a negative one.

This allows the basic parameters for assessing the classification quality to be characterised and symbolised by case differentiation in accordance with the following table


![x](img/2.2.b_Classification_Table1.png)

The confusion matrix then looks like this:

![x](img/2.2.b_Classification_Table2.png)

Depending on the problem, various parameters can be derived from this for
Evaluation of the classifier. Some possible **derived
parameters** and their characterization and symbolization can be
be taken from the following table.

![x](img/2.2.b_Classification_Table3.png)

Ground truth: [yes, yes, no, yes]
Prediction: [yes, no, no, no]

Prediction2 : [yes, yes, yes, yes]

Accuracy: How many predictions where correct (fractions)?
Answer: 2/4 = 0.5 ~50%

Precision: If we predicted the positive class, how correct was that (fraction of total)?
Answer: 1
Answer: 3/4 < 1

Recall: How many of the positive class did we ("the model") find (fraction of total)?
Answer: 1/3 = 0.33
Answer2: 1