# **Classification - Part II**


## Model selection
Which of the available models, algorithms and parameters configurations is the best one?

**Oil slick example**: detect oil slick from satellite images for early alarm.
Manual detection is expensive and slow, but with machine learning there are some problem such as:
* Scarcity of training data: oil spill are rare (fortunately);
* Unbalanced nature of data (skew classes): non-spills samples are predominant over the positive ones.

An automatic hazard detection system has been developed and marketed with a manual pre-selection of images for final manual processing. 
It has been necessary a **tradeoff between undetected spills and false alarms**.
The evaluation of performance guides the tradeoff.

> In real life I can decide if the classificator has to be biased in one direction or the other (cost of error).

In supervised learning the training set performance is overoptimistic, we need a lower bound for performance obtained by independent tests.

Supervised data are usually scarce and we need to balance the use between train, validation and test.

It's important to evaluate how much the theory fits the data and evaluate the cost generated by prediction errors. The evaluation is independent from the algorithm used to genereate the classifier.

Empirically the more training data we use the best performance we should expect, since we are covering a larger situation. But we have already seen that in this way we caputre noise too (random changes). 




### Error estimation
Let's suppose that the test set is a good representation, on the average, of the entire dataset $\varepsilon$.
The relationship between the training set and the dataset will be subject to **probabilistic variability**.
The evaluation can be either:
* **General**: the whole performance of the classifier;
* **Local**: the local performance of a component of the model (i.e. a node of the DT).

If the test set error ratio is $x$, we should expect a runtime error $x \pm \sigma$ (**confidence interval**).

Forecasting each element of the test set is like a Bernoulli process: good prediction is a success $S$, a bad prediction is an error, the same as $N$ indipendent binary random events of the same type.

If $f=\frac{S}{N}$ is the empirical frequency of error, it has a normal distribution around the true probability (with $N \ge 30$).
We choose a **confidence level** $\alpha$ such as the probability that the true frequency of success is below the pessimistic frequency that we will compute:
$$P(z_{\alpha/2} \le \frac{f-p}{\sqrt{p(1-p)/N}} \le z_{{1-\alpha}/2}) = 1 - \alpha$$
Where:
* $z_{\alpha/2}$ and $z_{{1-\alpha}/2}$ are the tails of the gaussian;
* $p$ is the probability of error;
* $f$ is the probability of success.

The tail $z$ depends on the desiderd confidence level $\alpha$, it's the abscissa delimiting the area $1-\alpha$ for a normal distribution ([Wilson score interval](https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval)).
With a little of algebra:
$$p \approx \frac{1}{1 + \frac{1}{N}z^2} \Biggl[f + \frac{1}{2N}z^2 \pm \sqrt{\frac{1}{N}f(1-f)+\frac{1}{4N^2}z^2}\Biggr]$$

![](https://i.ibb.co/5rsFzB0/photo-2020-12-29-09-02-42.jpg)
Increasing $N$ with empirical frequency of error $f$ constant, the uncertainty for $p$ narrows.

### Statistical pruning of the decision tree
The C4.5 strategy is:
* Consider a subtree near the leaves;
* Compute its maximum error $e_l$ as weighted sum of the maximum error of the leaves;
* Compute the maximum error $e_r$ of the root of the subtree transformed into leaf;
* Prune if $e_r \le e_l$, hence if the upstream error is less than the downstream one (worsening).

With pruning, the error frequency increases and the number of records as well, therefore the maximum error can decrease.

### Testing a classifier
The *error frequency* (sum of errors on any class, divided by the number of tested records) is the simplest indicator of the quality of a classifier.
From now on for simplicity we will consider *empirical error frequencies* since we can just measure it, in real cases *error frequency* is used but it's computation is heavier.

Accuracy and others more sophisticated indicators are used to compare different classifiers or parameter settings and the cost of errors.

**Hyperparameters**: every machine learning algorithm has one or more parameters that influence its behaviour (i.e. in decision tree are impurity and maximum depth). 
Several train/test loops are in general necessary to find the best set of values.
Sometimes is necessary to find a tradeoff between optimization step and the quality of the result (the optimization can be expensive).

In each step the data should be representative of the data that will be classifed run-time. 
Several testing strategies (splitting) for optimization step exists:
* **Holdout**: a typical split ratio is $2:1$.
The split should be as random as possible, and the proportion between classes should be mantained (sampling with **stratification**).
> * Splitting data into *training set* and *test set*;
  * Splitting data into *training set*, *validation set* (we change hyperparameters, retrain and choose the best) and *test set* (totally independent, not used in training, represents the run-time situation).
* **Cross-validation**
> * Repeated tests with different splits (different kinds);
![](https://i.ibb.co/bHbCd0t/photo-2020-12-29-09-30-54.jpg)
  * $k$-fold: the training set is randomly partitioned into $k$ subsets (folds).
  There are $k$ iterations using one of the folds for test and others for training (more complexity).
  Results are then combined and final model is generated on the entire training set, since each record is used $k-1$ times for training and once for testing (optimal use of the supervised data).
  Tipically $k=10$;
  * Leave one out: extreme case of cross-validation where $k=N$, no random partitioning nor stratification. It's very expensive.

**Bootstrap**: statistical sampling technique. Sampling $N$ records with replacement. Each record can be re-selected, others are never selected and will be used for test.

### Performance measure of a classifier
**Confusion matrix**: let's consider for simplicity a binary prediction.
It's possible to define the so called confusion matrix, structured in the following way:
![](https://miro.medium.com/max/2102/1*fxiTNIgOyvAombPJx5KGeA.png)
We can define:
* **Success rate** or **accuracy** as $\frac{TP+TN}{N_{\text{test}}}$;
* **Error rate** as $1-\text{Success rate}$.

There are others performance indicator for a classifier such as: velocity, robustness, scalability, etc...
A classification error may have different consequences, depending on the class of the individual (sample).
For example, when forecasting an illness a false positive can be less dangerous than a false negative (unless cares or further examinations are dangeruous or invasive).

There are others measure, such as:
* **Precision**: the rate of true positives among positive classifications $\frac{TP}{TP+FP}$;
* **Recall** or **sensitivity**: the rate of positives I can catch $\frac{TP}{TP+FN}$;
* **Specifity**: the rate of negatives that I can catch $\frac{TN}{TN+FP}$;
* **Accuracy**: the weighted sum of sensitivity and specificity $\text{acc}=\text{sens}\frac{\text{pos}}{N}+\text{spec}\frac{\text{neg}}{N}$;
* **F-measure**: armonic mean of precision and recall (maximum when they're equal) $F = 2 \frac{\text{prec} * \text{rec}}{\text{prec}+\text{rec}}$.

The confusion matrix may be extended to the multidimensional case (on the main diagonal there are correct predictions).

It is useful to give also global definitions, which combine the values
of the single classes:
* **Macro**: the gloabl performance measure is more penalized if the
minority class has a bad performance;
* **Micro**: the performance measure of each class contributes at the
same extent to the global measure;
* **Weighted**: the performance measure of the majority class has a major influence in the global measure.

### $k$ statistic
It's a parameter that evaluates the concordance between two classifications (i.e. predicted and true values).
Let be:
* Probability of concordance $Pr(c)=\frac{TP_a + TP_b + TP_c}{N}$;
* Probability of random concordance $Pr(r)=\frac{T_a+P_a + T_b*P_b + T_c*P_c}{N^2}$.

Then, $k$ will be the ratio between the concordance exceeding rhe random component and the maximum surplus possible:
$$-1 \le k = \frac{Pr(c)-Pr(r)}{1-Pr(r)} \le 1$$

We will have:
* $1$ for perfect agreement;
* $0$ for total disagreement (rare);
* $0$ for random agreement.
![](https://i.ibb.co/yYW16Rr/photo-2020-12-29-14-46-47.jpg)

### The cost of errors
Our decisions are driven by predictions and bad predictions imply a cost.
An easy way to compute it's to compute the weighted cost of errors.
* Alternative 1: alterate the proportion of classes in the supervised data, duplicating the examples where the classification error is higher.
In this way the classifier will became more able to classify the classes where the cost is higher (we repeat train in these classes);
* Alternative 2: some learning algorithms allow to add weights to the istances.

## Evaluation of a probabilistic classifier
We've already seen the distinction between:
* **Crisp prediction**: gives label, immediate decision;
* **Probabilistic prediction**: gives probabilities of labels, soft prediction.

The adequacy of on output rather than another depend on tha application domain: when an immediate decision is required a crisp classifier is necessary, when the classifier is part of a process that requires different steps a probabilistic classifier can be more appropriated.

Crisp values sometimes hide probabilities, for example when a leaf has some counts for the minority classes. Since it quite common to have leaves with a small number of samples of minority classes, smoothing techniques are used to adjust probabilities.

Probabilities can be converted into crisp values with different technqiues:
* Binary: a threshold fo positive classes is setted;
* Multiclass: output the class with maximum probability.

### Lift chart
It's a chart used to evaluate various scenarios:
* Apply a probabilistic classification scheme;
* Sort all the classified elements for decreasing probability of the positive class;
* Make a $2$-dimensional chart with axes:
> * $x = \text{sample size}$
  * $y = \text{number of positives in sample}$

Only the rank is important, not the specific probability. 
![](https://www.bayesserver.com/docs/images/lift-chart.png)
* The blue straight line plot the number of positives obtained with a random choice of a sample of test data;
* The red curve plots all the classified elements for decreasing probability of the positive class (first the true positives);
* The orange curve is the perfect classifier.

The larger the area between the two curves, the better is the classifier.

### [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (receiver-operator characteristic)

There is a tradeoff between hit rate and false alarm in a noisy channel.
The noise can be such that the recognition of the transmission is altered, in particular it alters the two levels according to a gaussian distribution.
The goal is to set a threshold in order to maximize tha above tradeoff.
With less noise the two gaussian are better separated:
![](https://i.ibb.co/30Sbv8Y/dvsvsvse.png)

Moving the threshold towards right increases both the rate of true positives and false positives caught.
The area between the non-discrimination line and the ROC curve is a quality index of the line (the bigger the better).
The maximum area is the upper left triangle.

A soft classifier can be converted into a crisp one setting a threshold. Varying the threshold the behaviour of the classifier changes by changing the ratio of $TP$ and $FP$.
Threshold steps allow to track the ROC curve.

## From a binary classifier to multi-class classification
Several classifier generates a binary classification model.
There are two main approaches to deal with a multi-class classification:
* Transform the training algorithm and model (increase the size of the problem);
* Use a set of binary classifiers and combine the results (increase the number of problems)
> * One-vs-one strategies;
  * One-vs-all strategies.

### One-vs-one strategy (OVO)
Consider all the possible $\frac{C(C-1)}{2}$ pairs of classes and generate a binary classifier for each pair.
Each binary problem will consider only samples from the two selected classes.

A prediction time is applied a voting scheme: an unseen example is submitted to all the classifiers and each winners receives a $+1$. The class with the highest value wins.

### One-vs-all (OVA)
Consider $C$ binary problems where one class is the positive example and all the others are negatives.
Build $C$ binary classifiers and a prediction time apply a voting scheme: an unseen example is submitted to all the classifiers obtaining a confidence score. The confidences are combined and the class with the highest global score wins.

OVO requires solving a higher number of problems, even if they are of smaller size. OVA it's intrinsically unbalanced ($1:C-1$).

## Ensemble methods
The idea is to train a set of base classifier rather than a single one. 
The final prediction is obtained taking votes of the base classifier.They tend to have better performances since their errors are uncorrelated and the ensemble is wrong only if the majority of the classifier is wrong (the error rate drops).

### Manipulating the training set (filtering rows)
Data are resampled according to some sampling strategy.
* **Bagging**: repeatedly samples with replacement according to a uniform probability distribution;
* **Boosting**: iteratively changes the distribution of the training examples so that the base classifer focus on the examples that are harder to classify;
* **Adaboost**: the importance of each base classifier depends on its error rate (different voting schemes).

### Manipulating input features (filtering columns)
Subset of input features can be chosen randomly or according to domain experts.
* **Random forest**: uses decision trees as base classifiers. Frequently produces very good results.

### Manipulating class labels
Useful when the number of classes is high.
1. For each base classifier randomly partition class labels into two subsets and relabel the dataset;
1. Train a binary classifier with these two classes;
1. At testing time when a subset it's selected all the classes included receive a vote;
1. The class with the top score wins (error-correcting output coding)

![](https://i.ibb.co/kSLbft3/photo-2020-12-29-16-27-52.jpg)