# **Classification - Part I**

With the term classification we always mean *supervised classification*.

**The classification problem**: the dataset contains $N$ *individuals* described by $D$ *attribute values*: one of these attributes is the *class*, that allows a finite number $C$ of different values.

The goal is to learn how to guess the value of the $D$-th attribute for individuals that have not been examined by the experts, so **learn a classification model**.

**Classification model**: algorithm that, given an individual for which class is unknown, compute the class.
The algorithm is parametrized to optimize the results for the specific problem.

To develop a classfication model:
* Choose the learning algorithm;
* Let the algorithm learn its parametrization;
* Asses the quality of the classification model, since it will be used at run-time for classification.

There are two main flavours for classifications:
* **Crisp**: the classifier assigns one label;
* **Probabilistic**: the classifier assigns a probability for each possible labels.

## A general workflow for classification
**Step 1**: learning the model for the given set of classes.
* A training set, obtained by a random process, is available;
* The training set contains a certain number of individuals and for each the class label is available.

Note: the training set should be representative as much as possible.

**Step 2**: estimate the accuracy of the model.
* A test set is available;
* The test set contains a certain number of individuals and for each the class label is available too;
* The model is run to assign the labels to each individual of the test set;
* The labels of the test set and the ones obtained by the classifier are compared and the accuracy estimated.

**Step 3**: the model is used run-time to label new individuals. 

Note: it's possible that after the labelling the true ones becomes available, so that the true accuracy can be compared to the estimated one.




# Classification with decision trees (C4.5 algorithm)
A run-time classifier (not learner) structured as a decision tree, hence a tree-shaped set of tests. The decision tree has inner nodes (decisions), leaf nodes (predicitons) and the root.

## General description of the learner
Given a set $\varepsilon$ of individuals for which the class is known, grow a tree as follows:
* If all the elements belong to class $c$ or $\varepsilon$ is small enough, generate a leaf node with label $c$;
* Otherwise (if we can't generate a leaf):
> 1. Choose a test based on a single attribute with two or more outcomes;
  1. Make this test the root of a tree with one branch for each of the outcomes of the test;
  1. Partition $\varepsilon$ into subsets corresponding to the outcomes and apply recursively the procedure to these subsets.

Open questions:
1. Which attribute should we test? *We need an indicator to find out more interesting ones;*
1. Which kind of test? *Binary, multi-way, etc... (depends also on the domains of the attributes)*;
1. What $\varepsilon$ small enough means? *Hyperparameter, we could use different thresholds*.

**Contingency tables**: $k$-dimensional generalization of histograms (frequency counter).

Our goal is to design an algorithm that find interesting patterns for forecasting, distinguish real pattern from illusions (choose useful patterns).
There are several methods to evaluate how much a pattern is interesting, one of them is the **[entropy](https://it.wikipedia.org/wiki/Entropia_(teoria_dell%27informazione))**, based on information theory.

## Entropy and Information Gain
Given a source $X$ with $V$ possible values with probablity distribution: $$P(v_1)=p_1, P(v_2)=p_2, \dots, P(v_V)=p_V$$ the best coding allow the transmission with an average number of bits given by: $$H(X)= -\sum_{j} p_j \log_{2}(p_j)$$ where $H(X)$ is the entropy of the information source $X$.

High entropy means that the probabilities are mostly similar (flat histogram), low entropy means that some symbols have much higher probablity (histogram with peaks).

**Higher number of allowed symbols gives higher entropy.**

Example: in a binary source with symbol probabilities of $p$ and $(1-p)$, when $p$ is $0$ or $1$ the entropy goes to $0$.

**Conditional specific entropy**: give an insight on attribute $Y$ knowing attribute $X$ (correlation, a sort of entropy filter) $$H(Y|X=v)$$

**Conditional entropy**: weighted average of the conditional specific entropy of $Y$ with respect to the values of $X$ $$H(Y|X) = \sum_{j} P(X=v_j)*H(Y|X=v_j)$$ It's the average number of bits necessary to transmit the value of $Y$ if both ends knows the values of $X$. 

**Information gain**: the amount of insight that $X$ provides to forecast the values of $Y$ $$IG(Y|X)=H(Y)-H(Y|X)$$ It's the average number of bits that can be saved to transmit the value of $Y$ if both ends knows the values of $X$. 

If $IG(Y|X)$ is low we can say that $X$ does not affect $Y$, so it won't be an interesting expansion.


## Back to decision trees
Calculate the information gain for each attribute.

**One-stump decision**: choose attribute giving the highest $IG$, partition the dataset according to the chosen attribute and choose as class label of each partition the majority.

**Recursive step**: build a new tree starting from each subset where the minority is non-empty (no unanimity), with another attribute.

The generation stops when there is no more possibility of recursion: no more attribute or the minority is empty.


## Errors and overfitting
The supervised data is partitioned in two sets:
* Training set, used to generate the model;
* Test set, used to compute the test set error with the generated model.

**Training set error**: discordances between true label of the training set and the ones forecasted by the decision tree on the training set. 
It can be non-zero due to the limits of the decision trees in general.
It's the error we make on the data used to generate the classification model.
It's the lower limit of the error we can expect when classifying new data, we're interested on an upper bound.

**Test set error**: indicative of the expected behaviour with new data. Additional statistic reasoning can be used to infer error bounds.

**Overfitting**: overfitting happens when the learning is affected by noise. When a learning algorithm is affected by noise, the performance on the test set is much worse than the one on the training set.

More formally, a decision tree is a hypothesis of the relationship between the predictor attributes and the class. The hypotesis $h$ overfits (learns also the noise) the training set if there is an alternative hypothesis $h'$ such that:
$$error_{train}(h)<error_{train}(h')$$
$$error_{\varepsilon}(h)>error_{\varepsilon}(h')$$ 

### Causes of overfitting
1. Presence of noise: individuals can have bad values in the predicting attributes and/or in the class label;
1. Lack of representative instances: some situations of the real world can be under-represented or not at all.

A good hypotesis has a low generalization error.

**Ockham's razor**
> "Everything should be made as simple as possible, but not simpler"

* All other thing being equal, simple theories are preferable complex ones; 
* Long hypotesis that fits the data is more likely to be a coincidence;
* **Pruning** a decision tree is a way to simplify it.

## Pruning
In decision trees, at higher levels we have info (truth) and at lower levels we have noise.
* **Pre-pruning**: early stop of the tree growth, before it perfectly classifies the training set;
* **Post-pruning**: build a complete tree, then prune some portions according to some criteria. Usually preferred since it's not easy to estimate when to stop the growing of the tree.

### Post-pruning criteria
* **Validation set**: use a distinct supervised dataset to evaluate the effect of post-pruning nodes from the tree;
* **Statistical pruning**: statistical test to estimate if pruning a particular node is likely to produce an improvement;
* **Minimum description lenght principle**: use an explicit measure of complexity for encoding the training set and the decision tree
$$\min \text{size}[\text{size}(\text{tree}) + \text{size}(\text{missclassification}(\text{tree}, \text{training set}))]$$
![](https://i.ibb.co/7JJ1qWk/photo-2020-12-28-15-58-37.jpg)

### Validation set
The supervised data is partitioned in three independent sets:
* **Training set**: basis to build the model;
* **Validation set**: the model is tuned (pruned) to minimize the error;
* **Test set**: asses final expected error.

### Statistical pruning
From inferential statistic, it's significance testing.
* **Error estimation**: does the pruning reduce the maximum error expected?
* **Significance testing**: is the contribution of a node compatible with a random effect?

It's the most used in practice.

### Minimum description lenght (MDL)
The learning process produce a theory on a set of data.
The theory is used to predict values for new data and can make errors.
The theory can be encoded, and errors as well as exceptions to the theory. According to the MDL principle a theory with shorter description is preferable.



## Decision tree learning algorithm
The base case uses greedy strategy: in each step chooses a local optimum and never goes back to try a different sequence of choices.
The options are:
1. **Specification of the test on the chosen attribute**: it depends on the value domain of the attribute:
> * If the domain of the attribute is discrete with $V$ values the split generates $V$ nodes; 
  * If the domain is continuous with $V$ values the split in $V$ node is infeasible since the high number of branches would generate very small subsets and the significance would decrease rapidly.
  * **Discretization**: the continuous domain is converted into a set of discrete values, according to some discretization technique;
  * **Binarization**: extreme case of discretization. The split point of the domain is set with a threshold.
  * The threshold-based split may deploy computational issues.

1. **Choice of the attribute to split the dataset**: we're looking for the split generating the maximum purity. 
A measure of purity is needed (information gain) or impurity (gini index, missclassification error):
> * A node with two classes in the same proportion has low purity;
  * A node with only one class has the highest purity.
3. **When to stop splitting** (pruning).

### Impurity functions
**Gini index**: consider a node $p$ with $C_p$ classes.
Which is the frequency of the wrong classification in class $j$ given by a random assignment based only on the class frequencies in the current node?
For class $j$:
* Frequency $f_{p,j}$;
* Frequency of other classes $1-f_{p,j}$;
* Probability of the wrong assignment $f_{p,j}*(1-f_{p,j})$.

The gini index is the total probability of wrong classification: $$\sum_j f_{p,j}*(1-f_{p,j}) = \sum_j f_{p,j} - \sum_j f^2_{p,j} = 1 - \sum_j f^2_{p,j}$$

The range is from $0$ to $1 - \frac{1}{C_p}$. 
The minimum value is when all records belong to the same class, the maximum value is when all records are uniformly distribuited over all the classes.

We choose to split giving the maximum reduction of the Gini index (the purity increases the most).

**Missclassification error**: if a node is a leaf we find the highest label frequency. 
This frequency is the accuracy of the node and his label is the output of the node. 
The missclassification error is the complement to one of the accuracy.
The range is from $0$ to $1 - \frac{1}{C_p}$. 
The minimum value is when all records belong to the same class, the maximum value is when all records are uniformly distribuited over all the classes.

The choice of the split is done in the same way as for the Gini index:
$$\text{ME}(p)=1-\max_{j} f_{p,j}$$

![](https://i.ibb.co/VTtnKm0/photo-2020-12-28-16-54-47.jpg)

* The behaviour of ME is linear, therefore an error in the frequency is completely transferred into the impurity computation;
* Entropy and Gini have varying derivative with minimum around the center: they're more robust to errors when the frequencies of the two classes are similar.


## Characteristics of DT induction
There are several variants depending on the strategy of constructions, strategy of partition and strategy of pruning, with tests based on linear combination of numeric attributes.
* Non-parametric approach to build classificator, doesn't require any assumption on the distributions of classes and attributes;
* Finding the best tree is NP-complete, heuristic algorithms allow to find sub-optimal solutions in reasonable times;
* The run-time use of the decision tree is extremely efficient, costs $\mathcal{O}(h)$ where $h$ is the height of the tree;
* Robust to noise if overfitting managed;
* Redundant attributes are not a problem: in case of strong correlation between two attribute, if one is chosen for a split the other will never provide a good increment of purity and won't be chose;
* Nodea at high depth are easily irrelevant since cover a low number of samples;


### Complexity
* $N$ istances (data-points) and $D$ attributes in $\varepsilon$. Tree height is $\mathcal{O}(\log N)$;
* Each level of the tree requires the consideration of all the dataset;
* Each node requires the consideration of all the attributes;
* Binary split of numeric attributes costs $\mathcal{O}(N \log N)$, but doesn't increment the complexity;
* Pruning requires consideration of each node, but they are at most $2N-1$;
* Pruning requires to consider globally all istances at each level, costs $\mathcal{O}(N \log N)$, which, once again, doesn't increment the complexity;

Overall cost is $\mathcal{O}(D N \log N)$.