# **Classification - Part III**

## Naive Bayes classifier
It's a statistics-based classifier (in particular on Bayes's theorem) that consider the contribution of all the attributes.
It assumes that each attribute is independent from the others: it's a very strong assumption (that's why it's called naive), almost never verified but nevertheless the method works.

Probabilities are estimated with frequencies.

### The Bayes's theorem
Given an hypotesis $H$ and an evidence $E$ that bears on that hypotesis, it holds: $$P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}$$
The hypotesis is the class i.e. $c$, the evidence is the tuple of values of elements to be classified: $$P(c|E) = \frac{P(E_1|H) \cdot \cdot \cdot P(E_D|H) \cdot P(H)}{P(E)}$$


### The Naive Bayes method
1. Compute the conditional probabilities from examples;
1. Apply the theorem.

The denominator is the same for all the classes and it's eliminated by the normalization step.

**Problem**: what happens if the value $v$ of the attribute $d$ never appears in the elements of class $c$? The probability of the class for that evidence drop to zero $P(d=v|c)=0$.

Unfortunately this is quite common, especially in domains with many attributes and many distinct values.

### Laplace smoothing
Let’s start ignoring the details of the dataset, we consider only the value domains, and we know that for a given attribute $d$ there are $V_d$ distinct values.
Then a simple guess for the frequency of each distinct value of $d$ in
each class is $\frac{1}{V_d}$.

In this way we consider only the apriori probabilities.
We can smooth the computation of the aposteriori probabilities of values inside a class balancing it with the apriori probability.

Let be:
* $\alpha$ - Smoothing parameter (typically it's $1$, tradeoff between apriori and aposteriori information);
* $af_{d=v_i,c}$ - Absolute frequency of value $v_i$ in attribute $d$ over the class $c$;
* $af_{c}$ - Absolute frequency of class $c$ over the dataset;
* $V$ - Number of distinct values in attribute $d_i$ over the dataset.

Then, the smoothed frequency will be: 
$$sf_{d=d_i,c} = \frac{af_{d=d_i,c}+\alpha}{af_c + \alpha V}$$

With $\alpha=0$ we obtain the standard case, with higher values we give more importance to the apriori probabilities. 
When for some class the posterior probability of a value is zero, the apriori probability still gives an approximate guess.

### Missing values 
Don't affect the model, no need to discard their istances: 
* In the test istance the calculation of likelyhood simply omits the attribute, so it will be higher for all classe but compensated by the normalization;
* In the train istance the record is simply not included in the frequency counts for that attribute since the descriptive statistics are based on the number of values that occour, non on the number of istances (frequentistic approach).

### Numeric values
The method based on frequencies is inapplicable, we need an additional assumption: the values have a gaussian distribution.
Instead of the fraction of counts we compute mean $\mu$ and variance $\sigma$ of the values, for each numeric attribute inside each class.

Probability and probability density are closely related, but are not the
same thing:
* On a continuous domain, the probability of a variable assuming exactly a single real value is zero;
* A value of the density function is the probability that the variable lies in a small interval around that value.

The value we use are, of course, rounded at some precision factor, if that precision factor is the same for all the classes, then we can disregard it.
If numeric values are missing, mean and standard deviation are based only on the values that are present.

**Note**: there is a dramatic degradation when the simplicistic condition aren't met:
* Violation of **independence**: the weight of the feature is enforced;
* Violation of **gaussian distribution**: use the standard probability estimation for the appropriate distribution, if known, or uses estimation procedures (i.e. kernel density estimation).

## Linear perceptron (artificial neuron)
Is a linear combination of weighted inputs.
For a dataset with numeric attributes learn a hyperplane such that all the positives lay on one side and all the negatives on the other.

The hyperplane is described by a set of weights $w_0, \dots, w_D$ in a linear equation on the data attributes $e_0, \dots, e_D$.
The fictious attribute $e_0=1$ (bias term) is added to allow a hyperplan that does not pass through the origin. 

There are either none or infinite such hyperplanes:
$$w_0 \cdot e_0 + w_1 \cdot e_2 + \dots + w_D \cdot e_D > 0 \to \text{positive}$$
$$w_0 \cdot e_0 + w_1 \cdot e_2 + \dots + w_D \cdot e_D < 0 \to \text{negative}$$ 

![](https://i.ibb.co/grTpSGD/photo-2020-12-31-15-51-10.jpg)

Each change of weight moves the hyperplane towards the missclassified istance.
For example, after the weight change for a positive istance:
$$(w_0 + e_0) \cdot e_0 + (w_1 + e_1) \cdot e_2 + \dots + (w_D + e_D) \cdot e_D = 0$$
The result of the equation is increased by a positive amount $e_0^2 + \dots + e_D^2$, therefore the result will be less negative or possibly even positive.

The corrections are incremental and can interfere with previous updates. 
The algorithm converges if the dataset is linearly separable or if it's fixes an upper bound to the iterations.

## Support vector machines (SVM)
As we have seen before we have some limitations if datasets are not linearly separable. A solution could be to give up the linearity, but it's not feasible since the method would become intractable for any reasonable number of variables (large number of coefficients).
Moreover it would be extremely prone to overfitting.

**Idea**: optimization of the hyperplane (with kernels maybe) rather than greedy search.

### Maximum margin hyperplane
The linear perceptron accepts any hyperplane able to separate the classes of the training set.
Of course some hyperplanes are better than others for the classification of new items.
The **maximum margin hyperplane** gives the greatest separation between classes.

![](https://i.ibb.co/dfC0SFk/Cattura.png)

The *convex hull* of a set of points is the thightest enclosing convex polygon.
If the dataset is linearly separable the convex hulls of the classes do not intersect and the maximum hyperplane is as far as possible from both hulls.

Only a subsets of point is sufficient to define the hulls: the **support vectors** (less expensive computationally).

Finding the support vectors and the maximum margin hyperplane belongs to the well known class of *constrained quadratic optimization problems*:
$$\max_{w_0,w_1,\dots,w_D} M$$
$$\sum_{j=1}^D w_j^2=1$$
$$c^e(w_0 + w_1x_1^e + \dots + w_Dx_D^e)>M,\forall e=1,\dots,N$$
where the class $c^e$ of the example $i$ is either $+1$ or $-1$ and $M$ is the margin.

### Soft margin vs Hard margin
It's quite common that the separating hyperplane does not exists (dataset non linear separable), in this case it's possible to:
* Find an hyperplane which almost separate the classes;
* Disregard examples which generates a very narrow margin.

In any case we have:
* A greater robustness to individual observations;
* A better classification of most of the training observations.

This is obtained by adding a constraint (hyperparmeter) to the optimization problem.
It's a regularization parameter called **penalty parameter of the error term** $C$ that controls the amount of overfitting.

### Non-linear class boundaries and the Kernel trick
The SVM method avoids the problem of overfitting and the non-linearity problem of boundaries can be overcome with a **non-linear mapping**: data are mapped in a new space (*feature space*), usually with higher dimensions, where an hyperplane exists.
![](https://miro.medium.com/max/872/1*zWzeMGyCc7KvGD9X8lwlnQ.png)
Once the hyperplane is obtained it's possible to going back to the original *input space*.

The separating hyperplane computation requires a series of dot product computations among the training data vectors.
The mapping is defined on the basis of a particular familiy of functions called **kernel functions**, where the mapping doesn't need to be explicitly computed and so the computation is done in the input space.
This avoid an increase of the complexity.

Kernel usually comes with some hyperparameters to tune.
*Rule of thumb*: start with a simpler kernel and then try more complex if necessary.
![](https://i.ibb.co/Q6qpsCN/Cattura.png)


### SVM complexity
The time complexity is mainly influenced by the efficiency of the optimization library.
The popular `libSVM` library scales from $\mathcal{O}(DN^2)$ to $\mathcal{O}(DN^3)$, depending on the effectivness of data catching in the library (in case of sparse data is reduced).

### SVM remarks
* Generally slower than simpler methods such as decision trees;
* Tuning is necessary;
* Very accurate results;
* Based on a theoretical model of learning;
* Not affected by local minima;
* Since no notion of distance is used, doesn't suffer the curse of dimensionality.

## Neural networks
It consists in many perceptron-like elements arranged in a hierarchical structure, inspired to the complex interconnections of neurons in animal brain.

A neuron, the engine of reasoning, can be seen as a signal processor with a threshold.
The signal transimission from one neuron to another is weighted: the weights change over time, also due to learning.

The signals transimitted are modelled as real numbers and the threshold is modeled as a mathematical function that has to be:
* Continuous;
* Differentiable;
* Limited;

Possibly the derivative should be expressed in terms of the function itself for simplicity. 
Usually they're non-linear to overcome the limit of linear decision boundaries and to reduce noise since linear functions completely tansfers the noise to the output.
The shape of the function influence the learning speed.

Some examples: sigmoid function (*squashing function*), arctangent 

### Neural networks training
* Inputs feed an **input layer** (one input node for each dimension in the training set);
* Input layer feeds with weights a **hidden layer**;
* Hidden layer feed with weights an **output layer**.

The number of nodes in the hidden layer is a parameter of the network, while the number of nodes in the output layer is related to number of different classes in the domain (we can use different encodings)

*Example*:

![](https://i.ibb.co/y5qJ0BV/photo-2021-01-01-11-27-37.jpg)
* The function $g(\cdot)$ is the **transfer function** of the node;
* The unitary input $e_0$ is added for dealing with the bias as we did for the linear perceptron;
* Weights are for each edge connecting two nodes.

**Feed-forward** defines which oriented edges are present:
* Edges connect only a node in a layer to a node in the following layer;
* Each node of one layer is connecred to all nodes of the following layer.

In this way the signal flows from input to output without loops.

![](https://i.ibb.co/6FpJB1Y/photo-2021-01-01-11-36-04.jpg)
Note: there are two loops since the changhes in weights change the behaviour of the network.

In analogy with learning in the animal domain the examples must be repeatedly feed the network.
The weights encode the knowledge given by the supervised examples.
This encoding isn't easily understandable (*sub-symbolic approach*) and it's look like a structured set of real numbers.

Convergence is **not** guaranteed.

Important issues:
* Computing the weight corrections;
* Preparation of the training examples (standardization to have zero mean and unit variance);
* Termination conditon (when we are about to overfitting).


### Error computation
Let be:
* $x$ the input vector;
* $y$ the desired output of a node;
* $w$ the input weight vectors of a node.

Then: 
$$E(w)=\frac{1}{2}(y-\text{Transfer}(w,x))^2$$
is the **quadratic error function**.
It should be convex since optima are easier (and fast) to find.
![](https://www.domsoria.com/wp-content/uploads/2020/03/convex_cost_function.jpg) 

### Gradient computation
Objective: move towards a (local) minimum of the error function (gradient descent), following the gradient and computing the partial derivatives of the error as a function of the weights.

The weight is changed subtracting the partial drivative multiplied by a **learning rate** constant (hyperparameter): it influences the convergence speed and represents a tradeoff between speed and precision.
$$w_{ij} = w_{ij} - \lambda \frac{\partial E(w)}{\partial w_{ij}}$$
The subtraction moves towards smaller errors (descent).

![](https://i.ibb.co/NxVP2mF/photo-2021-01-01-11-52-32.jpg)

### Learning modes
There are several methods for learning in neural networks:
* **Stochastic**: each forward propagation is immediately followed by a weight update. It introduces some noise in the gradient descent process since the gradient is computed by a single data point. It reduces the change of getting stucked in a local minimum. Good for online learning.
* **Batch**: many propagations occour before updating the weights, accumulating errors over the samples within a batch. Generally faster and stable descent since the updated is driven by the direction of the average error.

As for all gradient tracking methods, local minima rather than global are usually found.

### Repetitions and stop criteria
A learning round over all the sample is called **epoch**.
Generally, after each epoch the network classification capability will be improved.
Several epochs will be necessary and after each epoch the starting weights will be different.

The learning rate can be changed in different epochs:
* In the beginning a higher learning rate can push faster towards the desired direction;
* In later epochs a lower learning rate can push more precisely towards a minum.

The learning process can stop when:
* Weight updates are small;
* Classification error rate goes below the chosen targer;
* A timeout or memoryout condition is reached.

### Regularization
Overfitting is possible if the network is too complex respect to the complexity of the problem.

Some regularization techniques can be used to improve the generalization capacity of the model, usually modifying the performance (error) function.
In essence, the improvement of performance is obtained by reducing the loss function in order to smooth the fitting data (the amount of regularization has to be tuned).

## Istance learning

### $K$-nearest neighbours classifer
The method consists in:
* Keeping all the training data (the model is the entire dataset);
* The predictions are made by computing the similarity between the new sample and each training instance;
* Picks the $K$ entries in the database which are closest to the new data point and do a majority vote.

*Parameters* number of neighbours to check and the metric used to compute the distances (Mahalanobis distance has usually good performances).