# Review of Introductory Shallow Models in Machine Learning

These descriptions are also available throughout the text of the [Exoplanets Classification Exercise](https://github.com/brunoaugustoam/MLExercises/blob/main/ML_Application_to_Classify_Exoplanets.ipynb)

# Naive Bayes
[Documentation](https://scikit-learn.org/stable/modules/naive_bayes.html)



Naive Bayes is used as a classifier, in which the class choice is based on the conditional probability of event occurrence, considering independence of events, that is, it assumes that there is no correlation between the chosen attributes, and therefore it is called "Naive", naive. The term "Bayes" comes from Bayes' Theorem, on which the algorithm is based. It is a simple algorithm, with generally good performance, but not optimal for many applications.

The conditional probability of event A occurring is the probability of A occurring, given that another event B has already occurred.

This probability is written in the form:

$ p(A|B) $

Bayes' Theorem for n elements can be written in its expanded form:

$P(y|x1, x2, ... xn) $ = $\frac{P(y)*P(x1, x2, ... xn|y)} {P(x1, x2, ... xn)} $

Because the algorithm assumes that the attributes are independent of each other, the probability of xi given y does not interfere with the value of the other variables e.g. P(x(i+1)∣y). And, since the algorithm inputs (x1,x2...xn) are constants, the theorem is summarized to:

$ P(y∣x1,x2...xn) P(y)∗ \prod_{i=1}^n P(xi|y) $

In this way, the Naive Bayes algorithm will select the y (class) for which the probability of y given the values of X, is maximized, that is:

$ŷ = argmax_{y} P(y) * \prod_{i=1}^n P(xi|y) $

When we look at our data, we see that the attributes are composed of continuous data, not discrete. Therefore, it is appropriate to use the "Gaussian Naive Bayes" algorithm, already implemented in the SciKit Learn library. Let's import it below

For this algorithm, P(xi∣y) is given with the aid of a normalization of the data, in the form:

$P(xi|y)$ = $ \frac{1}{\sqrt{2πσ^2_{y}}}$ * $exp(- \frac{(xi - μ_{y})^2} {2σ^2_{y}}) $
In this context, the mean μ
y
​
  and standard deviation σ
y
​
  are given by the mean and standard deviation of xi given the observations of y.

Additional notes:

* The term "naive" in Naive Bayes refers to the assumption that the attributes are independent of each other. This assumption is often violated in real-world data, but it can still be a good approximation in many cases.
* Gaussian Naive Bayes is a specific type of Naive Bayes algorithm that assumes that the attributes are normally distributed. This assumption is often reasonable for continuous attributes.


___________________________________________________________________________________________________

# Decision Tree


Decision trees are supervised machine learning algorithms that progressively reduce the dataset into smaller groups based on some attribute, ideally, until it is possible to classify these sets by labels.

They are used in both classification and regression problems. They are called classification trees when the data of the dependent variables are categorical (discrete, qualitative) and regression trees when the dependent variables are continuous (numerical, quantitative).

Trees are composed of nodes, branches, and leaves. The nodes are where the attributes are allocated. The branches are generated from the nodes, separating the examples based on a threshold or class of the attribute. The leaves are the outputs of the algorithm, which will be for example classified. From a leaf, no nodes are generated.

From each node, the data are divided in a way to increase homogeneity, that is, to generate groups of data with fewer and fewer mixed classes. The complexity or capacity of the model increases as the height of the tree increases. In the learning process, it is important to balance so that it has enough attributes and nodes to classify the data (avoid underfitting) but without making the tree too specific/complex, losing generalization ability (avoid overfitting).

A procedure used to avoid overfitting is called pruning. It is a technique that reduces the complexity of the model to gain performance in the training data, by removing nodes that are being used to classify noise. They are separated by:

Pre-pruning: It terminates the growth of the tree prematurely.

Post-pruning: It allows the complete growth of the tree, then removes leaves if that removal results in an increase in performance.

For this exercise, since it is a classification task, an algorithm based on the concept of information gain will be used to build the tree. Information gain is given by the decrease in expected entropy when using a given attribute. A commonly used algorithm is ID3 (Interative Dichotomiser), based on information gain, which follows:

Prior

$ Info_{D} = - \sum_{i=1}^m{\log_{2}pi} $

Expected entropy when using a given attribute

$ {Info_{A}(D)} = - \sum_{i=1}^v{\frac {|D|_{j}}{|D|}} * InfoD_{j}$

Information gain of the attribute

$InformationGain(A) = Info(d) - Info_{A}(D) $

Where Info(D) is the average information needed to identify the class in D.

$\frac {|D|_{j}}{|D|} $ serves as a weight for the j−th partition

Info(D) is the expected information that needs to be needed to identify the class in D using A at a node.

The C4.5 algorithm is an evolution of ID3. Its separation procedure is given by:

$ {SplitInfo_{A}(D)} = - \sum_{i=1}^v{\frac {|D|_{j}}{|D|}} $ * $log_{2} {(\frac {|D|_{j}}{|D|})} $

where $\frac {|D|_{j}}{|D|} $ serves as a weight for the j−th partition

v is the number of attributes

The information gain is given by

$ GainRation $ = $ \frac {Ganho(A)}{SplitInfo_{A}(D)}  $

Additional notes:

* The term "entropy" refers to the amount of uncertainty or randomness in a system. In the context of machine learning, entropy is used to measure the uncertainty of a class label.

* The term "information gain" refers to the reduction in entropy that is achieved by using a given attribute to classify the data.

* The term "gain ratio" is a measure of the information gain that is normalized by the complexity of the splitting rule.

___________________________________________________________________________________________________

# SVM


Support Vector Machines (SVMs) are algorithms that seek to find the optimal hyperplane in an N-dimensional space (N attributes) that separates the data with the largest possible margin. This margin is built based on the points closest to the decision boundary (support vectors), equally distant from the points of each class. The Margin can be sought by admitting different error levels, with the margin size and the admitted error being parameters that are calibrated by a hyperparameter.

The optimal margin hyperplane would initially be able to separate only linearly separable data, however, by using the "Kernel Trick", the data are virtually (not explicitly) shifted so that they become linearly separable. The resulting algorithm is formally similar, however using a non-linear kernel function that transforms the data and makes them linearly separable.

Some classically used kernels are: linear, sigmoid, polynomial, and RBF

The loss function that allows maximizing the margin is called Hinge Loss:

$c(x,y,f(x)) = (1 - y * f(x)) $

The cost is zero if the predicted value and the expected value are the same, and only differs from zero if the values are not equal. The regularization parameter is responsible for the trade-off between maximizing the margin and decreasing the error. With the regularization factor, the Loss takes the form:

$ min_{w}λ ||w||^2 + \sum_{i=1}^n(1-y_{i}<x_{i},w>)    $

Parameter updates are given by gradients, given by the partial derivative of the loss with respect to the weights. When the predicted class differs from the expected one, the weight update is given by:

$w = w + α(y_{i} x_{i} - 2 λ w)  $

When the predicted class is the same as the expected one, the weight update is summarized to:

$w = w - α(2 λ w)  $

___________________________________________________________________________________________________

# k-NN

k-NN - k-Nearest Neighbors or k-nearest neighbors is an algorithm that does not assume a data distribution a priori, that is, it is a non-parametric algorithm.

In addition, it is a "lazy algorithm" that does not require training to generate the model, and all training data can be used in testing.

In k-NN, K is the number of neighbors. It is the most important hyperparameter. Typically, it is an odd number when used in binary classification.

The algorithm works basically as follows: Given a point for which we want to classify, find the nearest neighbors and classify the point based on the classes of the nearest neighbors.

k-NN performs better with a reduced number of attributes, especially with few labeled data available. By increasing the number of dimensions/attributes, there is a greater risk of overfitting.

There is no general optimal number of neighbors for any data, we will have to experiment. For few neighbors, noise will have a major influence on the result, and is subject to lower bias and higher variance. With many neighbors, the computational cost increases, and implies lower variance error, but with higher bias.

___________________________________________________________________________________________________

# Random Forest

The Random Forest model operates through blocks of decision trees. Its intuition stems from a concept of "The Wisdom of the Crowds". Basically, several simple decision trees are generated, each of which will provide a classification for the data, and the classification with the highest number of models predicting will be chosen.

A key point for the operation of Random Forest is the independence/low correlation between the models built. The sum of the parts of the simple models generates a more accurate model than a single more complex tree. This is due to the fact that multiple trees correct each other from individual errors, as long as they do not all err in the same direction.

Two prerequisites for random forest are defined:

The data must be separable through the attributes
The predictions and errors of each individual tree must have low correlation with the others
A procedure used to guarantee the non-correlation of the data is bagging. This method consists of letting each tree that makes up the forest withdraw samples from the dataset with replacement. This results in different trees, some with overlapping data and others not, guaranteeing distinct trees, less susceptible to variance errors.

Another procedure adopted to guarantee the generation of distinct trees is "Attribute Randomness". In this process, instead of choosing the best attribute for the node, out of all the available ones, the tree is forced to choose from a random subset of attributes, which increases the chance of obtaining distinct trees.

In this way, the final forest is composed of trees that not only were trained on distinct data subsets, but also used different attributes to make decisions.

___________________________________________________________________________________________________

# Gradient Tree Boosting  

Gradient Tree Boosting is a machine learning algorithm that also uses ensembles, that is, the combination of simpler models to obtain better performance. The Boosting technique is related to the addition of models (trees) so that the errors of one model are corrected by the other models.

In this model, parameter updating is done using random loss functions in the individual models and gradient descent optimization.

___________________________________________________________________________________________________