# 1. Regression
#### The most popular regression algorithms are:
1. Linear Regression
2. Logistic Regression
3. Ordinary Least Squares Regression (OLSR)
4. Stepwise Regression
5. Multivariate Adaptive Regression Splines (MARS)
6. Locally Estimated Scatterplot Smoothing (LOESS)

### 1.1. Linear regression
<img src="data/images/linear_regression.png" alt="xxx" title="title" width=260 height=260 />

Is a form of predictive technique, used in trends, estimates, impact of price changes <br>
$$y = mx + c$$
- m = slope of the line
- c = intercept of the line
$$ m = \frac{∑(x - x̄)(y - ӯ)}{∑(x - x̄)^2}$$
The error (better when closer to 1)/Least squares/Residuals: <br>
$$ R^2 = \frac{∑(y_{pred} - ӯ)}{∑(y - ӯ)^2}$$

### 1.2. Logistic Regression

<img src="data/images/logistic_regression.png" alt="xxx" title="title" width=260 height=260 />

Logistic regression provides probabilities. <br>
Logistic regression produces results in a binary format: <br>
- 0 or 1
- Yes or No
- True or False
- High or Low

Logistic Regression doesn't have the concept of residuals, so it can't use least squares. <br>
Instead, it uses the maximum likelihood.
<img src="data/images/logistic_regression_likelihood.png" alt="xxx" title="title" width=460 height=460 />

### 1.3. Ordinary Least Squares Regression (OLSR)

<img src="data/images/OLSR.png" alt="xxx" title="title" width=460 height=460 />

### 1.4. Stepwise Regression

- Is a tool used to "pool" the features which doesn't have such a big impact for our prediction model
- Stepwise attempts to find the most important variables
- E.g.: The price of a house is impacted by number of rooms and location, but not by color.

#### How it works:
 - assume we have n independent variables
 - Step1: we create all possible n models: E(y) = B_0 + B_1 * x_k
  - we choose the most signigicant xi
 - Step2: we create all possible n-1 models:E(y) = B_0 + B_1 * x_1 + B_2 * x_k
  - we choose the most signigicant xi
 - We repeat the steps until the variable doesn't impact the model that much

#### Methods to choose the best variable:
 - p value (smallest)
 - standard deviation
 - R squared

#### Drawbacks:
- Only linear terms are considered

### 1.5. Multivariate Adaptive Regression Splines (MARS)

- MARS models fits piecewise linear models
- Hinge functions are used to "cut" the lines into sectors which could be easly shaped with linear functions

<img src="data/images/spline_regression.png" alt="xxx" title="title" width=460 height=460 />

<img src="data/images/MARS.png" alt="xxx" title="title" width=460 height=460 />

<img src="data/images/1vs2_cut_points.png" alt="xxx" title="title" width=460 height=460 />

### 1.6. Locally Estimated Scatterplot Smoothing (LOESS)


#### Steps:
- The data to be fitted
<img src="data/images/LOESS1.png" alt="xxx" title="title" width=460 height=460 />



- Divide the data into smaller blobs: 5 points

<img src="data/images/LOESS2.png" alt="xxx" title="title" width=460 height=460 />

- Within this window, each point will be focal point.
- The focal point has the biggest weight, the next points has smaller weight proportional with the distance
<img src="data/images/LOESS3.png" alt="xxx" title="title" width=460 height=460 />

- After this, you'll have the fierst point of the fitted curve. It will be after that updated with respect of the distance between the curve and the actual point
<img src="data/images/LOESS4.png" alt="xxx" title="title" width=460 height=460 />

- Here comes the 2nd weight
<img src="data/images/LOESS5.png" alt="xxx" title="title" width=460 height=460 />

- The curve after taking into consideration both weights (distance comparing to focal points and distance between the curve and the point)
<img src="data/images/LOESS6.png" alt="xxx" title="title" width=460 height=460 />

#### Additional consideration:
- Lines or parabolas
<img src="data/images/LOESS7.png" alt="xxx" title="title" width=460 height=460 />

- Difference between the two
<img src="data/images/LOESS8.png" alt="xxx" title="title" width=460 height=460 />

- The functions for the 2 weights:
<img src="data/images/LOESS9.png" alt="xxx" title="title" width=460 height=460 />

# 2. Classification
#### The most popular instance-based algorithms are:
1. k-Nearest Neighbor (kNN)
2. Learning Vector Quantization (LVQ)
3. Self-Organizing Map (SOM)
4. Locally Weighted Learning (LWL)
5. Support Vector Machines (SVM) ★

### 2.1 k-Nearest Neighbor (kNN)

- Finds the nearest n neighbors, and it decides which class is the new element, seeing which neighbors has the most votes
<img src="data/images/knn1.png" alt="xxx" title="title" width=460 height=460 />

- When k=1, each training vector defines a region in space, defining a Voronoi partition of space
<img src="data/images/knn2.png" alt="xxx" title="title" width=460 height=460 />

#### Remarks:
- For a two class problem, an odd k value must be chosen
- k must not be a multiple of the number of classes
- not so scalable (for large datasets, it can be a problem because many distances has to be calculated)

### 2.2 Learning vector quantization (lvq)

### 2.5 Support Vector Machines (SVM)

- The extreme points of 2 classes are called Support Vectors
- LSVM: linear suport vector machines: classes are linearly separable

- Bad threshold
<img src="data/images/SVM1.png" alt="xxx" title="title" width=760 height=760 />

- Good threshold
<img src="data/images/SVM2.png" alt="xxx" title="title" width=760 height=760 />

Margin: is the minimum distance between threshold and the limits of the classes
- If the threshold is in the middle, the margin is as large as it can be 
<img src="data/images/SVM3.png" alt="xxx" title="title" width=760 height=760 />

- Low bias: when the threshold is robust (no missclassifications)
- High bias: when we allow missclassifications --> Soft margin

When data is not classificable: we add another dimension --> kerneling

In order to make the mathematics possible, Support Vector Machines use Kernel functions to sytematically find Support Vector Classifiers in higher dimensions.

Examples of Kernel functions:
 - Linear Kernel: x * y
 - Polynomial Kernel: (x * y)^d, d = 1, 2 ,3
 - Radial Kernel(Radial Basis Function kernel/RBF): finds Support Vector Classifiers in infinite dimensions: e^(-gama||x-y||2)
 <img src="data/images/SVM_radial_kernel.png" alt="xxx" title="title" width=260 height=260 />
 - Sigmoid Kernel
 <img src="data/images/SVM_sigmoid.png" alt="xxx" title="title" width=260 height=260 />


#### Disadvantages of SVM:
 - Poor performance when # features > # samples

# 3. Clustering
#### The most popular clustering algorithms are:
1. k-Means
2. k-Medians
3. Expectation Maximisation (EM)
4. Hierarchical Clustering

# 4. The most popular Bayesian algorithms are:
1. Naive Bayes ★
2. Gaussian Naive Bayes
3. Multinomial Naive Bayes
4. Averaged One-Dependence Estimators (AODE)
5. Bayesian Belief Network (BBN)
6. Bayesian Network (BN)

# 5. Algorithms
1. Decision Tree ★
2. Random Forest ★
3. Dimensionality Reduction Algorithms
4. Gradient Boosting algorithms
5. GBM
6. XGBoost
7. LightGBM
8. CatBoost

# 5.1 Decision Tree
It is a type of supervized learning algorithm

Common Decision Tree Algorithms
 - Gini Index
 - Chi-Square
 - Information Gain
 - Reduction invariance 
 
Example:
<img src="data/images/Decision_Trees8.png" alt="xxx" title="title" width=460 height=460 />
<img src="data/images/Decision_Trees9.png" alt="xxx" title="title" width=460 height=460 />

- The root
<img src="data/images/Decision_Trees1.png" alt="xxx" title="title" width=460 height=460 />

### Decision trees using binary data (Yes/No)
<img src="data/images/Decision_Trees2.png" alt="xxx" title="title" width=460 height=460 />

- Finding out which one will be the root: Determine which one has the lowest impurity

### Gini impurity

<img src="data/images/Decision_Trees3.png" alt="xxx" title="title" width=460 height=460 />

- for each leave
     - Gini impurity = 1 - (probabiliy of yes)^2 - (probabiliy of no)^2
     
$$G.I._1 =1 - (\frac{105}{105 + 39})^2 - (\frac{39}{105 + 39})^2 = 0.395$$
$$G.I._2 =1 - (\frac{34}{34 + 125})^2 - (\frac{125}{34 + 125})^2 = 0.336$$
- After having these 2 impurities, we calculate the weighted average of them
$$G.I._t =(\frac{144}{144 + 159})*0.395 - (\frac{159}{144 + 159})*0.336 = 0.364$$
- Repeat this for each feature, and the feature with the smallest impurity will be the root
- This is repeated until we reach the smallest impurity
<img src="data/images/Decision_Trees4.png" alt="xxx" title="title" width=460 height=460 />
- Here the impurity if we split one more time after chest pain is bigger comparing with what was before, so we make it a leave node

### Decision trees numerical data
<img src="data/images/Decision_Trees5.png" alt="xxx" title="title" width=460 height=460 />

 - First order the data asc
 - Calculate the average for adiacent values
 - Calculate the impurity for each average value
 - Choose the smallest impurity --> that will be the cut-off/threshold

### Decision trees ranked/multiple choice data
<img src="data/images/Decision_Trees6.png" alt="xxx" title="title" width=460 height=460 />

 - <= 1
 - <= 2
 - <= 3
 
 <img src="data/images/Decision_Trees7.png" alt="xxx" title="title" width=460 height=460 />
 
 - Green
 - Red
 - Blue
 - Green or Red
 - Red or Blue
 - Blue or Green

# 5.2 Random Forest
 - One of the most powerful supervized learning algorithm
 - Capable of solving regression and classification tasks
 

Advantages:
 - Both classification and regresion tasks
 - Handles missing data
 - Won't overfit the model
 - Handle large datasets with higher dimensionality


Disdvantages:
 - Good at classification but not as good as for regression
 - Very little control ( Black box)

### Creating a random forest:
 1. Create a bootstrap dataset: Randomly selecting samples from the original dataset (one sample can be chosen more than once)
 2. Create a decision tree using the bootstrapped data, but only using a random subset of freatures at each step ( each node of the tree)
 3. This is a tree. Repeat this hundreds of times to create a random forest

How is it used:
 - With a new data, to be predicted, it is passed through all the trees from the random forest.
 - The option with the most votes will be chosen as class


How the accuracy is calculated: the data which is not included in the bootstrapped data, called out-of-bag data, it's passed through the trees and labeled.

Accuracy = the proportion of the out-of-bag samples which were correctly labeled

### Configuration:
 - Build the random forest
 - Estimate the accuracy
 - Change the number of variables used at building the random forest
 - Estimate again
 - Choose the best model/forest