# Machine Learning Pipeline

## Preprocessing


### Feature Rescaling

* Normalization: Convert the range of a variable: $x_{new}=\frac{x_{old}-MIN}{MAX-MIN}$
* Standardization:  $x_{new}=\frac{x_{old}-\bar{x}}{\sigma}$



### Missing Value Imputation

* Remove row
* Use mean of the value fo the feature to replace the NA
* Use rest of data to predict value

### Imbalanced Data

* Oversampling: Make multiple copies of the examples of the underepresented class.
* Undersampling: Remove randomly examples of the majority class.
* Synthetic Creation: Take two (or $k$) examples of the underepresented class, and build an element in-between.


### Feature reduction

Having too many features can be harmful for the classifier. This is called course of dimensionality (adding more features, keeping the same amount of data, is useful in the beginning, but at some point adding more features will hurt the performance of the model). Reducing the data is useful for:

* Remove redundant (correlated) features;
* Reduce irrelevant (noise) data

Some feature reduction methods are:

* Filter methods:
  * Select the variables with the most correlation with the target variable.
  * Select the variables with the most Information gain.
  * Others (Linear discriminant analysis, ANOVA, chi-square (for categorical classifiers))

* Wrapper methods
  * Step Forward selection: Use a greed method to select features as follows:

    1. The set of features $F$ is initialized as empty.
    2. For each feature $f_i$: Evaluate the performance of the model with the set of features $F+f_i$
    3. Select the $f_i$ that showed the best performance and add it to $F$
    4. Go to step 2 until selecting the desired amount of features

  * Step Backward selection: Perform the inverse of step forward selection. Start with all the features and gradually discard them.



## Evaluation

### Confusion Matrix

 
- | TRUE (predicted) | FALSE (predicted)
--- | --- | ---
TRUE (actual)	| TP | FN
FALSE (actual) 	| FP | TN



When errors are equally important:

* Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$. 

When they are not, multiply the confusion matrix by a cost matrix, or use other metrics:

* Precision: $\frac{TP}{TP+FP}$.  Proportion of relevant cases retrieved (from the total retrieved)
* Recall: $\frac{TP}{TP+FN}$.  Proportion of relevant cases retrieved (from the total relevants)
* F1-score: $2 \cdot \frac{precision \cdot recall}{precision + recall}$ Weighted average score of the precision and recall.

### Area Under Curve (AUC)

Area under the ROC (Receiver Operating Characteristic) Curve is also used to choose a classifier. It is used with those classifiers that retrieve some probability score (and a threshold is used to determine whether it is positive or negative class).

The first step is to normalize by the actual class (by row in our example) by computing:

* True positive rate: $TPR=\frac{TP}{TP+FN}$
* False positive rate: $FPR=\frac{FP}{FP+TN}$

We plot the values of TPR and FPR using different thresholds.

The higher the area under the curve, the better the classifier is.


### Cross-validation


Cross-validation (or rotation estimation, or out-of-sample testing) is a way of estimating the quality model. It consist of dividing the training data in k-fold, and use $k-1$ for train and 1 fold to validate. This is repeated $k$ times using each validation fold.



## Model Tune


### Regularization
The fitting procedure of a linear regression $f(x)=\beta_0 + \beta_1x_1 + ...+ \beta_nx_n$ involves a loss function called as residual sum of squares (RSS) defined as as $RSS=\sum_{i=1}^{n} (y_i - f(x_i))^2$

In order to prevent overfitting a penalty is added (regularization):

* L1 (Lasso): $RSS + \lambda \sum_{i=1}^{n} |\beta_i |$
* L2 (Ridge): $RSS + \lambda \sum_{i=1}^{n} \beta_i^2$

The value of $\lambda$ is used to decide how much we want to penalize.

For more info check [here](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a) and [here](https://github.com/ShuaiW/data-science-question-answer#l1-vs-l2-regularization)

### Ensemble Techniques

This involves performing samples from the data, train several model and combine:

* Bagging (in parallel): Obtain $k$ samples, train $k$ models (in parallel). The result will be the average (or majority vote). 
* Boosting (sequential): Obtain one sample, train a model and classify. Increase the probability of missclassified instances to be picked in following samples. Repeat $k$ times.
* Stacking: Build models that use the output of other models as one of the features.

For more info check [here](https://www.pluralsight.com/guides/ensemble-methods:-bagging-versus-boosting)