# Gradient Boosting

## Evaluation Metrics

This is a quick review of metrics for evaluating machine learning models. Suppose I want to classify image into 3 categories, a dog, a cat, and a human, there only four possible outcome for a classification model if we center our attention on dog classification.

- True positive (TP): model predicted dog and it is a picture of dog.
- True negative (TP): model predicted not a dog, maybe a cat or a human, and indeed it is **NOT** a
  picture of dog.
- False positive (FP): model predicted dog but it is **NOT** a picture of dog.
- False negative (TP): model predicted not a dog but it **ACTUALLY** a picture of dog.

Sensitivity, or _recall_ or _true positive rate_ is asking, of all dog images, how many were predicted to be dog?

$$
\text{sensitivity}_{dog} = \frac{TP}{TP + FN}
$$

Specificity, or _true negative rate_ is asking, of all non-dog images, how many were predicted to be **NOT** dog?

$$
\text{specificity}_{dog} = \frac{TN}{TN + FP}
$$

Precision, or _positive predictive value_ is asking, of all the predicted dog labels, how of the input images are actually dogs?

$$
\text{precision}_{dog} = \frac{TP}{TP + FP}
$$

Precision and recall are the two most common metrics. I personally have not seen people using specificity. People tend only care about producing the correct label. Maybe specificity is useful for fraud and disease detection. However, for those cases, it's often better to have false positive than to have false negative because false negative could mean death to an ill person.

## Bias vs Variance

The inability for a machine learning method to capture the true relationship is called **bias**. The difference in fits between data sets is called **variance**. High variance means the fits vary greatly between different data set, which is often a result of overfitting to one data set. The perfect model would have low bias; it fits the data very well because it captures the relationship, have low variability because it produces consistent predictions across different datasets.

This is done by finding the sweeet spot between a simple model and a complicated model. There are three commonly used methods for finding the sweet spot between simple and complicated models. 

- Regularization (to prevent overfitting as explained in my neural network sections)
- Boosting
- Bagging


## Random Forest

Decision trees are easy to build, easy to use, and easy to interpret but in practice they are not that useful. 

> Quoted The Elements of Statistical Learning, "Trees have one apsect that prevents them from being the ideal tool
  for predictive learning, namely inaccuracy

In other words, they work great with the data used to create them, but they are not flexible when it comes to classifying new samples. Random forest solves this problem by combining the simplicity of decision trees with flexibility resulting in a vast improvement in accuracy.

### Bagging

Start off with bootstrapping, to create a bootstrapped dataset that is the same size as the original, we just randomly select samples from the original dataset but we're allowed to pick the same sample more than once. Then we create a decision tree using the bootstrapped dataset, but only use a subset of variables or columns at each step. How many columns we use at each node splitting step is a hyperparameter that requires tuning. 

In summary,

1. We build a tree using a bootstrapped dataset.
2. Only considering a random subset of variables at each step.
3. Repeat 1 & 2 until we have X trees and we have a forest.
4. Random forest will predict by aggregating the result of all tree, take the majority vote. 
5. We take the out-of-bag data (data that didn't go into creating the tree) to evaluate accuracy of the forest.
6. Tune the hyperparemter by repeating step 1 to 5.

Why is it called _bagging_ ? Because it **b**ootstrapped and **ag**ggregated.

### AdaBoost

AdaBoost is short for adaptive boosting. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one. 

$$
F(x) = \text{sign}(\sum^{M}_{m=1} \theta_{m}f_{m}(x)))
$$

$f_m$ is a weaker learner and $\theta_m$ is the corresponding weight. 

In the forest of trees made with AdaBoost, the trees are usually just a node with two leaves. We call this a **stump**. Stumps are not great at making accurate classifications because stump can only use one variable to make a decision. However, the trees of a regular random forest vote with an equal weight. In contrast, some stumps of a AdaBoosted forest get more say in the final classification than others.

Order matters when we construct stumps for an AdaBoosted forest. The errors that the first stump makes influence how the second stump is made.

Key ideas,

1. AdaBoost combines a lot of weaker learners to make classifications.
  The weak learners are almost always stumps.
2. Some stumps get more say in the classification than others.
3. Each stump is made by taking the previous stump's mistakes into 
  account.
  
In order to take in the previous stump's mistakes, we assign each sample (data point) a weight initially. Each time a stump misclassifies these samples, the weights will be adjusted according. More mathematical details will come later. 

TODO: 
1. Mathemathetical formulation for stump "say value"
2. Mathemathetical formulation for sample weights

When we try to make a second stump, if we have a weighted gini function, then we use it with the sample weights, otherwise use the sample weights to make a new dataset that reflects these weights.