# Gradient Boost 

- Gradient boost is very similar to AdaBoost algorithm

AdaBoost:
- AdaBoost builds an initial stump and tries to predict the outcome. Based on the error made by the first stump subsequent stumps are created always focusing on building smaller learners.
- After building multiple such stumps each having a weight, we can predict the outcome by performing a weighted average of predictions made by each stump.

Gradient Boost:
- In Gradient Boost, we start with a single leaf node as a predictor which represents the weigths of all the samples.
- In general, the first value is always going to be an average in case of regression
- after this step, the tree is built (larger than a stump but restricted) considering the errors made by the previous trees
- The errors are computed using (observed - predicted) = psuedo residual
- These pusedo residuals get computed for every entry
- After computing the pseudo residuas, we build a tree using all our parameters which predicts the psuedo residual values
- now to predict an output, we first predict it using the first tree and then using the second and add the two values
        Eg: If first tree predicts 78, and second tree predicts 12.2, our final prediction would be 80.2

#### NOTE: directly adding the two values might not be a very effective solution as we might end up with a tree which over fits our data
- to solve such issues we introduce a learning_rate
        Eg: old tree prediction + learning_rate* residual tree prediction

- As we get the new predicteed values we can again compute a new round of residuals and those can be used to build further trees
#### NOTE: If multiple samples end up in the same leaf of the residual tree we can average out the values and put a single value in the leaf

- At the end of building the second residual tree we scale the prediction from both trees by the learning rate and add the results to the predictions of the original tree

This process continues until:
1. we reach the maximum number of trees (or)
2. new trees dont improve the residuals by much

### Gradient Boost: Classification

- Gradient Boost for classification works in almost the same way as the Regression algorithm but there are a few subtle differences
- In Regression case, we used to find the average of the results to get the data for our first leaf. In Classification we tend to calculate log(odds) (i.e. if 4 outputs are yes and 2 are no then log(odds) = log(4/2) = 0.7)
- as we have a number for our leaf node, to determine the classification output we use logistic function.
        probability of yes class = e^log_odds/(1+ e^log_odds)
- Based on the probability value we can determine the output class by setting a threshold for classes (Eg: > 50% == yes)

- Using the tree generated by above method we can ddetermine the predicted result for all the samples.
- Using the predicted and original values we can determine psuedo residuals. Here if we assume there are 2 classes yes and no; there are 4 yes samples and 2 no samples; the predicted probability for all is 0.7; then the yes values can be given a value of 1; the no values can be given a value of 0; so 4*(1 - 0.7) + 2*(0.7 - 0) = 2.6

# Source:
- https://www.youtube.com/watch?v=3CC4N4z3GJc
- https://www.youtube.com/watch?v=2xudPOBz-vs
- https://www.youtube.com/watch?v=jxuNLH5dXCs
- https://www.youtube.com/watch?v=StWY5QWMXCw
- https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

# Example

### Gradient Boost Regressor

In [1]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_friedman1
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
X, y = make_friedman1(n_samples=1200, random_state=0, noise=1.0)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]

In [3]:
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
mean_squared_error(y_test, est.predict(X_test))

5.009154859960321

### Gradient Boost Classifier

In [4]:
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

In [5]:
X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

In [6]:
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

0.913