# Ensemble Methods - Boosting

## Objectives

- Describe boosting algorithms
- Implement boosting models with `sklearn` 
- Implement boosting models with `XGBoost`

# Intro

One of the problems with using single decision trees and random forests is that, once I make a split, I can't go back and consider how another feature varies across the whole dataset. But suppose I were to consider **my tree's errors**. The fundamental idea of ***boosting*** is to start with a **weak learner** and then to use information about its errors to build a new model that can supplement the original model.

Though the individual learners are weak, the idea is to train iteratively in order to produce a better predictor. More specifically, the first learner will be trained on the data as it stands, but future learners will be trained on modified versions of the data. The point of the modifications is to highlight the "hard-to-predict-accurately" portions of the data.

## Enter: Boosting 

Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.

1. A subset is created from the original dataset.
2. Initially, all data points are given equal weights.
3. A base model is created on this subset.
4. This model is used to make predictions on the whole dataset.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd1-e1526989432375.png)


5. Errors are calculated using the actual values and predicted values.
6. The observations which are incorrectly predicted, are given higher weights (here, the three misclassified blue-plus points will be given higher weights)
7. Another model is created and predictions are made on the dataset (this model tries to correct the errors from the previous model)
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd2-e1526989487878.png)


8. Similarly, multiple models are created, each correcting the errors of the previous model.
9. The final model (strong learner) is the weighted mean of all the models (weak learners).
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/11/boosting10-300x205.png)



Thus, the boosting algorithm combines a number of weak learners to form a strong learner. The individual models would not perform well on the entire dataset, but they work well for some part of the dataset. Thus, each model actually boosts the performance of the ensemble.
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2015/11/dd4-e1526551014644.png)

## Set Up

Scenario: predict whether a car was made in US

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import precision_score, recall_score, confusion_matrix

from sklearn.tree import DecisionTreeClassifier

In [None]:
df = pd.read_csv('data/cars.csv', na_values = ' ')
df['target'] = df[' brand'] == ' US.'
X = df.drop(['target', ' brand'], axis=1)
y = df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42)

In [None]:
X_train.head()

In [None]:
si = SimpleImputer()

si.fit(X_train)

X_tr_im = si.transform(X_train)
X_te_im = si.transform(X_test)

### Baseline Understanding

In [None]:
# Modelless baseline
y_train.value_counts(normalize=True)

In [None]:
# Decision tree with max_depth = 5
dt_baseline = DecisionTreeClassifier(max_depth=5)

dt_baseline.fit(X_tr_im, y_train)

print(f"Train Score: {dt_baseline.score(X_tr_im, y_train)}")
print(f"Test Score: {dt_baseline.score(X_te_im, y_test)}")

## AdaBoost (Adaptive Boosting)

**AdaBoost** works by iteratively adapting two related series of weights, one attached to the datapoints and the other attached to the learners themselves. Datapoints that are incorrectly classified receive greater weights for the next learner in the sequence. That way, future learners will be more likely to focus on those datapoints. At the end of the sequence, the learners that make better predictions, especially on the datapoints that are more resistant to correct classification, receive more weight in the final "vote" that determines the ensemble's prediction.


### The Steps:

1. Initially, all observations in the dataset are given equal weights.
2. A model is built on a subset of data.
3. Using this model, predictions are made on the whole dataset.
4. Errors are calculated by comparing the predictions and actual values.
5. While creating the next model, higher weights are given to the data points which were predicted incorrectly.
6. Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation.
7. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.


## AdaBoost in Scikit-Learn

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [None]:
# Import!


In [None]:
# Instantiate and fit, after exploring hyperparameters
ada = None



In [None]:
print(f"Train Score: {ada.score(X_tr_im, y_train)}")
print(f"Test Score: {ada.score(X_te_im, y_test)}")

#### Hyperparameter Tuning

Let's see if we can do better by trying different hyperparameter values:

In [None]:
# Don't really need to do this, but nice to start with a clean slate
ada = AdaBoostClassifier(random_state = 123)

# Let's define a param grid together!
param_grid = None

# Create our grid search
gs = None

# Fit our grid search


In [None]:
print(f"Train Score: {gs.best_estimator_.score(X_tr_im, y_train)}")
print(f"Test Score: {gs.best_estimator_.score(X_te_im, y_test)}")

In [None]:
# Check out the best parameters found by the search
gs.best_params_

# Gradient Boosting

> Use gradient descent to improve the model

![](images/gradient_boosting_residuals.png)

**Gradient Boosting** works by training each new learner on the residuals of the model built with the learners that have so far been constructed. That is, Model $n+1$ (with $n+1$ learners) will focus on the predictions of Model $n$ (with only $n$ learners) that were **most off the mark**. As the training process repeats, the learners learn and the residuals get smaller. I would get a sequence going:

Model 0 is very simple. Perhaps it merely predicts the mean: $\hat{y}_0 = \bar{y}$;

Model 1's predictions would then be the sum of (i) Model 0's predictions and (ii) the predictions of the model fitted to Model 0's residuals: $\hat{y}_1 = \hat{y}_0 + \hat{(y - \hat{y})}_{err0}$; 

Now iterate: Model 2's predictions will be the sum of (i) Model 0's predictions, (ii) the predictions of the model fitted to Model 0's residuals, and (iii) the predictions of the model fitted to Model 1's residuals: $\hat{y}_2 = \hat{y}_0 + \hat{(y - \hat{y})}_{err0} + \hat{(y - \hat{y})}_{err1}$

Et cetera, et cetera!

$\rightarrow$ How does gradient boosting work for a classification problem? How do we even make sense of the notion of a gradient in that context? The short answer is that we appeal to the probabilities associated with the predictions for the various classes. See more on this topic [here](https://sefiks.com/2018/10/29/a-step-by-step-gradient-boosting-example-for-classification/). <br/> $\rightarrow$ Why is this called "_gradient_ boosting"? Because using a model's residuals to build a new model is using information about the derivative of that model's loss function. See more on this topic [here](https://www.ritchievink.com/blog/2018/11/19/algorithm-breakdown-why-do-we-call-it-gradient-boosting/).

### SKLearn's Gradient Boosting

In [None]:
# Import! 


In [None]:
# Instantiate and fit - let's use just a few estimators to start
gb_sklearn = None



In [None]:
print(f"Train Score: {gb_sklearn.score(X_tr_im, y_train)}")
print(f"Test Score: {gb_sklearn.score(X_te_im, y_test)}")

## Comparing gradient boosting with many estimators 

In [None]:
# More estimators
gb_more = GradientBoostingClassifier(max_depth=2,
                                     n_estimators=100,
                                     random_state=123)
gb_more.fit(X_tr_im, y_train)

# Even more estimators
gb_evenmore = GradientBoostingClassifier(max_depth=2,
                                       n_estimators=1000,
                                       random_state=123)
gb_evenmore.fit(X_tr_im, y_train)

In [None]:
print(f"Train Score: {gb_more.score(X_tr_im, y_train)}")
print(f"Test Score: {gb_more.score(X_te_im, y_test)}")

In [None]:
print(f"Train Score: {gb_evenmore.score(X_tr_im, y_train)}")
print(f"Test Score: {gb_evenmore.score(X_te_im, y_test)}")

# XGBoost

From [XGBoost's documentation](https://xgboost.readthedocs.io/):

>_**XGBoost** is an optimized distributed gradient boosting library designed to be highly **efficient**, **flexible** and **portable**. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples._

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique.

**Pros of XGBoost:** 

1. Regularization:
    - Standard GBM implementation has no regularisation like XGBoost.
    - Thus XGBoost also helps to reduce overfitting.
    
    
2. Parallel Processing:
    - XGBoost implements parallel processing and is faster than GBM .
    - XGBoost also supports implementation on Hadoop.
    
    
3. High Flexibility:
    - XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model.
    
    
4. Handling Missing Values:
    - XGBoost has an in-built routine to handle missing values.
    
    
5. Tree Pruning:
    * XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.
    
    
6. Built-in Cross-Validation:
    * XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.

Note - XGBoost is not in SKLearn!

[Documentation for the Classifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)

In [None]:
# Import!


In [None]:
# Instantiate and fit
xgb_model = None



In [None]:
print(f"Train Score: {xgb_model.score(X_tr_im, y_train)}")
print(f"Test Score: {xgb_model.score(X_te_im, y_test)}")

## Discuss: Bagging vs Boosting?

How are they similar?

- 


How are they different?

- 


## Resources

- [Great short podcast on Ensemble Methods](http://lineardigressions.com/episodes/2017/1/22/ensemble-algorithms)
- [Slideshow on bagging and boosting ensemble methods](http://www2.stat.duke.edu/~rcs46/lectures_2017/08-trees/08-tree-advanced.pdf)
- [How to explain Gradient Boosting](https://explained.ai/gradient-boosting/index.html)
- [Thorough post on AdaBoost from Data Camp](https://www.datacamp.com/community/tutorials/adaboost-classifier-python)
- [Complete Guide to AdaBoost for Beginners from Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/)


-----

# Level Up: Move to Regression!

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

### Galaxy Data

In [None]:
galaxies = pd.read_csv('data/COMBO17.csv')
galaxies.head()

This is a dataset about galaxies. The Mcz and MCzml columns are measures of redshift, which is our target. Mcz is usually understood to be a better measure, so that will be our target column. Many of the other columns have to do with various measures of galaxies' magnitudes. For more on the dataset, see [here](https://astrostatistics.psu.edu/datasets/COMBO17.html).

In [None]:
galaxies.columns

In [None]:
galaxies.isnull().sum().sum()

In [None]:
galaxies.info()

In [None]:
galaxies = galaxies.dropna()

Let's collect together the columns that have high correlation with Mcz, our target:

In [None]:
preds = []
for ind in galaxies.corr()['Mcz'].index:
    if abs(galaxies.corr()['Mcz'][ind]) > 0.5:
        preds.append(ind)

In [None]:
galaxies[preds].corr()

These various magnitude columns all have high correlations **with one another**! Let's try a simple model with just the S280MAG column, since it has the highest correlation with Mcz.

In [None]:
galaxies_x = galaxies['S280MAG']
galaxies_y = galaxies['Mcz']

Since we only have one predictor, we can visualize the correlation with the target! We can also reshape it for modeling purposes!

In [None]:
galaxies_x_rev = galaxies_x.values.reshape(-1, 1)

In [None]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(galaxies_x_rev, galaxies_y)

## AdaBoost Regression

In [None]:
abr = AdaBoostRegressor(random_state=42)

abr.fit(x_train2, y_train2)

In [None]:
abr.score(x_test2, y_test2)

## Gradient Boosted Regression

In [None]:
gbr = GradientBoostingRegressor(random_state=42)

gbr.fit(x_train2, y_train2)

In [None]:
gbr.score(x_test2, y_test2)

## XGBoost Regression

In [None]:
xgbr = xgb.XGBRegressor(random_state=42)

xgbr.fit(x_train2, y_train2)

In [None]:
xgbr.score(x_test2, y_test2)