<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# XGBoost

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
# Right now I am getting a useless warning every time I fit an `XGBoost` model.
# This line of code prevents warnings from being displayed. Not generally
# recommended.
warnings.filterwarnings(action='ignore')

In [3]:
%matplotlib inline

"Gradient boosting," like bagging, is a general method for training decision tree ensembles.

XGBoost ("eXtreme Gradient Boosting") is a particular implementation of gradient boosted decision trees. It is popular on Kaggle because it is both fast to train and often gives excellent predictive performance.

## Getting Started with XGBoost

We will use the `xgboost` library instead of scikit-learn for this lesson. Scikit-learn has [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) and [`GradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) classes, but they lack some of the tricks that have made the `xgboost` library popular.

`xgboost` provides estimators that use the same interface as scikit-learn's, so we will not need to change our approach.

In [None]:
# Install the xgboost library
# !pip install xgboost

In [5]:
# Import the xgboost package
import xgboost as xgb

In [None]:
# Instantiate an XGBoost regressor
```xgb_reg = xgb.XGBRegressor()```

**Exercise (20 mins., in pairs)**

- Load the Ames housing dataset from `assets/data/ames_train.csv` in this lesson's base directory.

- Create a feature matrix DataFrame `X` containing all of the numeric columns from the Ames dataset except "Id" and the target column "SalePrice". Drop "OverallQual" to make things more interesting -- that very is very predictive but expensive to collect.

- Create a target vector Series `y` with the values of the variable "SalePrice".

- Do a simple train/test split on `X` and `y`.

- Fit an `XGBRegressor` on the training data.

- Get an R^2 score for the training set.

- Get an R^2 score for the test set.

- Is your model overfitting, underfitting, both, or neither? How do you know?

- Load the Titanic dataset (located in this lesson's  `assets/data` directory).

- Create a feature matrix `X` by dropping "PassengerId", "Name", "Ticket", and the target column "Survived." Dummy-code string columns as needed. Do NOT drop or impute missing values -- XGBoost handles them internally!

![](https://media.giphy.com/media/12NUbkX6p4xOO4/giphy.gif)

- Create a target Series `y` with the values of "Survived."

- Instantiate an `XGBClassifier`.

- Fit and score the classifier on the entire dataset.

- Get your classifier's score on held-out data using ten-fold cross-validation on the Titanic dataset, shuffling the rows before taking the folds.

- Is your model overfitting, underfitting, both, or neither? How do you know?

$\blacksquare$

## XGBoost vs. Random Forests

### Similarities

Random forests and XGBoost both produce tree ensembles, and the provide many of the same parameters to reduce overfitting.

### Differences

#### Gradient Boosting vs. Bagging

- Bagging involves training each tree *independently* on a different *bootstrap sample*.
- Gradient boosting involves training each tree *sequentially to reduce the residual errors left by its predecessors*.

See [the official `xgboost` library documentation](https://xgboost.readthedocs.io/en/latest/tutorials/model.html) and Chapter 10 of [Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) for details.

#### Handling Missing Values

At each split for a given variable, XGBoost simply learns whether sending items with missing values left or right gives better results. This approach has a few advantages:

- It is automatic.
- Unlike dropping rows or columns, it allows you to use all of the values you do have.
- Unlike imputation, it treats "missing" as its own value rather than replacing it with some other value that might be wrong.

## Tuning XGBoost

`XGBoost` provides many options that you can tune to improve predictive performance.

### `n_estimators` and `learning_rate`

The `learning_rate` controls how "aggressive" each tree is in trying to correct the errors of its predecessors.

- If it is too low, then getting good predictive performance will require a large value for `n_estimators` (and thus a lot of time).
- If it is too high, then the algorithm will keep overshooting the target and won't coverge to good results.

Unlike with a random forest, setting `n_estimators` too high can hurt predictive performance with boosting because it leads to overfitting.

### Addressing Overfitting

#### Reducing Model Complexity

One way to address overfitting is to restrict model complexity more or less directly. `xgboost` provides many options for this purpose:

- Restricting tree shape
    - `max_depth` / `max_leaf_nodes` puts a hard limit on the depth or number of leaves in each tree
    - `gamma` is the minimum loss reduction required in order to make another split
    - `min_child_weight` is the minimum number of observations required in each child node in order to make a split, adjusted for the weight that is placed on each observation
- Restricting sizes of weights: `reg_lambda` and `reg_alpha` provide L1 and L2 regularization on sample weights, respectively

#### Adding Randomness

Another way to address overfitting when ensembling is to add randomness to the process of training each item in the ensemble.

- `subsample` specifies what proportion of the data is used to train each tree.
- `colsample_bytree` and `colsample_bylevel` specify what proportion of the features are available at the tree and split level, respectively.

### Example

We will use this general approach to tune our model:

- Find the optimal number of trees with default learning rate.
- Tune additional parameters.
- Lower learning rate and increase the number of trees.

#### Find Optimal Number of Trees with Default Learning Rate

In [None]:
# Split data by column


In [None]:
# Instantiate model


In [None]:
# Fit and score on all data


In [None]:
# Score with 5-fold CV


In [None]:
# Vary number of trees


In [None]:
# Fix number of trees


#### Tune Additional Parameters

scikit-learn has a `GridSearchCV` class that will run a model with various hyperparameter combinations and identify the combination that generated the best cross-validation scores.

In [None]:
# Try a few values for "max_depth" and "min_child_weight"


In [None]:
# Find out best parameters and their score


In [None]:
# Try a few values for "subsample"


In [None]:
# Find out best parameters and their score


In [None]:
# Get report on grid search results


The effect of one hyperparameter typically depends on the values of other hyperparameters -- for instance, increasing "max_depth" will have no effect if "min_child_weight" is sufficiently large. For this reason, it is generally valuable to do grid searches over multiple parameters simultaneously, rather than fixing one hyperparameter at a time. However, testing many combinations of many parameters can take a long time.

#### Lower Learning Rate and Increase the Number of Trees.

In [None]:
# Divide the learning rate by 10 and vary number of trees


**Exercise (open-ended, in pairs)**

- Create the best XGBoost model you can for the Titanic dataset, as measured by accuracy in five-fold cross-validation. Use `GridSearchCV` at least once to search over at least two hyperparameters at a time.

$\blacksquare$

## Summary

- `XGBoost` is a popular decision tree ensemble algorithm.
- `XGBoost` uses gradient boosting, meaning that each tree attempts to correct the errors of previous trees.
- scikit-learn's `GridSearchCV` helps with testing hyperparameter values.