# XGBoost (Extreme Gradient Boosted Decision Trees)

Our next notebook will look at a popular library called `XGBoost`, which stands for "Extreme Gradient Boosting" and applies to CART.  With random forests, we take a large number of decision trees and agglomerate their answers.  XGBoost starts from the same premise:  create an ensemble of trees.  The key difference is that XGBoost uses a technique called "boosting" to correct the mistakes of prior iterations in the sequence.

In other words, we build the first model off of training data and see how it does.  Then, the second model takes the first model's results as an input and tries to correct what the first model got wrong.  Then, the next third model tries to improve upon the second model, and so on until either all training predictions are correct or we have reached a maximum number of classifier models (i.e., a maximum number of iterations).

If you do not already have the package, install it with `conda install xgboost` or `pip install xgboost`.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import xgboost as xgb

## Load Data

For this demo, we will load a dataset of individuals and whether they have a high chance of heart attack (output = 1).

In [None]:
heart_attack_data = "../data/HeartAttackData.csv"
df = pd.read_csv(heart_attack_data, header=0)

# Review the data
df

These measures aren't very self-explanatory, so let's explain them here.  These are the same explanations that we saw in the CART notebook, but they're included again here for clarity.

- `age` = Age of patient
- `sex` = Sex of the patient (0 = female, 1 = male)
- `cp` = Type of chest pain.
  - 1 = Typical angina
  - 2 = Atypical angina
  - 3 = Non-anginal pain
  - 4 = Asymptomatic
- `trtbps` = Resting blood pressure (mm/Hg)
- `chol` = Cholesterol level
- `fbs` = Fasting blood sugar above 120 mg/dl
- `restecg` = Resting ECG result
  - 0 = Normal
  - 1 = ST-T wave abnormality
  - 2 = Probable or definite left ventricular hypertrophy
- `thalachh` = Maximum heart rate achieved
- `exng` = Exercise-induced angina (1 = yes, 0 = no)
- `oldpeak` = Previous peak
- `slp` = Slope
- `caa` = Number of major vessels (0-3)
- `thall` = Thalium Stress Test result (ranges from 0-3)
- `output` = Diagnosis of heart disease (0 = < 50% diameter narrowing, 1 = > 50% diameter narrowing)


## Split Labels from Features

Let's now create two variables:  `y`, which is the thing we want to predict (output: `{ 0, 1 }`); and `X`, which is everything we can use to predict the specific value of `y`.

With Python, splitting data out like this will not shuffle the results (something we might have to worry about if we split the data up in SQL).

In [None]:
y = df['output']
X = df.drop('output', axis=1)

## Split into Training & Test Datasets

The sklearn library has a method called `train_test_split` which breaks our data out into training and test datasets.  This allows us to train a model on one set of data and then see how it would perform on a completely different set of data.  This gives us a better idea of how our model might perform than simply using accuracy from the test dataset, as models tend to **overfit**:  they latch on the peculiarities of the training dataset.  If those peculiarities do not also exist in the broader population, then the trained model may come up with the wrong answer.  Having a separate test dataset that the trained model knows nothing about gives us a better idea of realistic behavior.  It also allows us to come up with a measure of how much overfitting the trained model does, as we can compare the training accuracy to the test accuracy; if there is a substantial difference between the two, our model is overfitting quite a bit.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

## Perform Classification

We'll train the model on our training data, ignoring the test data for now.  With sklearn, this is easy:  use the `fit()` method.

In [None]:
clf = xgb.XGBClassifier(max_depth=5, n_estimators=45, use_label_encoder=False, eval_metric='logloss')
clf = clf.fit(X_train, y_train)

## How'd we do?

Let's use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

Our accuracy score is now 83.5%, whereas for CART, it was 75.8%.  This ties the random forest classifier that we saw in the prior notebook.

In practice, XGBoost is typically one of the better-performing algorithms, although random forest can often give it a run for its money.


## Viewing feature importance

Instead of looking at trees, let's see a plot of feature importance.  This is built into the `XGBoost` library.

In [None]:
xgb.plot_importance(clf, max_num_features=7)

Based on our training dataset, the two most important factors in determining whether a person is likely to present with a heart attack is age and cholesterol level.  After that, it looks like the maximum heart rate achieved (`thalachh`) is the next most important factor.
