# Extreme boosted trees (XGBoost) for classification (and regression)

Let's take a look at how to train an `xgboost` model for classificaiton in Python using the Titanic dataset from last week. First, load the Titanic data:

In [33]:
import pandas as pd
import numpy as np # we'll need this later!

# Read the data
df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The `xgboost` library is not installed by default in `Anaconda`, so you will need to use `pip` to install it:

In [34]:
%pip install xgboost

Note: you may need to restart the kernel to use updated packages.


We can now import the xgboost library in the typical way:

In [None]:
import xgboost as xgb

## Data preprocessing and feature selection

When compared to the `RandomForestClassifer()` model, we need to do considerably less preprocessing in advance of model training (e.g., no imputation for missing data). However, `xgboost` will still complain if you try to pass `pandas` "objects" (i.e., strings) directly to your model. As such, let's recode the `Sex` variable to be an integer:

In [35]:
# Preprocess Sex
df['female'] = (df['Sex'] == 'female').astype(int)

Next, in order to compare `xgboost` to our `RandomForestClassifer()` from last week, let's use the same features (**note**: again, you could include all availalble variables in the dataset if you want to!):

In [111]:
# select features
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'female']
y = 'Survived'

### Splitting into training and testing sets

Again, to facilitate comparison, let's split our data into **training** and **testing** sets in the exact same way that we did last week:

In [36]:
# split data into traning and test sets
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

## XGBoost in Python

We are now ready to fit our model. Given that we are interested in binary classification (survived vs. not survived), we start by setting up an `XGBClassifier` using a `binary:logistic` objective function:

In [37]:
xgb_model = xgb.XGBClassifier(objective='binary:logistic')

As with `sklearn` models, we can train this model (using default hyperparmeters) using the `fit()` method:

In [38]:
xgb_model.fit(df_train[features], df_train[y])

And we can assess out-of-sample performance in the usual way:

In [40]:
# Import the metrics from sklearn
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# Make predictions
y_pred = xgb_model.predict(df_test[features])

# Calculate precision
precision = precision_score(df_test[y], y_pred)
print(f'Precision is {precision}')

# Calculate recall
recall = recall_score(df_test[y], y_pred)
print(f'Recall is {recall}')

# Calculate F1 score
f1 = f1_score(df_test[y], y_pred)
print(f'F1 score is {f1}')

Precision is 0.7534246575342466
Recall is 0.7432432432432432
F1 score is 0.7482993197278911


## Hyperparameter tuning and Bayesian optimization

So far, we've used the `XGBClassifier` default hyperparamaters. As with our random forest classifer, it's easy to change these hyperparmeters using the `xgboost` library:

In [44]:
# Let's lower the learning rate
xgb_model = xgb.XGBClassifier(objective='binary:logistic', learning_rate = .05)

# Fit the model
xgb_model.fit(df_train[features], df_train[y])

# Make predictions
y_pred = xgb_model.predict(df_test[features])

# Calculate precision
precision = precision_score(df_test[y], y_pred)
print(f'Precision is {precision}')

# Calculate recall
recall = recall_score(df_test[y], y_pred)
print(f'Recall is {recall}')

# Calculate F1 score
f1 = f1_score(df_test[y], y_pred)
print(f'F1 score is {f1}')

Precision is 0.8333333333333334
Recall is 0.7432432432432432
F1 score is 0.7857142857142858


Wow, that already really helped in terms of performance! We could try to find an even better solution using either `sklearn`'s `GridSearchCV` or `RandomizedSearchCV` as demonstrated last week. However, these approaches are either impossible if you have a large "hyperparmeter space" (`GridSearchCV`) or extremely inefficient (`RandomizedSearchCV`). And the benefit of `xgboost` is it's flexibility: there are many hyperparmeters to choose from in order to find a model suitble for your data. What's a budding data scientist to do?

The answer: **Bayesian optimization**. The details of Bayesian optimization are quite complex and probably not necessary for us to undertand at this stage. In a nutshell, Bayesian optimization:

1. Start with a handful of initial sample of different hyperparmeters and calculate the performance (e.g., the "function" that you want to optimize) for each combination. (**Note**: you can think of this as a small `RandomizedSearch`.)
2. Fit a model mapping these initial hyperparemeter values to performance. (**Specifically**, we fit what's called a **Gaussian process regression**.)
3. We then use this model -- contructing what's called an **activation function** -- to change our parameters in a way that provides a better overall fit. Note that this is where things start to get really confusing!
4. Repeat until you are satisfied.

This is my best attempt at making a complicated procedure simple! However, the details are less important (at least for us) than the implementation, and the implementation in Python is pretty easy. There are a number of different libraries for Bayesian optimization in Python, but we are going to focus on the `hyperopt` library:

In [None]:
%pip install hyperopt

Load the necessary functions:

In [53]:
from hyperopt import fmin, tpe, hp, STATUS_OK
from hyperopt.pyll.base import scope # for controlling data types


Similar to the `sklearn` functions, we start by setting up a dictionary with our "hyperparameter space":

In [56]:
space = {
    'max_depth': scope.int(hp.quniform('max_depth', 1, 15, 1)),
    'min_child_weight':  scope.int(hp.quniform('min_child_weight', 1, 15, 1)),
    'learning_rate': hp.loguniform('learning_rate', -5, -2),
    'subsample': hp.uniform('subsample', 0.5, 1),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.1, 1),
    'n_estimators':  scope.int(hp.quniform('n_estimators', 100, 1000, 1))
}

The difference, however, is that we use various [probability distributions](http://hyperopt.github.io/hyperopt/getting-started/search_spaces/) to define the space (instead of hard-coding specific values). Next, we need to define the objective function that we want to optimize:

In [57]:
# Define the objective function to minimize
def objective(params):
    xgb_model = xgb.XGBClassifier(objective='binary:logistic', **params)
    xgb_model.fit(df_train[features], df_train[y])
    y_pred = xgb_model.predict(df_test[features])
    #score = accuracy_score(y_test, y_pred)
    score = f1_score(df_test[y], y_pred)
    return {'loss': -score, 'status': STATUS_OK}

Lastly, we use the `fmin()` function to minimize this objective:

In [58]:
best_params = fmin(objective, space, algo=tpe.suggest, max_evals=100)
print("Best set of hyperparameters: ", best_params)

100%|██████████| 100/100 [02:47<00:00,  1.68s/trial, best loss: -0.8137931034482759]
Best set of hyperparameters:  {'colsample_bytree': 0.6341553831562514, 'learning_rate': 0.029367273109818785, 'max_depth': 7.0, 'min_child_weight': 5.0, 'n_estimators': 890.0, 'subsample': 0.7805899657927483}


In [72]:
xgb_model

Annoyingly, `fmin()` returns floating point numbers for parameters that we need to cast as integers. So we need to quickly turn these floats back into `int`:

In [67]:
for key in best_params:
    if key in ['max_depth', 'min_child_weight', 'n_estimators']:
        best_params[key] = int(best_params[key])

print("Best set of hyperparameters: ", best_params)

## Cross-validation with `xgboost`

While the `xgboost` library has a built-in method (i.e., `xgboost.cv()`) for cross-validation, the list of metrics that are available to monitor performance is quite limited and I prefer the flexibility of using `sklearn` to perform cross-validation "manually". Let's see how to do this. First, we need to import the `KFold` function from `sklearn`:

In [70]:
from sklearn.model_selection import KFold

In [80]:
# Define the classifier
clf = xgb.XGBClassifier(objective='binary:logistic', **best_params)

# Get the k folds
kf = KFold(n_splits=10, shuffle = True, random_state=50)

# Loop over folds and calculate performance measure
results = []
for k, (train_idx, test_idx) in enumerate(kf.split(df[features])):
    # Fit model
    cfit = clf.fit(df[features].iloc[train_idx], df[y].iloc[train_idx])
    
    # Get predictions
    y_pred = cfit.predict(df[features].iloc[test_idx])
    
    # Write results
    result = {'fold': k,
              'precision': precision_score(df[y].iloc[test_idx], y_pred),
              'recall': recall_score(df[y].iloc[test_idx], y_pred),
              'f1': f1_score(df[y].iloc[test_idx], y_pred)}
    # If we want to monitor progress
    print(result)
              
    results.append(result)

In [85]:
# View results
pd.DataFrame(results)

Unnamed: 0,fold,precision,recall,f1
0,0,0.9375,0.697674,0.8
1,1,0.724138,0.6,0.65625
2,2,0.741935,0.821429,0.779661
3,3,0.916667,0.666667,0.77193
4,4,0.882353,0.882353,0.882353
5,5,0.741935,0.71875,0.730159
6,6,0.727273,0.705882,0.716418
7,7,0.705882,0.685714,0.695652
8,8,0.794118,0.72973,0.760563
9,9,0.956522,0.709677,0.814815


In [86]:
# Average precision
np.mean([x['precision'] for x in results])
print(f'Average precision is {np.mean([x["precision"] for x in results])}')

# Average recall
np.mean([x['recall'] for x in results])
print(f'Average recall is {np.mean([x["recall"] for x in results])}')

# Average F1
np.mean([x['f1'] for x in results])
print(f'Average F1 is {np.mean([x["f1"] for x in results])}')

Average precision is 0.8128322973022717
Average recall is 0.7217876385616391
Average F1 is 0.7607800792303067


## Using our final model in "production"

So we've have a model that we think is pretty good -- now what? We can now use our model to predict new data by, for instance, creating a "would you have survived the Titanic app". The steps for resusing our "final" model include:

1. Fit the final model using **all** the data.
2. `pickle` the model for later use

And then when you want to predict a new observation, you:

3. Read in the data and format it **exactly** how the training data was formatted. For us, that means reading in our data and storing it as a `pandas` dataframe with the following variables: ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'female'].

First, we fit and save the model:

In [119]:
# Define the classifier
clf = xgb.XGBClassifier(objective='binary:logistic', **best_params)

# Fit on all data
cfit = clf.fit(df[features], df[y])

# Save the model
import pickle
pickle.dump(cfit, open('xgb_model.pkl', 'wb'))

We can then load the model as follows:

In [120]:
loaded_model = pickle.load(open('xgb_model.pkl', 'rb'))

In [121]:
# Create a new example observation as a dictionary with the variable names as keys
new_obs = {'PassengerId': 1,
            'Survived': 0,
            'Pclass': 1,
            'Name': 'Braverman, Suella',
            'Sex': 'Female', 
            'Age': 43.0,
            'SibSp': 0,
            'Parch': 0,
            'Ticket': 'A/5 21171',
            'Fare': 7.25,
            'Cabin': '',
            'Embarked': 'S',
            'female': 1}

# Convert to a dataframe
df_new_obs = pd.DataFrame([new_obs])

# Make a prediction
prob = loaded_model.predict_proba(df_new_obs[features])
print(f'Probability of survival is {prob[0][1]}')


Probability of survival is 0.6893494725227356


## Regression using `xgboost`

Solving regression problems with `xgboost` follow the same basic syntax. Let's start by importing our housing price data that we used to demonstrate regression analysis in Python:

In [87]:
data = pd.read_csv('data/housing.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


And split the data into training and testing sets:

In [88]:
data_train, data_test = train_test_split(data, test_size=.3, random_state=42)
data_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
135,136,20,RL,80.0,10400,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,5,2008,WD,Normal,174000
1452,1453,180,RM,35.0,3675,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2006,WD,Normal,145000
762,763,60,FV,72.0,8640,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2010,Con,Normal,215200
932,933,20,RL,84.0,11670,Pave,,IR1,Lvl,AllPub,...,0,,,,0,3,2007,WD,Normal,320000
435,436,60,RL,43.0,10667,Pave,,IR2,Lvl,AllPub,...,0,,,,0,4,2009,ConLw,Normal,212000


In [95]:
from xgboost import XGBRegressor

Grab some features:

In [122]:
y = 'SalePrice'
features = ['LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea']



Setup an `XGBREgressor()` model:

In [123]:
model = XGBRegressor(objective='reg:squarederror')

Train the model and get out of sample predictions:

In [124]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Fit the model
model.fit(data_train[features], data_train[y])

# Make predictions
y_pred = model.predict(data_test[features])

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(data_test[y], y_pred))
print(f'RMSE is {rmse}')

# Calculate R^2
r2 = r2_score(data_test[y], y_pred)
print(f'R^2 is {r2}')

# Calculate MAE
mae = mean_absolute_error(data_test[y], y_pred)
print(f'MAE is {mae}')

RMSE is 30918.93316626797
R^2 is 0.8630026392750184
MAE is 20326.75792843893


That's it!

## The `pandas` library, redux

We've discussed `pandas` each week and used a number of `pandas` functions. However, it is worth your time to stop this week and take a closer look at `pandas`. I highly recommend the following two resources:

1. https://www.udemy.com/course/python-pandas-for-your-grandpa/?ranMID=39197&ranEAID=JVFxdTr9V80&ranSiteID=JVFxdTr9V80-LVUoSPJ.mdrAlF15115KEQ&LSNPUBID=JVFxdTr9V80&utm_source=aff-campaign&utm_medium=udemyads (my all time favorite!)
2. Any of the other free courses listed here: https://medium.com/javarevisited/5-best-free-pandas-courses-for-beginners-in-2022-d7dbe017b90c

And if you prefer to read:
3. https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
4. https://www.w3schools.com/python/pandas/default.asp
