In this section templates will be developed for XGBoost models. Later on, these templates can be referenced as starting points for building XGBoost classifiers and regressors.

## XGBoost - Classification Template

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

In [3]:
iris = datasets.load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
# Scikit-Learn datasets are stored as NumPy arrays
print(f"Dataset shape: {iris.data.shape}")
print(f"Feature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")

Dataset shape: (150, 4)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']


In [15]:
df = pd.DataFrame(
    data=np.c_[iris.data, iris.target],
    columns= iris.feature_names + ['target']
)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, :-1], df.iloc[:, -1],
    random_state= 2
)

In [17]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

xgb_cls = XGBClassifier(
    booster='gbtree', objective='multi:softprob', 
    max_depth=6, learning_rate=0.1, n_estimators=100, 
    random_state=2, n_jobs=-1
)

- `booster:'gbtree'`: The booster is the base learner. It is machine learning model that is constructed during every round of boosting. *gbtree* stands for gradient boosted tree.

- `objective='multi:softprob'`: This objective is a standard alternative to *binary:logistic* when the dataset includes **multiple classes**. If not explicitly stated, XGBoost will often find the right objective for you.

- `'max_depth=6'`: Determines the number of branches each tree has. XGBoost uses a default 6.

- `'learning_rate=0.1'`: Within XGBoost, this hyperparameter is often referred as **eta**. Limits the variance by reducing the weight of each tree to the given percentage.

- `'n_estimators=100'`: Number of boosted trees in the model. Increasing this number while decreasing *learning_rate* can lead to more robust results.

In [19]:
import warnings
warnings.filterwarnings('ignore')

xgb_cls.fit(X_train, y_train)

y_pred = xgb_cls.predict(X_test)

score = accuracy_score(y_test, y_pred)
print(f"Score: {score}")

Score: 0.9736842105263158


An initial score of **97.4** percent on the Iris Dataset using default hyperparameters is very good.

## XGBoost - Regression Template

In [26]:
X, y = datasets.load_diabetes(return_X_y=True)

X.shape

(442, 10)

In [27]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

xgb_reg = XGBRegressor(
    booster='gbtree', objective='reg:squarederror', 
    max_depth=6, learning_rate=0.1, n_estimators=100,
    random_state=2, n_jobs=-1
)

In [28]:
scores = cross_val_score(xgb_reg, X, y, scoring="neg_mean_squared_error", cv=5)

rmse = np.sqrt(-scores)
print(f"RMSE: {np.round(rmse, 3)}")
print(f"RMSE mean: {np.round(rmse.mean(), 3)}")

RMSE: [63.033 59.689 64.538 63.699 64.661]
RMSE mean: 63.124


Without a baseline of comparison, we have no idea what that score means. Converting the target column, y, into a pandas DataFrame:

In [29]:
pd.DataFrame(y).describe()

Unnamed: 0,0
count,442.0
mean,152.133484
std,77.093005
min,25.0
25%,87.0
50%,140.5
75%,211.5
max,346.0


A score of **63.124** is less than 1 standard deviation, a respectable result.