In this section templates will be developed for XGBoost models. Later on, these templates can be referenced as starting points for building XGBoost classifiers and regressors.

## XGBoost - Classification Template

In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets

In [3]:
iris = datasets.load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [4]:
# Scikit-Learn datasets are stored as NumPy arrays
print(f"Dataset shape: {iris.data.shape}")
print(f"Feature names: {iris.feature_names}")
print(f"Target names: {iris.target_names}")

Dataset shape: (150, 4)
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']


In [15]:
df = pd.DataFrame(
    data=np.c_[iris.data, iris.target],
    columns= iris.feature_names + ['target']
)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, :-1], df.iloc[:, -1],
    random_state= 2
)

In [17]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

xgb_cls = XGBClassifier(
    booster='gbtree', objective='multi:softprob', 
    max_depth=6, learning_rate=0.1, n_estimators=100, 
    random_state=2, n_jobs=-1
)

- `booster:'gbtree'`: The booster is the base learner. It is machine learning model that is constructed during every round of boosting. *gbtree* stands for gradient boosted tree.

- `objective='multi:softprob'`: This objective is a standard alternative to *binary:logistic* when the dataset includes **multiple classes**. If not explicitly stated, XGBoost will often find the right objective for you.

- `'max_depth=6'`: Determines the number of branches each tree has. XGBoost uses a default 6.

- `'learning_rate=0.1'`: Within XGBoost, this hyperparameter is often referred as **eta**. Limits the variance by reducing the weight of each tree to the given percentage.

- `'n_estimators=100'`: Number of boosted trees in the model. Increasing this number while decreasing *learning_rate* can lead to more robust results.

In [19]:
import warnings
warnings.filterwarnings('ignore')

xgb_cls.fit(X_train, y_train)

y_pred = xgb_cls.predict(X_test)

score = accuracy_score(y_test, y_pred)
print(f"Score: {score}")

Score: 0.9736842105263158


An initial score of **97.4** percent on the Iris Dataset using default hyperparameters is very good.

## XGBoost - Regression Template

In [26]:
X, y = datasets.load_diabetes(return_X_y=True)

X.shape

(442, 10)

In [27]:
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

xgb_reg = XGBRegressor(
    booster='gbtree', objective='reg:squarederror', 
    max_depth=6, learning_rate=0.1, n_estimators=100,
    random_state=2, n_jobs=-1
)

In [28]:
scores = cross_val_score(xgb_reg, X, y, scoring="neg_mean_squared_error", cv=5)

rmse = np.sqrt(-scores)
print(f"RMSE: {np.round(rmse, 3)}")
print(f"RMSE mean: {np.round(rmse.mean(), 3)}")

RMSE: [63.033 59.689 64.538 63.699 64.661]
RMSE mean: 63.124


Without a baseline of comparison, we have no idea what that score means. Converting the target column, y, into a pandas DataFrame:

In [29]:
pd.DataFrame(y).describe()

Unnamed: 0,0
count,442.0
mean,152.133484
std,77.093005
min,25.0
25%,87.0
50%,140.5
75%,211.5
max,346.0


A score of **63.124** is less than 1 standard deviation, a respectable result.

## Case Study - Finding the Higgs Boson

This section is based on Higgs Boson Kaggle Competition, which brought XGBoost into the machine learning spotlight.

- In popular culture, the Higgs Boson is known as the *God Particle*. Theorized by Peter Higgs in 1964, the Higgs boson was introduced to explain why particles have mass.


The Higgs boson was discovered by smashing protons into each other at extremely high speeds and observing the results. Observations came from the ATLAS detector, which records data resulting from hundreds of millions of proton-proton collisions per second, according to the competition's technical documentation, Learning to discover: the Higgs boson machine learning challenge, https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf.

After discovering the Higgs boson, the next step was to precisely measure the characteristics of its decay. The ATLAS experiment found the Higgs boson decaying into two tau particles from data wrapped in background noise. To better understand the data, ATLAS called upon the machine learning community.

In [31]:
df = pd.read_csv("data/atlas-higgs-challenge-2014-v2.csv", nrows=250000)
df.head()

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label,KaggleSet,KaggleWeight
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2.15,0.444,46.062,1.24,-2.475,113.497,0.000814,s,t,0.002653
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,0.725,1.158,-999.0,-999.0,-999.0,46.226,0.681042,b,t,2.233584
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,2.053,-2.028,-999.0,-999.0,-999.0,44.251,0.715742,b,t,2.347389
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,1.660654,b,t,5.446378
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,1.904263,b,t,6.245333


Loaded dataset is from the original source, which is much more bigger than Kaggle dataset. To make df similar to Kaggle dataset, pay attention to following columns:

- Kaggle used a different number for their weight column which is denoted in the preceding diagram as KaggleWeight

- The t value under Kaggleset indicates that it's part of the training set for the Kaggle dataset.

In [33]:
del df["Weight"]

del df["KaggleSet"]

df = df.rename(columns={"KaggleWeight": "Weight"})

# Move Label column to the end of table (whenever assigning a new column to df, it appears at the end)
label_col = df["Label"]
del df["Label"]
df["Label"] = label_col

df.head()

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653,s
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584,b
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347389,b
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,5.446378,b
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245333,b


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   EventId                      250000 non-null  int64  
 1   DER_mass_MMC                 250000 non-null  float64
 2   DER_mass_transverse_met_lep  250000 non-null  float64
 3   DER_mass_vis                 250000 non-null  float64
 4   DER_pt_h                     250000 non-null  float64
 5   DER_deltaeta_jet_jet         250000 non-null  float64
 6   DER_mass_jet_jet             250000 non-null  float64
 7   DER_prodeta_jet_jet          250000 non-null  float64
 8   DER_deltar_tau_lep           250000 non-null  float64
 9   DER_pt_tot                   250000 non-null  float64
 10  DER_sum_pt                   250000 non-null  float64
 11  DER_pt_ratio_lep_tau         250000 non-null  float64
 12  DER_met_phi_centrality       250000 non-null  float64
 13 

The columns beyond EventId include variables prefixed with PRI, which stands for primitives, which are values directly measured by the detector during collisions. By contrast, columns labeled DER are numerical derivations from these measurements.

Column 0 : EventId – irrelevant for the machine learning model.


Columns 1-30: Physics columns derived from LHC collisions. Details for these columns can be found in the link to the technical documentation at http://higgsml.lal.in2p3.fr/documentation. These are the machine learning predictor columns.


Column 31 : Weight – this column is used to scale the data. The issue here is that Higgs boson events are very rare, so a machine learning model with 99.9 percent accuracy may not be able to find them. Weights compensate for this imbalance, but weights are not available for the test data. 


Column 32: Label – this is the target column, labeled s for signal and b for background. The training data has been simulated from real data, so there are many more signals than otherwise would be found. The signal is the occurrence of the Higgs boson decay.

In [35]:
df["Label"].replace(('s', 'b'), (1, 0), inplace=True)

In [36]:
# Predictor columns are indexed 1-30 (Col 0 is irrelevant)
X = df.iloc[:, 1:31]
y = df.iloc[:, -1]