<a href="https://colab.research.google.com/github/andysingal/xgboost/blob/main/xgboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

XGBoost was specifically designed for speed. Speed gains allow machine learning models to build more quickly which is especially important when dealing with millions, billions, or trillions of rows of data. 

The following new design features give XGBoost a big edge in speed over comparable ensemble algorithms:

- Approximate split-finding algorithm

- Sparsity aware split-finding

- Parallel computing

- Cache-aware access

- Block compression and sharding


- Approximate split-finding algorithm:

Decision trees need optimal splits to produce optimal results. A greedy algorithm selects the best split at each step and does not backtrack to look at previous branches.

The split-finding algorithm uses quantiles, percentages that split data, to propose candidate splits. In a global proposal, the same quantiles are used throughout the entire training, and in a local proposal, new quantiles are provided for each round of splitting.





- Sparsity-aware split finding

Sparse data occurs when the majority of entries are 0 or null. A sparsity-aware split indicates that when looking for splits, XGBoost is faster because its matrices are sparse.

According to the original paper, XGBoost: A Scalable Tree Boosting System, the sparsity-aware split-finding algorithm performed 50 times faster than the standard approach on the All-State-10K dataset.


- Parallel computing
Boosting is not ideal for parallel computing since each tree depends on the results of the previous tree. 

Parallel computing occurs when multiple computational units are working together on the same problem at the same time. XGBoost sorts and compresses the data into blocks. These blocks may be distributed to multiple machines, or to external memory.

Sorting the data is faster with blocks. The split-finding algorithm takes advantage of blocks and the search for quantiles is faster due to blocks. In each of these cases, XGBoost provides parallel computing to expedite the model-building process.

- Cache-aware access
The data on your computer is separated into cache and main memory. The cache, what you use most often, is reserved for high-speed memory. The data that you use less often is held back for lower-speed memory. Different cache levels have different orders of magnitude of latency, as outlined here: https://gist.github.com/jboner/2841832.

When it comes to gradient statistics, XGBoost uses cache-aware prefetching. XGBoost allocates an internal buffer, fetches the gradient statistics, and performs accumulation with mini batches. According to XGBoost: A Scalable Tree Boosting System, prefetching lengthens read/write dependency and reduces runtimes by approximately 50% for datasets with a large number of rows.

- Block compression and sharding
XGBoost delivers additional speed gains through block compression and block sharding.

Block compression helps with computationally expensive disk reading by compressing columns. Block sharding decreases read times by sharding the data into multiple disks that alternate when reading the data.

- Accuracy gains
XGBoost adds built-in regularization to achieve accuracy gains beyond gradient boosting. Regularization is the process of adding information to reduce variance and prevent overfitting.





# New Section

**For further reading please check https://xgboost.readthedocs.io/en/latest/tutorials/model.html.**

## *Analyzing XGBoost parameters
The learning objective of a machine learning model determines how well the model fits the data. In the case of XGBoost, the learning objective consists of two parts: the **loss function** and the **regularization term**.

# **Higgs Boson Event Detection**

** Backstory

.1. Backstory
Particle accelerators. To probe into the basic questions on how matter, space and time work and how they are structured, physicists focus on the simplest interactions (for example, collision of subatomic particles) at very high energy. Particle accelerators enable physicists to explore the fundamental nature of matter by observing subatomic particles produced by high-energy collisions of particle beams. The experimental measurements from these collisions inevitably lack precision, which is where machine learning (ML) comes into picture. The research community typically relies on standardized machine learning software packages for the analysis of the data obtained from such experiments and spends a huge amount of effort towards improving statistical power by extracting features of significance, derived from the raw measurements.

Higgs boson. The Higgs boson particle, also called the God particle in mainstream media, is the final ingredient of the standard model of particle physics, which sets the rules for the subatomic particles and forces. The elementary particles are supposed to be massless at very high energies, but some of them can acquire mass at low-energies. The mechanism of this acquiring remained an enigma in theoretical physics for a long time. In 1964
, Peter Higgs and others proposed a mechanism that theoretically explains the origin of mass of elementary particles. The mechanism involves a field, commonly known as Higgs field, that the paricles can interact with to gain mass. The more a particle interacts with it, the heavier it is. Some particles, like photon, do not interact with this field at all and remain massless. The Higgs boson particle is the associated particle of the Higgs field (all fundamental fields have one). It is essentially the physical manifestation of the Higgs field, which gives mass to other particles. The detection of this elusive particle waited almost half a century since its theorization!

The discovery. On 4th July 2012, the ATLAS and CMS experiments at CERN's Large Hadron Collider announced that both of them had observed a new particle in the mass region around 125 GeV. This particle is consistent with the theorized Higgs boson. This experimental confirmation earned François Englert and Peter Higgs The Nobel Prize in Physics 2013

"for the theoretical discovery of a mechanism that contributes to our understanding of the origin of mass of subatomic particles, and which recently was confirmed through the discovery of the predicted fundamental particle, by the ATLAS and CMS experiments at CERN's Large Hadron Collider."

Giving mass to fermions. There are many different processes through which the Higgs boson can decay and produce other particles. In physics, the possible transformations a particle can undergo as it decays are referred to as channels. The Higgs boson has been observed first to decay in three distinct decay channels, all of which are boson pairs. To establish that the Higgs field provides the interaction which gives mass to the fundamental fermions (particles which follow the Fermi-Dirac statistics, contrary to the bosons which follow the Bose-Einstein statistics) as well, it has to be demonstrated that the Higgs boson can decay into fermion pairs through direct decay modes. Subsequently, to seek evidence on the decay of Higgs boson into fermion pairs (such as tau leptons (τ)
 or b-quarks) and to precisely measure their characteristics became one of the important lines of enquiry. Among the available modes, the most promising is the decay to a pair of tau leptons, which balances a modest branching ratio with manageable backgrounds.

The first evidence of h→τ+τ−
 decays was recently reported, based on the full set of proton–proton collision data recorded by the ATLAS experiment at the LHC during 2011
-2012
. Despite the consistency of the data with h→τ+τ−
 decays, it could not be ensured that the statistical power exceeds the 5σ
 threshold, which is the required standard for claims of discovery in high-energy physics community.

In [1]:
# File system manangement
import time, psutil, os

from IPython import display 

# Mathematical functions
import math

# Data manipulation
import numpy as np
import pandas as pd

# Plotting and visualization
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.patches as mpatches

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
from matplotlib import cm
from mpl_toolkits.mplot3d.axes3d import get_test_data

import seaborn as sns
sns.set_theme()
import plotly.graph_objects as go
from plotly.subplots import make_subplots

**Runtime and memory usage**

In [2]:
# Recording the starting time, complemented with a stopping time check in the end to compute process runtime
start = time.time()

# Class representing the OS process and having memory_info() method to compute process memory usage
process = psutil.Process(os.getpid())

In [13]:
# Loading the training data
data_train = pd.read_csv('training.zip')
print(pd.Series({"Memory usage": "{:.2f} MB".format(data_train.memory_usage().sum()/(1024*1024)),
                 "Dataset shape": "{}".format(data_train.shape)}).to_string())
data_train.head()

Memory usage         62.94 MB
Dataset shape    (250000, 33)


Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653,s
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584,b
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347389,b
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,5.446378,b
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245333,b


In [9]:
# Loading the test data
data_test = pd.read_csv('test.zip')
print(pd.Series({"Memory usage": "{:.2f} MB".format(data_test.memory_usage().sum()/(1024*1024)),
                 "Dataset shape": "{}".format(data_test.shape)}).to_string())
data_test.head()

Memory usage        130.08 MB
Dataset shape    (550000, 31)


Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_met_phi,PRI_met_sumet,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt
0,350000,-999.0,79.589,23.916,3.036,-999.0,-999.0,-999.0,0.903,3.036,...,2.022,98.556,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0
1,350001,106.398,67.49,87.949,49.994,-999.0,-999.0,-999.0,2.048,2.679,...,-1.138,176.251,1,47.575,-0.553,-0.849,-999.0,-999.0,-999.0,47.575
2,350002,117.794,56.226,96.358,4.137,-999.0,-999.0,-999.0,2.755,4.137,...,-1.868,111.505,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
3,350003,135.861,30.604,97.288,9.104,-999.0,-999.0,-999.0,2.811,9.104,...,1.172,164.707,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0
4,350004,74.159,82.772,58.731,89.646,1.347,536.663,-0.339,1.028,77.213,...,-0.231,869.614,3,254.085,-1.013,-0.334,185.857,0.335,2.587,599.213


## Basic Data Exploration

In [10]:
# Shape of the data
print(pd.Series({"Shape of the training set": data_train.shape,
                 "Shape of the test set": data_test.shape}).to_string())

Shape of the training set    (250000, 33)
Shape of the test set        (550000, 31)


In [11]:
# Count of observations
df_obs = pd.DataFrame(index = ['Number of observations'], columns = ['Training set', 'Test set'])
df_obs['Training set'] = len(data_train)
df_obs['Test set'] = len(data_test)
df_obs

Unnamed: 0,Training set,Test set
Number of observations,250000,550000


In [14]:
# Count of columns
df_cols_count = pd.DataFrame(index = ['Number of columns'], columns = ['Training set', 'Test set'])
df_cols_count['Training set'] = len(data_train.columns)
df_cols_count['Test set'] = len(data_test.columns)
df_cols_count

Unnamed: 0,Training set,Test set
Number of columns,33,31


In [15]:
# Column names for the training dataset
data_train.columns

Index(['EventId', 'DER_mass_MMC', 'DER_mass_transverse_met_lep',
       'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DER_mass_jet_jet',
       'DER_prodeta_jet_jet', 'DER_deltar_tau_lep', 'DER_pt_tot', 'DER_sum_pt',
       'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality',
       'DER_lep_eta_centrality', 'PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi',
       'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi',
       'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt',
       'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt',
       'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt',
       'Weight', 'Label'],
      dtype='object')

In [16]:
# Columns in the training dataset which are not in the test dataset
[col for col in data_train.columns if col not in data_test.columns]

['Weight', 'Label']

In [17]:
# Column datatypes for the training dataset
data_train.dtypes

EventId                          int64
DER_mass_MMC                   float64
DER_mass_transverse_met_lep    float64
DER_mass_vis                   float64
DER_pt_h                       float64
DER_deltaeta_jet_jet           float64
DER_mass_jet_jet               float64
DER_prodeta_jet_jet            float64
DER_deltar_tau_lep             float64
DER_pt_tot                     float64
DER_sum_pt                     float64
DER_pt_ratio_lep_tau           float64
DER_met_phi_centrality         float64
DER_lep_eta_centrality         float64
PRI_tau_pt                     float64
PRI_tau_eta                    float64
PRI_tau_phi                    float64
PRI_lep_pt                     float64
PRI_lep_eta                    float64
PRI_lep_phi                    float64
PRI_met                        float64
PRI_met_phi                    float64
PRI_met_sumet                  float64
PRI_jet_num                      int64
PRI_jet_leading_pt             float64
PRI_jet_leading_eta      

In [18]:
# Count of column datatypes for the training dataset
df_cols_train = pd.DataFrame(index = ['Number of columns for the training set'], columns = ['Integer', 'Float', 'Object'])
df_cols_train['Integer'] = len(data_train.columns[data_train.dtypes == 'int64'])
df_cols_train['Float'] = len(data_train.columns[data_train.dtypes == 'float64'])
df_cols_train['Object'] = len(data_train.columns[data_train.dtypes == 'object'])
df_cols_train

Unnamed: 0,Integer,Float,Object
Number of columns for the training set,2,30,1


In [19]:
# Integer columns in the training dataset
data_train.columns[data_train.dtypes == 'int64']

Index(['EventId', 'PRI_jet_num'], dtype='object')

In [20]:
# Object columns in the training dataset
data_train.columns[data_train.dtypes == 'object']

Index(['Label'], dtype='object')

In [21]:
# Count of column datatypes for the test dataset
df_cols_test = pd.DataFrame(index = ['Number of columns for the test set'], columns = ['Integer', 'Float', 'Object'])
df_cols_test['Integer'] = len(data_test.columns[data_test.dtypes == 'int64'])
df_cols_test['Float'] = len(data_test.columns[data_test.dtypes == 'float64'])
df_cols_test['Object'] = len(data_test.columns[data_test.dtypes == 'object'])
df_cols_test

Unnamed: 0,Integer,Float,Object
Number of columns for the test set,2,29,0
