# Utilizing xgboost on Higgs data

## Overview

In this notebook, we will focus on using Gradient Boosted Trees (in particular XGBoost) to classify a theoretical model with new Higgs bosons, described by [Baldi et al. Nature Communication 2015 and Arxiv:1402.4735](https://arxiv.org/pdf/1402.4735.pdf]). 
The data set consists of 11M Monte-Carlo samples of signal (gluon fusion producing $gg \rightarrow H^{0} \rightarrow W^{\pm} H^{\mp} \rightarrow W^{\pm} W^{\mp} h^{0} \rightarrow W^{\pm} W^{\mp} b\bar{b}$, with $M_{h^{0}} \sim$ 125.3 GeV, $M_{W} \sim$ 80.4 GeV) and background (top-quarks $gg \rightarrow  t\bar{t} \rightarrow W^{+} W^{-} b\bar{b}$) with 21 features. 



<!---
![signal (a) VS background (b)](./images/higgs_dataset.png)
--->

<img src="./images/higgs_dataset.png" alt="drawing" width="400"/>

Simulated events are generated with MADGRAPH event generator assuming 8 TeV collisions of protons at LHC, with showering and hadronization performed by PYTHIA and detector response simulated by DELPHES. 
We assume as benchmark case $m_{H^{0}} =$ 425 GeV and $m_{H^{\pm}} =$ 325 GeV. 

We focus on the semi-leptonic decay mode, in which **only one W boson decays to a lepton and neutrino ($l \nu$), and the other decays to a pair of jets ($jj$)**, giving the decay product **$l\nu jj b \bar{b}$**.


We consider data with:

- Exactly one electron or muon, with $p_{T} >$ 20 GeV and $|\eta| <$ 2.5;

- at least four jets, each with $p_{T} >$ 20 GeV GeV and $|\eta| <$ 2.5;

- b-tags on at least two of the jets, indicating that they are likely due to b-quarks rather than gluons or lighter quarks


In addition we reconstruct the missing transverse momentum in the event and have b-tagging information for each jet. Together, 21 features comprise our low-level feature set.

The low-level features show some distinguishing characteristics, but our knowledge of the different interme- diate states of the two processes allows us to construct other features which better capture the differences.

We can also check the reconstruction of the invariant mass of:

- $W \rightarrow l\nu$ gives a peak in the $M_{l\nu}$ distribution at the known $W$ boson mass, $M_{W}$ ,

- $W \rightarrow jj$ gives a peak in the $M_{jj}$ distribution at $M_{W}$,

- $h^{0} \rightarrow b\bar{b}$ gives a peak in the $M_{b\bar{b}}$ distribution at the known Higgs boson mass, $M_{h^{0}}$ ,

- $H^{\pm} \rightarrow Wh^{0}$ gives a peak in the $M_{Wb\bar{b}}$ distribution at the assumed $H^{\pm}$ mass, $M_{H^{\pm}}$ ,

- $H^{0} \rightarrow WH^{\pm}$ gives a peak in the $M_{WW b \bar{b}}$ distribution at the assumed $H^{0}$ mass,at $M_{H^{0}}$


Note that the leptonic W boson is reconstructed by combining the lepton with neutrino, whose transverse momentum is deduced from the imbalance of momentum in the final state objects and whose rapidity is set to give $M_{l\nu}$ closest to $M_{W} =$ 80.4 GeV.

In the case of the $t\bar{t}$ background, we expect:

- $W \rightarrow l \nu$ gives a peak at $M_{W}$

- $W \rightarrow jj$ gives a peak at $M_{W}$ 

- $t \rightarrow Wb$ gives a peak in $M_{l\nu b}$ and $M_{jj b}$ at $M_{t}$

In [1]:
# Load the dataset using pandas and numpy 

import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split  
