<a href="https://colab.research.google.com/github/elisasanzani/MachineLearningProject/blob/main/Zboson_decay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
import logging
logging.getLogger('matplotlib.font_manager').setLevel(logging.ERROR)

import numpy as np
import xgboost as xgb
import pandas as pd
import os
import time

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, auc
from sklearn.model_selection import train_test_split


np.random.seed() # shuffle random seed generator



Getting data from my GitHub.
The datasets I am using are from [CERN Open Data](http://opendata.cern.ch/record/545)

These data were selected to obtain
* **a candidate Z boson event into 2 muons**: an event was selected if there were two muons in the event with pT > 20 GeV and |eta| < 2.1 and the invariant mass of the two muons was > 60 GeV and < 120 GeV.
* **a candidate Z boson event into 2 electrons**: an event was selected if there were two electrons in the event with pT > 25 GeV and the invariant mass of the two electrons was > 60 GeV and < 120 GeV





In [28]:
!wget https://raw.githubusercontent.com/elisasanzani/MachineLearningProject/main/Zee.csv -O Zee.csv
!wget https://raw.githubusercontent.com/elisasanzani/MachineLearningProject/main/Zmumu.csv -O Zmumu.csv

--2023-10-05 13:59:38--  https://raw.githubusercontent.com/elisasanzani/MachineLearningProject/main/Zee.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1445651 (1.4M) [text/plain]
Saving to: ‘Zee.csv’


2023-10-05 13:59:38 (18.1 MB/s) - ‘Zee.csv’ saved [1445651/1445651]

--2023-10-05 13:59:38--  https://raw.githubusercontent.com/elisasanzani/MachineLearningProject/main/Zmumu.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 970550 (948K) [text/plain]
Saving to: ‘Zmumu.csv’


2023-10-05 13:59:39 (13.2 MB/s) - ‘Zmumu.csv’ save

# Let's have a look at the datasets

We can get rid of a few columns per dataset in order to simplify the training and to keep only featues which are useful to us. We will keep
* Run number (```Run```) and number of the (```Event```)
* The transverse momentum of the lepton (in units of GeV) (```pt```)
* The pseudorapidity of the lepton (```eta```)
* The phi angle (in radians) of the lepton (```phi```)

We also need to add a column with a flag (0 for Zee, 1 for Zmumu) to identify the two decay modes before shuffling the datasets

## **Z → ee**

In [48]:
df_ee = pd.read_csv('Zee.csv') # read csv file
print('Total number of events: ', len(df_ee), '\n')
df_ee.head()

Total number of events:  10000 



Unnamed: 0,Run,Event,pt1,eta1,phi1,Q1,type1,sigmaEtaEta1,HoverE1,isoTrack1,...,pt2,eta2,phi2,Q2,type2,sigmaEtaEta2,HoverE2,isoTrack2,isoEcal2,isoHcal2
0,163286,109060857,37.5667,2.2892,2.0526,-1,EE,0.0251,0.009,0.0,...,45.4315,1.4706,-1.163,1,EB,0.0008,0.0,0.0,1.019,0.0
1,163286,109275715,36.2901,-0.8373,-1.5859,1,EB,0.0078,0.0438,0.0,...,60.5754,-0.4896,1.0496,-1,EB,0.0112,0.0,0.7185,1.8461,0.0
2,163286,109075352,25.9705,-0.6974,1.636,-1,EB,0.0097,0.0407,6.287,...,45.2954,-2.0401,3.1187,1,EE,0.026,0.028,15.217,4.5337,3.837
3,163286,109169766,41.0075,1.4619,-0.5325,1,EB,0.0088,0.0,0.0,...,45.9013,1.1561,2.4786,-1,EB,0.0086,0.0,0.0,2.4388,0.5676
4,163286,108947653,39.8985,-0.5927,-2.3947,1,EB,0.0153,0.0,2.5435,...,34.8931,-2.2444,0.6106,-1,EE,0.029,0.0,12.4229,0.4534,0.9096


In [49]:
print ('Before dropping columns: \n',  list(df_ee.columns))
df_ee = df_ee.drop(['type1', 'sigmaEtaEta1', 'HoverE1', 'isoTrack1', 'isoEcal1', 'isoHcal1', 'type2', 'sigmaEtaEta2', 'HoverE2', 'isoTrack2', 'isoEcal2', 'isoHcal2'], axis=1) # remove unwanted column with just increasing int
print ('After dropping columns: \n', list(df_ee.columns))
print ('\n\nAdding flag column')
df_ee['flag'] = 0
df_ee.head()

Before dropping columns: 
 ['Run', 'Event', 'pt1', 'eta1', 'phi1', 'Q1', 'type1', 'sigmaEtaEta1', 'HoverE1', 'isoTrack1', 'isoEcal1', 'isoHcal1', 'pt2', 'eta2', 'phi2', 'Q2', 'type2', 'sigmaEtaEta2', 'HoverE2', 'isoTrack2', 'isoEcal2', 'isoHcal2']
After dropping columns: 
 ['Run', 'Event', 'pt1', 'eta1', 'phi1', 'Q1', 'pt2', 'eta2', 'phi2', 'Q2']


Adding flag colunm 



Unnamed: 0,Run,Event,pt1,eta1,phi1,Q1,pt2,eta2,phi2,Q2,flag
0,163286,109060857,37.5667,2.2892,2.0526,-1,45.4315,1.4706,-1.163,1,0
1,163286,109275715,36.2901,-0.8373,-1.5859,1,60.5754,-0.4896,1.0496,-1,0
2,163286,109075352,25.9705,-0.6974,1.636,-1,45.2954,-2.0401,3.1187,1,0
3,163286,109169766,41.0075,1.4619,-0.5325,1,45.9013,1.1561,2.4786,-1,0
4,163286,108947653,39.8985,-0.5927,-2.3947,1,34.8931,-2.2444,0.6106,-1,0


Let's have a look at the distributions of the features

## **Z → 𝜇𝜇**

In [None]:
df_mumu = pd.read_csv('Zmumu.csv') # read csv file
print('Total number of events: ', len(df_mumu), '\n')
df_mumu.head()

In [50]:
print ('Before dropping columns: \n',  list(df_mumu.columns))
df_mumu = df_mumu.drop(['dxy1', 'iso1', 'dxy2', 'iso2'], axis=1) # remove unwanted column with just increasing int
print ('After dropping columns: \n', list(df_mumu.columns))
print ('\n\nAdding flag column')
df_mumu['flag'] = 1
df_mumu.head()

Before dropping columns: 
 ['Run', 'Event', 'pt1', 'eta1', 'phi1', 'Q1', 'dxy1', 'iso1', 'pt2', 'eta2', 'phi2', 'Q2', 'dxy2', 'iso2']
After dropping columns: 
 ['Run', 'Event', 'pt1', 'eta1', 'phi1', 'Q1', 'pt2', 'eta2', 'phi2', 'Q2']


Adding flag colunm 



Unnamed: 0,Run,Event,pt1,eta1,phi1,Q1,pt2,eta2,phi2,Q2,flag
0,165617,74969122,54.7055,-0.4324,2.5742,1,34.2464,-0.9885,-0.4987,-1,1
1,165617,75138253,24.5872,-2.0522,2.8666,-1,28.5389,0.3852,-1.9912,1,1
2,165617,75887636,31.7386,-2.2595,-1.3323,-1,30.2344,-0.4684,1.8833,1,1
3,165617,75779415,39.7394,-0.7123,-0.3123,1,48.279,-0.1956,2.9703,-1,1
4,165617,75098104,41.2998,-0.1571,-3.0408,1,43.4508,0.591,-0.0428,-1,1


## Now we can shuffle the datasets to create a single one and look at the overall distributions




In [58]:
df_all = pd.concat([df_ee, df_mumu], ignore_index=True)
df_all = df_all.sample(frac=1, random_state=42).reset_index(drop=True)
df_all.head()

Unnamed: 0,Run,Event,pt1,eta1,phi1,Q1,pt2,eta2,phi2,Q2,flag
0,166784,20257329,42.913,-1.3248,-3.0057,-1,42.2897,-0.6624,0.3485,1,1
1,163261,64679856,39.8744,-0.0057,-0.7744,-1,36.3997,1.371,2.2952,1,0
2,165570,196860468,72.9296,-0.4162,0.9198,1,39.541,-0.268,-1.4629,1,0
3,172163,497791581,33.4436,0.3063,-1.6034,1,27.2395,-1.6182,1.5315,-1,0
4,173692,550966077,43.9826,0.4276,-0.3876,-1,29.7856,-0.308,2.8965,1,1


It is also better to check is there are zeros of NaN

In [60]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Run     20000 non-null  int64  
 1   Event   20000 non-null  int64  
 2   pt1     20000 non-null  float64
 3   eta1    20000 non-null  float64
 4   phi1    20000 non-null  float64
 5   Q1      20000 non-null  int64  
 6   pt2     20000 non-null  float64
 7   eta2    20000 non-null  float64
 8   phi2    20000 non-null  float64
 9   Q2      20000 non-null  int64  
 10  flag    20000 non-null  int64  
dtypes: float64(6), int64(5)
memory usage: 1.7 MB


In [71]:
nan_check = df_all.isna()
if nan_check.any().any():
    print("The DataFrame contains NaN values in these rows:")
    rows_with_nan = df_all[df_all[df_all.columns[:-1]].isna().any(axis=1)]
else:
    print("The DataFrame does not contain NaN values.")

zero_check = (df_all[df_all.columns[:-1]] == 0.00)
if zero_check.any().any():
    print("The DataFrame contains zero values in these rows:")
    rows_with_zeros = df_all[df_all[df_all.columns[:-1]].eq(0).any(axis=1)]
    print(rows_with_zeros)
else:
    print("The DataFrame does not contain zero values.")


The DataFrame does not contain NaN values.
The DataFrame contains zero values in these rows:
          Run      Event      pt1  eta1    phi1  Q1      pt2    eta2    phi2  \
11198  173381  209714969  39.1714   0.0 -2.9318  -1  52.2589  0.1797  0.1143   

       Q2  flag  
11198   1     1  


There is one zero value, but pseudorapidity can assume that value (direction perpendicular wrt beam pipe) and this does not appear to be a bad event, so we'll keep it

## Data visualisation

In [62]:
# Create a sample DataFrame
data = {'A': [1, 2, np.nan, 4],
        'B': [0, np.nan, 3, 4],
        'C': [1, 2, 3, 4]}

df = pd.DataFrame(data)

# Check for NaN values in the DataFrame
nan_check = df.isna()  # Returns a DataFrame with True for NaN values and False for non-NaN values

# Check if there are any NaN values in the entire DataFrame
has_nan = nan_check.any().any()

if has_nan:
    print("The DataFrame contains NaN values.")
else:
    print("The DataFrame does not contain NaN values.")

df.info()

The DataFrame contains NaN values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      float64
 1   B       3 non-null      float64
 2   C       4 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 224.0 bytes
