# Electron Study

The aim of this study is to show that the introduction of new track properties, namely `chi2rphi` and `chi2rz`, improve the performance of machine learning algorithms dedicated to telling whether electron-labeled track-trigger tracks are real or fake.

The Monte-Carlo samples used are a QCD sample, a Z boson to muon-muon sample, and a Z boson to electron-electron sample. Each sample is run for the D49 detector geometry and contains 1000 events, each of which has a pileup of about 200.

Much of the code here is run using the `ntupledicts` package, which can be found [here](https://github.com/cqpancoast/ntupledicts), along with a simple tutorial that covers all code used here.
The requirements for running this notebook are the same as the ones listed for running `ntupledicts` in the README.

In [21]:
from uproot import open as uproot_open
from matplotlib.pyplot import cla, sca, gca, savefig
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Softmax
from tensorflow.keras.layers import Dense

from ntupledicts import operations as ndops
from ntupledicts.operations import select as sel
from ntupledicts import plot as ndplot
from ntupledicts.ml import data as ndmldata
from ntupledicts.ml import predict as ndmlpred
from ntupledicts.ml import models as ndmlmodels
from ntupledicts.ml import plot as ndmlplot

## Data Acquisition

Grab tracks from stored ntuples, perform cuts, process into datasets.

In [22]:
# List the ntuples we want data from
input_files = ["eventsets/ZMM_PU200_D49.root",
    "eventsets/ZEE_PU200_D49.root",
    "eventsets/QCD_PU200_D49.root"]

# Create list of uproot event sets for easy data access
event_sets = []
for input_file in input_files:
    event_sets.append(next(iter(uproot_open(input_file).values()))["eventTree"])
    
# What track properties do we want available to play with?
# We can select which ones we want a model to train on
# Build an ntuple dict whose only track type is "trk"
properties_by_track_type = {"trk": ["pt", "eta", "z0", "nstub", "genuine", "matchtp_pdgid",
                                    "chi2", "bendchi2", "chi2rphi", "chi2rz"]}

# Create ntuple dict from event sets
ntuple_dict = ndops.uproot_ntuples_to_ntuple_dict(event_sets, properties_by_track_type)

trk
['pt', 'eta', 'z0', 'nstub', 'genuine', 'matchtp_pdgid', 'chi2', 'bendchi2', 'chi2rphi', 'chi2rz']
trk
['pt', 'eta', 'z0', 'nstub', 'genuine', 'matchtp_pdgid', 'chi2', 'bendchi2', 'chi2rphi', 'chi2rz']
trk
['pt', 'eta', 'z0', 'nstub', 'genuine', 'matchtp_pdgid', 'chi2', 'bendchi2', 'chi2rphi', 'chi2rz']


### Apply Cuts

Ensure there are as many fake tracks as there are real tracks, perform desired cuts on dataset.

In [12]:
# Would we like to consider only a portion of this dataset? (Typically done for speed reasons.)
reduce_ntuple_dict = True
reduction_size = 10000  # number of tracks to reduce to
if reduce_ntuple_dict:
    ntuple_dict = ndops.reduce_ntuple_dict(ntuple_dict, reduction_size, shuffle_tracks=True)

# Reduce genuine track size to be equal to fake track size, concatenate and shuffle
# (nd means ntuple_dict)
all_nd_gens = ndops.cut_ntuple_dict(ntuple_dict, {"trk": {"genuine": sel(1)}})
nd_fakes = ndops.cut_ntuple_dict(ntuple_dict, {"trk": {"genuine": sel(0)}})
nd_gens = ndops.reduce_ntuple_dict(all_nd_gens,
                                   track_limit=ndops.track_prop_dict_length(nd_fakes["trk"]),
                                   shuffle_tracks=True,
                                   seed=42)
nd_both = ndops.shuffle_ntuple_dict(ndops.add_ntuple_dicts([nd_gens, nd_fakes]), seed=42)

# Are there any other cuts that should be applied to this dataset?
additional_cuts = {"trk": {"pt": sel(2, 100)}}
ntuple_dict = ndops.cut_ntuple_dict(nd_both, additional_cuts)

### Process into Datasets

Process the ntuple dict above into `TrackPropertiesDataset`s.

In [13]:
data_properties = ["chi2", "bendchi2", "nstub"]  # what properties do we want to train our models on?
label_property = "genuine"                       # what property are we trying to predict?
split_list = [.7, .2, .1]                        # how many datasets should we create,
                                                 #   and with what relative sizes?
train_ds, eval_ds, test_ds = ndmldata.TrackPropertiesDataset(ntuple_dict["trk"],
                                                             label_property,
                                                             data_properties).split(split_list)

## Making Models

Build a neural network and a gradient boosted decision tree, train them on data. Also define a set of predictive cuts to compare our models against.

In [19]:
NN = ndmlmodels.make_neuralnet(train_ds, eval_dataset=eval_ds, hidden_layers=[15, 8], epochs=10)
GBDT = ndmlmodels.make_gbdt(train_ds)
cuts = {"chi2rphi": sel(0, 23), "chi2rz": sel(0, 7), "chi2": sel(0, 21)}

Train on 70 samples, validate on 20 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Model Predictions

Use the test dataset `test_ds` that hasn't been used for training to make predicted labels. These will be probablistic in the case of the models `NN` and `GBDT` and exact in the case of `cuts`. Store these predictions in `test_ds` for easy future access.

In [20]:
test_ds.add_prediction("NN", ndmlpred.predict_labels(NN, test_ds.get_data()))
test_ds.add_prediction("GBDT", ndmlpred.predict_labels(GBDT, test_ds.get_data()))
test_ds.add_prediction("cuts", ndmlpred.predict_labels_cuts(cuts, test_ds))