# ML4P for Jet Clustering

In this notebook, we'll use our topo-cluster classifiers and regressors to correct the energies of topo-clusters, and then cluster these topo-clusters into jets! The goal is to see how using these corrected energies (i.e. applying our ML stuff) affects the jet energy scale resolution.

## Setup

First we have to import a whole bunch of things we'll use.

In [1]:
# First, the generic imports.
import sys, os, glob, uuid, pathlib
import numpy as np
import h5py as h5
import pandas as pd
import ROOT as rt # Mostly useful here for plotting
import matplotlib.pyplot as plt # Alternative plotting option
import uproot as ur
import awkward as ak

# Next, import our utilities. We define our "path_prefix" from where we can find them.
path_prefix = os.getcwd() + '/../'
if(path_prefix not in sys.path): sys.path.append(path_prefix)

from util import ml_util as mu # Data preparation and wrangling for our neural networks.
from util import qol_util as qu # Quality-of-life stuff, like plot styles and progress bars.
from util import jet_util as ju # Jet-specific utilities, e.g. wrapping for FastJet & introducing our ML outputs to jet clustering.
from util import io_util as iu # Utilities for scaling regression input/output.

# Classification-specific utilities (network setup).
from util.classification import data_util as cdu
from util.classification import training_util as ctu # besides training, can be used to load network from file
import util.classification.models as classifier_models
import util.classification.models_exp as classifier_models_exp

# Regression-specific utilities (data-loading, network setup).
from util.regression import data_util as rdu
from util.regression import training_util as rtu # besides training, can be used to load network from file
import util.regression.models as regressor_models

Welcome to JupyROOT 6.24/02


2021-07-22 08:29:04.654929: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


In [2]:
# Set up some plotting stuff.
plot_style = 'dark'
ps = qu.PlotStyle(plot_style)
ps.SetStyle() # will automatically affect ROOT plots from here on out, it sets ROOT.gStyle.
rt.gStyle.SetOptStat(0)

In [3]:
# Set up some calorimeter metadata.
# TODO: Get this from one of our libraries
layers = ["EMB1", "EMB2", "EMB3", "TileBar0", "TileBar1", "TileBar2"]
cell_size_phi = [0.098, 0.0245, 0.0245, 0.1, 0.1, 0.1]
cell_size_eta = [0.0031, 0.025, 0.05, 0.1, 0.1, 0.2]
len_phi = [4, 16, 16, 4, 4, 4]
len_eta = [128, 16, 8, 4, 4, 2]
cell_shapes = {layers[i]:(len_eta[i],len_phi[i]) for i in range(len(layers))}

In [4]:
# Perform fastjet setup. This function will download & build fastjet if it isn't found at the given location.
fastjet_dir = path_prefix + 'fastjet'
fastjet_dir =  ju.BuildFastjet(fastjet_dir, j=8)
fastjet_dir = glob.glob('{}/**/site-packages'.format(fastjet_dir),recursive=True)[0]
if(fastjet_dir not in sys.path): sys.path.append(fastjet_dir)
import fastjet as fj

## Data preparation

Let's also fetch our jet data. This is MC dijet data, so we have some light-quark jets to work with. The data contains information on topo-clusters for each event -- the same cell-level and cluster-level info we have with our network training data.

We *also* need to fetch the locations of our classification and energy regression networks, so that we can load and apply them to the data.

In [6]:
h5_name_suffix = 'jdata' # used for HDF5 files containing selected events

# We package things as a dictionary for now, since that's what our `setupPionData` function expects.
data_dir = path_prefix + 'data/jet_small'
rootfiles = {'jet':glob.glob(data_dir + '/*.root')}
branches = [
            'clusterE', 'clusterECalib', 
            'clusterPt', 'clusterEta', 'clusterPhi', 
            'cluster_nCells', 'cluster_sumCellE', 
            'cluster_ENG_CALIB_TOT', 'cluster_EM_PROBABILITY'
] 

In [7]:
# Prepare data
h5_name = '/'.join((data_dir,h5_name_suffix))

pdata,pcells = mu.setupPionData(rootfiles,
                                branches=branches,
                                layers=layers,
                                balance_data=True,
                                n_max = 1000,
                                verbose=True,
                                load=True,
                                save=True,
                                filename=h5_name,
                                cut_distributions=['cluster_ENG_CALIB_TOT','clusterEta'],
                                cut_values = [.2, (-0.7,0.7)],
                                cut_types=['lower','window']
                               )

# Get rid of one layers of keys, which is redundant in this case.
pdata = pdata['jet']
pcells = pcells['jet']

Loading pandas DataFrame and calo images from /local/home/jano/ml4pions/LCStudies/jets/../data/jet_small/jdata_frame.h5 and /local/home/jano/ml4pions/LCStudies/jets/../data/jet_small/jdata_images.h5.


We also want to fetch information on EM and LC jets. These are stored in trees called `EventTree` in our files.

In [9]:
jet_branches = [
    'AntiKt4EMTopoJetsPt',
    'AntiKt4EMTopoJetsEta',
    'AntiKt4EMTopoJetsPhi',
    'AntiKt4EMTopoJetsE',
    'AntiKt4LCTopoJetsPt',
    'AntiKt4LCTopoJetsEta',
    'AntiKt4LCTopoJetsPhi',
    'AntiKt4LCTopoJetsE'
]

jet_info = ur.lazy([':'.join((x,'EventTree')) for x in rootfiles['jet']],filter_branch=lambda x: x.name in jet_branches)

### Network Preparation

Besides loading the jet data, we also need to load our models (neural networks), which we will apply to the data in order to get the corrected topo-cluster energies.

We have a number of different classifiers and regressors available -- here we will choose which ones we use. Note that our choice of classifier/regressor may also affect *how* we have to load the data. So if we switch models, we may also have to change our data-loading code.

Also note that for the energy regression, we are using a *binned* regression -- thus we will have multiple regressors for both charged and neutral pions, each corresponding to a particular range of topo-cluster reco energies.

In [None]:
classification_dir = path_prefix + 'classifier/Models/pion3'
classifier_modelname = 'cnn_split_EMB'
classifier_file = classification_dir + '/cnn/{}.h5'.format(classifier_modelname)
classifier_model = classifier_models_exp.exp_merged_model

In [None]:
regression_dir = path_prefix + 'regression_binned/Models/split_3'
regressor_modelname = 'split_emb_cnn'
regressor_model = regressor_models.split_emb_cnn

reco_energy_bin_edges = [0.,1.,10.] # lower bin edges

regression_files = {
    'charged': glob.glob(regression_dir + '/*/{a}/{a}_charged.h5'.format(a=regressor_modelname)),
    'neutral': glob.glob(regression_dir + '/*/{a}/{a}_neutral.h5'.format(a=regressor_modelname))
}

scaler_files = glob.glob(regression_dir + '/*/scalers.save')
scaler_files.sort()

for key,val in regression_files.items():
    regression_files[key].sort()

In [None]:
# Now we explicitly check that the filenames are lined up, i.e. that we will pair up the right scalers with the regression files.
for i,scaler_file in enumerate(scaler_files):
    bin_name = scaler_file.replace(regression_dir + '/','').split('/')[0]    
    for key,val in regression_files.items():
        bin_name_2 = val[i].replace(regression_dir + '/','').split('/')[0]
        assert(bin_name == bin_name_2)
        
# Also make sure that we have the right number of reco bin edges
for key,val in regression_files.items():
    assert(len(val) == len(reco_energy_bin_edges))

Now we must define our regression variables -- our energy regression will need some inputs that undergo some scaling, which is some mapping (a function) followed by application of some scalers (which were derived using training data).

In [None]:
m = 1.
b = 1.0e-5
EnergyMapping = iu.LogMapping(b=b,m=m)

In [None]:
# Some regression vars.
pdata['logE'] = EnergyMapping.Forward(pdata['clusterE'].to_numpy()) # log of reco energy, possible network input
pdata['clusterEtaAbs'] = np.abs(pdata['clusterEta'].to_numpy()) # absolute value of eta, possible network input

In [None]:
#TODO: We have scalers derived from charged pion and neutral pion data. We're just using the charged pion ones -- does this make sense?
scalers = []
scaler_branches = ['logE', 'clusterEtaAbs']
scaled_variable_prefixes = ['s{}'.format(i) for i in range(len(scaler_files))]
for i,scaler_file in enumerate(scaler_files):
    scalers.append(mu.setupScalers({'pp':pdata}, scaler_branches, scaler_file, scaled_variable_prefixes[i]))

Let's load the actual networks now.

In [None]:
classifier, _ = ctu.TrainNetwork(classifier_model(), classifier_file, overwriteModel=False, finishTraining=False)

In [None]:
regressor = {}

for key,val in regression_files.items():
    regressor[key] = []
    for regressor_file in val:
        reg, _ = rtu.TrainNetwork(regressor_model(), regressor_file, overwriteModel=False, finishTraining=False)
        regressor[key].append(reg)

Since we are using multiple regressors binned by reco energy, let's fetch the indices of events that will be passed to each regressor.

In [None]:
regressor_indices = []

for i in range(len(reco_energy_bin_edges)-1,-1,-1):
    
    indices = pdata['clusterE'].to_numpy() > reco_energy_bin_edges[i]

    if(i != len(reco_energy_bin_edges)-1):
        indices *= pdata['clusterE'].to_numpy() < reco_energy_bin_edges[i+1]
        
    indices = np.where(indices)[0]
    regressor_indices.append(indices)

Lastly, we package the data in the format that our networks will use.

In [None]:
# Classifier data.
classifier_input = cdu.ReshapeImages(pcells, cell_shapes, use_layer_names=True)

In [None]:
# Regressor data. Note that we have a set of data for each of our regression energy bins, as they
# use different scaled energies.
regressor_input = []
for i in range(len(reco_energy_bin_edges)):
    prefix = scaled_variable_prefixes[i]
    dummy_key = 'jet'
    reg_input = rdu.ResnetInput(
        {dummy_key:pdata},
        {dummy_key:pcells},
        branch_map = {
            '{}_logE'.format(prefix):'energy',
            '{}_clusterEtaAbs'.format(prefix):'eta'
        }
    )
    
    reg_input = reg_input[dummy_key]
    
    # We can immediately pare things down by removing clusters that won't be used by a particular regressor.
    for key in reg_input.keys():
        reg_input[key] = reg_input[key][regressor_indices[i]]
    
    regressor_input.append(reg_input)

## Apply Neural Networks

Now that we've loaded our data and prepared our neural networks, we want to evaluate them on the data to get our classification scores and corrected energies.

The simplest way to do this would be to evaluate every network on every event. However, it will be more efficient to avoid evaluating regressors on topo-clusters outside their energy range, since we would not use the result anyway.

We could also in principle define our classification score cut here -- and only apply charged/neutral regressions to each topo-cluster based on its score and that cut. But it's easier to just apply the charged and neutral regression to each topo-cluster, so that we can adjust the cut afterwards without having to recompute things.

Lastly, we will save our network outputs to a file that can be loaded, so that we don't have to re-evaluate the networks every time. We just have to make sure to move/remove this file if we change something about the evaluation, e.g. which models we're using. Ultimately we'll want to design this so that if we're loading the scores from a file, we don't even load the networks into memory as we did above.

In [None]:
classification_score_file = 'classification_scores.h5'
regression_score_file = 'regression_scores.h5'

In [None]:
# First get the classification scores -- this is the simple part, we just apply the same classifier to all topo-clusters.

if(not pathlib.Path(classification_score_file).exists()):
    
    # Evaluate the network.
    print('Evaluating classifier.')
    classification_scores = classifier.predict(classifier_input)[:,1]
    
    # Save these scores to a file.
    print('Saving classification scores to {}.'.format(classification_score_file))
    hf = h5.File(classification_score_file, 'w')
    dset = hf.create_dataset('scores',data=classification_scores,compression='gzip', compression_opts=7)
    hf.close()
    
else:
    
    # Load the scores from a file.
    print('Loading classification scores from {}.'.format(classification_score_file))
    hf = h5.File(classification_score_file,'r')
    classification_scores = hf['scores'][:]
    hf.close()

In [None]:
# Now handle the regression scores.
regression_scores = {}

if(not pathlib.Path(regression_score_file).exists()):
    print('Evaluating regressors.')
    for key,regressor_set in regressor.items():
        scores = np.zeros(len(pdata))
        for i,reg in enumerate(regressor_set):
            scores[regressor_indices[i]] = rtu.GetPredictions(regressor=reg, model_input=regressor_input[i])
        regression_scores[key] = scores
    
    # Now save the scores to a file.
    print('Saving regression scores to {}.'.format(regression_score_file))
    hf = h5.File(regression_score_file, 'w')
    for key,val in regression_scores.items():
        dset = hf.create_dataset(key,data=val ,compression='gzip', compression_opts=7)
    hf.close()
    
else:
    
    # Load the scores from a file.
    print('Loading regression scores from {}.'.format(regression_score_file))
    hf = h5.File(regression_score_file,'r')
    for key in hf.keys():
        regression_scores[key] = hf[key][:]
    hf.close()

Now that we've collected our classification and regressino scores, let's add the relevant data to our `pandas.DataFrame`.

Note that we are using regressors that predict the *ratio* between true and reco energy, not the reco energy itself -- so we must multiply the existing reco energy by the regression scores to get the new predicted energy.

In [None]:
classification_key = 'score'
regression_key_prefix = 'clusterE_pred'
regression_keys = {key:regression_key_prefix + '_' + key for key in regression_scores.keys()}

pdata[classification_key] = classification_scores

for key,val in regression_keys.items():
    pdata[val] = regression_scores[key] * pdata['clusterE'].to_numpy()

## Plotting network results

Before going any further, we can plot some network results -- our classification and regression scores, the corresponding predicted energies for each regression, and the ratio of predicted energy to reco energy for each regression.

**TODO:** Note that we have a charged regression result and a neutral regression result for *each* topo-cluster -- in other words there's some double-counting going on, because ultimately we will treat each topo-cluster as only charged or neutral. Without immediately deciding on a classification score cut, we can make a 2D plot showing the distribution of charged or neutral regression scores (or predicted energies) as a function of that cut.

In [None]:
c = rt.TCanvas(qu.RN(),qu.RN(),1600,600)
c.Divide(2,1)

# ---
c.cd(1)
hist = rt.TH1F(qu.RN(),'Classification Scores;Score;Fractional Count',102,-0.01,1.01)
for entry in classification_scores: hist.Fill(entry)
hist.SetLineColor(ps.main)
hist.SetFillColorAlpha(ps.main, 0.5)
hist.Scale(1./hist.Integral())
hist.Draw('HIST')
# ---
c.cd(2)
stack = rt.THStack(qu.RN(),'Regression Scores;Score;Fractional Count')
leg = rt.TLegend(0.75,0.8,0.9,0.9)
h = {key:rt.TH1F(qu.RN(),'',100,0.,2.0) for key in regression_scores.keys()}
colors = [ps.curve, ps.text]

for i,key in enumerate(regression_scores.keys()):
    for entry in regression_scores[key]:
        h[key].Fill(entry)
    h[key].SetLineColor(colors[i])
    h[key].SetFillColorAlpha(colors[i],0.1)
    h[key].Scale(1./h[key].Integral())
    stack.Add(h[key])
    leg.AddEntry(h[key],key,'lf')
stack.Draw('NOSTACK HIST')
leg.SetTextColor(ps.text)
leg.Draw()
# ---

c.Draw()

From the simple plots above, things look sensible -- it looks like our topo-clusters are predominantly predicted to be charged pions. And consistent with our results in training, we see that the neutral pion regression doesn't shift the energy very much, whereas the charged regression will typically shift the energy slightly upwards. Of course, these plots include *charged pion* regression scores for clusters likely to be *neutral* pions and vice-versa, but it's nonetheless a sensible result.

## Defining a classification score cut.

At this point, we might want to choose a classification score cut. All topo-clusters with scores below this cut will be treated as neutral pions, and all topo-clusters with scores above this cut will be treated as charged pions.

**TODO:** We ultimately might want to define this within our jet-clustering function/routine, so that we can easily see how shifting this cut affects output.

In [None]:
classification_score_cut = 0.5
predicted_energy_key = 'clusterE_pred'

# Funky stuff since I'm not very familiar with pandas tricks.
clusterE_pred = np.array(pdata['clusterE_pred_neutral'].to_numpy()) # make a copy
charged_idxs = (pdata[classification_key] > classification_score_cut).to_numpy()
clusterE_pred[charged_idxs] = pdata['clusterE_pred_charged'][charged_idxs].to_numpy()

pdata[predicted_energy_key] = clusterE_pred

## Jet clustering

Now we want to cluster jets

In [None]:
rootfiles

In [None]:
rfile = glob.glob(rootfiles['jet'])[0]

In [None]:
f = ur.open(rfile)
t = f['EventTree']

In [None]:
t

In [None]:
t.keys()

In [None]:
t['AntiKt4EMTopoJetsPt'].array()