### When running this notebook via the Galaxy portal
You can access your data via the dataset number. Using a Python kernel, you can access dataset number 42 with ``handle = open(get(42), 'r')``.
To save data, write your data to a file, and then call ``put('filename.txt')``. The dataset will then be available in your galaxy history.
<br><br>Note that if you are putting/getting to/from a different history than your default history, you must also provide the history-id.
<br><br>More information including available galaxy-related environment variables can be found at https://github.com/bgruening/docker-jupyter-notebook. This notebook is running in a docker container based on the Docker Jupyter container described in that link.

# Ntuple to data frame conversion

The following notebook converts ntuples to pandas data frame and writes the output to hdf5 files. Events are selected when running over the ntuples and new variables are created and put into a data frame. Code adds the background category, whether the event is coming from a signal simulation or not (useful when training a BDT or NN) and the weight used to scale the MC to data.

The current code takes about 4 - 5 hours on the simulated 2Lep background and signal samples. I.e. processing about 118 million events.

First import some of the needed modules.

In [None]:
import ROOT as R
import import_ipynb
import setPath
from os import listdir
from os.path import isfile, join
from Input.OpenDataPandaFramework13TeV import *
%jsroot on

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

Set the path to the open data ntuples and which skim you are interested in:

In [None]:
R.gSystem.Load("/storage/shared/software/Input/CalcGenericMT2/src/libBinnedLik.so")

In [None]:
opendatadir = "/storage/shared/data/fys5555/ATLAS_opendata/"
analysis = "2lep"

Make the ROOT::TChain for adding all the root files and eventually looping over all the events.

In [None]:
background = R.TChain("mini")
data = R.TChain("mini")

Get all the MC and data files available for the selected data set and make lists with the background and signal categories (useful information to add into the data frame later)

In [None]:
mcfiles = initialize(opendatadir+"/"+analysis+"/MC")
datafiles = initialize(opendatadir+"/"+analysis+"/Data")
allfiles = z = {**mcfiles, **datafiles}
Backgrounds = getBkgCategories(); 
Signals = getSignalCategories();

Some more preparatory steps to classify the individual backgrounds into categories.

In [None]:
getSignalCategories()

In [None]:
MCcat = {}
for cat in allfiles:
    for dsid in allfiles[cat]["dsid"]:
        try:
            MCcat[int(dsid)] = cat
        except:
            continue

Adding the background to the TChain and check number of events.

In [None]:
dataset_IDs = []
background.Reset()
for b in Backgrounds+Signals:
    i = 0
    if not b in mcfiles.keys(): continue
    for mc in mcfiles[b]["files"]:
        if not os.path.isfile(mc): continue
        try:
            dataset_IDs.append(int(mcfiles[b]["dsid"][i]))
            background.Add(mc)
        except:
            print("Could not get DSID for %s. Skipping"%mc)
        i += 1
nen = background.GetEntries()
print("Added %i entries for backgrounds and signals"%(nen))

Adding all the available data into the TChain.

In [None]:
data.Reset(); 
for d in datafiles["data"]["files"]:  
    if not os.path.isfile(d): continue
    data.Add(d)
nen = data.GetEntries()
print("Added %i entries for data"%(nen))

These are the variables/features we want to add to our data frame and which will be filled during the loop over events. Here you can add and remove variables depending on what you will use the resulting data frame to.

In [None]:
columns = {"lep_pt1":[],"lep_eta1":[],"lep_phi1":[],"lep_E1":[],"lep_flav1":[],
           "lep_pt2":[],"lep_eta2":[],"lep_phi2":[],"lep_E2":[],"lep_flav2":[],
           "met":[], "mll":[], "njet20":[], "njet60":[], "nbjet60":[],"nbjet70":[],
           "nbjet77":[],"nbjet85":[],"mt2_80":[],"mt2_0":[],
           "isSF":[], "isOS":[], "weight":[],"category":[],"isSignal":[],
           "lep_z01":[], "lep_z02":[], "lep_trackd0pvunbiased1":[],
           "lep_trackd0pvunbiased2":[], "lep_tracksigd0pvunbiased1":[], "lep_tracksigd0pvunbiased2":[],
           "lep_etcone201":[],"lep_etcone202":[], "lep_ptcone301":[], "lep_ptcone302":[]}

This is the event loop (needs to be run twice; one for MC and one for data if you are interested in both). It makes some selections, creates new variables and fill the list in the dictionary defined above. 

In [None]:
%%time
import time
isData = 0; 

if isData == 1: ds = data 
else: ds = background     

l1 = R.TLorentzVector() 
l2 = R.TLorentzVector() 
met = R.TLorentzVector() 
dileptons = R.TLorentzVector() 
    
i = 0   
for event in ds: 
    
    if i%100000 == 0 and i>0: 
        print("Total events %i/%i"%(i,ds.GetEntries()))
        #break
    i += 1 
    
    sig_lep_idx = []
    nsig_lep = 0
    for j in range(ds.lep_n):
        if ds.lep_etcone20[j]/ds.lep_pt[j] > 0.15: continue
        if ds.lep_ptcone30[j]/ds.lep_pt[j] > 0.15: continue
        sig_lep_idx.append(j)
        nsig_lep += 1
        
    if not nsig_lep == 2: continue 
    njet20 = 0
    njet60 = 0
    nbjet60 = 0
    nbjet70 = 0
    nbjet77 = 0
    nbjet85 = 0
    for j in range(ds.jet_n):
        if ds.jet_pt[j] > 20000:
            njet20 += 1
            if ds.jet_MV2c10[j] > 0.9349:
                nbjet60 += 1
            if ds.jet_MV2c10[j] > 0.8244:
                nbjet70 += 1
            if ds.jet_MV2c10[j] > 0.6459:
                nbjet77 += 1
            if ds.jet_MV2c10[j] > 0.1758:
                nbjet85 += 1
        if ds.jet_pt[j] > 60000:
            njet60 += 1
        
    ## Require "good leptons": 
    idx1 = sig_lep_idx[0]
    idx2 = sig_lep_idx[1]
    
    ## Set Lorentz vectors: 
    l1.SetPtEtaPhiE(ds.lep_pt[idx1]/1000., ds.lep_eta[idx1], ds.lep_phi[idx1], ds.lep_E[idx1]/1000.);
    l2.SetPtEtaPhiE(ds.lep_pt[idx2]/1000., ds.lep_eta[idx2], ds.lep_phi[idx2], ds.lep_E[idx2]/1000.);
    
    met.SetPtEtaPhiE(ds.met_et/1000., 0.0, ds.met_phi, 0.0);
    
    ## Variables are stored in the TTree with unit MeV, so we need to divide by 1000 
    ## to get GeV, which is a more practical and commonly used unit. 
    
    dileptons = l1 + l2;   
    
    # The stransverse mass!
    mycalc_80 = R.ComputeMT2(l1,l2,met,0.,80.)
    mycalc_0 = R.ComputeMT2(l1,l2,met,0.,0.)
    
    columns["lep_pt1"].append(ds.lep_pt[idx1]/1000.0)
    columns["lep_eta1"].append(ds.lep_eta[idx1])
    columns["lep_phi1"].append(ds.lep_phi[idx1])
    columns["lep_E1"].append(ds.lep_E[idx1]/1000.0)
    columns["lep_flav1"].append(ds.lep_charge[idx1]*ds.lep_type[idx1])
    columns["lep_z01"].append(ds.lep_z0[idx1])
    columns["lep_tracksigd0pvunbiased1"].append(ds.lep_tracksigd0pvunbiased[idx1])
    columns["lep_trackd0pvunbiased1"].append(ds.lep_trackd0pvunbiased[idx1])
    columns["lep_etcone201"].append(ds.lep_etcone20[idx1])
    columns["lep_ptcone301"].append(ds.lep_ptcone30[idx1])
    
    columns["lep_pt2"].append(ds.lep_pt[idx2]/1000.0)
    columns["lep_eta2"].append(ds.lep_eta[idx2])
    columns["lep_phi2"].append(ds.lep_phi[idx2])
    columns["lep_E2"].append(ds.lep_E[idx2]/1000.0)
    columns["lep_flav2"].append(ds.lep_charge[idx2]*ds.lep_type[idx2])
    columns["lep_z02"].append(ds.lep_z0[idx2])
    columns["lep_tracksigd0pvunbiased2"].append(ds.lep_tracksigd0pvunbiased[idx2])
    columns["lep_trackd0pvunbiased2"].append(ds.lep_trackd0pvunbiased[idx2])
    columns["lep_etcone202"].append(ds.lep_etcone20[idx2])
    columns["lep_ptcone302"].append(ds.lep_ptcone30[idx2])
    
    columns["mt2_80"].append(mycalc_80.Compute())
    columns["mt2_0"].append(mycalc_0.Compute())
    columns["met"].append(ds.met_et/1000.0)
    columns["mll"].append(dileptons.M())
    
    columns["njet20"].append(njet20)
    columns["njet60"].append(njet60)
    
    columns["nbjet60"].append(nbjet60)
    columns["nbjet70"].append(nbjet70)
    columns["nbjet77"].append(nbjet77)
    columns["nbjet85"].append(nbjet85)
    
    Type = ""
    if not isData:
        Type = MCcat[ds.channelNumber]
        # print("Type",Type)
        columns["category"].append(Type)
    else:
        columns["category"].append("data")
    
    if Type in Backgrounds:
        columns["isSignal"].append(0)
    elif Type in Signals:
        columns["isSignal"].append(1)
    else:
        columns["isSignal"].append(0)
    
    if ds.lep_charge[idx1] == ds.lep_charge[idx2]: columns["isOS"].append(0)
    else: columns["isOS"].append(1)
        
    if ds.lep_type[idx1] == ds.lep_type[idx2]: columns["isSF"].append(1)
    else: columns["isSF"].append(0)
        
    if isData:
        columns["weight"].append(1.0)
    else:
        W = ((ds.mcWeight)*(ds.scaleFactor_PILEUP)*
             (ds.scaleFactor_ELE)*(ds.scaleFactor_MUON)*
             (ds.scaleFactor_BTAG)*(ds.scaleFactor_LepTRIGGER))*((ds.XSection*lumi)/ds.SumWeights)
        columns["weight"].append(W)
        
print("Done!")
if isData == 0:
    print("Remebered to run over data? No? Set data = 1 at the top and run again")
else:
    print("Remebered to run over MC? No? Set data = 0 at the top and run again")

Finally convert the dictionary to a data frame

In [None]:
df = pd.DataFrame(data=columns)

In [None]:
df["category"]

... and write it to a file for later use. There are many more possibilites for file format. Have a look at the pandas documentation (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html) for possibilites. 

In [None]:
for c in columns.keys():
    print(c,len(columns[c]))

In [None]:
df.to_hdf("/storage/shared/data/2lep_df_forML_bkg_signal_inclusive.hdf5","mini")

In [None]:
df['category']

In [None]:
plt.figure()
df[df['met'] < 1000].plot(y='lep_tracksigd0pvunbiased2',kind='hist',logy=True)
plt.show()

In [None]:
df[df['met'] > 1000]

In [None]:
reread = pd.read_hdf("/storage/shared/data/2lep_df_forML_bkg_signal_inclusive.hdf5")

In [None]:
reread.shape