# Physics analysis with uproot (and some coffea)


## What we are trying to do

This session will be rather similar to the tutorial notebook that's already available, but I want to focus a bit more on how to use our physics knowledge to implement useful variables, and verify that we're actually calculating something useful.

You can run this notebook in jupyter yourself, or also use ipython instead. If you use a ssh session with X11 forwarding, you'll also see the plots.


## What you should take home from this session

- Explore data in a jupyter notebook (or ipython session).
- Filter data, calculate variables
- Verify your selections and variables qualitatively and quantitatively

## The physics case

Consider that we want to find top quark pair production with an associated W boson (ttW) in events with 3 leptons. In this case, all the involved W bosons will decay leptonically. If one swaps the associated W boson with a Z boson (ttZ) a similar final state is also present.

The production cross sections of the two processes are of the same order (0.6pb vs 1.0pb), however, because of the different branching fractions the ttZ process will usually be dominant in the 3l final state.

The leptonic Z boson decay from ttZ gives us a handle to distinguish the two processes. If we succeed in identifying $Z \rightarrow \ell \ell$ events, we can reject those events and end up with a event sample that's enriched in ttW.

![](https://cds.cern.ch/record/2264544/files/Figure_001-a.pdf)

![](http://cds.cern.ch/record/2264544/files/Figure_001-b.pdf)


- Find lepton pairs with the same flavor and opposite charge
- Calculate their invariant mass
- Find lepton pairs that have an invariant mass that's compatible with the Z boson mass, e.g. |m(ll)-m(Z)|<10 GeV
- Last but not least, make sure the variable we calculate makes sense and the implementation is bug free




## Load modules and files



In [None]:
import uproot4
import numpy as np
from uproot_methods import TLorentzVectorArray

from coffea.processor import LazyDataFrame
from coffea.analysis_objects import JaggedCandidateArray

To start some expolartion we can just use a single root file of a sample that represents a process we're interested in. Because we want to reconstruct leptonic Z decays and eventually want to reject ttZ events we load a file from a ttZ sample to start this session.

In [None]:
fn_ttZ = '/hadoop/cms/store/user/dspitzba/nanoAOD/ttw_samples/0p1p11/TTZToLLNuNu_M-10_TuneCP5_13TeV-amcatnlo-pythia8_RunIIAutumn18NanoAODv6-Nano25Oct2019_102X_upgrade2018_realistic_v20_ext1-v1/nanoSkim_1.root'
fn_ttW = '/hadoop/cms/store/user/dspitzba/nanoAOD/ttw_samples/0p1p11/TTWJetsToLNu_TuneCP5_13TeV-amcatnloFXFX-madspin-pythia8_RunIIAutumn18NanoAODv6-Nano25Oct2019_102X_upgrade2018_realistic_v20_ext1-v1/nanoSkim_1.root'

file = uproot4.open(fn_ttZ)

tree = file['Events']

df = LazyDataFrame(tree, flatten=False)

In [None]:
df['MET_pt']

## Get physics objects

Tau leptons are very special (and dificult). Therefore we'll focus on electrons and muons in this analysis. We can get muon and electron candidate arrays using `JaggedCandidateArray.candidatesfromcounts`.


In [None]:
muon = JaggedCandidateArray.candidatesfromcounts(
    # The first part is essential: number of muons, and kinematic properties
    df['nMuon'],
    pt = df['Muon_pt'].content,
    eta = df['Muon_eta'].content,
    phi = df['Muon_phi'].content,
    mass = df['Muon_mass'].content,
    # all below is optional
    mediumId = df['Muon_mediumId'].content,
    miniPFRelIso_all = df['Muon_miniPFRelIso_all'].content,
    charge = df['Muon_charge'].content,
)

In [None]:
electron = JaggedCandidateArray.candidatesfromcounts(
    # The first part is essential: number of electrons, and kinematic properties
    df['nElectron'],
    pt=df['Electron_pt'].content,
    eta=df['Electron_eta'].content,
    phi=df['Electron_phi'].content,
    mass=df['Electron_mass'].content,
    # all below is optional
    cutBased=df['Electron_cutBased'].content,
    miniPFRelIso_all=df['Electron_miniPFRelIso_all'].content,
    charge=df['Electron_charge'].content,
)

We can look at the properties of the muons, e.g. the transverse momentum, eta, medium ID etc.

We can already see the structure of the muons: they are of a JaggedArray type, which means muons from every event are contained in a sub-array. The structure is jagged because you can have any number of muons. If there's no muon in the event, the array is empty.   


In [None]:
muon.mediumId

This means that the length of the electron and muon array should be the same - the number of events is the same.

In [None]:
print ("Is the number of events the same?", len(muon) == len(electron))
print ("Is the number of electrons and muons the same?", muon.counts.sum() == electron.counts.sum())

The number of muons and electrons per event is of course not the same, and we can look at the number of muons or electrons with

In [None]:
muon.counts

Noticed the change of format from JaggedArray to array? Why?

Because some of our electrons and muons are wrongly reconstructed, we impose some requirements on them:

In [None]:
muon = muon[
    (muon.pt>10) & 
    (abs(muon.eta)<2.4) & 
    (muon.mediumId) & 
    (muon.miniPFRelIso_all<0.2) 
]

electron = electron[
    (electron.pt>10) & 
    (abs(electron.eta)<2.4) & 
    (electron.cutBased>=3) & 
    (electron.miniPFRelIso_all<0.1) 
]

Let's check if those requirements actually worked. This is already one of the most important lessons and take aways from this tutorial: Always check if your code makes sense!

In [None]:
muon.mediumId

For today, let's focus on events with three leptons - either electrons or muons. Because in real data we'd rely on a trigger selecting those events we need to make sure that our selected leptons also pass any potential trigger threshold. Lepton triggers usually have a minimum requirement on the transverse momentum, e.g. 30 GeV. Only one of the electrons or muons has to pass that threshold, so we use the .any() function.

In [None]:
baseline = ((electron.counts+muon.counts)==3) & ((electron.pt>30).any() | (muon.pt>30).any())

Does this .any() function actually do what we think it does? Let's check!

In [None]:
electron.pt

In [None]:
(electron.pt>30).any()

Looks pretty good! We do have at least one electron with pT>30 GeV in the first two events, but none in the third.


A common mistake would be to just ask for `electron.pt>30` without proper counting. E.g. this would be an alternative approach:

In [None]:
(electron.pt>30).counts>0

NB: Always check the behaviour of a function. I've also seen sth like

In [None]:
electron[electron>1] # don't try this at home!

This clearly does not select the number of electrons, as one might have thought. Let's apply the baseline selection we defined above, and just work with events that pass our requirements from here on.

In [None]:
muon = muon[baseline]
electron = electron[baseline]

## Constructing dileptons

`JaggedCandidateArray` has some nice functionality to get pairs (or triplets etc) of different kinds of objects. We can use `.choose(2)` to get any combination of two muons or electrons.

If one wants to get combinations of e.g. electrons and muons one can use the `.cross()` method like `electron.cross(muon)`.

Because Z bosons decay into leptons of same flavor and opposite signed charge (OS) we only select those dimuons of dielectrons with a charge product that is smaller than 0.


In [None]:
dimuon = muon.choose(2)
OS_dimuon = dimuon[(dimuon.i0.charge*dimuon.i1.charge < 0)]

dielectron = electron.choose(2)
OS_dielectron = dielectron[(dielectron.i0.charge*dielectron.i1.charge < 0)]

We can calculate an invariant mass of these dileptons. Notice the empty arrays for events where no opposite sign dimuon is found in the event!

In [None]:
OS_dimuon.mass

## Reconstructing Z boson candidates

The invariant mass resultion is finite, therefore we allow the dimuon or dielectron mass to be within 10 GeV of the Z boson mass to find a good $Z\rightarrow \mu \mu$ candidate.

Then, we calculate how many of our ttZ events with 3 leptons also have a good Z boson candidate per our definitions (or events that are so-called on-Z).

In [None]:
nEvents_all = len(muon)

nEvents_onZmumu = len(muon[(abs(OS_dimuon.mass-91.2)<10).any()])
nEvents_onZee = len(electron[(abs(OS_dielectron.mass-91.2)<10).any()])

(nEvents_onZmumu+nEvents_onZee)/nEvents_all


Inversely, we can also get the number of events that have no good Z boson candidate ("off-Z"). The total of on-Z and off-Z events should of course be 1, so that's a good way to check that our selections make sense.

In [None]:
nEvents_offZ = len(muon[(abs(OS_dimuon.mass-91.2)>10).all() & (abs(OS_dielectron.mass-91.2)>10).all()])

(nEvents_onZmumu + nEvents_onZee + nEvents_offZ)/nEvents_all

Why do we not have double counting of events here? Can you think of a case where double counting of events with a $Z \rightarrow \mu \mu$ and $Z \rightarrow e e$ candidate might become an issue?

Noticed the `.all()` function here? In contrast to `.any()` which requires at least one of the dileptons to be close to the Z mass, our `.all()` requirement forces all dimuon or dielectron candidates in an event to be at least 10 GeV away from the Z boson mass. 

Below we ask for the "best" Z candidate: if we have 3 muons, we will find two OS dimuons, but one of the pairs will have a mass that's closer to the Z boson. We can find the `OS_dimuon` that has the closest mass with the `argmin()` method.

In [None]:
OS_dimuon_bestZmumu = OS_dimuon[abs(OS_dimuon.mass-91.2).argmin()]
OS_dielectron_bestZee = OS_dielectron[abs(OS_dielectron.mass-91.2).argmin()]

`argmin()` returns the index to the dimuon pair that is our best Z candidate. As an example, we can look at events with 3 muons:

In [None]:
OS_dimuon[muon.counts==3].mass

In [None]:
abs(OS_dimuon[muon.counts==3].mass-91.2).argmin()

`argmin` and `argmax` are very useful methods, e.g. if you want to get the transverse momentum of the muon with highest eta: `muon[abs(muon.eta).argmax()].pt`.

We can use matplotlib to make histograms of our variables. The `.flatten()` method is needed in order to get a flat array (in contrast to a jagged one).

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm


figure=plt.figure(1)
plt.hist(
    [OS_dimuon_bestZmumu.mass.flatten(), OS_dielectron_bestZee.mass.flatten(), OS_dimuon.mass.flatten(), OS_dielectron.mass.flatten()], 
    bins=20, 
    range=[0, 200], 
    label=[r'$Z \rightarrow \mu \mu$ (best)', r'$Z \rightarrow e e$ (best)', r'$Z \rightarrow \mu \mu$ (any)', r'$Z \rightarrow e e$ (any)'], 
    histtype='step'
)

plt.xlabel(r'$M_{ll}$ (GeV)')
plt.ylabel('Number of candidates')
plt.legend()

plt.show()


So, how could we best veto these events? The easiest way is to just force all dilepton candidates to have an invariant mass away from the Z boson mass, like we have done earlier

In [None]:
(abs(OS_dimuon.mass-91.2)>10).all() & (abs(OS_dielectron.mass-91.2)>10).all()

In [None]:
abs(OS_dimuon[muon.counts==3].mass-91.2)>10

In [None]:
(abs(OS_dimuon[muon.counts==3].mass-91.2)>10).all()

Let's quickly look at the best Z candidates again. We now have seperate arrays for muons and electrons, but we might want to have them combined, too.

In [None]:
OS_dimuon_bestZmumu.mass

In [None]:
OS_dielectron_bestZee.mass

So, how can we combine two JaggedArrays? It's not exactly straight forward, but it can be done.

In [None]:
from Tools.helpers import mergeArray

OS_dilepton_mass = mergeArray(OS_dimuon_bestZmumu.mass, OS_dielectron_bestZee.mass)

In [None]:
OS_dilepton_mass

What does mergeArry do?
```
def mergeArray(a1, a2):
    '''
    Merge two arrays into one, e.g. electrons and muons
    '''
    import awkward
    a1_tags = awkward.JaggedArray(a1.starts, a1.stops, np.full(len(a1.content), 0, dtype=np.int64))
    a1_index = awkward.JaggedArray(a1.starts, a1.stops, np.arange(len(a1.content), dtype=np.int64))
    a2_tags = awkward.JaggedArray(a2.starts, a2.stops, np.full(len(a2.content), 1, dtype=np.int64))
    a2_index = awkward.JaggedArray(a2.starts, a2.stops, np.arange(len(a2.content), dtype=np.int64))
    tags = awkward.JaggedArray.concatenate([a1_tags, a2_tags], axis=1)
    index = awkward.JaggedArray.concatenate([a1_index, a2_index], axis=1)
    return awkward.JaggedArray(tags.starts, tags.stops, awkward.UnionArray(tags.content, index.content, [a1.content, a2.content]))
```


Since we have trilepton events and we already chose the *best* Z boson candidate our new JaggedArray, `OS_dilepton_mass`, should not have an event with more than one entry. As a sanity check we can still do

In [None]:
(OS_dilepton_mass.counts>1).any()

In [None]:
figure=plt.figure(1)
plt.hist(
    [OS_dilepton_mass.flatten()], 
    bins=20, 
    range=[0, 200], 
    label=[r'$Z \rightarrow \ell \ell$ (best)'], 
    histtype='step'
)

plt.xlabel(r'$M_{ll}$ (GeV)')
plt.ylabel('Number of candidates')
plt.legend()

plt.show()

We can also use some coffea tools to make some histograms. The syntax isn't very different, but since we'll use coffea later on to process larger amounts of data, it makes sense to familiarize ourselves with them.

In [None]:
from coffea import hist

In [None]:
muon_pt = hist.Hist("Counts", hist.Bin("pt", r"$p_{T} \ (GeV)$", 25, 0, 250))
muon_pt.fill( pt = muon.pt.flatten() )
ax = hist.plot1d(muon_pt, overflow='over')

We fill the histogram with `muon.pt.flatten()`. 

## Coffea processor - larger scale data processing

Coffea processors are nice to process larger amounts of data (or even more than just a single file). Once you have an idea of what variables you want to calculate and made sure they work as intended, you can put everything together in a processor.

The processor splits up the events into chunks and then either runs sequentially or in parallel on your local CPU or on a cluster. The results (which usually are just plain numbers or histograms) of the different chunks are accumulated and definide at initialization.


In [None]:
from tqdm.auto import tqdm
import coffea.processor as processor
from coffea.processor.accumulator import AccumulatorABC

from Tools.helpers import loadConfig, getCutFlowTable, mergeArray
from Tools.objects import Collections
from Tools.cutflow import Cutflow

import copy

class exampleProcessor(processor.ProcessorABC):
    """Dummy processor used to demonstrate the processor principle"""
    def __init__(self):

        # we can use a large number of bins and rebin later
        dataset_axis        = hist.Cat("dataset",   "Primary dataset")
        pt_axis             = hist.Bin("pt",        r"$p_{T}$ (GeV)", 1000, 0, 1000)
        mass_axis           = hist.Bin("mass",      r"M (GeV)", 1000, 0, 2000)
        multiplicity_axis   = hist.Bin("multiplicity",         r"N", 20, -0.5, 19.5)

        self._accumulator = processor.dict_accumulator({
            "dimuon_mass" :     hist.Hist("Counts", dataset_axis, mass_axis),
            "dimuon_mass_offZ" : hist.Hist("Counts", dataset_axis, mass_axis),
            "diele_mass" :      hist.Hist("Counts", dataset_axis, mass_axis),
            "dilep_mass" :      hist.Hist("Counts", dataset_axis, mass_axis),
            
            "MET_pt" :          hist.Hist("Counts", dataset_axis, pt_axis),
            "MET_pt_onZ" :      hist.Hist("Counts", dataset_axis, pt_axis),
            "MET_pt_offZ" :     hist.Hist("Counts", dataset_axis, pt_axis),
            
            "N_ele" :           hist.Hist("Counts", dataset_axis, multiplicity_axis),
            "N_mu" :            hist.Hist("Counts", dataset_axis, multiplicity_axis),
            "N_jet" :           hist.Hist("Counts", dataset_axis, multiplicity_axis),
            'TTW':              processor.defaultdict_accumulator(int),
            'TTZ':              processor.defaultdict_accumulator(int),
            'totalEvents':      processor.defaultdict_accumulator(int),
        })
        
    @property
    def accumulator(self):
        return self._accumulator

    def process(self, df):
        """
        Processing function. This is where the actual analysis happens.
        """
        output = self.accumulator.identity()
        dataset = df["dataset"]
        cfg = loadConfig()

        # load all the default candidates (jets, leptons, ...)
        
        ## Jets
        jet = JaggedCandidateArray.candidatesfromcounts(
            df['nJet'],
            pt = df['Jet_pt'].content,
            eta = df['Jet_eta'].content,
            phi = df['Jet_phi'].content,
            mass = df['Jet_mass'].content,
            jetId = df['Jet_jetId'].content,
            puId = df['Jet_puId'].content,
            btagDeepB = df['Jet_btagDeepB'].content,
            
        )
        jet       = jet[(jet.pt>30) & (abs(jet.eta)<2.4) & (jet.jetId>0)]
        
        ## Muons.
        muon = JaggedCandidateArray.candidatesfromcounts(
            # The first part is essential: number of muons, and kinematic properties
            df['nMuon'],
            pt = df['Muon_pt'].content,
            eta = df['Muon_eta'].content,
            phi = df['Muon_phi'].content,
            mass = df['Muon_mass'].content,
            # all below is optional
            mediumId = df['Muon_mediumId'].content,
            miniPFRelIso_all = df['Muon_miniPFRelIso_all'].content,
            charge = df['Muon_charge'].content,
        )
        
        ## Electrons.
        electron = JaggedCandidateArray.candidatesfromcounts(
            # The first part is essential: number of electrons, and kinematic properties
            df['nElectron'],
            pt=df['Electron_pt'].content,
            eta=df['Electron_eta'].content,
            phi=df['Electron_phi'].content,
            mass=df['Electron_mass'].content,
            # all below is optional
            cutBased=df['Electron_cutBased'].content,
            miniPFRelIso_all=df['Electron_miniPFRelIso_all'].content,
            charge=df['Electron_charge'].content,
        )
        
        ## Selections
        muon = muon[
            (muon.pt>10) & 
            (abs(muon.eta)<2.4) & 
            (muon.mediumId) & 
            (muon.miniPFRelIso_all<0.2) 
        ]

        electron = electron[
            (electron.pt>10) & 
            (abs(electron.eta)<2.4) & 
            (electron.cutBased>=3) & 
            (electron.miniPFRelIso_all<0.1) 
        ]   

        ## MET
        met_pt  = df["MET_pt"]
        met_phi = df["MET_phi"]
        
        dimuon = muon.choose(2)
        OS_dimuon = dimuon[(dimuon.i0.charge*dimuon.i1.charge < 0)]

        dielectron = electron.choose(2)
        OS_dielectron = dielectron[(dielectron.i0.charge*dielectron.i1.charge < 0)]

        OS_dimuon_bestZmumu = OS_dimuon[abs(OS_dimuon.mass-91.2).argmin()]
        OS_dielectron_bestZee = OS_dielectron[abs(OS_dielectron.mass-91.2).argmin()]
        
        OS_dilepton_bestZll_mass = mergeArray(OS_dimuon_bestZmumu.mass, OS_dielectron_bestZee.mass)

        baseline = ((electron.counts+muon.counts)==3) & ((electron.pt>30).any() | (muon.pt>30).any())
        offZ_selection = (abs(OS_dimuon.mass-91.2)>15).all() & (abs(OS_dielectron.mass-91.2)>15).all()
        onZ_selection = ((abs(OS_dimuon.mass-91.2)<15).any() | (abs(OS_dielectron.mass-91.2)<15).any())
        
        
        output['totalEvents']['all'] += len(df['weight'])
        
        
        
        ## And fill the histograms
        
        # just the some multiplicities
        output['N_ele'].fill(dataset=dataset, multiplicity=electron[baseline].counts, weight=df['weight'][baseline]*cfg['lumi'])
        output['N_mu'].fill(dataset=dataset, multiplicity=muon[baseline].counts, weight=df['weight'][baseline]*cfg['lumi'])
        output['N_jet'].fill(dataset=dataset, multiplicity=jet[baseline].counts, weight=df['weight'][baseline]*cfg['lumi'])
                
        output['MET_pt'].fill(
            dataset=dataset,
            pt=met_pt[baseline].flatten(),
            weight=df['weight'][baseline]*cfg['lumi']
        )
        
        output['MET_pt_offZ'].fill(
            dataset=dataset,
            pt=met_pt[baseline & offZ_selection].flatten(),
            weight=df['weight'][baseline & offZ_selection]*cfg['lumi']
        )
        
        output['MET_pt_onZ'].fill(
            dataset=dataset,
            pt=met_pt[baseline & onZ_selection].flatten(),
            weight=df['weight'][baseline & onZ_selection]*cfg['lumi']
        )
        
        output['dimuon_mass'].fill(
            dataset=dataset,
            mass = OS_dimuon_bestZmumu[baseline].mass.flatten(),
            weight=df['weight'][baseline & (OS_dimuon_bestZmumu.counts>0)]*cfg['lumi']
        )
        
        output['dimuon_mass_offZ'].fill(
            dataset=dataset,
            mass = OS_dimuon_bestZmumu[baseline & offZ_selection ].mass.flatten(),
            weight=df['weight'][baseline & (OS_dimuon_bestZmumu.counts>0) & offZ_selection]*cfg['lumi']
        )
        
        output['diele_mass'].fill(
            dataset=dataset,
            mass = OS_dielectron_bestZee[baseline].mass.flatten(),
            weight=df['weight'][baseline & (OS_dielectron_bestZee.counts>0)]*cfg['lumi']
        )
        
        output['dilep_mass'].fill(
            dataset=dataset,
            # usage of np.array() is pretty awkward, but UnionArray doesn't work with the histograms (yet)
            mass = np.array(OS_dilepton_bestZll_mass[baseline & (OS_dilepton_bestZll_mass.counts>0)].flatten()),
            weight=df['weight'][baseline & (OS_dilepton_bestZll_mass.counts>0)]*cfg['lumi']
        )
        
        return output

    def postprocess(self, accumulator):
        return accumulator

The main part of the above processor code is pretty similar to what we've done before. We do fill histograms in the end, and we have to be careful that the length of the arrays for the variable of interest and the weight are aligned.


In [None]:
import glob

# Run the processor
fileset = fileset = {
    'TTZ': glob.glob('/hadoop/cms/store/user/dspitzba/nanoAOD/ttw_samples/0p1p11/TTZToLLNuNu_M-10_TuneCP5_13TeV-amcatnlo-pythia8_RunIIAutumn18NanoAODv6-Nano25Oct2019_102X_upgrade2018_realistic_v20_ext1-v1/*.root'),
    'TTW': glob.glob('/hadoop/cms/store/user/dspitzba/nanoAOD/ttw_samples/0p1p11/TTWJetsToLNu_TuneCP5_13TeV-amcatnloFXFX-madspin-pythia8_RunIIAutumn18NanoAODv6-Nano25Oct2019_102X_upgrade2018_realistic_v20_ext1-v1/*.root'),
}
workers = 12
output = processor.run_uproot_job(fileset,
                                      treename='Events',
                                      processor_instance=exampleProcessor(),
                                      executor=processor.futures_executor,
                                      executor_args={'workers': workers, 'function_args': {'flatten': False}},
                                      chunksize=500000,
                                     )

In [None]:
output

The histogram output looks as follows. We chose to have a large number of bins when running the processor, this way we are more flexible in the choice of the final binning later.

In [None]:
histogram = output['dilep_mass']
ax = hist.plot1d(histogram,overlay="dataset", stack=True)

Let's rebin to a more sensible range.

In [None]:
histogram = output['dilep_mass'].copy()
histogram = histogram.rebin('mass', hist.Bin('mass', r'$M(ll) \ (GeV)$', 50, 0, 200))

ax = hist.plot1d(histogram,overlay="dataset", stack=True)

In [None]:
histogram = output['dimuon_mass_offZ'].copy()
histogram = histogram.rebin('mass', hist.Bin('mass', r'$M(ll) \ (GeV)$', 50, 0, 200))

ax = hist.plot1d(histogram,overlay="dataset", stack=True, overflow='over')

In [None]:
histogram = output['MET_pt'].copy()
histogram = histogram.rebin('pt', hist.Bin('pt', r'$p_T^{miss} \ (GeV)$', 50, 0, 500))

ax = hist.plot1d(histogram,overlay="dataset", stack=True)

In [None]:
histogram = output['MET_pt_offZ'].copy()
histogram = histogram.rebin('pt', hist.Bin('pt', r'$p_T^{miss} \ (GeV)$', 50, 0, 500))

ax = hist.plot1d(histogram,overlay="dataset", stack=True, overflow='over')

In [None]:
histogram = output['MET_pt_onZ'].copy()
histogram = histogram.rebin('pt', hist.Bin('pt', r'$p_T^{miss} \ (GeV)$', 50, 0, 500))

ax = hist.plot1d(histogram,overlay="dataset", stack=True, overflow='over')

We can also easily access the counts of the histogram...

In [None]:
histogram.values(overflow='over')

... also with the sum of the squared weights (that we need for the statistical uncertainties in the MC)

In [None]:
histogram.values(overflow='over', sumw2=True)

What's the ratio of ttW/ttZ events without the Z veto?

In [None]:
output['MET_pt']['TTW'].sum('dataset').values(overflow='over')[()].sum()/output['MET_pt']['TTZ'].sum('dataset').values(overflow='over')[()].sum()

And now just off-Z

In [None]:
output['MET_pt_offZ']['TTW'].sum('dataset').values(overflow='over')[()].sum()/output['MET_pt_offZ']['TTZ'].sum('dataset').values(overflow='over')[()].sum()

# Advanced topics

## DASK

We can use DASK to submit jobs to a cluster. This just serves as an example for now, I couldn't start a cluster this morning...

In principle, a cluster can be started by running `ipython -i start_cluster.py`


In [None]:
from Tools.helpers import get_scheduler_address

from dask.distributed import Client, progress

scheduler_address = get_scheduler_address()

c = Client(scheduler_address)

c

In [None]:
exe_args = {
    'client': c,
}
exe = processor.dask_executor

output = processor.run_uproot_job(fileset,
                                      treename='Events',
                                      processor_instance=exampleProcessor(),
                                      executor=exe,
                                      executor_args=exe_args,
                                      chunksize=500000,
                                     )


## Matching

Matching can be interesting for efficiency studies, overlap removal of different collections etc. Let's first do some gen matching.


In [None]:
# The loaded samples contain a branch that is called GenL (generated leptons). This is non-standard for nanoAOD.

gen_lep = JaggedCandidateArray.candidatesfromcounts(
    df['nGenL'],
    pt = df['GenL_pt'].content,
    eta = df['GenL_eta'].content,
    phi = df['GenL_phi'].content,
    mass = ((df['GenL_pt']>0)*0).content,
    pdgId =  df['GenL_pdgId'].content,
)

In [None]:
gen_lep = gen_lep[baseline]
gen_muon = gen_lep[abs(gen_lep.pdgId)==13]

In [None]:
matched_gen_muon = gen_muon[gen_muon.match(muon, deltaRCut=0.4)]
unmatched_gen_muon = gen_muon[~gen_muon.match(muon, deltaRCut=0.4)]

In [None]:
(matched_gen_muon.pt>20).flatten().sum()/(gen_muon.pt>20).flatten().sum()

In [None]:
(unmatched_gen_muon.pt>20).flatten().sum()/(gen_muon.pt>20).flatten().sum()

Now let's do some cross cleaning, namely remove jets that overlap with leptons. First, let's load jets similarly to electrons and muons, and define some baseline requirements.

In [None]:
jet = JaggedCandidateArray.candidatesfromcounts(
    df['nJet'],
    pt = df['Jet_pt'].content,
    eta = df['Jet_eta'].content,
    phi = df['Jet_phi'].content,
    mass = df['Jet_mass'].content,
    jetId = df['Jet_jetId'].content,
    puId = df['Jet_puId'].content,
    btagDeepB = df['Jet_btagDeepB'].content,
    
)
jet = jet[(jet.pt>30) & (abs(jet.eta)<2.4) & (jet.jetId>0)]
jet = jet[baseline]

In [None]:
jet = jet[~jet.match(muon, deltaRCut=0.4)]
jet = jet[~jet.match(electron, deltaRCut=0.4)]