## **Introduction**

We will study basic tools and wrappers for enabling not-too-alien syntax when running columnar collider HEP analysis. This tools are part of a pythonic HEP ecosystem. 

![](https://i.imgur.com/3a11SeT.png)

To make things easier to find, they're cataloged under a common name at [scikit-hep.org](https://scikit-hep.org/)

## **ROOT file structure and terminology**

A ROOT file is like a little filesystem containing nested directories. Any class instance (like `ROOT TObject`) can be stored in a directory.

One of these classes, TTree, is a gateway to large datasets. A TTree is roughly like a Pandas DataFrame in that it represents a table of data. The columns are called TBranches, which can be nested (unlike Pandas), and the data can have any C++ type (unlike Pandas, which can store any Python type).

A TTree is often too large to fit in memory, and sometimes (rarely) even a single TBranch is too large to fit in memory. Each TBranch is therefore broken down into TBaskets, which are “batches” of data. TBaskets are the smallest unit that can be read from a TTree: if you want to read the first entry, you have to read the first TBasket.

As a data analyst, you’ll likely be concerned with TTrees and TBranches first-hand, but only TBaskets when efficiency issues come up

![](https://i.imgur.com/gkN0q9f.png)

CMS data has this structure and can be found in different formats. One of these formats is the so-called **NanoAOD** format. A NanoAOD file contains a main TTree named **Events**. A dump of the documentation of content for different releases is available [here](https://cms-nanoaod-integration.web.cern.ch/autoDoc/)

## **Reading and manipulating data**

[Uproot](https://uproot.readthedocs.io/en/latest/) is a Python package that reads and writes ROOT files and is only concerned with reading and writing (no analysis, no plotting, etc.). It interacts with NumPy, Awkward Array, and Pandas for computations, boost-histogram/hist for histogram manipulation and plotting, Vector for Lorentz vector functions and transformations, Coffea for scale-up, etc.

Uproot is implemented using only Python and Python libraries.

![](https://i.imgur.com/pt8y1c1.png)

To access a remote file via XRootD, we use a `root://...` URL:



In [None]:
# file name
fname = "root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root"

We can use `uproot.open()` to open a ROOT file. Instead, we will use the `coffea.nanoevents.NanoEventsFactory()` (which uses `uproot.open` internally) to read the ROOT File and create a `NanoEvents` object (with a `NanoAODSchema`). To limit the amount of data read, set `entry_stop` to the range you want:

In [None]:
from coffea.nanoevents import NanoEventsFactory, NanoAODSchema

# load events
events = NanoEventsFactory.from_root(fname, schemaclass=NanoAODSchema, entry_stop=10).events()

Browsing through the nanoAOD event content is useful to understand each branch. We can list its content using the `fields` attribute:

In [None]:
# events fields
events.fields

In [None]:
# jet fields
events.Jet.fields

In [None]:
# jet transverse momentum
events.Jet.pt

Arrays like these are sometimes called "jagged arrays". A jagged array is an irregular array, in other words, an array of arrays of which the member arrays can be of different lengths.

[Awkward](https://awkward-array.org/doc/main/index.html) is a library, part of the pythonic HEP ecosystem, used for manipulating these jagged arrays using NumPy-like idioms. Basically, it is a generalization of NumPy for irregular arrays.





**operations:** Arrays are dynamically typed, but operations on them are compiled and fast

In [None]:
# jet transverse momentum squared

**slicing:** Basic slices are a generalization of NumPy's, what NumPy would do if it had variable-length lists (Boolean and integer slices work too)

In [None]:
# jet transverse momentum
jet_pt = events.Jet.pt

In [None]:
# first event
jet_pt[0]

In [None]:
# second event
jet_pt[1]

In [None]:
# first element of the first two events
jet_pt[:2, 0]

Note that if, for example, an event has no elements, this type of slicing will fail

In [None]:
# first element of all events
jet_pt[:, 0]

In such cases, we can use the Awkward `pad_none()` function to increase the lengths of lists to a target length by adding `None` values, so that we can perform the slicing:

In [None]:
# pad None's

In [None]:
# first element of all events

**Application: selecting particles and events**

**particle-level cut**: This jagged array of booleans selects all muons with at least 20 GeV in transverse momentum

In [None]:
particle_cut = events.Muon.pt > 20

In [None]:
muons = events.Muon[particle_cut]

**event-level cut**: This non-jagged array of booleans (made with `ak.any`) selects all events that have a muon with at least 20 GeV in transverse momentum:

In [None]:
event_cut  = ak.any(events.Muon.pt > 20, axis=1)

In [None]:
muons = events.Muon[event_cut]

## **Boost-histogram and Hist**

High-energy physicists usually want to fill histograms with more data than can fit in memory, which means setting bin intervals on an empty container and filling it in batches (sequentially or in parallel). Boost-histogram is a library designed for that purpose. 

A more user-friendly layer is provided by a library called [Hist](https://hist.readthedocs.io/en/latest/), which is a powerful histogramming tool within the python HEP ecosystem for analysis based on Boost-histogram. It provides:

* N-dimensional histograms
* Discrete and dense axis: Regular, Boolean, Variable, Integer, IntCategory and StrCategory.
* Useful methods to transform and index histograms
* Plotting via matplotlib or mplhep: stacked and normalized plots, ratio plots, 2D plots, etc.

**1-dimensional Histogram**

In [None]:
import hist

# definition of a regular axis
jet_pt_axis = hist.axis.Regular(
    bins=40, 
    start=20, 
    stop=1000, 
    name="jet_pt", 
    label="Jet $p_T$ [GeV]"
)

# 1D histogram
jet_pt_histogram = hist.Hist(jet_pt_axis)

jet_pt_histogram

Since this is a jagged array, it can't be directly histogrammed. Histograms take a set of numbers as inputs, but this array contains lists.

You can use `ak.flatten()` to flatten one level of list or `ak.ravel()` to flatten all levels

In [None]:
# filling the histogram
jet_pt_histogram.fill(jet_pt=ak.flatten(jet_pt))

jet_pt_histogram

we can plot the histogram using the `plot1d()` method

In [None]:
jet_pt_histogram.plot1d();

**N-dimensional Histogram**

In [None]:
import numpy as np

# jet axes
jet_pt_axis = hist.axis.Variable(
    edges=[20, 60, 90, 120, 150, 180, 210, 240, 300, 500],
    name="jet_pt",
    label="Jet $p_T$ [GeV]"
)
jet_eta_axis = hist.axis.Regular(
    bins=50,
    start=-2.4,
    stop=2.4,
    name="jet_eta",
    label="jet $\eta$"
)
jet_phi_axis = hist.axis.Regular(
    bins=50,
    start=-np.pi,
    stop=np.pi,
    name="jet_phi",
    label="Jet $\phi$"
)

# 3D histogram
jet_histogram = hist.Hist(
    jet_pt_axis,
    jet_eta_axis,
    jet_phi_axis,
)
jet_histogram

In [None]:
# filling the histogram
jet_histogram.fill(
    jet_pt=ak.flatten(events.Jet.pt),
    jet_eta=ak.flatten(events.Jet.eta),
    jet_phi=ak.flatten(events.Jet.phi)
)
jet_histogram

We can inspect an individual axis, using the `project()` method first

In [None]:
# project into an axis

We could alse visualize 2D histograms using the `plot2d()` method:

In [None]:
# select all entries for the first two axis

In [None]:
# plot 2D histogram

see more about histogram manipulation and transformations [here](https://github.com/CoffeaTeam/coffea/discussions/705)

**Quiz 1 (particle-level cut):** Select electrons that satisfy the following conditions

* $p_T \geq 30$ GeV
* $|\eta| < 2.5$
* electrons with a medium MVA identification working point (`mvaFall17V2Iso_WP80`)

**Quiz 2: (event-level cut)** Select transverse missing energy (MET) for events in which the condition $p_T(j) \geq 30$ GeV is met for at least two jets

hint: use the `ak.sum()` function

**Quiz 3:** Using the full dataset:

* Select muons that satisfy the following conditions

  * $p_T \geq 35$ GeV
  * $|\eta| < 2.4$
  * muons with a tight cut-based identification working point (`tightId`)
  * muons with a tight PF Relative Isolation working point (pfRelIso04_all $\leq 0.15$)

* Select the leading and subleading muons $\mu_1$ and $\mu_2$ (use the `ak.firsts()` function)
* Compute the invariant mass $m(\mu_1, \mu_2)$ using the following formula: $$m^2(\mu_1, \mu_2) = 2 p_T(\mu_1)p_T(\mu_2)f(\eta, \phi)$$

  where $$f(\eta, \phi) = \cosh[\eta(\mu_1) - \eta(\mu_2)] - \cos[\phi(\mu_1) - \phi(\mu_2)]$$
* Select events such that $60 < m < 120$ (GeV)
* Create and plot an histogram of the invariant mass (hint: use the `ak.fill_none()` if needed)