In [1]:
#Set up notebook
!pip3 install awkward uproot
import awkward as ak
import uproot
import numpy as np



# Reading ROOT Files and Basic Analysis Techniques
In this exercise, we will use `uproot` to open a TFile, learn to navigate its structure, and import a TTree as an awkward array. Once we have read the TTree "events", we can begin to perform our analysis of the data.

## Reading ROOT Files
The library `uproot` handles reading and writing ROOT files in python. It can read the TFile TTrees' branches into numpy arrays, pandas dataframe, or awkward arrays by setting the variable `uproot.default_library`. We will default it to awkward array for this exercise.

In [2]:
uproot.default_library = "ak" 
TFile = uproot.open("SMS_T2tt_mStop250_mLSP75_fastsim_2016_Skim_070650_37_082253_36.root")
print(TFile)

<ReadOnlyDirectory '/' at 0x7f45e4090520>


Printing the TFile object does not give much information. To access its contents we need to use methods defined by `Uproot.ReadOnlyDirectory`. The class reference can be found [here](https://uproot.readthedocs.io/en/latest/uproot.reading.ReadOnlyDirectory.html). To see the contents of `TFile` we call

In [3]:
TFile.keys()

['Events;1', 'Runs;1', 'LuminosityBlocks;1', 'untagged;1']

To obtain the classtypes and number of branches for these objects we call

In [4]:
TFile.items()

[('Events;1', <TTree 'Events' (860 branches) at 0x7f45cafd8490>),
 ('Runs;1', <TTree 'Runs' (8 branches) at 0x7f4513622520>),
 ('LuminosityBlocks;1',
  <TTree 'LuminosityBlocks' (2 branches) at 0x7f45135b9a00>),
 ('untagged;1', <TObjString 'untagged' at 0x7f45138eccf0>)]

We can see our ROOT file contains a TTree called "Events" with 860 branches which contains our data. We can read it without loading all the data into memory. We can call `.keys()` and `.items()` on the TTree as well to read its branch names. Let's read "Events" by passing the key in square brackets to `TFile` and check the first 5 branches.

In [5]:
Events = TFile["Events"]
print(Events.keys()[0:4])

['run', 'luminosityBlock', 'event', 'btagWeight_CSVV2']


The contents of these branches can be read as an awkward array by calling `TBranch.array()`. Let's read the branch `Muon_pt` into an awkward array.

In [6]:
Muon_pT = Events["Muon_pt"].array()
Muon_pT

We read `type: 847873 * var * float32` as follows. There are 847873 entries (events) where the values are all of type `float32`. `* var *` defines the shape of the array. The empty entries are the events with no muons. We can also create an awkward array containing multiple branches if convenient. Let's create an awkward array containing the pT, eta, and phi of the muons using `ak.zip`.

In [7]:
muons = ak.zip(
            {"pT": Muon_pT, #key is the name of the array and the value is the associated array
             "eta": Events["Muon_eta"].array(), 
             "phi": Events["Muon_phi"].array(),
             }
        )
muons

We can read the branches of our constructed array as follows by passing the key in square brackets.

In [8]:
print(muons["eta"])

[[], [0.723], [-0.927], [0.145, 0.686], [0.33], ..., [...], [0.137], [], [], []]


## Basic Selection Techniques 
### Efficiency Calculation
It is often convenient to evaluate multiple selection criteria at once. Let's calculate the efficiency of the following selection: "An event must contain atleast one muon with pt > 50 and |eta| < 2.5". Let's compare the time to complete this calculation between a row-wise and columnar analysis.

In [13]:
%%time
#Row wise analysis
muons_subset = muons[0:10000]
events_pass = 0

#Check each event for atleast one muon satisfying quality criteria
for i, event in enumerate(muons_subset["pT"]):
    for j, muon_pT in enumerate(event):
        if (muon_pT > 50) and (np.abs(muons["eta"][i,j]) < 2.5):
            events_pass += 1
            break #go to next event when we find a good muon
            
efficiency = events_pass/len(muons_subset)
print("Efficiency is: ", efficiency)

Efficiency is:  0.0881
CPU times: user 606 ms, sys: 31.2 ms, total: 638 ms
Wall time: 623 ms


Now using awkward array to perform a columnar analysis of the problem:

In [10]:
%%time
#Columnar analysis
selection = ak.any((muons_subset["pT"] > 50) & (np.abs(muons_subset["eta"]) < 2.5), axis = 1)
muons_pass = muons_subset[selection]
efficiency = len(muons_pass)/len(muons_subset)
print("Efficiency is: ", efficiency)

Efficiency is:  0.0881
CPU times: user 4.6 ms, sys: 0 ns, total: 4.6 ms
Wall time: 4.32 ms


You should see the runtime reduce from seconds to milliseconds representing a 100 fold reduction in runtime. Python for loops are extremely slow as memory is being accessed per loop iteration (here per muon per event). The awkward array syntax is more human readable and left us the object `muons_pass` which contains the events which satisfy this selection criteria.  

### Scale Factor Calculation