**LHC data from a ROOT file**

Particle physicsts need structures like these—in fact, they have been a staple of particle physics analyses for decades. The ROOT file format was developed in the mid-90's to serialize arbitrary C++ data structures in a columnar way (replacing ZEBRA and similar Fortran projects that date back to the 70's). The PyROOT library dynamically wraps these objects to present them in Python, though with a performance penalty. The uproot library reads columnar data directly from ROOT files in Python without intermediary C++.

In [1]:
import uproot
events = uproot.open("http://scikit-hep.org/uproot/examples/HZZ-objects.root")["events"].lazyarrays()
events

<Table [<Row 0> <Row 1> <Row 2> ... <Row 2418> <Row 2419> <Row 2420>] at 0x7fa5492b8370>

In [2]:
events.columns

['jetp4',
 'jetbtag',
 'jetid',
 'muonp4',
 'muonq',
 'muoniso',
 'electronp4',
 'electronq',
 'electroniso',
 'photonp4',
 'photoniso',
 'MET',
 'MC_bquarkhadronic',
 'MC_bquarkleptonic',
 'MC_wdecayb',
 'MC_wdecaybbar',
 'MC_lepton',
 'MC_leptonpdgid',
 'MC_neutrino',
 'num_primaryvertex',
 'trigger_isomu24',
 'eventweight']

This is a typical particle physics dataset (though small!) in that it represents the momentum and energy ("p4" for Lorentz 4-momentum) of several different species of particles: "jet", "muon", "electron", and "photon". Each collision can produce a different number of particles in each species. Other variables, such as missing transverse energy or "MET", have one value per collision event. Events with zero particles in a species are valuable for the event-level data.

In [7]:
# The first event has two muons.
events.muonp4
# The first event has zero jets.
events.jetp4
# Every event has exactly one MET.
events.MET

<ChunkedArray [TVector2(5.9128, 2.5636) TVector2(24.765, -16.349) TVector2(-25.785, 16.237) ... TVector2(18.102, 50.291) TVector2(79.875, -52.351) TVector2(19.714, -3.5954)] at 0x7fa5492a5ac0>

Unlike the exoplanet data, these events cannot be represented as a DataFrame because of the different numbers of particles in each species and because zero-particle events have value. Even with just "muonp4", "jetp4", and "MET", there is no translation.

In [6]:
try:
    awkward.topandas(events[["muonp4", "jetp4", "MET"]], flatten=True)
except Exception as err:
    print(type(err), str(err))

<class 'NameError'> name 'awkward' is not defined


It could be described as a collection of DataFrames, in which every operation relating particles in the same event would require a join. But that would make analysis harder, not easier. An event has meaning on its own.

In [8]:
events[0].tolist()

{'jetp4': [],
 'jetbtag': [],
 'jetid': [],
 'muonp4': [TLorentzVector(x=-52.899, y=-11.655, z=-8.1608, t=54.779),
  TLorentzVector(x=37.738, y=0.69347, z=-11.308, t=39.402)],
 'muonq': [1, -1],
 'muoniso': [4.200153350830078, 2.1510612964630127],
 'electronp4': [],
 'electronq': [],
 'electroniso': [],
 'photonp4': [],
 'photoniso': [],
 'MET': TVector2(5.9128, 2.5636),
 'MC_bquarkhadronic': TVector3(0, 0, 0),
 'MC_bquarkleptonic': TVector3(0, 0, 0),
 'MC_wdecayb': TVector3(0, 0, 0),
 'MC_wdecaybbar': TVector3(0, 0, 0),
 'MC_lepton': TVector3(0, 0, 0),
 'MC_leptonpdgid': 0,
 'MC_neutrino': TVector3(0, 0, 0),
 'num_primaryvertex': 6,
 'trigger_isomu24': True,
 'eventweight': 0.009271008893847466}

Particle physics isn't alone in this: analyzing JSON-formatted log files in production systems or allele likelihoods in genomics are two other fields where variable-length, nested structures can help. Arbitrary data structures are useful and working with them in columns provides a new way to do exploratory data analysis: one array at a time.