# Rootfiles and uproot

- Rootfiles contain particle physics data organized in a
column-style format.

- Specifically, Rootfile datasets is composed of events and
each event is composed of fields of data.

    - Events correspond to rows.
    - Fields correspond to columns.

- To access the Rootfile data, a user reads in
the columns/fields from the dataset (because that is how it
was chosen to be organized).

    - You cannot read in rows/individual events.


- `Uproot` is a python library that allows direct access to root files
without dealing with the `ROOT` program.

- `Uproot4` is currently being developed to replace `Uproot`, but it has
some issues dealing with certain characters in the column names
(heretoafter referred to as field names). But the same general ideas
for `Uproot` apply to `Uproot4`.

    - Note: `Uproot4` does make reading in data easier, and I have already
    written some code to transition to uproot4/awkward1 easily once the bugs
    are fixed.

- This notebook will focus on Uproot, not Uproot4.


In [1]:
import uproot


Opening rootfiles returns a `ROOTDirectory` object.


In [2]:
file = uproot.open("../data/test_ttbar_1000.root")
print(type(file))

<class 'uproot.rootio.ROOTDirectory'>


The `ROOTDirectory` should contain a `TTree` object, which contains the
data we're interested in.

In [3]:
for k, v in file.items():
    print("{} {}".format(k, v))


b'ProcessID0;1' <TProcessID b'ProcessID0' at 0x000105d6d4a8>
b'Delphes;1' <TTree b'Delphes' at 0x000105d6d588>


## TTree

- The method .keys() returns a list of all the fields at the top of the
rootfile data structure.

- Some of these fields are associated with subtrees, which contain
subfields within their own tree structure.

- The method .allkeys() returns a list of all the fields in the entire
rootfile data structure, down through all subtrees.

In [4]:
tree = file['Delphes']
print("Keys at top of tree structure")
for k in tree.keys():
    print(k)
print("\n total # of fields in rootfile: {}".format(len(tree.allkeys())))


Keys at top of tree structure
b'Event'
b'Event_size'
b'Particle'
b'Particle_size'
b'Track'
b'Track_size'
b'Tower'
b'Tower_size'
b'EFlowTrack'
b'EFlowTrack_size'
b'EFlowPhoton'
b'EFlowPhoton_size'
b'EFlowNeutralHadron'
b'EFlowNeutralHadron_size'
b'Jet'
b'Jet_size'
b'Electron'
b'Electron_size'
b'Photon'
b'Photon_size'
b'Muon'
b'Muon_size'
b'FatJet'
b'FatJet_size'
b'MissingET'
b'MissingET_size'
b'ScalarHT'
b'ScalarHT_size'

 total # of fields in rootfile: 299


### Jet subtree

- Because there are so many fields in this tree, we're going to focus on
the subtree `Jet` in this notebook


In [5]:
jet = tree['Jet']



- Can see the list of fields associated with the `Jet` field by
calling .keys(), just as in the above cell.

- But can see more information about `Jet` by calling the .show() method
(can also do this with the above for tree.show()). Can see:

    - the keys
    - the `Uproot` interpretation
    - the primitive data type (e.g. int, float, etc.)

In [6]:
jet.show()

Jet                        TStreamerInfo              asdtype('>i4')
Jet.fUniqueID              TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.fBits                  TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.PT                     TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Eta                    TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Phi                    TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.T                      TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Mass                   TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.DeltaEta               TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.DeltaPhi               TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Flavor                 TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.FlavorAlgo             TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.FlavorPhys             TStreamerBasicType         asjagged

## Uproot interpretations

- ROOT data can be interpreted in different ways in Python (as opposed
to C++).

- A branch may be interpreted be interpreted as:
    - a Numpy array,
    - an array of unequal length arrays,
    - an array of classes defined by ROOT streamers,
    - an array of custom classes,
    - a string,
    - etc.

- Currently, AwkwardNN supports training on 4 different interpretations:

    - a fixed data structure interpretation,

    - a jagged data structure interpretation,

    - an object data structure interpretation,

    - a nested data structure interpretation,

        - Note: in `uproot` you have to check if a field is nested by
        looking at its .keys() list of subfields, but in `uproot4`
        you could check in the same way as for the other interpretations.

In [7]:
jet_size_interp = type(tree['Jet_size'].interpretation)
print("Example of fixed interpretation: `Jet_size`: {}\n".format(jet_size_interp))

jet_mass_interp = type(tree['Jet.Mass'].interpretation)
print("Example of jagged interpretation: `Jet.Mass`: {}\n".format(jet_mass_interp))

jet_const_interp = type(tree['Jet.Constituents'].interpretation)
print("Example of object interpretation: `Jet.Constituents`: {}\n".format(jet_const_interp))

jet_interp = type(tree['Jet'].interpretation)
print("Example of nested interpretation: `Jet`: {}\n".format(jet_interp))
print("Note: in `uproot4`, this nested field would be interpreted "
      "as a \"group\".")


Example of fixed interpretation: `Jet_size`: <class 'uproot.interp.numerical.asdtype'>

Example of jagged interpretation: `Jet.Mass`: <class 'uproot.interp.jagged.asjagged'>

Example of object interpretation: `Jet.Constituents`: <class 'uproot.interp.objects.asgenobj'>

Example of nested interpretation: `Jet`: <class 'uproot.interp.numerical.asdtype'>

Note: in `uproot4`, this nested field would be interpreted as a "group".


### Fixed interpretation

- Each branch/column is a Numpy array.

- Each element in the Numpy array is a number that is associated to an event.


In [8]:
# Note: we are reading in data in the column manner as mentioned above.
# To read in a column, specify the column name, e.g.

jet_size = tree['Jet_size']

print("Type of fixed interpreted branch: {}".format(type(jet_size.array())))
print()
for i in range(3):
    print("Event {}: {}".format(i+1, jet_size.array()[i]))


Type of fixed interpreted branch: <class 'numpy.ndarray'>

Event 1: 4
Event 2: 2
Event 3: 4


### Jagged interpretation

- Each branch/column is a jagged array.

- Each element in the jagged array is also an array that is associated to an
 event. These subarrays may have either numbers or more arrays as elements.

- Each subarray has a different length, hence the name jagged.

- The elements of subarrays as associated with the objects that make up the event.

    - So if I am looking at a jagged interpreted branch, "Jet.Mass" for
    example, then each element in a subarray is associated with a different
    jet within the event.

- Across different fields within the same subtree, each field has the same
size jagged array with respect to a specified modulus.

    - E.g., the first event in jet mass, `tree['Jet.Mass'].array()[0]`, has
    the same size as the first event in jet charge, `tree['Jet.Charge'].array()[0]`.

    - E.g. if an event has 4 jets, then an event in jagged subbranch can have
    length 4 (Jet.Mass) or say length 20 (Jet.Tau[5]), where the first
    5 points are associated to the first jet, the next 5 to second jet, and so on.


- Note: fields from different subtrees do not have the same length. This is why
they cannot be trained within the same AwkwardNN block.

    - E.g. the first event in jet mass, `tree['Jet.Mass'].array()[0]`, is not
    the same size as the first event in particle energy,
    `tree['Particle.E'].array()[0]`.

In [9]:
jet_mass = tree['Jet.Mass']
print("Type of jagged interpreted branch: {}".format(type(jet_mass.array())))
print()
for i in range(3):
    print("Event {}: {}".format(i+1, jet_mass.array()[i]))

Type of jagged interpreted branch: <class 'awkward.array.jagged.JaggedArray'>

Event 1: [206.97768  179.3755    10.840576  14.236657]
Event 2: [180.59798 133.46812]
Event 3: [216.01949   167.80676     8.562021    7.6187215]


### Object interpretation

- Each branch/column is an object array.

- Each element in the object array is also an array that is associated to an
 event. Furthermore, these subarrays have arrays as elements.

 - Like the jagged interpreted branches, each event/subarray has a different length.

- The elements of events/subarrays as associated with the objects that make up
the event.

    - So if I am looking at an object interpreted branch, "Jet.Particles" for
    example, then each element in an event/subarray is also an array that is
    associated with a particle jet within the event.

    - Each of these second nested arrays can have different lengths.


In [10]:
jet_const = tree['Jet.Constituents']

print("Type of object interpreted branch: {}".format(type(jet_mass.array())))
print()
for i in range(3):
    print("Event {}: number of jets = {}".format(i+1, len(jet_const.array()[i])))
    for j, k in enumerate(jet_const.array()[i]):
        print("Jet {} has length {}".format(j+1, len(k)))
    print()


Type of object interpreted branch: <class 'awkward.array.jagged.JaggedArray'>

Event 1: number of jets = 4
Jet 1 has length 48
Jet 2 has length 68
Jet 3 has length 10
Jet 4 has length 11

Event 2: number of jets = 2
Jet 1 has length 60
Jet 2 has length 61

Event 3: number of jets = 4
Jet 1 has length 68
Jet 2 has length 83
Jet 3 has length 8
Jet 4 has length 9



  return cls.numpy.array(value, copy=False)


### Nested interpretation

- `uproot` doesn't assign a special interpretation to nested fields.

- To check whether or not a field is a subtree, i.e. is nested, then
you have to check the length of the array that .keys() returns .

- However, `uproot4` does interpret nested fields as "Groups".

In [11]:
nested_key_length = len(jet.keys())
non_nested_key_length = len(jet_mass.keys())
print("Length of .keys() for nested field \"jet\": {}".format(nested_key_length))
print("Length of .keys() for non-nested field \"jet_mass\": {}".format(non_nested_key_length))

Length of .keys() for nested field "jet": 35
Length of .keys() for non-nested field "jet_mass": 0


## Primitives

- Each field, regardless of its interpretation, has an associated
primitive data type associated with it. Can again see this in the
.show() function above.

In [12]:
jet_charge = tree['Jet.Charge']
prim0 = jet_charge.interpretation.type.to
prim1 = jet_mass.interpretation.type.to
jet_pruned = tree['Jet.PrunedP4[5]']
prim2 = jet_pruned.interpretation.type.to
muon_particle = tree['Muon.Particle']
prim3 = muon_particle.interpretation.type.to

print("Jet charge primitive: {}".format(prim0))
print("Jet mass primitive: {}".format(prim1))
print("Jet pruned primitive: {}".format(prim2))
print("Muon Particle primitive: {}".format(prim3))

Jet charge primitive: int32
Jet mass primitive: float32
Jet pruned primitive: <class 'uproot_methods.classes.TLorentzVector.Methods'>
Muon Particle primitive: <class 'uproot.rootio.TRef'>


- Most of the primitives are numerical types: `int`, `float`, etc.

- But some of them are custom types specific to particle physics:
    - `TLorentzVector`, and
    - `TREF`

- Because a neural network can't process these custom types, I call one of
their methods to convert them to numerical types. This, though, is rather
ad hoc, just so I can train them.

    - Question: How should I be dealing with this?

    - Perhaps, add a field in the yaml file so the user can specify what to do?
    Or something else?

In [13]:
print("Custom data types before being modified")
print(jet_pruned.array()[0])
print(muon_particle.array()[275])
print("\n Custom data types are being modified")
print(jet_pruned.array().E[0])
print(muon_particle.array().id[275])


Custom data types before being modified
[TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) ... TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0)]
[<TRef 424>]

 Custom data types are being modified
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[424]


## Processing data

- But a neural network processes a single event at a time, not a single
field at a time. That is, it feeds forward rows, not columns.

- But it makes more sense for a neural network to process a single event
at a time, not a single field/branch at a time. That is, it makes more
sense to feed forward rows, not columns.

- So, given a set of fields, I process the data from those fields such
that it is organized as rows instead of columns.

In [14]:
# Example

fields = ['Jet.PT', 'Jet.Eta', 'Jet.Phi']

print("before processing")
for col_dict in tree.iterate(branches=fields, entrysteps=3, namedecode='ascii'):
    for k, v in col_dict.items():
        print("{}: {}".format(k, v))
    break
print()

print("after processing")
from awkwardNN.utils.dataset_utils_uproot import get_events_from_tree
data = get_events_from_tree(tree, fields)
for i in range(3):
    print("Event {}: {}\n".format(i+1, data[i]))


before processing
Jet.PT: [[827.8882 769.4987 97.594574 70.50537] [936.3517 898.72076] [918.127 844.6479 21.60765 20.497742]]
Jet.Eta: [[0.27329388 -0.33246595 0.49002475 3.0309205] [-0.15793906 0.4833028] [0.70078456 0.37452587 3.4654589 0.18378289]]
Jet.Phi: [[0.88772887 -2.410344 -0.22583866 -1.910816] [1.7673397 -1.3502347] [3.0060477 -0.14007255 -2.3441007 -1.2875067]]

after processing
Event 1: tensor([[[ 8.2789e+02,  2.7329e-01,  8.8773e-01]],

        [[ 7.6950e+02, -3.3247e-01, -2.4103e+00]],

        [[ 9.7595e+01,  4.9002e-01, -2.2584e-01]],

        [[ 7.0505e+01,  3.0309e+00, -1.9108e+00]]])

Event 2: tensor([[[ 9.3635e+02, -1.5794e-01,  1.7673e+00]],

        [[ 8.9872e+02,  4.8330e-01, -1.3502e+00]]])

Event 3: tensor([[[ 9.1813e+02,  7.0078e-01,  3.0060e+00]],

        [[ 8.4465e+02,  3.7453e-01, -1.4007e-01]],

        [[ 2.1608e+01,  3.4655e+00, -2.3441e+00]],

        [[ 2.0498e+01,  1.8378e-01, -1.2875e+00]]])



- Assumption: within the context of an event, the i^th index of a subfield,
say `Jet.PT`, is associated with the i^th subobject, in this case jet, of
an event.

    - That is, the elements in fields are ordered, as opposed to being
    unordered.

    - E.g. if there are 4 jets in an event, then the first element in both
    `Jet.PT` and `Jet.Mass` are associated with the same jet and should
    therefore be trained together.

- So when training an rnn or lstm, it will process each event by processing
each subobject (e.g. jet or particle) in an event sequentially.


## Note: bad keys

- Sometimes `uproot` can't read in some fields because of an error related
to how data is read in (I think).

- The fields that return errors in the rootfiles `test_qcd_1000.root` and
`test_ttbar_1000.root` are:
    - `EFlowNeutralHadron.fBits`,
    - `EFlowPhoton.fBits`,
    - `EFlowTrack.fBits`,
    - `Muon.fBits`,
    - `Particle.fBits`,
    - `Tower.fBits`, and
    - `Track.fBits`.

- I filter these keys out during the yaml file processing.


In [15]:
# Example of error
tree["EFlowNeutralHadron.fBits"].array()

AssertionError: 