# Rootfiles and uproot

- Rootfiles contain particle physics data organized in a
column-style manner

- Specifically, Rootfile datasets consist of events.
Each event consists of fields of information.
The events correspond to rows and the fields correspond
to columns.
To access the information in a Rootfile, you can read in
the columns/fields from the dataset (because that is how it
was chosen to be organized).
You cannot read in rows/individual events.


- `Uproot` is a python library that allows direct access to root files
without dealing with ROOT

- `Uproot4` is currently being developed to replace `Uproot`, but it has
some issues dealing with certain characters in the column names
(heretoafter referred to as field names). But the same general ideas
apply to `Uproot4` as for `Uproot`.

    - Note: `Uproot4` does make reading in data easier, and I have already
    written some code to transition to uproot4/awkward1 easily once the bugs
    are fixed.

- This notebook will focus on Uproot, not Uproot4


In [2]:
import uproot


Opening rootfiles returns a `ROOTDirectory` object


In [3]:
file = uproot.open("../data/test_ttbar_1000.root")
print(type(file))

<class 'uproot.rootio.ROOTDirectory'>


The `ROOTDirectory` should contain a `TTree` object, which contains all of the
data we're interested in.

In [4]:
for k, v in file.items():
    print("{} {}".format(k, v))


b'ProcessID0;1' <TProcessID b'ProcessID0' at 0x000107ef2438>
b'Delphes;1' <TTree b'Delphes' at 0x000107ef2518>


## TTree

- The method .keys() returns a list of all the fields at the top of the
rootfile data structure

- Some of these fields are subtrees, which contain subfields within their
tree structure.

- The method .allkeys() returns a list of all the fields in the entire
rootfile data structure, down through all subtrees

In [5]:
tree = file['Delphes']
for k in tree.keys():
    print(k)
print("\n total # of fields in rootfile: {}".format(len(tree.allkeys())))


b'Event'
b'Event_size'
b'Particle'
b'Particle_size'
b'Track'
b'Track_size'
b'Tower'
b'Tower_size'
b'EFlowTrack'
b'EFlowTrack_size'
b'EFlowPhoton'
b'EFlowPhoton_size'
b'EFlowNeutralHadron'
b'EFlowNeutralHadron_size'
b'Jet'
b'Jet_size'
b'Electron'
b'Electron_size'
b'Photon'
b'Photon_size'
b'Muon'
b'Muon_size'
b'FatJet'
b'FatJet_size'
b'MissingET'
b'MissingET_size'
b'ScalarHT'
b'ScalarHT_size'

 total # of fields in rootfile: 299


Because there are so many fields in this tree, we're going to focus on
the subtree `Jet` in this notebook


In [6]:
jet = tree['Jet']


- Can see the data list of fields associated with the `Jet` field by
calling .keys().

- But can see more information regarding the types of primitives within
each field by calling the .show() method (can also do this with the
above for tree.show())

In [7]:
jet.show()

Jet                        TStreamerInfo              asdtype('>i4')
Jet.fUniqueID              TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.fBits                  TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.PT                     TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Eta                    TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Phi                    TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.T                      TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Mass                   TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.DeltaEta               TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.DeltaPhi               TStreamerBasicType         asjagged(asdtype('>f4'))
Jet.Flavor                 TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.FlavorAlgo             TStreamerBasicType         asjagged(asdtype('>u4'))
Jet.FlavorPhys             TStreamerBasicType         asjagged

## Uproot interpretations

- Can see from the right hand column that there are different possible
combinations of datatypes each field can have

- As I see it (based on the examples I have seen), there are
4 different types of data structures interpretations that a field
can have:

    - a fixed data structure interpretation

    - a jagged data structure interpretation

    - an object data structure interpretation

    - a nested data structure interpretation

        - Note: in `uproot` you have to check if a field is nested by
        looking at its .keys() list of subfields, but in `uproot4`
        you could check in the same way as for the other interpretations

In [8]:
jet_size_interp = type(tree['Jet_size'].interpretation)
print("Example of fixed interpretation: `Jet_size`: {}\n".format(jet_size_interp))

jet_mass_interp = type(tree['Jet.Mass'].interpretation)
print("Example of jagged interpretation: `Jet.Mass`: {}\n".format(jet_mass_interp))

jet_const_interp = type(tree['Jet.Constituents'].interpretation)
print("Example of object interpretation: `Jet.Constituents`: {}\n".format(jet_const_interp))

jet_interp = type(tree['Jet'].interpretation)
print("Example of nested interpretation: `Jet`: {}\n".format(jet_interp))


Example of fixed interpretation: `Jet_size`: <class 'uproot.interp.numerical.asdtype'>

Example of jagged interpretation: `Jet.Mass`: <class 'uproot.interp.jagged.asjagged'>

Example of object interpretation: `Jet.Constituents`: <class 'uproot.interp.objects.asgenobj'>

Example of nested interpretation: `Jet`: <class 'uproot.interp.numerical.asdtype'>



In [9]:
# Note: we are reading in data in the column manner as mentioned above.
# To read in a column, specify the column name, e.g.

jet_size = tree['Jet_size']

# Can check the number of events by checking the length of the column

print("# of events: {}".format(len(jet_size)))

# of events: 1000


### Fixed interpretation

- Each event in a fixed column is associated to single number.

- It is called fixed because each event has a fixed number of data points associated
with it; in this case, 1.

In [10]:
jet_size = tree['Jet_size']
print(type(jet_size.array()))
for i in jet_size.array()[:3]:
    print(i)

<class 'numpy.ndarray'>
4
2
4


### Jagged interpretation

- Each event in a jagged column is associated to an array.

- It is called jagged because each event has a variable/jagged number of possible
data points associated with it.

- The interpretation of these fields is that each event has a variable number of
say Jets, and that each jet in each event has an associated data point given in
jagged array

- Note: across different fields within the same subtree, each field has the same
size jagged array. For example, the first event in jet mass, `tree['Jet.Mass'].array()[0]`,
has the same size as the first event in jet charge, `tree['Jet.Charge'].array()[0]`.

    - Caveat: not technically true. Actually across different fields within each
    subtree, each field can be an integer multiple of the minimum of data points.
    For example, if an event has 4 jets, then a jagged field can have length 4
    (Jet.Mass) or say length 20 (Jet.Tau[5]), where the first 5 points are associated
    to the first jet, the next 5 to second jet, and so on.

- Note: however, across different fields from different subtrees, this is not the case.
E.g. the first event in jet mass, `tree['Jet.Mass'].array()[0]`, is not the same size
as the first event in particle energy, `tree['Particle.E'].array()[0]`.

In [11]:
jet_mass = tree['Jet.Mass']
print(type(jet_mass.array()))
for i in jet_mass.array()[:3]:
    print(i)

<class 'awkward.array.jagged.JaggedArray'>
[206.97768  179.3755    10.840576  14.236657]
[180.59798 133.46812]
[216.01949   167.80676     8.562021    7.6187215]


### Object interpretation

- Each event in an object column is associated to an array.

- Like jagged arrays, object arrays have a varying length that is related to say the
number of jets in each event.

- Unlike jagged interpretations, object interpretations seem to have two levels of
variation, where, for example, each of the four jets in the first event has an
array of varying length.

In [12]:
jet_const = tree['Jet.Constituents']
print(type(jet_const.array()))
for i in jet_const.array()[:3]:
    for j in i:
        print(j)
    print()

<class 'awkward.array.objects.ObjectArray'>
[1398, 1428, 1275, 1427, 1413, 1268, 1452, 1260, 1443, 1441, 1448, 1246, 1442, 1250, 1439, 1446, 1254, 1444, 1445, 1415, 1421, 1227, 1230, 1419, 1417, 1430, 1393, 1207, 1396, 1391, 1395, 1411, 1400, 1394, 1224, 1222, 1403, 1219, 1205, 1402, 1406, 1212, 1210, 1404, 1217, 1408, 1407, 1215]
[1324, 1322, 1321, 1334, 1464, 1319, 1326, 1389, 1381, 1385, 1379, 1380, 1179, 1383, 1384, 1183, 1100, 1197, 1328, 1102, 1337, 1330, 1113, 1346, 1366, 1387, 1163, 1194, 1167, 1172, 1159, 1169, 1118, 1364, 1150, 1121, 1123, 1342, 1340, 1133, 1135, 1339, 1141, 1131, 1349, 1335, 1138, 1344, 1143, 1351, 1377, 1187, 1372, 1376, 1371, 1153, 1347, 1357, 1368, 1370, 1374, 1359, 1353, 1355, 1157, 1361, 1360, 1362]
[1454, 1266, 1270, 1450, 1423, 1433, 1425, 1437, 1432, 1435]
[1286, 1488, 1471, 1469, 1485, 1473, 1477, 1290, 1475, 1481, 1483]

[1375, 1373, 1236, 1326, 1332, 1322, 1288, 1324, 1128, 1327, 1320, 1328, 1323, 1314, 1309, 1286, 1301, 1316, 1094, 1312, 1298, 10

  return cls.numpy.array(value, copy=False)


### Nested interpretation

- As mentioned above, `uproot` doesn't assign a special interpretation to nested fields.
You have to check the length of the array that .keys() returns

- However, `uproot4` does interpret nested fields as `Groups`.

In [13]:
nested_key_length = len(jet.keys())
non_nested_key_length = len(jet_mass.keys())
print("Length of .keys() for nested field: {}".format(nested_key_length))
print("Length of .keys() for non-nested field: {}".format(non_nested_key_length))

Length of .keys() for nested field: 35
Length of .keys() for non-nested field: 0


## Primitives

- Each field, regardless of whether it is jagged or fixed or object, has an associated
primitive data type associated with it. Can again see this in the .show() function above.

In [22]:
prim1 = jet_mass.interpretation.type.to
jet_pruned = tree['Jet.PrunedP4[5]']
prim2 = jet_pruned.interpretation.type.to
muon_particle = tree['Muon.Particle']
prim3 = muon_particle.interpretation.type.to

print("Jet mass primitive: {}".format(prim1))
print("Jet pruned primitive: {}".format(prim2))
print("Muon Particle primitive: {}".format(prim3))

Jet mass primitive: float32
Jet pruned primitive: <class 'uproot_methods.classes.TLorentzVector.Methods'>
Muon Particle primitive: <class 'uproot.rootio.TRef'>


- Most of the primitives are numerical types: `int`, `float`, etc.

- But some of them are custom types specific to particle physics: `TLorentzVector`
and `TREF` are the only two I have come across so far. Each of these custom types
are not really primitives, they are objects with methods and fields you can call.

- Because a neural network can't process these custom types, I call one of their methods
to convert them to numerical types. This, though, is rather ad hoc, just so I can
train them.

    - Question: How should I be dealing with this?

    - Perhaps, add a field in the yaml file so the user can specify what to do?
    Or something else?

In [29]:
print(jet_pruned.array()[0])
print(muon_particle.array()[275])
print()
print(jet_pruned.array().E[0])
print(muon_particle.array().id[275])


[TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) ... TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0) TLorentzVector(x=0, y=0, z=0, t=0)]
[<TRef 424>]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[424]


## Converting to something a neural network can read

- `uproot` reads in data organized in columns.

- But a neural network would presumably process a single event at a time, not a single
field at a time. That is, it would process rows, not columns.

- So, given a set of fields that are to be used as inputs in a neural network,
I process the data from those fields such that it is organized as rows instead
of columns.

In [40]:
# Example

fields = ['Jet.PT', 'Jet.Eta', 'Jet.Phi']

print("before processing")
for col_dict in tree.iterate(branches=fields, entrysteps=2, namedecode='ascii'):
    for k, v in col_dict.items():
        print("{}: {}".format(k, v))
    break
print()

print("after processing")
from awkwardNN.utils.dataset_utils_uproot import get_events_from_tree
data = get_events_from_tree(tree, fields)
for i in range(2):
    print("Event {}: {}\n".format(i+1, data[i]))


before processing
Jet.PT: [[827.8882 769.4987 97.594574 70.50537] [936.3517 898.72076]]
Jet.Eta: [[0.27329388 -0.33246595 0.49002475 3.0309205] [-0.15793906 0.4833028]]
Jet.Phi: [[0.88772887 -2.410344 -0.22583866 -1.910816] [1.7673397 -1.3502347]]

after processing
Event 1: [[ 8.2788818e+02  2.7329388e-01  8.8772887e-01]
 [ 7.6949872e+02 -3.3246595e-01 -2.4103439e+00]
 [ 9.7594574e+01  4.9002475e-01 -2.2583866e-01]
 [ 7.0505371e+01  3.0309205e+00 -1.9108160e+00]]

Event 2: [[ 9.3635168e+02 -1.5793906e-01  1.7673397e+00]
 [ 8.9872076e+02  4.8330280e-01 -1.3502347e+00]]



- Assumption: the $$ith$$ index of a field, say `Jet.PT`, in say event 1, is
associated with the $$ith$$ index of `Jet.Eta` for the same event.
That is to say, if there are 4 jets in an event, then each jagged field
for that event will be of length 4, where I assume that for each field,
the first element of the jagged array will always correspond to the same jet,
and the second element will correspond to the same second jet, and so on.

    - As opposed to all the elements of a jagged array just being associated
    to one of the 4 jets, but we can't tell which jet because of how the data
    was first processed initially.

- Then, when this goes through a neural network, say an rnn or lstm, it will
process each event sequentially by processing each jet or each particle
in an event.

- Note: The way I organize events and fields means that fixed fields, jagged fields,
and object fields should be not trained together on a single neural network.

- Note: fields from different events shouldn't be trained together on
the same neural network because their is no natural way to pre process them in the same
manner as the above example. E.g. The $$ith$$ event in the `Particle` subtree may have 3
particles with 6 fields, but the same event in the `Jet` subtree may have 5 jets with
4 fields. It's not obvious to me how to combine these in a way for a neural network
to process, especially if the fields are different sizes.


## Note: bad keys

- Sometimes `uproot` can't read in some fields because of some error related
to way the data is organized in memory (I think).

- The fields that return errors in the rootfiles `test_qcd_1000.root` and
`test_ttbar_1000.root` are: `EFlowNeutralHadron.fBits`, `EFlowPhoton.fBits`,
`EFlowTrack.fBits`, `Muon.fBits`, `Particle.fBits`, `Tower.fBits`, `Track.fBits`

- These keys are also filtered out during the yaml file processing. They are also
specified in the yaml file.


In [42]:
tree["EFlowNeutralHadron.fBits"].array()

AssertionError: 