<br><br><br><br><br>

# Advanced Uproot

<br><br><br><br><br>

<br><br>

## Cache management

<br>

**Uproot does not automatically cache arrays.** (Remote backends cache raw bytes, but that's different.)

  * **Disadvantage:** unless you opt-into caching, uproot reads and decompresses the data every time you ask for it.
  * **Advantage:** you control how much memory your process uses.

<br>

In this sense and others, uproot is a _low-level_ library.

<br><br>

In [7]:
import uproot

# any dict-like object may be used as a cache
cache = {}

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)

# cache contains UUID;treename;branchname;interpretation;entryrange → arrays
cache

{'AAGUS3fQmKsR56dpAQAAf77v;events;Type;asstring();0-2304': <ObjectArray [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG'] at 0x7f56543036d8>,
 'AAGUS3fQmKsR56dpAQAAf77v;events;Run;asdtype(Bi4(),Li4());0-2304': array([148031, 148031, 148031, ..., 148029, 148029, 148029], dtype=int32),
 'AAGUS3fQmKsR56dpAQAAf77v;events;Event;asdtype(Bi4(),Li4());0-2304': array([10507008, 10507008, 10507008, ..., 99991333, 99991333, 99991333],
       dtype=int32),
 'AAGUS3fQmKsR56dpAQAAf77v;events;E1;asdtype(Bf8(),Lf8());0-2304': array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;px1;asdtype(Bf8(),Lf8());0-2304': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;py1;asdtype(Bf8(),Lf8());0-2304': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;pz1;asdtyp

In [11]:
# So that the next time you make this exact request, the arrays come from cache, not disk.

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)
arrays

{b'Type': <ObjectArray [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG'] at 0x7f56680889b0>,
 b'Run': array([148031, 148031, 148031, ..., 148029, 148029, 148029], dtype=int32),
 b'Event': array([10507008, 10507008, 10507008, ..., 99991333, 99991333, 99991333],
       dtype=int32),
 b'E1': array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 b'px1': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 b'py1': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 b'pz1': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247]),
 b'pt1': array([44.7322, 38.8311, 38.8311, ..., 32.3997, 32.3997, 32.3997]),
 b'eta1': array([-1.21769, -1.05139, -1.05139, ..., -1.57044, -1.57044, -1.57044]),
 b'phi1': array([ 2.74126  , -0.440873 , -0.440873 , ...,  0.0370275,  0.0370275,
         0.0370275]),
 b'Q1': 

In [18]:
# Using a dict as a cache keeps everything in memory forever (until you call dict.clear()).

# More realistically, you should use an ArrayCache with a memory upper limit.

cache = uproot.cache.ArrayCache(100*1024)   # 100*1024 bytes is 100 kB

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)

# Now we only have the last ones that fit into cache.
list(cache.keys())

['AAGUS3fQmKsR56dpAQAAf77v;events;pz2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;pt2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;eta2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;phi2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;Q2;asdtype(Bi4(),Li4());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;M;asdtype(Bf8(),Lf8());0-2304']

<br><br><br><br>

**Question:** couldn't you manage arrays in memory yourself?

Yes, but inserting `cache=whatever` into your function calls minimally changes your analysis script, which avoids cluttering it up with technical details.

<br><br><br><br>

In [34]:
# To see the caching in action, let's overload an interpretation so that it prints when used.

class CustomAsDtype(uproot.asdtype):
    @property
    def identifier(self):
        out = super(CustomAsDtype, self).identifier
        print(out, "identifier")
        return out
    def fromroot(self, *args):
        print(self.identifier, "fromroot (first step in interpreting data from a ROOT file)")
        return super(CustomAsDtype, self).fromroot(*args)
    def finalize(self, *args):
        print(self.identifier, "finalize (puts finishing touches on array and returns it)")
        return super(CustomAsDtype, self).finalize(*args)

custom_asdtype = uproot.open("data/Zmumu.root")["events"]["E1"].interpretation
custom_asdtype.__class__ = CustomAsDtype
custom_asdtype

asdtype('>f8')

In [35]:
# Exercise: modify this cell so that evaluating it draws from the cache, instead of reading
# fromroot and finalizing the array.

# You should see it print only one message: identifier.

cache = {}

arrays = uproot.open("data/Zmumu.root")["events"]["E1"].array(custom_asdtype, cache=cache)

asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) fromroot (first step in interpreting data from a ROOT file)
asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) finalize (puts finishing touches on array and returns it)


<br>

## Interpretations

<br>

Uproot performs two tasks:

   * it recognizes class objects like TH1F and TTree, using the latter to navigate to raw physics data (in TBaskets)
   * it provides tools to interpret the raw physics data however is needed.

Most users don't mess with the default interpretations, but it's worth peeking inside to see how it works. Uproot provides the tools to investigate the TBasket data deeply.

<br>

This is another sense in which uproot is a _low-level_ library.

<br>

In [42]:
branch = uproot.open("data/Zmumu.root")["events"]["Type"]

# The default interpretation for a /C branch is "asstring."

# But if we interpret it asjagged(asdtype('uint8')), we can see raw bytes, separated by event.

# Can you see what those bytes mean?

print(f"\nbranch.title = {branch.title}")
print(f"\nbranch.interpretation = {branch.interpretation}")
print(f"\nuproot.asdebug = {uproot.asdebug}")
print(f"\nbranch.array() = {branch.array()}")
print(f"\nbranch.array(uproot.asdebug) = {branch.array(uproot.asdebug)}")


branch.title = b'Type/C'

branch.interpretation = asstring()

uproot.asdebug = asjagged(asdtype('uint8'))

branch.array() = [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG']

branch.array(uproot.asdebug) = [[2 71 84] [2 84 84] [2 71 84] ... [2 84 84] [2 71 84] [2 71 71]]


In [76]:
branch = uproot.open("data/HZZ-objects.root")["events"]["muonp4"]

print(f"\nbranch.interpretation = {branch.interpretation}")
print(f"""\nbranch.interpretation.content.content.content.fromdtype =
        {repr(branch.interpretation.content.content.content.fromdtype)}""")
print(f"\nbranch.array(entrystop=1)[0] = {branch.array(entrystop=1)[0]}\n")

import pandas
pandas.DataFrame(branch.array(uproot.asjagged(uproot.asdtype(
    branch.interpretation.content.content.content.fromdtype), skipbytes=10), entrystop=1)[0])


branch.interpretation = asjagged(asobj(<uproot_methods.classes.TLorentzVector.Methods>), 10)

branch.interpretation.content.content.content.fromdtype =
        dtype([(' fBits', '>u8'), (' fUniqueID', '>u8'), (' fBits2', '>u8'), (' fUniqueID2', '>u8'), ('fX', '>f8'), ('fY', '>f8'), ('fZ', '>f8'), ('fE', '>f8')])

branch.array(entrystop=1)[0] = [TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)]



Unnamed: 0,fBits,fUniqueID,fBits2,fUniqueID2,fX,fY,fZ,fE
0,4611686276125687809,33554432,4611686173046407169,33554432,-52.899456,-11.654672,-8.160793,54.779499
1,4611686276125687809,33554432,4611686173046407169,33554432,37.737782,0.693474,-11.307582,39.401695


<br>

This TLorentzVector has structure:

```c++
// 10 bytes of std::vector header...
struct {
    unsigned long fBits;
    unsigned long fUniqueID;

    unsigned long fBits2;
    unsigned long fUniqueID2;
    double fX;
    double fY;
    double fZ;

    double fE;
};
```

All of this was derived from the streamers, but we can inspect it.

<br>

<br>

This TLorentzVector has structure:

```c++
// 10 bytes of std::vector header...
struct {
    unsigned long fBits;          // TLorentzVector's TObject superclass
    unsigned long fUniqueID;
    struct {
        unsigned long fBits;      // TVector3's TObject superclass
        unsigned long fUniqueID;
        double fX;
        double fY;
        double fZ;
    }
    double fE;
};
```

All of this was derived from the streamers, but we can inspect it.

<br>

In [88]:
# This Python code was automatically generated from streamer info in the ROOT file:
print(uproot.open("data/HZZ-objects.root")._context.classes["TVector3"]._pycode)

class TVector3(uproot_methods.classes.TVector3.Methods, TObject):
    _methods = uproot_methods.classes.TVector3.Methods
    _bases = [TObject]
    @classmethod
    def _recarray(cls):
        out = []
        out.append((' cnt', 'u4'))
        out.append((' vers', 'u2'))
        for base in cls._bases:
            out.extend(base._recarray())
        out.append(('fX', numpy.dtype('>f8')))
        out.append(('fY', numpy.dtype('>f8')))
        out.append(('fZ', numpy.dtype('>f8')))
        return out
    _fields = ['fX', 'fY', 'fZ']
    _classname = b'TVector3'
    _versions = versions
    _classversion = 3
    @classmethod
    def _readinto(cls, self, source, cursor, context, parent, asclass=None):
        start, cnt, classversion = _startcheck(source, cursor)
        if cls._classversion != classversion:
            cursor.index = start
            if classversion in cls._versions:
                return cls._versions[classversion]._readinto(self, source, cursor, context, parent)
   

In [84]:
array = uproot.open("data/HZZ-objects.root")["events"]["muonp4"].array()
print(repr(array), end="\n\n")

print(array.columns, end="\n\n")

print(array["fX"], end="\n\n")    # get the x values directly as a Table column

print(array.x, end="\n\n")        # get the x values using a high-level Python method

<JaggedArrayMethods [[TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)] [TLorentzVector(-0.81646, -24.404, 20.2, 31.69)] [TLorentzVector(48.988, -21.723, 11.168, 54.74) TLorentzVector(0.82757, 29.801, 36.965, 47.489)] ... [TLorentzVector(-29.757, -15.304, -52.664, 62.395)] [TLorentzVector(1.1419, 63.61, 162.18, 174.21)] [TLorentzVector(23.913, -35.665, 54.719, 69.556)]] at 0x7f5634fa2048>

[' fBits', ' fUniqueID', ' fBits2', ' fUniqueID2', 'fX', 'fY', 'fZ', 'fE']

[[-52.89945602416992 37.7377815246582] [-0.8164593577384949] [48.987831115722656 0.8275666832923889] ... [-29.756786346435547] [1.1418697834014893] [23.913206100463867]]

[[-52.89945602416992 37.7377815246582] [-0.8164593577384949] [48.987831115722656 0.8275666832923889] ... [-29.756786346435547] [1.1418697834014893] [23.913206100463867]]



In [94]:
# All of these high-level methods are defined in uproot-methods, not uproot.

# Uproot itself is strictly about file I/O, so histogram-handling and kinematics are exiled here.

import uproot_methods

# You can create your own ROOT-inspired objects directly with uproot-methods...

myarray = uproot_methods.TLorentzVectorArray([1, 2, 3], [1, 2, 3], [1, 2, 3], 10)
print(f"\nmyarray    = {myarray}")

print(f"\nmyarray.pt = {myarray.pt**2}")


myarray    = [TLorentzVector(1, 1, 1, 10) TLorentzVector(2, 2, 2, 10) TLorentzVector(3, 3, 3, 10)]

myarray.pt = [ 2.  8. 18.]


In [105]:
# This is just a bunch of column arrays...
print(f"\nmyarray.content.contents['fX'] = {repr(myarray.content.contents['fX'])}")

# Wrapped up as a table...
print(f"\nmyarray.content                = {repr(myarray.content)}")

# With high-level physics methods on top of that.
print(f"\nmyarray                        = {repr(myarray)}")


myarray.content.contents['fX'] = array([1, 2, 3])

myarray.content                = <Table [<Row 0> <Row 1> <Row 2>] at 0x7f5634fa2b38>

myarray                        = <TLorentzVectorArray [TLorentzVector(1, 1, 1, 10) TLorentzVector(2, 2, 2, 10) TLorentzVector(3, 3, 3, 10)] at 0x7f5634fa2940>
