<br><br><br><br><br>

# Advanced Uproot

<br><br><br><br><br>

<br><br>

## Cache management

<br>

**Uproot does not automatically cache arrays.** (Remote backends cache raw bytes, but that's different.)

  * **Disadvantage:** unless you opt-into caching, uproot reads and decompresses the data every time you ask for it.
  * **Advantage:** you control how much memory your process uses.

<br>

In this sense and others, uproot is a _low-level_ library.

<br><br>

In [1]:
import uproot

# any dict-like object may be used as a cache
cache = {}

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)

# cache contains UUID;treename;branchname;interpretation;entryrange → arrays
cache

{'AAGUS3fQmKsR56dpAQAAf77v;events;Type;asstring();0-2304': <ObjectArray [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG'] at 0x7f9e60f50390>,
 'AAGUS3fQmKsR56dpAQAAf77v;events;Run;asdtype(Bi4(),Li4());0-2304': array([148031, 148031, 148031, ..., 148029, 148029, 148029], dtype=int32),
 'AAGUS3fQmKsR56dpAQAAf77v;events;Event;asdtype(Bi4(),Li4());0-2304': array([10507008, 10507008, 10507008, ..., 99991333, 99991333, 99991333],
       dtype=int32),
 'AAGUS3fQmKsR56dpAQAAf77v;events;E1;asdtype(Bf8(),Lf8());0-2304': array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;px1;asdtype(Bf8(),Lf8());0-2304': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;py1;asdtype(Bf8(),Lf8());0-2304': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 'AAGUS3fQmKsR56dpAQAAf77v;events;pz1;asdtyp

In [2]:
# So that the next time you make this exact request, the arrays come from cache, not disk.

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)
arrays

{b'Type': <ObjectArray [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG'] at 0x7f9e60f50390>,
 b'Run': array([148031, 148031, 148031, ..., 148029, 148029, 148029], dtype=int32),
 b'Event': array([10507008, 10507008, 10507008, ..., 99991333, 99991333, 99991333],
       dtype=int32),
 b'E1': array([82.20186639, 62.34492895, 62.34492895, ..., 81.27013558,
        81.27013558, 81.56621735]),
 b'px1': array([-41.19528764,  35.11804977,  35.11804977, ...,  32.37749196,
         32.37749196,  32.48539387]),
 b'py1': array([ 17.4332439 , -16.57036233, -16.57036233, ...,   1.19940578,
          1.19940578,   1.2013503 ]),
 b'pz1': array([-68.96496181, -48.77524654, -48.77524654, ..., -74.53243061,
        -74.53243061, -74.80837247]),
 b'pt1': array([44.7322, 38.8311, 38.8311, ..., 32.3997, 32.3997, 32.3997]),
 b'eta1': array([-1.21769, -1.05139, -1.05139, ..., -1.57044, -1.57044, -1.57044]),
 b'phi1': array([ 2.74126  , -0.440873 , -0.440873 , ...,  0.0370275,  0.0370275,
         0.0370275]),
 b'Q1': 

In [3]:
# Using a dict as a cache keeps everything in memory forever (until you call dict.clear()).

# More realistically, you should use an ArrayCache with a memory upper limit.

cache = uproot.cache.ArrayCache(100*1024)   # 100*1024 bytes is 100 kB

arrays = uproot.open("data/Zmumu.root")["events"].arrays("*", cache=cache)

# Now we only have the last ones that fit into cache.
list(cache.keys())

['AAGUS3fQmKsR56dpAQAAf77v;events;pz2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;pt2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;eta2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;phi2;asdtype(Bf8(),Lf8());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;Q2;asdtype(Bi4(),Li4());0-2304',
 'AAGUS3fQmKsR56dpAQAAf77v;events;M;asdtype(Bf8(),Lf8());0-2304']

<br><br><br><br>

**Question:** couldn't you manage arrays in memory yourself?

Yes, but inserting `cache=whatever` into your function calls minimally changes your analysis script, which avoids cluttering it up with technical details.

<br><br><br><br>

In [4]:
# To see the caching in action, let's overload an interpretation so that it prints when used.

class CustomAsDtype(uproot.asdtype):
    @property
    def identifier(self):
        out = super(CustomAsDtype, self).identifier
        print(out, "identifier")
        return out
    def fromroot(self, *args):
        print(self.identifier, "fromroot (first step in interpreting data from a ROOT file)")
        return super(CustomAsDtype, self).fromroot(*args)
    def finalize(self, *args):
        print(self.identifier, "finalize (puts finishing touches on array and returns it)")
        return super(CustomAsDtype, self).finalize(*args)

custom_asdtype = uproot.open("data/Zmumu.root")["events"]["E1"].interpretation
custom_asdtype.__class__ = CustomAsDtype
custom_asdtype

asdtype('>f8')

In [5]:
# Exercise: modify this cell so that evaluating it draws from the cache, instead of reading
# fromroot and finalizing the array.

# You should see it print only one message: identifier.

cache = {}

arrays = uproot.open("data/Zmumu.root")["events"]["E1"].array(custom_asdtype, cache=cache)

asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) fromroot (first step in interpreting data from a ROOT file)
asdtype(Bf8(),Lf8()) identifier
asdtype(Bf8(),Lf8()) finalize (puts finishing touches on array and returns it)


<br>

## Interpretations

<br>

Uproot performs two tasks:

   * it recognizes class objects like TH1F and TTree, using the latter to navigate to raw physics data (in TBaskets)
   * it provides tools to interpret the raw physics data however is needed.

Most users don't mess with the default interpretations, but it's worth peeking inside to see how it works. Uproot provides the tools to investigate the TBasket data deeply.

<br>

This is another sense in which uproot is a _low-level_ library.

<br>

In [6]:
branch = uproot.open("data/Zmumu.root")["events"]["Type"]

# The default interpretation for a /C branch is "asstring."

# But if we interpret it asjagged(asdtype('uint8')), we can see raw bytes, separated by event.

# Can you see what those bytes mean?

print(f"\nbranch.title = {branch.title}")
print(f"\nbranch.interpretation = {branch.interpretation}")
print(f"\nuproot.asdebug = {uproot.asdebug}")
print(f"\nbranch.array() = {branch.array()}")
print(f"\nbranch.array(uproot.asdebug) = {branch.array(uproot.asdebug)}")


branch.title = b'Type/C'

branch.interpretation = asstring()

uproot.asdebug = asjagged(asdtype('uint8'))

branch.array() = [b'GT' b'TT' b'GT' ... b'TT' b'GT' b'GG']

branch.array(uproot.asdebug) = [[2 71 84] [2 84 84] [2 71 84] ... [2 84 84] [2 71 84] [2 71 71]]


In [7]:
branch = uproot.open("data/HZZ-objects.root")["events"]["muonp4"]

print(f"\nbranch.interpretation = {branch.interpretation}")
print(f"""\nbranch.interpretation.content.content.content.fromdtype =
        {repr(branch.interpretation.content.content.content.fromdtype)}""")
print(f"\nbranch.array(entrystop=1)[0] = {branch.array(entrystop=1)[0]}\n")

import pandas
pandas.DataFrame(branch.array(uproot.asjagged(uproot.asdtype(
    branch.interpretation.content.content.content.fromdtype), skipbytes=10), entrystop=1)[0])


branch.interpretation = asjagged(asobj(<uproot_methods.classes.TLorentzVector.Methods>), 10)

branch.interpretation.content.content.content.fromdtype =
        dtype([(' fBits', '>u8'), (' fUniqueID', '>u8'), (' fBits2', '>u8'), (' fUniqueID2', '>u8'), ('fX', '>f8'), ('fY', '>f8'), ('fZ', '>f8'), ('fE', '>f8')])

branch.array(entrystop=1)[0] = [TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)]



Unnamed: 0,fBits,fUniqueID,fBits2,fUniqueID2,fX,fY,fZ,fE
0,4611686276125687809,33554432,4611686173046407169,33554432,-52.899456,-11.654672,-8.160793,54.779499
1,4611686276125687809,33554432,4611686173046407169,33554432,37.737782,0.693474,-11.307582,39.401695


<br>

This TLorentzVector has structure:

```c++
// 10 bytes of std::vector header...
struct {
    unsigned long fBits;
    unsigned long fUniqueID;

    unsigned long fBits2;
    unsigned long fUniqueID2;
    double fX;
    double fY;
    double fZ;

    double fE;
};
```

All of this was derived from the streamers, but we can inspect it.

<br>

<br>

This TLorentzVector has structure:

```c++
// 10 bytes of std::vector header...
struct {
    unsigned long fBits;          // TLorentzVector's TObject superclass
    unsigned long fUniqueID;
    struct {
        unsigned long fBits;      // TVector3's TObject superclass
        unsigned long fUniqueID;
        double fX;
        double fY;
        double fZ;
    }
    double fE;
};
```

All of this was derived from the streamers, but we can inspect it.

<br>

In [8]:
# This Python code was automatically generated from streamer info in the ROOT file:
print(uproot.open("data/HZZ-objects.root")._context.classes["TVector3"]._pycode)

class TVector3(uproot_methods.classes.TVector3.Methods, TObject):
    _methods = uproot_methods.classes.TVector3.Methods
    _bases = [TObject]
    @classmethod
    def _recarray(cls):
        out = []
        out.append((' cnt', 'u4'))
        out.append((' vers', 'u2'))
        for base in cls._bases:
            out.extend(base._recarray())
        out.append(('fX', numpy.dtype('>f8')))
        out.append(('fY', numpy.dtype('>f8')))
        out.append(('fZ', numpy.dtype('>f8')))
        return out
    _fields = ['fX', 'fY', 'fZ']
    _classname = b'TVector3'
    _versions = versions
    _classversion = 3
    @classmethod
    def _readinto(cls, self, source, cursor, context, parent, asclass=None):
        start, cnt, classversion = _startcheck(source, cursor)
        if cls._classversion != classversion:
            cursor.index = start
            if classversion in cls._versions:
                return cls._versions[classversion]._readinto(self, source, cursor, context, parent)
   

In [9]:
array = uproot.open("data/HZZ-objects.root")["events"]["muonp4"].array()
print(repr(array), end="\n\n")

print(array.columns, end="\n\n")

print(array["fX"], end="\n\n")    # get the x values directly as a Table column

print(array.x, end="\n\n")        # get the x values using a high-level Python method

<JaggedArrayMethods [[TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)] [TLorentzVector(-0.81646, -24.404, 20.2, 31.69)] [TLorentzVector(48.988, -21.723, 11.168, 54.74) TLorentzVector(0.82757, 29.801, 36.965, 47.489)] ... [TLorentzVector(-29.757, -15.304, -52.664, 62.395)] [TLorentzVector(1.1419, 63.61, 162.18, 174.21)] [TLorentzVector(23.913, -35.665, 54.719, 69.556)]] at 0x7f9e5f0145f8>

[' fBits', ' fUniqueID', ' fBits2', ' fUniqueID2', 'fX', 'fY', 'fZ', 'fE']

[[-52.89945602416992 37.7377815246582] [-0.8164593577384949] [48.987831115722656 0.8275666832923889] ... [-29.756786346435547] [1.1418697834014893] [23.913206100463867]]

[[-52.89945602416992 37.7377815246582] [-0.8164593577384949] [48.987831115722656 0.8275666832923889] ... [-29.756786346435547] [1.1418697834014893] [23.913206100463867]]



In [10]:
# All of these high-level methods are defined in uproot-methods, not uproot.

# Uproot itself is strictly about file I/O, so histogram-handling and kinematics are exiled here.

import uproot_methods

# You can create your own ROOT-inspired objects directly with uproot-methods...

myarray = uproot_methods.TLorentzVectorArray([1, 2, 3], [1, 2, 3], [1, 2, 3], 10)
print(f"\nmyarray    = {myarray}")

print(f"\nmyarray.pt = {myarray.pt**2}")


myarray    = [TLorentzVector(1, 1, 1, 10) TLorentzVector(2, 2, 2, 10) TLorentzVector(3, 3, 3, 10)]

myarray.pt = [ 2.  8. 18.]


In [11]:
# This is just a bunch of column arrays...
print(f"\nmyarray.content.contents['fX'] = {repr(myarray.content.contents['fX'])}")

# Wrapped up as a table...
print(f"\nmyarray.content                = {repr(myarray.content)}")

# With high-level physics methods on top of that.
print(f"\nmyarray                        = {repr(myarray)}")


myarray.content.contents['fX'] = array([1, 2, 3])

myarray.content                = <Table [<Row 0> <Row 1> <Row 2>] at 0x7f9e5f0f6518>

myarray                        = <TLorentzVectorArray [TLorentzVector(1, 1, 1, 10) TLorentzVector(2, 2, 2, 10) TLorentzVector(3, 3, 3, 10)] at 0x7f9e5f0f64a8>


<br><br>

## Detail about baskets

<br>

Uproot needs to access details about TBaskets (number of bytes, number of entries, etc.), and these are public methods. You might want to use them to diagnose issues with your ROOT files, most importantly **too many small baskets**.

<br>

Yet another sense in which uproot is a _low-level_ library.

<br><br>

<img src="img/terminology.png" width="90%">

In [12]:
tree = uproot.open("data/HZZ-objects.root")["events"]
branch = tree["muonp4"]

# TTree number of entries SHOULD be equal to the TBranch number of entries...
print(f"\ntree.numentries = {tree.numentries}, branch.numentries = {branch.numentries}")

print(f"\nbranch.numbaskets = {branch.numbaskets}")

# If you have multiple values per event (entry), the number of items != number of entries
print(f"\nbranch.numitems() = {branch.numitems()}")

print(f"\nbranch.basket_numentries(0) = {branch.basket_numentries(0)}")

print(f"\nbranch.basket_numitems(0) = {branch.basket_numitems(0)}")


tree.numentries = 2421, branch.numentries = 2421

branch.numbaskets = 10

branch.numitems() = 3825

branch.basket_numentries(0) = 262

branch.basket_numitems(0) = 423


In [13]:
# Read one TBasket at a time
for basket in branch.iterate_baskets():
    print(len(basket), basket[0], sep="\n")

262
[TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)]
268
[TLorentzVector(-32.974, -52.461, 46.334, 77.371) TLorentzVector(22.022, 22.085, 7.9787, 32.193)]
269
[TLorentzVector(-31.721, 26.438, 30.693, 51.452) TLorentzVector(-12.316, -38.976, -25.216, 48.028)]
268
[TLorentzVector(61.393, -29.656, -114.65, 133.4) TLorentzVector(-5.3458, 40.936, -54.989, 68.761)]
271
[TLorentzVector(-29.59, -23.43, 77.423, 86.133)]
262
[TLorentzVector(53.614, -35.479, 105.72, 123.74) TLorentzVector(13.618, 38.494, 150.76, 156.2)]
269
[TLorentzVector(-35.162, 37.467, -127.09, 137.08) TLorentzVector(10.536, -47.655, -178.63, 185.18)]
272
[TLorentzVector(30.471, 0.62969, 40.417, 50.62)]
263
[TLorentzVector(22.242, -12.982, -11.336, 28.138)]
17
[TLorentzVector(-31.072, -55.729, 149.66, 162.7)]


In [14]:
# Uproot's operational definition of "clusters" are the entry ranges where a set of branch's baskets
# line up. (This is a good place to slice entry ranges, to avoid reading unnecessary baskets.)

# Exercise: what happens if you include "fTemperature" in the selection? Why?

tree = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["T"]
list(tree.clusters(["fMatrix[4][4]"]))

[(0, 221), (221, 442), (442, 663), (663, 800), (800, 1000)]

In [15]:
# This is why the length of arrays in iterate depends on the number of branches you're looking at.

# Exercise: control it explicitly with entrysteps.

for arrays in tree.iterate(["fMatrix[4][4]", ]):      # "fTemperature"
    print([(n, len(x)) for n, x in arrays.items()])

[(b'fMatrix[4][4]', 221)]
[(b'fMatrix[4][4]', 221)]
[(b'fMatrix[4][4]', 221)]
[(b'fMatrix[4][4]', 137)]
[(b'fMatrix[4][4]', 200)]


<br><br>

## Lazy arrays

<br>

Iteration over arrays helps you avoid running out of memory, but you have to write your analysis in a loop over batches.

Lazy arrays let you pretend the whole array is in memory, but only access it a little at a time.

<br>

This is actually a _high-level_ feature.

<br><br>

In [16]:
# uproot.lazyarray(s), TTree.lazyarray(s), and TBranch.lazyarray all give you lazy-loading arrays.

arrays = uproot.open("data/HZZ-objects.root")["events"].lazyarrays()

# It's a chunked awkward.Tree, which looks like "<Row>, <Row>, <Row>" so that you don't accidentally
# read in data by looking at it.
arrays

<ChunkedArray [<Row 0> <Row 1> <Row 2> ... <Row 2418> <Row 2419> <Row 2420>] at 0x7f9e880e9ac8>

In [17]:
print(arrays.type)    # or .columns

[0, 2421) -> 'jetp4'             -> [0, inf) -> <class 'uproot_methods.classes.TLorentzVector.Methods'>
             'jetbtag'           -> [0, inf) -> float32
             'jetid'             -> [0, inf) -> bool
             'muonp4'            -> [0, inf) -> <class 'uproot_methods.classes.TLorentzVector.Methods'>
             'muonq'             -> [0, inf) -> int32
             'muoniso'           -> [0, inf) -> float32
             'electronp4'        -> [0, inf) -> <class 'uproot_methods.classes.TLorentzVector.Methods'>
             'electronq'         -> [0, inf) -> int32
             'electroniso'       -> [0, inf) -> float32
             'photonp4'          -> [0, inf) -> <class 'uproot_methods.classes.TLorentzVector.Methods'>
             'photoniso'         -> [0, inf) -> float32
             'MET'               -> <class 'uproot_methods.classes.TVector2.Methods'>
             'MC_bquarkhadronic' -> <class 'uproot_methods.classes.TVector3.Methods'>
             'MC_bquarklept

In [18]:
# Accessing Muon_Px causes it to be read (in this case, just the first and last baskets).

arrays["muonp4"]

<ChunkedArrayMethods [[TLorentzVector(-52.899, -11.655, -8.1608, 54.779) TLorentzVector(37.738, 0.69347, -11.308, 39.402)] [TLorentzVector(-0.81646, -24.404, 20.2, 31.69)] [TLorentzVector(48.988, -21.723, 11.168, 54.74) TLorentzVector(0.82757, 29.801, 36.965, 47.489)] ... [TLorentzVector(-29.757, -15.304, -52.664, 62.395)] [TLorentzVector(1.1419, 63.61, 162.18, 174.21)] [TLorentzVector(23.913, -35.665, 54.719, 69.556)]] at 0x7f9e880e9a20>

In [19]:
# Let's reuse that custom interpretation that prints out whenever ROOT data are read.

branch = uproot.open("http://scikit-hep.org/uproot/examples/Event.root")["T"]["fMatrix[4][4]"]
print("number of entries per basket:", [branch.basket_numentries(i) for i in range(branch.numbaskets)])

custom_asdtype = branch.interpretation
custom_asdtype.__class__ = CustomAsDtype
custom_asdtype

lazy = branch.lazyarray(custom_asdtype)
# (Note: no print-outs yet!)

number of entries per basket: [221, 221, 221, 137, 200]


In [20]:
# Exercise: what do you see when you access lazy[0], lazy[221], lazy[442]? Why?

lazy[442]

asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) fromroot (first step in interpreting data from a ROOT file)
asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) finalize (puts finishing touches on array and returns it)


array([[-0.57938826,  0.04526387, -0.5436005 ,  0.        ],
       [ 0.62756741,  0.45790792,  1.73868799,  0.        ],
       [-0.7387197 ,  1.20050085,  3.7554934 ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

In [21]:
# Control the cache, as before.

cache = {}

lazy = branch.lazyarray(custom_asdtype, cache=cache)
lazy[0:441]

cache.keys()

asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) fromroot (first step in interpreting data from a ROOT file)
asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) finalize (puts finishing touches on array and returns it)
asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) fromroot (first step in interpreting data from a ROOT file)
asdtype(Bf4(4,4),Lf8(4,4)) identifier
asdtype(Bf4(4,4),Lf8(4,4)) finalize (puts finishing touches on array and returns it)


dict_keys([<VirtualArray.TransientKey 140318728592464>, <VirtualArray.TransientKey 140318728591792>])

In [52]:
import os
import numpy
import awkward

# Load lazy arrays with persistvirtual=True (persistence methods remember that the chunks of the lazy array are virtual)
arrays = uproot.open("data/HZZ.root")["events"].lazyarrays(persistvirtual=True)

# Add new data to the Table, not previously in the ROOT file
arrays["Muon_Pt"] = numpy.sqrt(arrays["Muon_Px"]**2 + arrays["Muon_Py"]**2)

# Save as an awkward-array file
awkward.save("tmp.awkd", arrays, mode="w")

# The file contains Muon_Pt and only INSTRUCTIONS for reading data from ROOT
os.path.getsize("tmp.awkd") // 1024, os.path.getsize("data/HZZ.root") // 1024

(45, 212)

In [53]:
# So when we read it back, we can pretend it's one dataset, but original columns come from ROOT and derived columns come from the awkd file.

arrays = awkward.load("tmp.awkd")

print(f"\n\nread from data/HZZ.root:\narrays['Muon_Pz'] = {arrays['Muon_Pz']}")
print(f"\n\nread from tmp.awkd:\narrays['Muon_Pt'] = {arrays['Muon_Pt']}")

# Note: won't work if the original ROOT files ever get moved...



read from data/HZZ.root:
arrays['Muon_Pz'] = [[-8.160793 -11.307582] [20.199968] [11.168285 36.96519] ... [-52.66375] [162.17632] [54.719437]]


read from tmp.awkd:
arrays['Muon_Pt'] = [[54.168106 37.744152] [24.417913] [53.58827 29.811996] ... [33.461536] [63.619816] [42.93995]]


In [54]:
# Apply a cut to get a new collection (lazy mask; not a big copy)
selected = arrays[arrays['Muon_Pt'].max() > 60]
print(selected)

# Save the lazily masked data
awkward.save("tmp2.awkd", selected, mode="w")

# Still not a very big file (essentially an event list)
os.path.getsize("tmp2.awkd") // 1024, os.path.getsize("data/HZZ.root") // 1024

# (Nearly) zero-cost skims!

[<Row 3> <Row 4> <Row 8> <Row 9> <Row 11> <Row 12> <Row 14> ...]


(53, 212)

In [55]:
# The filtered file contains filtered data
arrays2 = awkward.load("tmp2.awkd")

print(f"\n\nfile has selected contents:\narrays2 = {arrays2}")
print(f"\n\nread from data/HZZ.root:\narrays2['Muon_Pz'] = {arrays2['Muon_Pz']}")
print(f"\n\nread from tmp.awkd:\narrays2['Muon_Pt'] = {arrays2['Muon_Pt']}")



file has selected contents:
arrays2 = [<Row 3> <Row 4> <Row 8> ... <Row 2406> <Row 2411> <Row 2419>]


read from data/HZZ.root:
arrays2['Muon_Pz'] = [[403.84845 335.0942] [-89.69573 20.115053] [35.638836 -17.473787] ... [-113.74551 -113.811455] [-24.587757 -0.38994783] [162.17632]]


read from tmp.awkd:
arrays2['Muon_Pt'] = [[88.63194 77.951485] [81.011406 47.175045] [106.28356 12.311636] ... [76.18734 31.307274] [61.645054 28.647495] [63.619816]]


In [26]:
# Dask's arrays are an abstraction over lazy arrays

uproot.daskarray("data/Zmumu*.root", "events", "E1")

dask.array<array, shape=(2304,), dtype=float64, chunksize=(2304,)>

In [27]:
# So are Dask DataFrames

uproot.daskframe("data/Zmumu*.root", "events")

Unnamed: 0_level_0,Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,object,int32,int32,float64,float64,float64,float64,float64,float64,float64,int32,float64,float64,float64,float64,float64,float64,float64,int32,float64
2303,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<br><br><br><br><br>

## Remote reading

<br>

We've already seen examples of HTTP; XRootD works the same way if you have it.

<br><br><br><br><br>

In [41]:
# Raw bytes from HTTP and XRootD are automatically cached (only exception of the "no-cache" policy).
# 
# You can control it with httpsource and xrootdsource options:

print("xrootdsource defaults:", uproot.XRootDSource.defaults)
print("httpsource defaults:  ", uproot.HTTPSource.defaults)

file = uproot.open("http://scikit-hep.org/uproot/examples/Event.root",
                   httpsource={"chunkbytes": 4*1024,    # read in 4 kB chunks
                               "limitbytes": 1024**2,   # keep at most 1 MB in memory
                               "parallel":   12         # allow 12 simultaneous threads
                              })
file.keys()

# This internal cache consists of tiles of raw data, labeled by location in the file.
list(file._context.source.cache.keys())

xrootdsource defaults: {'timeout': None, 'chunkbytes': 32768, 'limitbytes': 33554432, 'parallel': True}
httpsource defaults:   {'chunkbytes': 32768, 'limitbytes': 33554432, 'parallel': 96}


[0, 9161, 9162, 9163]

<br><br><br><br>

## Parallel decompression/interpretation

<br>

Raw bytes are read in parallel and automatically cached (explicitly for remote files, implicitly through memory-mapping local files), but the decompression is serial and uncached unless requested.

<br><br><br><br>

In [48]:
import concurrent.futures                            # built-in feature of Python 3

executor = concurrent.futures.ThreadPoolExecutor(12) # 12 parallel threads

tree = uproot.open("data/HZZ-lzma.root")["events"]
arrays = tree.arrays("*",
                     executor=executor,              # do all the work on the executor
                     blocking=False)                 # do NOT wait until it's done: return immediately

print(f"\narrays:      {arrays}")                    # it returned a function object, not the data

real_arrays = arrays()                               # evaluating function means, "wait until done"

print(f"\nreal_arrays: {real_arrays}")               # now we have the data


arrays:      <function TTreeMethods.arrays.<locals>.wait at 0x7f9e7ff53d90>

real_arrays: {b'NJet': array([0, 1, 0, ..., 1, 2, 0], dtype=int32), b'Jet_Px': <JaggedArray [[] [-38.874714] [] ... [-3.7148185] [-36.361286 -15.256871] []] at 0x7f9e5c267dd8>, b'Jet_Py': <JaggedArray [[] [19.863453] [] ... [-37.202377] [10.173571 -27.175364] []] at 0x7f9e5c26d9b0>, b'Jet_Pz': <JaggedArray [[] [-0.8949416] [] ... [41.012222] [226.42921 12.119683] []] at 0x7f9e5c26dcf8>, b'Jet_E': <JaggedArray [[] [44.137363] [] ... [55.95058] [229.57799 33.92035] []] at 0x7f9e5c26d6d8>, b'Jet_btag': <JaggedArray [[] [-1.0] [] ... [-1.0] [-1.0 -1.0] []] at 0x7f9e5c2671d0>, b'Jet_ID': <JaggedArray [[] [True] [] ... [True] [True True] []] at 0x7f9e5c3b22e8>, b'NMuon': array([2, 1, 2, ..., 1, 1, 1], dtype=int32), b'Muon_Px': <JaggedArray [[-52.899456 37.73778] [-0.81645936] [48.98783 0.8275667] ... [-29.756786] [1.1418698] [23.913206]] at 0x7f9e5c26d320>, b'Muon_Py': <JaggedArray [[-11.654672 0.6934736] [-24.4042

<br>

**Parallel decompression/interpretation has only really been useful for large LZMA-compressed datasets.**

For most purposes, I wouldn't bother (only worth trying if you think you might be in this situation).

<br><br>

Parallel speedup vs number of threads _for a large, LZMA-compressed dataset:_

<center><img src="img/scaling.png" width="50%"></center>

<br><br><br><br><br><br>

<center><b>That's all!</b></center>

<br><br><br><br><br><br>