# Demo: `DAOD_PHYSLITE` analysis with uproot/awkward on jupyterhub on GCP

<div class="alert alert-info">
Note: This tutorial is targeted at users interested in R&D and technical details. Much of this is still in early development/prototyping.
</div>

## Read and process PHYSLITE using uproot/awkward

First, let's start with some general notes on reading `DAOD_PHYSLITE`

The PHYSLITE ROOT files currently follow a similar structure as regular ATLAS xAODs

They containing several trees, where the one holding the actual data is called `CollectionTree`. The others contain various forms of Metadata.

In [None]:
import uproot
import awkward as ak

In [None]:
f = uproot.open("data/DAOD_PHYSLITE_21.2.108.0.art.pool.root")

In [None]:
f.keys()

### 1-D vectors
* All branches are stored with the **highest split level**
* In most cases data stored in branches called `Aux.<something>` or `AuxDyn.<something>`
* Typically **vectors of fundamental types**, like e.g. pt/eta/phi of particle collections
* **can be read into numpy arrays efficiently using uproot** since data stored as contiguous blocks  
(except for the 10-byte vector headers whoose positions are known from ROOT's event offsets)

In [None]:
f["CollectionTree"].show("/AnalysisElectronsAuxDyn.(pt|eta|phi)$/i", name_width=30, interpretation_width=50)

### ElementLinks

The most relevant exception to this: `ElementLink` branches:

* provide cross references into other collections
* **often 2-dimensional** (`vector<vector<ElementLink<...>>>`)
* data part (`ElementLink`) is serialized as a **structure of 2 32bit unsigned integers**:
  * hash `m_persKey`, identifying the target collection
  * index `m_persIndex` identifying the array-index of the corresponding particle in the target collection.

In [None]:
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].typename

In [None]:
for element in f.file.streamer_named("ElementLinkBase").elements:
    print(f"{element.member('fName')}: {element.member('fTypeName')}")

Uproot can read this, but the loop that deserializes the data is done in python and therefore slow.

This is not relevant for this very small file, but becomes important for larger files.

This can be handled by [AwkwardForth](https://doi.org/10.1051/epjconf/202125103002) which is however currently (November 2021) not yet integrated with uproot.

For now we can use a custom function `branch_to_array` to do this:

In [None]:
from physlite_experiments.deserialization_hacks import branch_to_array

In [None]:
branch_to_array(f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"])

One can actually see a significant improvement already for the small file with only 40 events!

In [None]:
%%timeit
# using standard uproot
f.file.array_cache.clear()
f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"].array()

In [None]:
%%timeit
# using numba
f.file.array_cache.clear()
branch_to_array(f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"])

In [None]:
%%timeit
# using awkward forth
f.file.array_cache.clear()
branch_to_array(f["CollectionTree/AnalysisElectronsAuxDyn.trackParticleLinks"], use_forth=True)

## Integration with `coffea.nanoevents`

The PHYSLITE schema and the corresponding behavior classes are still under development - [CoffeaTeam/coffea#540](https://github.com/CoffeaTeam/coffea/issues/540) tracks the progress of some TODO items.

For more information on `NanoEvents` see the [NanoEvents tutorial](https://github.com/CoffeaTeam/coffea/blob/master/binder/nanoevents.ipynb) or [Nick Smith's presentation](https://youtu.be/udzkE6t4Mck) at the [pyHEP 2020](https://indico.cern.ch/event/882824).

<div class="alert alert-block alert-success">
    <b>The Goal:</b>
    <ul>
        <li>Work with object-oriented event data models, but stick to the array-at-a-time processing paradigm.<br> → Struct/Object of arrays instead of Array of structs/objects</li>
        <li>Hide the details from the user</li>
    </ul>
</div>

In [None]:
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema

# patch nanoevents to use the custom branch_to_array function
from physlite_experiments.deserialization_hacks import patch_nanoevents
patch_nanoevents()

In [None]:
factory = NanoEventsFactory.from_root(
    "data/DAOD_PHYSLITE_21.2.108.0.art.pool.root",
    "CollectionTree",
    schemaclass=PHYSLITESchema
)
events = factory.events()

This groups particles and the available properties conveniently under one central `event` array

* everything is lazy loading
* cross referencing via ElementLinks already implemented for some collections
* particles behave as LorentzVectors (can add them, calculate invariant masses and much more)

See [my tutorial at the IRIS-HEP AGC tools workshop 2021](https://github.com/nikoladze/agc-tools-workshop-2021-physlite) for more technical details

In [None]:
events.Electrons

In [None]:
events.Electrons.fields

In [None]:
events.Electrons.trackParticles

In [None]:
events.Electrons.trackParticles.z0

In [None]:
events.Electrons[events.Electrons.pt > 10000].trackParticles

In [None]:
events.TruthElectrons.parents

In [None]:
events.TruthElectrons.parents.children

In [None]:
events.TruthElectrons.parents.children.parents

In [None]:
events.TruthElectrons.parents.children.parents.children.pdgId

In [None]:
events.TruthElectrons.parents.children.parents.children.pdgId.ndim

## Read data via HTTPS from google cloud storage (authentication via rucio)

*Now we are going to do something a bit weird: instead of importing some utility functions we will directly execute a python file containing them. This is because we later want dask to serialize the functions to send them to the workers (which don't have access to our local directory on the submission node). It's a workaround for interactively developing functions that are sent to dask workers on a dask gateway cluster (which is used here). This issue does not occur in a setting where you have a shared filesystem for all workers.*

**Let me know if you know a better approach - one alternative is dask's `upload_file` method, but that has it's own issues**

In [None]:
%run utils.py

this gives us the following functions:

In [None]:
setup_rucio_and_proxy, get_signed_url, get_signed_url_worker

We will use them to authenticate to rucio and get signed urls on google cloud storage (GCS).

For that we have to provide a VOMS proxy. To avoid the need for having the grid certificate and the voms tools on this jupyterhub instance we create the voms proxy outside (some machine where we have the voms tools and our grid certificate) and upload it to this notebook:

In [None]:
from ipywidgets import FileUpload
upload = FileUpload()
display(upload)

Then we setup the nescessary environment variables (fill in your cern account name):

In [None]:
setup_rucio_and_proxy(upload.data[-1], rucio_account="nihartma")

Now we should be able to query rucio:

In [None]:
import rucio.client
rucio_client = rucio.client.Client()

Let's get a list of all files in one data period, corresponding to around 10% of the whole Run2 data - around 10TB in total:

In [None]:
files = list(rucio_client.list_files("data17_13TeV", "data17_13TeV.periodK.physics_Main.PhysCont.DAOD_PHYSLITE.grp17_v01_p4309"))

In [None]:
files[0]

In [None]:
sum(file["bytes"] for file in files) / 1024 ** 4

The full Run2 dataset is replicated to GCS. To access it via https we can ask rucio for a signed url. Uproot can directly deal with http(s) urls:

In [None]:
url = get_signed_url(rucio_client, files[0]["scope"], files[0]["name"])

In [None]:
f_remote = uproot.open(url)

In [None]:
f_remote["CollectionTree/AnalysisElectronsAuxDyn.pt"].array()

Some notes on this:

* GCS does not support multi-range requests (equivalent to xrootd vector reads), single-range requests are allowed
* Single-range requests with the uproot `MultithreadedHTTPSource` are suboptimal
* GCS seems fine with a huge number of parallel requests - this can be done with asyncio
* However, oftentimes downloading the whole file is still faster async reading of partial chunks (but needs lot's of memory)

In [None]:
import requests

def download(url):
    return requests.get(url).content

In [None]:
data = download(url)

In [None]:
import io

uproot.open(io.BytesIO(data))["CollectionTree/AnalysisElectronsAuxDyn.pt"].array()

I have an experimental implementation for an asyncio HTTPSource for uproot (should probably make a PR for uproot at some point or consider using an interface to fsspec which has a `cat_ranges` method that might be used for this).

GCS seems fine with 100 parallel tcp connections (even for each worker on a larger cluster):

In [None]:
from physlite_experiments.io import AIOHTTPSource

class AIOHTTP100Source(AIOHTTPSource):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, tcp_connection_limit=100, **kwargs)

In [None]:
uproot.open(url, http_handler=AIOHTTP100Source)["CollectionTree/AnalysisElectronsAuxDyn.pt"].array()

## Run an actual analysis with this

## Run on a dask cluster