<pre> &#8593;&#8593;&#8593;&#8593;&#8593;&#8593;</pre>

**To view this notebook as a slideshow click on the deck icon ![deck](https://raw.githubusercontent.com/deathbeds/jupyterlab-deck/main/docs/_static/deck.svg) above.**

For a better slideshow experience, set font sizes to 24px by going to Settings -> Fonts -> Code/Content -> Size.

<center>
<h1>RNTuple in Uproot</h1>
<h2>Andres Rios-Tascon</h2>
<img src="images/PU_lockup.png" style="height:50px;"/>&nbsp&nbsp&nbsp&nbsp&nbsp<img src="images/Iris-hep-4-no-long-name.png" style="height:50px;"/>
</center>

## Outline

- Introduction and motivation for RNTuple.
- Status of RNTuple reading and writing support in Uproot.
- Quick demo.
- Future work and outlook.

## What is RNTuple and why should we care?

- `RNTuple` is a modern serialization format that will replace `TTree`.

- `TTree` has become outdated and bloated.
  - Inefficient storing and reading of nested and/or jagged collections.
  - Lots of special cases and hacky implementations.
  - Virtually impossible to fully support on `uproot`.
 
<center><img src="images/ttree_current_status.svg" style="height:400px;"/></center>

- `RNTuple` will bring many improvements.
  - Simple and modern design (and has a formal spec).
  - Focuses on native data types.
  - Columnar layout very similar to `awkward`.
  - Much faster performance and designed for parallelization.
  - Simpler design should alow for almost 100% support on `uproot`.

## RNTuple performance comparison

<br/>
<br/>
<br/>

<center>
<img src="images/rntuple_comparison.png" style="height:200px;"/>
</center>

<br/>
<br/>
<br/>

Image taken from [arXiv:2204.09043](https://arxiv.org/abs/2204.09043).

## RNTuple timeline

<br/>
<br/>
<br/>

<center>
<img src="images/rntuple_timeline.png" style="height:100px;"/>
</center>

<br/>
<br/>
<br/>

Image taken from <https://doi.org/10.1051/epjconf/202429506020>.

**Version 1.0.0 of the specification was released at the beginning of the year!** ([pdf here](https://cds.cern.ch/record/2923186/files/CERN-OPEN-2025-001.pdf))

<center><img src="images/rntuple_v1.png" style="height:400px;"/></center>

The specification will continue to be updated to add more features, but we already have a stable binary format that we can start using.

## RNTuple in Uproot

- Initial implementation was written by Jerry Ling.
  - Basic reading support.
  - Some scaffolding for writing support.

- The `RNTuple` spec changed very significantly since Jerry worked on it, which completely broke the existing implementation.

- We have fixed and reworked the reading functionality, and we have achieved **100%** coverage of the current specification! (to the best of my knowledge)

- We also already have pretty good writing support.

<center><img src="images/rntuple_current_status.svg" style="height:400px;"/></center>

- The current focus is on implementing functionality that exists for `TTree` to ensure a smooth transition.

<center>
<h2>Let's look at a concrete example</h2>
</center>

## Example RNTuple

Let's consider an example<a name="cite_ref-1"></a>[<sup>[1]</sup>](#cite_note-1) where we have the following data:

| Trigger (bool) | Missing ET {float, float} | Lepton ids (vector) |
| -------------- | ------------------------- | ------------------- |
| False          | {et: 79.7, phi: 2.83}     | []                  |
| True           | {et: 78, phi: 0.62}       | [11, -11]           |
| False          | {et: 10, phi: -2.78}      | [-13, -11]          |
| True           | {et: 14.3, phi: 1.31}     | [11, 11, -13]       |
| True           | {et: 83.2, phi: 2.76}     | [11]                |

<br/>
<br/>

<a name="cite_note-1"></a>1. [^](#cite_ref-1) This example is based on [this talk](https://indico.cern.ch/event/1222943/) by Jerry Ling.

## Data layout

```mermaid
flowchart BT
    A[\"(top level)"/]
    B("trig (bool)") --> A
    C("met (struct)") --> A
    D("lep_pid (std::vector&lt;int&gt;)") --> A
    E[("column (data)")] --> B
    F("et (float)") --> C
    G("phi (float)") --> C
    H[("column (data)")] --> F
    I[("column (data)")] --> G
    J[("column (offset)")] --> D
    K("_0 (int)") --> D
    L[("column (data)")] --> K
```

We will see that this very closely matches the data layout in `awkward`!

## ROOT code

We can create an RNTuple with this data by using the following ROOT code.

In [None]:
from IPython.display import Code
with open("example_rntuple.C") as f:
    code = f.read()
Code(code, language='cpp')

## Using `uproot` to read this RNTuple

As a quick reminder, `uproot` can be installed with `pip install uproot` or `conda install -c conda-forge uproot`.

In [None]:
# Only run this cell if you're using JupyterLite
import sys
if sys.platform == "emscripten":
    import awkward_cpp
    %pip install awkward==2.7.2
    %pip install uproot

Let's start by importing `uproot`.

In [None]:
import uproot

Let's now open this example file and see what's inside.

In [None]:
f = uproot.open("data/example_rntuple.root")
f.classnames()

Let's now look at this RNTuple and briefly take a look at the data layout that we discussed before.

In [None]:
ntpl = f["ntpl"]

In [None]:
ntpl.show()

In [None]:
for i,fr in enumerate(ntpl.field_records):
    print(f"field_name={fr.field_name:<7} type_name={fr.type_name:<25} idx={i} parent_idx={fr.parent_field_id}")

In [None]:
for cr in ntpl.column_records:
    print(f"idx={cr.idx}, field_id={cr.field_id}, type={cr.type:0>2}, nbits={cr.nbits:0>2}")

### Let's now actually read the data and put it into arrays!

In [None]:
arrays = ntpl.arrays()
arrays

Now everything works in the usual `awkward` fashion.

In [None]:
arrays.lep_pid

### We can already ready complex files

Here is an example of a file produced by an ATLAS workflow.

In [None]:
filename = "data/uproot-physlite-rntuple_v1-0-0-0.root"

f = uproot.open(filename)

In [None]:
f.classnames()

In [None]:
ntpl = f["EventData"]
ntpl.fields

In [None]:
arrays = ntpl.arrays(filter_name="AnalysisSiHitElectronsAuxDyn*")
arrays

Since we already can read 100% of the spec, we can handle advanced features such as:

- Schema evolution (i.e. fields and columns that were added after some data was already written)
- Multiple representations (i.e. columns that over the course of writing change data types)
- Variable-length floating points (i.e. truncated or quantized floats used to save memory)

### Writing support is also in good shape

In [None]:
import awkward as ak
import numpy as np

data = ak.Array(
    {
        "bool": [True, False, True],
        "int": [1, 2, 3],
        "float": [1.1, 2.2, 3.3],
        "jagged_list": [[1], [2, 3], [4, 5, 6]],
        "nested_list": [[[1], []], [[2], [3, 3]], [[4, 5, 6]]],
        "string": ["one", "two", "three"],
        "utf8_string": ["こんにちは", "⚛️💫🎆😀", "ǧ̸̛̫͍̰͖̟̈͛͑͆̆̌̃̉̅̄̔̈́̀̔͆̄͋̍͐͂̎͗̈́͒͘͝ͅö̴̮̝̪̬͎͚̜̖̜͖̞̤͕̙͂̀̀̊͛͑̈́͛͐͊͂͂̇͛̾̔͐͆͑͂̓̅̀͘͘͘̕͝͠͝͝ơ̶͍̙̻̾̈́̓̈́̀̅͑ḑ̷͚̠̹̗͉͙̞͇͕̼̲̥͉̯̞͕̲̻̞͗̓̃̊̅͗͊͊́̑̈́̎͋̇̓͛̅͜͜͠͝ͅb̷̢̢̨̨̛̛̘̠̞̰̺̘̰̖̺̞̱͇̰̙̲̱̪͕͎͉̖̞͇̹̮͙͋̀͑͂̈́̇͛̐͊̀̇͆̓̋̀̿̋̂̅̀̌̑̓̽͊̂͑̈̇̚͜͝y̶̗͇̠̞͚̦̮̦͈̹̥̋̓̓̈́̐̆̀̄̋̂̀̇͋̎̚͜͝ȩ̷̢̡͇̮̩̹̥̬̰͎͔̬̩̰̯͍̲͎̭͉̬̣̻̖͍̥̟̪͕̫̟̋̔̀͆̑̈́̐̃͐͌̍͒̔̈́̃̈́̐̔̾͊̿̓͆͑̚͜͝͝͝ͅ"],
        "regular": ak.Array(
            ak.contents.RegularArray(
                ak.contents.NumpyArray([1, 2, 3, 4, 5, 6, 7, 8, 9]), 3
            )
        ),
        "numpy_regular": np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
        "struct": [{"x": 1, "y": 2}, {"x": 3, "y": 4}, {"x": 5, "y": 6}],
        "struct_list": [
            [{"x": 1}, {"x": 2}],
            [{"x": 3}, {"x": 4}],
            [{"x": 5}, {"x": 6}],
        ],
        "tuple": [(1, 2), (3, 4), (5, 6)],
        "tuple_list": [[(1,), (2,)], [(3,), (4,)], [(5,), (6,)]],
        "optional": [1, None, 2],
        "union": [1, 2, "three"],
        "optional_union": [1, None, "three"],
    }
)

with uproot.recreate("my_file.root") as file:
    obj = file.mkrntuple("ntuple", data)

In [None]:
f = uproot.open("my_file.root")
f.classnames()

In [None]:
ntpl = f["ntuple"]
arrays = ntpl.arrays()
arrays

## Tracking status of RNTuple support in Uproot

I opened [this issue](https://github.com/scikit-hep/uproot5/issues/1382) to use it as a central place for everyone to know the current status of RNTuple support. I've been updating it and adding more things there as they come up.

## Future work and outlook

- Although `RNTuple` reading and writing are already in good shape, there is still a significant amount of work that needs to be done to get to the level of support offered for `TTree`s.

- Reading with `Dask` already works at a basic level, but writing with `Dask`, and reading/writing with `coffea` still need to be implemented.

- `uproot` will become even more useful than it already is, as `RNTuple` becomes the primary data format. Having support for virtually 100% of the `RNTuple` spec means that, in many cases, it can fully replace `ROOT` for reading and writing.

<center><img src="images/rntuple_goal.svg" style="height:400px;"/></center>