# dask-awkward IO Tutorial

_last updated 2024-01-26_

In [1]:
from __future__ import annotations
import awkward
import dask_awkward
print("package versions:")
print(f"awkward:       {awkward.__version__}")
print(f"dask-awkward:  {dask_awkward.__version__}")

## I. The Basics

### Data on disk

`dask-awkward` supports a number of file formats out-of-the-box. Those include:

- Parquet
- JSON
- Plain text


> _Note_: The [uproot project](https://github.com/scikit-hep/uproot5) provides the `uproot.dask` module for reading the ROOT file format into dask-awkward collections. This tutorial will focus on file formats that have support baked into `dask-awkward`.

Since this is Dask, data on disk is _staged_ for reading when using a dask-awkward read function. Data will not be read until a `compute` or (`persist`) call is reached.

**Note**: we can also create dask-awkward `Array` collections from other Dask collections (bag, array, dataframe, delayed) or from data that already exists in memory (awkward arrays and Python lists)

Let's jump into a quick example reading a Parquet dataset.

In [2]:
import dask_awkward as dak
pq_dir = os.path.join("data", "parquet")
dataset = dak.from_parquet(pq_dir)

In [3]:
dataset

dask.awkward<from-parquet, npartitions=4>

By default the `dak.from_parquet` function will partition by file. Since the directory has four files, we will have four partitions in our Dask collection.

Without computing anything, the `from_parquet` function has extracted some metadata, so we are able to get a peek at the structure of what the eventual awkward array would be if we did compute this collection:

In [4]:
dataset._meta

You'll notice this dataset has two top level fields:
- `type`
- `scoring`

Inside of the `scoring` field (or Parquet column) are three subfields (`scoring` is a `Record` in awkward array terminology):

- `player`
- `basket`
- `distance`

We can also see that for each element in the top level array, we have exactly one entry for the `type` field, and some variable (showing array raggedness) number of `scoring` entries.

The data we have here is some made up data about basketball games/matches. Each game is labeled as either a "friendly" match or a "league" match. Each game has some number of total scores, each score being made by some player as some type of basket at some distance. The raggedness of the array comes from each match having a different total number of scores.

Since this first section of the tutorial is meant to show the basics of the IO functions, we won't worry too much about the details of the dataset, but we will revisit the structure in the next section!

Since this tutorial is using a small toy dataset we can easily compute it quickly to see a concrete awkward array:

In [5]:
computed_dataset = dataset.compute()

In [6]:
computed_dataset

With parquet, we can restrict our data reading to only grab a specific set of columns from the files. In this toy dataset we're working with, if we only care about the specific players which did some scoring, we can specific that:

In [7]:
dataset = dak.from_parquet(pq_dir, columns=["scoring.player"])

In [8]:
dataset._meta

Notice that when we peek at the metadata now, we see our array is going to contain less information, as expected! If we tied to access one of the fields we didn't request, we'd hit an `AttributeError` (before compute time!). Since we are able to track metadata at graph construction time, we can fail as early as possible

In [9]:
dataset.scoring.distance

AttributeError: distance not in fields.

Let's go back to the original dataset and save it to JSON after repartitioning the collection:

In [14]:
dataset = dak.from_parquet(pq_dir)
smaller_partition_dataset = dataset.repartition(15)
dak.to_json(smaller_partition_dataset, os.path.join("data", "json"))

`dask-awkward`'s `to_*` functions have a bit of special treatmeant compared to other dask-awkward functions. They are the only parts of dask-awkward that are _eagerly_ computed. The `to_*` functions have a `compute=` argument that defaults to `True`. If you'd like to stage a data writing step without compute, you can write:

In [15]:
write_it = dak.to_json(smaller_partition_dataset, os.path.join("data", "json2"), compute=False)

In [16]:
write_it

dask.awkward<to-json, type=Scalar, dtype=float64>

Notice that the `write_it` object is a dask-awkward `Scalar` collection that can be computed.

In [17]:
write_it.compute()

Now we can reload our data with `dak.from_json`. Realistically, taking data stored in parquet to then save it as JSON to be read later is likely a bad idea! But we're just doing this to show example usage of the dask-awkward API.

In [18]:
dataset = dak.from_json(os.path.join("data", "json"))

In [19]:
dataset

dask.awkward<from-json-files, npartitions=15>

## II. Column (buffer) optimization

Dask workflows can be separated into two stages: first is task graph construction, and second is task graph execution. During task graph construction we are able to track metadata about our awkward array collections; with that metadata knowledge we are able, just before execution time, to know which parts of the Array are necessary to complete a computation. This is possible by running the task graph on a metadata only version of the arrays. When we run the metadata task graph, components of the data-less array are "touched" by the execution of the graph, and when that happens we know that's a part of the data on disk that needs to be read. 

Let's look at a quick example with Parquet. Recall the dataset from the previou section. We have these columns:

- `type`
- `scoring.player`
- `scoring.basket`
- `scoring.distance`

If we want to calculate the average distance of each scored basket during each game, ignoreing all freethrows, we can calculate that like so:

In [20]:
dataset = dak.from_parquet(pq_dir)
free_throws = dak.str.match_substring(dataset.scoring.basket, "freethrow")
distances = dataset.scoring.distance[free_throws == False]
result = dak.mean(distances, axis=1)

The `result` will be the average distance of each non-free-throw shot. Notice we only used two of the four columns: `scoring.basket` and `scoring.distance`, If we wanted to be explicit about it, we could use the `columns=` argument in the `dak.from_parquet` call. But we can also just rely on dask-awkward to do this for us! The columns/buffer optimization will detect that the graph is only going to need those columns, rewriting the internal `ak.from_parquet` call at the node in the task graph that actually reads the data from disk. We can actually see this logic without running the compute with the `dak.necessary_columns` function:

In [21]:
dak.necessary_columns(result)

{'from-parquet-b7916bd949c3744cf0ec38dea00d0bd6': frozenset({'scoring.basket',
            'scoring.distance'})}

We see the name of the input layer, and the names of the columns that are going to be read by that input layer.

This will also work with JSON. Awkward-Array's `from_json` has a feature that allows users to pass in a JSONSchema that instructs the reader which parts of the JSON dataset should be read. The reader still has to process all of the bytes in the text based file format but with a schema declared, the reader can intelligently skip over different keys in the JSON, saving memory and and time during array building.

Here's the same computation but starting with a JSON dataset:

In [22]:
dataset = dak.from_json(os.path.join("data", "json"))
free_throws = dak.str.match_substring(dataset.scoring.basket, "freethrow")
distances = dataset.scoring.distance[free_throws == False]
result = dak.mean(distances, axis=1)

In [23]:
dak.necessary_columns(result)

{'from-json-files-6eebaf87f3a09a08c1234137dd381b61': frozenset({'scoring.basket',
            'scoring.distance'})}

We see the exact same necessary columns.

A final little detail. The way that we generate the JSON schema which is then passed to the reading node is with `dak.layout_to_jsonschema`. Once the column/buffer optimization has determined which are the fields will be necessary, we can select those fields from the awkward array form that we start with after the `dak.from_json` call. We then generate an awkward array layout from the sub-form generated by selecting a subset of the columns. Finally, we create a JSONSchema from that layout:

In our small example case here, we know the columns are `scoring.basket` and `scoring.distance`. We can show this step manually here (starting with the first array collection created with the `dak.from_json call):

In [24]:
# create the subform based on the columns we need:
subform = dataset.form.select_columns(["scoring.basket", "scoring.distance"])
# create an awkward array layout:
sublayout = subform.length_zero_array(highlevel=False)
# and convert that to JSONSchema:
necessary_schema = dak.layout_to_jsonschema(sublayout)

In [25]:
necessary_schema

{'title': 'untitled',
 'description': 'Auto generated by dask-awkward',
 'type': 'object',
 'properties': {'scoring': {'type': 'array',
   'items': {'type': 'object',
    'properties': {'basket': {'type': 'string'},
     'distance': {'type': 'number'}}}}}}

This feature can be turned off when running dask-awkward graphs with the config parameter "awkward.optimization.enabled". By default this setting is `True`. We can run the same compute with the feature turned off via:

In [26]:
import dask

with dask.config.set({"awkward.optimization.enabled": False}):
    result.compute()

This could be useful for debugging. If the compute fails with the optimization enabled, but succeeds with the optimization disabled, then there is likely a bug in dask-awkward or awkward-array that should be raised!