# Array Scheduling Optimization

In this example, we'll perform a typical workflow

* Load batches of data
* Stack batches into a single Dask Array
* Rechunk the data for downstream processing
* Write rechunked data to disk, using [Zarr](https://zarr.readthedocs.io)

Depending on the rechunking pattern, this is an embarassingly parallel opertion. However,
the Dask scheduler doesn't necessarily know that, which may result in sub-optimal scheduling. We'll configure the optimization settings to ensure that we have ideal scheduling to minimize data transfer across the cluster.

In [None]:
!conda install -y -c conda-forge zarr

In [None]:
from dask.distributed import Client
import dask.array as da
import zarr

client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

Rather than loading data from disk, we'll generate random data in memory.
We'll have 25 batches, each of which has shape `(2,000,000,)` split into chunks of 90,000 each.

We rechunk along just the first axis (preserving the chunking on the second axis).

In [None]:
inputs = [da.random.random(size=2_000_000, chunks=90_000)
          for _ in range(25)]
inputs_stacked = da.vstack(inputs)
inputs_rechunked = inputs_stacked.rechunk((50, 90_000))

In [None]:
inputs_stacked

In [None]:
inputs_rechunked

And we'll set up the writing to Zarr.

In [None]:
store = zarr.DirectoryStore('spike.zarr')
root = zarr.group(store, overwrite=True)
dest = root.empty_like(name='dest', data=inputs_rechunked, 
                       chunks=inputs_rechunked.chunksize,
                       overwrite=True)

Examining the structure of the stacking and rechunking, we can see that the problem is embarassingly parallel. We'll take a look at the task graph for two of the blocks.

In [None]:
inputs_rechunked.blocks[0, :2].visualize(optimize_graph=True)

These different chains of computation -- from data loading to stacking to rechunking (and eventually writing) -- are completely independent. There's no shared dependencies between chains.

Dask prefers to execute graphs depth first. If you zoom in on the visualization below, which is colored by the order Dask will prefer to execute tasks in, you'll notice that Dask wants to execute all the tasks from the first chain, then all the tasks from the second, and so on (in practice, Dask will notice that the data-generating tasks from the second chain are ready to execute before tasks concatenate or rechunk tasks in the first chain).

In [None]:
inputs_rechunked.blocks[0, :2].visualize(color='order', optimize_graph=True)

What does this imply for the distribution of tasks across a cluster of machines?

Because Dask would like to complete all the tasks from a single chain as quickly as possible, they'll end up being scheduled on different machines. For the (perhaps more typical) non-embarassingly parallel workload, this is fine. But for embarassingly parallel workloads it's not optimal. The rechunk / reduction step may end up needing data from two different machines, requiring a data transfer. But we know that shouldn't be necessary in this case, since it's an embarassingly parallel problem at a higher level.

Let's actually write the data, and monitor the dashboard while writing.

In [None]:
%time _ = inputs_rechunked.store(dest, lock=False)

In this case we end up with many small tasks. Depending on various things (the ratio of machines to cores on the cluster, the memory available on the cluster, bandwidth between machines, network latency), tasks from a single chain may end up on different machines, requiring transfer (highlighted in red). See https://github.com/dask/dask/issues/5105 for an example.

We can achieve the desired scheduling by having Dask *fuse* chains of tasks. See :ref:`dask.optimize.fuse` for more, but the default values aren't aggressive enough for this computation. We want to ensure that the 25 original inputs (we used `da.random.random`) are fused into a single task. This ensures they'll be executed on a single machine, so the following rechunking will happen on the same machine.

In [None]:
import dask

In [None]:
with dask.config.set(fuse_ave_width=25):
    display(inputs_rechunked.blocks[0, :2].visualize(optimize_graph=True))

Let's perform the write, and again observe the dashboard.

In [None]:
%%time

with dask.config.set(fuse_ave_width=25):
    write = inputs_rechunked.store(dest, lock=False)

Overall, we notice larger chunks, and no communication. Each of these contributes to a faster runtime.

The necessary value for `fuse_ave_width` depends strongly on the computation. If we increase the number of batches to, say, 75 we need to increase `fuse_ave_width` accordingly.

In [None]:
inputs = [da.random.random(size=2_000_000, chunks=90_000)
          for _ in range(75)]  # increased from 25 -> 75
inputs_stacked = da.vstack(inputs)
inputs_rechunked = inputs_stacked.rechunk((50, 90_000))

store = zarr.DirectoryStore('spike.zarr')
root = zarr.group(store, overwrite=True)
dest = root.empty_like(name='dest', data=inputs_rechunked, 
                       chunks=inputs_rechunked.chunksize,
                       overwrite=True)

In [None]:
inputs_stacked

In [None]:
inputs_rechunked

In this case, the widest number of tasks being reduces is the size-`50` blocks as a result of the `inputs_stacked.rechunk((50, 90_000))`. So we need to increase the `fuse_ave_width` to at least 50.

In [None]:
with dask.config.set(fuse_ave_width=50):
    display(inputs_rechunked.blocks[0, :2].visualize(optimize_graph=True))

Without fusion.

In [None]:
%time write = inputs_rechunked.store(dest, lock=False)

With fusion.

In [None]:
%%time

with dask.config.set(fuse_ave_width=50):
    write = inputs_rechunked.store(dest, lock=False)

Fusing tasks is not an unambiguously good thing. As the docs for :ref:`dask.optimize.fuse` state

> This trades parallelism opportunities for faster scheduling by making tasks less granular.

In this case, *we* know that we can already achieve ideal parallelism by the coarse-grained embarassinly parallel nature of the problem. We're happy to have less fine-grained parllelism since we know we'll still be able to saturate the cluster.