# Dask integration

recursive-diff supports {class}`xarray.DataArray` and {class}`xarray.Dataset` objects backed by [Dask](https://dask.org). When it compares two such objects, the comparison is optimized to maximise parallelism and minimize memory usage.

In this example, we're going to compare two arrays worth a total of 3 GiB.
However, because they're lazily defined, the whole comparison will use only a few MiB RAM and will run on all available threads:

In [None]:
import sys               
sys.path.insert(0, "..")

import dask.array as da
import xarray
from recursive_diff import display_diffs

a = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
b = xarray.DataArray(da.ones((200_000, 1_000)), name="ones")
a[123_456, 789] = 1.01
b[133_700, 333] = 1.0000000001  # Below tolerance

display_diffs(a, b)

## Dask clusters
If you have a Dask client active and compare chunked Xarray objects, the comparison will run on the Dask cluster.

In this example we're using a ``LocalCluster``, but this works with remote clusters as well as [Coiled](https://coiled.io) clusters!

You may use {func}`xarray.open_zarr` or {func}`xarray.open_dataset` to open Zarr or NetCDF files on S3, which means that if your client is outside of AWS the data won't transfer over the internet and you won't pay egress charges.
S3 access not yet supported by {func}`~recursive_diff.recursive_open`.

In [None]:
import dask.distributed

with dask.distributed.LocalCluster() as cluster:
    with dask.distributed.Client(cluster):
        display_diffs(a, b)