<img src="https://xarray.dev/dataset-diagram-logo.png"
     align="right"
     width="30%"/>

# Geospatial Dataset Rechunking

This is a national water model: https://registry.opendata.aws/nwm-archive/

## Set up cluster

In [None]:
import dask

dask.config.set({
    "array.rechunk.method": "p2p",
    "optimization.fuse.active": False,
    "distributed.comm.retry.count": 20,
    "distributed.comm.timeouts.connect": 120,
});

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=100,
    region="us-east-1",
)
client = cluster.get_client()
client

## Load NWM data

In [None]:
import xarray as xr

ds = xr.open_zarr(
    "s3://noaa-nwm-retrospective-2-1-zarr-pds/rtout.zarr",
    consolidated=True,
).drop_encoding()
ds

In [None]:
ds.nbytes / 1e12  # half-petabyte

## Time-optimized rechunking

Let's look at two months worth of data (~1 TB) and rechunk it to be optimized for time dimension selections.

In [None]:
data = ds.zwattablrt.sel(time=slice("2020-01-01", "2020-03-01"))   # 1 TB of data
data

In [None]:
result = data.chunk({"time": 1, "x": "auto", "y": "auto"})
result

In [None]:
result.to_zarr("s3://oss-scratch-space/nwm-time-optimized.zarr", mode="w")

In [None]:
import fsspec

fs = fsspec.filesystem("s3")
fs.ls("s3://oss-scratch-space/nwm-time-optimized.zarr/")