# Chunking with xarray and dask

A basic example of loading a netCDF dataset using xarray, chunking it using dask, and saving it to disk in zarr format

Author: Charles Blackmon-Luca

## Getting started

To load in our netCDF datasets, we will require xarray and netCDF4; to chunk them into dask arrays, we will require dask.

In [None]:
import xarray as xr
import dask

To get a grasp of dask's functionaltiy beyond cloud computing, we will also use a local distributed scheduler, which can be viewed by starting up a `Client` through `dask.distributed`:

In [None]:
from dask.distributed import Client

client = Client()
client

We can check the progress of this scheduler by following the link above to the Dashboard - if this link does not work, we may need to forward port `8787` to our local machine using `ssh`:

```
ssh -L 8787:localhost:8787 tracmip@weather.rsmas.miami.edu
```

With the dashboard open, let's try opening a random dataset - CAM4's Aqua4xCO2 experiment with a monthly timestep. When working with data we find is commonly used for climatologies, we may prefer to chunk by time, as we'd prefer to load the entire space for any given plot or computation. When working with pressure-sensitive data, we may prefer to chunk by pressure levels. Overall, take into consideration that we want our chunks to be around 10-100 MB in size; when in doubt, use `'auto'` to quickly select chunk sizes you *think* may need to be chunked but don't necessarily know a good size for:

In [None]:
dask.config.set({'array.chunk-size' : '128MiB'})

monthly = xr.open_mfdataset('/data2/tracmip/CAM4/Aqua4xCO2/Aqua4xCO2.monthly.nc', chunks={'time' : 'auto', 'lev' : 'auto'})
monthly

Once we have the data loaded and chunked, we can convert it to zarr, where the chunking will be retained. This process shouldn't take very long, as the amount of data is relatively small. The progress of this conversion can be viewed from our dashboard:

In [None]:
monthly.to_zarr('/data2/tracmip/zarr/CAM4/Aqua4xCO2/monthly/')

Now that our data is saved to disk, we can inspect the chunk size in terminal to make sure it is reasonable:

In [None]:
!ls -lh /data2/tracmip/zarr/CAM4/Aqua4xCO2/monthly/CLDICE/

The data is chunked well, but it is not representative of the data that would benefit from cloud computing. A better example is Aqua4xCO2 data with a daily timestep - here we have nearly 6 times more data! With data at a finer timestep, we find that our use cases revolve around checking trends over a long period of time at a specific location - here we would benefit from chunking by latitude and longitude, leaving time unchunked:

In [None]:
daily = xr.open_mfdataset('/data2/tracmip/CAM4/Aqua4xCO2/Aqua4xCO2.daily.nc', chunks={'lat' : 'auto', 'lon' : 'auto', 'lev' : 'auto'})
daily

And similarly save to disk in zarr format:

In [None]:
daily.to_zarr('/data2/tracmip/zarr/CAM4/Aqua4xCO2/daily/')