# A short introduction to Xarray (& friends)

In short, [Xarray](https://xarray.dev/) is a Python library for handling n-dimensional, **labelled** arrays.

In this notebook, we'll see how Xarray compares to (or integrates with) other Python arrays libraries such as Numpy, Dask and Zarr. We'll also see how to plot Xarray datasets.

## Environment setup

In [None]:
import dask.array as da
import xarray as xr
import matplotlib.pyplot as plt
import numpy as np
import zarr

from dask.distributed import LocalCluster, Client

## Numpy arrays

Let's start with something familiar:

In [None]:
# a 3-d array where dimensions are, e.g., "time", "x" and "y".

arr3d = np.random.uniform(size=(3, 2, 4))

arr3d

Indexing (slicing) numpy arrays by position (integers) and axis:

In [None]:
# extract the 1st time slice:

arr3d[0]

In [None]:
# extract cross-sections along the "y" dimension

arr3d[:, 0, :]

Broadcasting (axis position is important):

<img src="assets/broadcast.png" alt="broadcast" width="600"/>

In [None]:
# extract 1st time slice and apply mutliplication factors
# along the "time" dimension

time0 = arr3d[0]
time_factors = np.array([1, 2, 3])

time0 * time_factors[:, None, None]

Reductions (axis position is important):

In [None]:
# compute the mean along the "y" dimension

arr3d.mean(axis=2)

Is there a way to use the dimension names directly? I.e., a common source of bugs when handling square arrays (or matrices):

In [None]:
# both arr1 and arr2 are 2-d arrays with "x" and "y" dimensions

arr1 = np.array([[0, 2], [1, 3]])
arr2 = np.array([[0, 1], [3, 4]])

arr1 + arr2

In [None]:
# really sure that arr1 and arr2 have the same dimension order?

arr1 + arr2.transpose()

## xarray.DataArray

We provide the dimension names explicitly:

In [None]:
da3d = xr.DataArray(arr3d, dims=("time", "x", "y"))

da3d

This `DataArray` is a lightweight wrapper around the numpy array:

In [None]:
da3d.data

In [None]:
da3d.data is arr3d

Indexing (slicing) by position (integers) and **dimension name**:

In [None]:
# extract the 1st time slice:

da3d.isel(time=0)

In [None]:
# extract cross-sections along the "y" dimension

da3d.isel(x=0)

Broadcasting by **dimension name**:

In [None]:
# extract 1st time slice and apply mutliplication factors
# along the "time" dimension

da_time0 = da3d.isel(time=0)
da_time_factors = xr.DataArray([1, 2, 3], dims="time")

da_time_factors * da_time0

Reduction **by dimension name**:

In [None]:
# compute the mean along the "y" dimension

da3d.mean("y")

Handling square arrays is less error-prone:

In [None]:
da1 = xr.DataArray([[0, 2], [1, 3]], dims=("x", "y"))
da2 = xr.DataArray([[0, 1], [3, 4]], dims=("y", "x"))

da1 + da2

### Coordinates!

There is more than dimension names: Xarray supports defining labels along each dimension as coordinates. 

In [None]:
da3d = da3d.assign_coords(
    time=[2020, 2021, 2022],
    x=[10, 20],
    y=[100, 110, 120, 130],
)

da3d

By default, those "dimension" are baked by an index, which means that we can use those coordinates to perform data selection **by label**:

In [None]:
da3d.sel(time=2020, x=10)

We don't need to provide the exact labels:

In [None]:
da3d.sel(y=118, method="nearest")

Xarray also supports automatic alignment between indexed coordinates:

<img src="assets/align.png" alt="alignment" width="400"/>

In [None]:
da_time_factors = xr.DataArray(
    [0.5, 0.1],
    coords={"time": [2020, 2022]},
    dims="time",
)

da3d * da_time_factors

## xarray.Dataset

An `xarray.Dataset` is a collection of (data) variables sharing the common dimensions (and coordinates).

<img src="assets/xarray-dataset-diagram.png" alt="xarray data model" width="600"/>


(https://xarray.pydata.org)


Let's load a dataset from Xarray's tutorial:

In [None]:
ds = xr.tutorial.load_dataset('air_temperature')

ds

We can access the variables simply like this (returns a DataArray):

In [None]:
ds.time

DataArrays and Datasets may have attributes too:

In [None]:
ds.attrs

In [None]:
ds.time.attrs

Generally, operations like indexing, reductions, arithmetics, etc. work the same way for both DataArray and Dataset objects. For the latter, operations are applied to all the (data) variables:

In [None]:
ds.sel(time="2014-02")

In [None]:
ds.mean(["lat", "lon"])

## Plotting

Xarray has powerful plotting capabilities built on top of matplotlib. See:

- https://docs.xarray.dev/en/stable/gallery.html
- https://docs.xarray.dev/en/stable/user-guide/plotting.html

Example: plot time series at a given location:

In [None]:
# note the matplotlib tick, axis labels and title automatically generated from metadata

ds.air.sel(lat=50, lon=225).plot();

In [None]:
# compare that with raw numpy and matplotlib code

air_raw = ds.air.data

lat = 50
lon = 225

ilat = 10
ilon = 10

ts = air_raw[:, ilat, ilon]
plt.plot(ts)
plt.gca().set_title(f"lat = {lat}, lon = {lon}")
plt.gca().set_ylabel("4xDaily Air temperature")
plt.gca().set_xlabel("Time");

# I don't remember how to properly format time tick labels

Or plot time series at multiple locations:

In [None]:
# xarray is smart enough that we want to plot time series at three locations (note the automatic legend)

ds.air.sel(lat=[50, 55, 60], lon=225).plot.line(x="time");

A more advanced example: compute seasonal averages and make a facet plot

In [None]:
ds.air.groupby('time.season').mean().plot(x="lon", y="lat", col="season", col_wrap=2);

## Xarray integrates well with some libraries for interactive visualization

Example with [HvPlot](https://hvplot.holoviz.org/) / [Holoviews](https://holoviews.org/):

In [None]:
import hvplot.xarray

In [None]:
ds.air.hvplot.image(groupby='time', frame_width=400, frame_height=400)

In [None]:
(ds.air
 .sel(lat=[50, 60, 70])
 .hvplot.line(x='lon', y='air', groupby='time', by='lat')
)

## Dask arrays

[Dask arrays](https://docs.dask.org/en/stable/array.html) are large arrays that are formed by (many) smaller arrays (most of the time those are numpy arrays).  while executing the computations in parallel.

<img src="assets/dask-array.png" alt="dask array" width="400"/>

(https://docs.dask.org)

In [None]:
darr3d = da.random.uniform(size=(300, 1000, 2000))

darr3d

Dask allows users to handle those arrays just like Numpy, e.g.,

In [None]:
time0 = darr3d[0]
time_factors = np.arange(300)

result = (time0 * time_factors[:, None, None]).mean(axis=0)

Unlike Numpy, the result is not computed immediately. Instead, it returns another dask array:

In [None]:
result

Dask arrays are "lazy" arrays, i.e., their actual element values are not computed yet. Instead a dask array holds a graph of computations:

In [None]:
result.dask

In [None]:
result.visualize()

To compute the actual values, we have to call `.compute()` explicitly:

In [None]:
# returns a numpy array in this case

result.compute()

Dask provides the computation graph to one of its schedulers, which executes it in parallel. For dask arrays, the default schedulers is "threads". Alternatively, we can use multiple processes:

In [None]:
result.compute(scheduler="processes")

There's also an advanced (distributed) scheduler, which can be used with a monitoring dashboard (when used within jupyterlab -> [dask-labextension](https://github.com/dask/dask-labextension)):

In [None]:
# start a new local dask (distributed) cluster

cluster = LocalCluster()
cluster

In [None]:
# create a new client and connect it to the cluster

client = Client(cluster)
client

In [None]:
result.compute(scheduler=client)

### Xarray + Dask integration

Xarray integrates well with Dask.

In [None]:
# Let's take back the tutorial xarray Dataset, and "chunk" the data variables
# along the time dimension:

dsd = ds.chunk({"time": 100})

dsd

The "air" variable (DataArray) is here a lightweight wrapper around a dask array:

In [None]:
dsd.air.data

Most Xarray operations (reduction, indexing, arithmetics...) work seamlessly with dask arrays:

In [None]:
result = dsd.air.sel(lat=70).mean("time")

result

We also need here to explicitly call `.compute()`:

In [None]:
result.compute()

## Zarr arrays

[Zarr](https://zarr.readthedocs.io) arrays are chunked arrays (like Dask) that can be stored somewhere (in memory, on disk, in a database, on the cloud, etc.) most often after applying some compression filter.

For example, we can create new Zarr arrays in-memory:

In [None]:
# create a 2-d array in-memory

z = zarr.zeros((10000, 10000), chunks=(1000, 1000))

z.info

Like dask arrays, zarr arrays are "lazy". In the example above, no memory has been allocated yet. Memory fills up as we assign data to (subsets of the) array, e.g., 

In [None]:
z[0:100, 100:200] = 1.0

z.info

We can also create a new array that is stored on disk:

In [None]:
# this will create an "example.zarr" folder in this notebook directory

z1 = zarr.open(
    "example.zarr",
    mode="w",
    shape=(10000, 10000),
    chunks=(1000, 1000),
)

z1.info

In [None]:
# This will create some data file in the "example.zarr" folder
# (one file per chunk)

z1[0:100, 100:200] = 1.0

It is also possible with Zarr to store a group (hierarchy) of arrays: 

In [None]:
# this will create a "dataset.zarr" folder in this notebook directory

group = zarr.group(store="dataset.zarr")

In [None]:
# This will create a "z" subfolder in "dataset.zarr"

group.create_dataset("z", shape=(10000, 10000), chunks=(1000, 1000))

In [None]:
# This will create some data file in the "dataset.zarr/z" directory

group.z[0:100, 100:200] = 1.0

In [None]:
group.z.info

### Xarray + Zarr (+ Dask) integration

Xarray integrates well with Zarr, i.e., it is possible to read (write) a Zarr dataset to (from) an Xarray Dataset:

In [None]:
# write the (chunked) tutorial Dataset to the Zarr format

dsd.to_zarr("air.zarr")

In [None]:
# Open the Zarr tutorial dataset into a new Xarray Dataset

ds_air = xr.open_dataset("air.zarr", engine="zarr", chunks={})

ds_air

In the latter Dataset, the `air` variable is not yet loaded into memory ("lazy" dask array). Data will be loaded on-demand, e.g., during the execution of the dask's graph: 

In [None]:
ds_air.mean().compute()

## "Duck arrays"

We can see in the examples above that Numpy, Dask, Zarr arrays and Xarray DataArray / Dataset objects all expose some similar API (e.g., methods like `.mean()`, indexing, operators, etc.). We usually call those arrays "duck" arrays, by reference to [Duck typing](https://en.wikipedia.org/wiki/Duck_typing), which roughly means that the type of an object is less important than the properties and actions (methods) it defines).

Numpy actually defines a protocol so that it is possible to reuse its API with other array types, e.g.,

In [None]:
# passing a numpy array

np.sqrt(arr3d)

In [None]:
# passing a dask array

np.sqrt(darr3d)

In [None]:
# passing a chunked xarray Dataset loaded from a Zarr store !!

np.sqrt(ds_air)