# Xarray: a very short demo

Xarray provides handy Python objects for dealing with a set of 1-d, 2-d, 3-d, n-dimensional arrays that share common dimensions (those may have a physical meaning).

Think like a netCDF file loaded as a Python object with many capabilities...

![Xarray data model](assets/xarray-dataset-diagram.png "Xarray data model")

(https://xarray.pydata.org)

---

Let's import xarray and numpy...

In [None]:
import xarray as xr
import numpy as np

Let's load some data as a `xarray.Dataset` object...


In [None]:
ds = xr.tutorial.load_dataset('air_temperature')

ds

## Xarray wraps numpy arrays...

In [None]:
# the `ds.air.data` property here returns the underlying numpy array for `air` variable.

arr = ds.air.data

In [None]:
#arr

## ...with labels and metadata! Way less error-prone!

Example: select (slice) the array at a given latitude of 50 degrees

In [None]:
# numpy: we need to know that latitude is the 2nd axis
#        and that 50 degrees is the 11th element on that axis
#
# --> error-prone!

arr[:, 10, :]

In [None]:
# xarray: just use labels!

ds.sel(lat=50)

Example: compute the mean along the latitude axis

In [None]:
# numpy: we need to know that latitude is the 2nd axis
#        in the returned result, it is hard to know
#        which axis is time and which one is longitude

arr.mean(axis=1)

In [None]:
# xarray: just use labels!

ds.air.mean(dim='lat')

## Xarray integrates well with numpy

Wait, we can use a Numpy function with an Xarray object and it returns an Xarray object? Where is the magic?

In [None]:
np.mean(ds.air, axis=1)

Advantage:

- Reuse the same code with different array libraries having their specific implementation of arrays (e.g., with data stored in RAM memory, or on in distributed memory, or with computation run on a CPU, GPU, etc.)
- Use those libraries together -> interoperability!

More info on how it is possible: https://numpy.org/neps/nep-0018-array-function-protocol.html

## There is more: Xarray integrates well with Dask

In [None]:
# recall the dataset loaded above

ds

Let's cut the arrays (data variables) in the dataset into multiple blocks (chunks)

In [None]:
# create chunks along the time dimension, each having 500 elements along that dimension

dsd = ds.chunk({'time': 500})

dsd

Computing the mean along the latitude axis... returns another Dask array...

In [None]:
dsd.air.mean(dim='lat')

This Dask array does not contain any actual value ("lazy array"). Instead, it contains a graph of computing tasks that can be executed in parallel.

In [None]:
dsd.air.mean(dim='lat').data.visualize()

We need to call `.compute()` to trigger the computation of the graph of tasks and get the actual values

In [None]:
dsd.air.mean(dim='lat').compute()

Let's see different execution times.

Here is the reference computation time based on the numpy array (not chunked):

In [None]:
%timeit ds.air.mean(dim='lat')

Here, a graph of computation tasks is built by Dask. This is very cheap compared to the reference time above.

In [None]:
%timeit dsd.air.mean(dim='lat')

Here, the Dask graph is computed. We get some speed-up due to parallel execution, although not a great speedup since parallel computation introduces some overhead and in this case the amount of data is quite small.

Note: the execution time may depend on different things (e.g., which hardware this notebook is run, the chosen Dask scheduler and how it is configured). For example, the `distributed` scheduler may introduce some overhead, but it comes with a visual dashboard with useful diagnostics. For more info, see https://docs.dask.org/en/latest/scheduling.html.

In [None]:
%timeit dsd.air.mean(dim='lat').compute()

In [None]:
%timeit dsd.air.mean(dim='lat').compute(scheduler='threads')

## Both Xarray and Dask can nicely play together with numpy

So we can use a `numpy` function with an Xarray object that wrap Dask arrays... and return Xarray objects that wrap Dask arrays?

In [None]:
np.sqrt(dsd)

## Xarray has powerful plotting capabilities built on top of matplotlib

Example: plot time series at a given location:

In [None]:
# note the matplotlib tick, axis labels and title automatically generated from metadata

ds.air.sel(lat=50, lon=225).plot();

Or plot time series at multiple locations:

In [None]:
# xarray is smart enough that we want to plot time series at three locations (note the automatic legend)

ds.air.sel(lat=[50, 55, 60], lon=225).plot.line(x="time");

A more advanced example: compute seasonal averages and make a facet plot

In [None]:
ds.air.groupby('time.season').mean().plot(x="lon", y="lat", col="season", col_wrap=2);

## Xarray integrates well with some libraries for interactive visualization

Example with Holoviews/HvPlot:

In [None]:
import hvplot.xarray

In [None]:
ds.air.hvplot.image(groupby='time', frame_width=400, frame_height=400)

In [None]:
ds.time.data

In [None]:
(ds.air
 .sel(lat=[50, 60, 70])
 .hvplot.line(x='lon', y='air', groupby='time', by='lat')
)