# Introduction à xarray

https://docs.xarray.dev/en/stable/index.html

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.

You can first have a look at the following slides if you want to get a grasp of xarray

https://fabienmaussion.info/acinn_xarray_workshop/#/

In [None]:
!pip install xarray

## 1) xarray for labelled multidimensionnal data & relationship with numpy

Let's load an image using numpy

In [None]:
import numpy as np
import skimage.data
import xarray as xr
from matplotlib import pyplot as plt

In [None]:
cat = skimage.data.chelsea()

In [None]:
plt.imshow(cat)
plt.show()

Let's transform it into an xarray.

As you can see, it "wraps" a numpy array and allows you to name dimensions.

In [None]:
cat = xr.DataArray(data=cat, dims=["y", "x", "band"])

In [None]:
cat

Using xarray:

    Data stored as a Numpy arrays.
    Dimensions do have a name.
    The coordinates of each of the dimensions can represent geographical coordinates, categories, dates, ... instead of just an index.



Xarray’s labels make working with multidimensional data much easier:

In numpy, to transpose data, you would use

```python
cat = cat.transpose((0,1,2))
```

However, if the dimensions are already in the correct order, or if you don't remember the dimensions order, this can become very painful. 

Using xarray, you can directly use the dimension names :

In [None]:
trcat = cat.transpose("x", "y", "band")

In [None]:
trcat

In [None]:
# It works like a numpy array
plt.imshow(trcat)
plt.show()

Let's do a mean accross the channels to get a grayscale cat. If you did it with numpy you would have to do
```python
gray_cat = cat.mean(axis=-1)
```
This implies, as usual, remembering order of the dimensions. For now it's kinda easy, but it can become harder on datasets with lots of dimensions

In [None]:
gray_cat = cat.mean(dim="band")
gray_cat

In [None]:
plt.imshow(gray_cat, cmap="gray")
plt.show()

You can slice the array like a pandas dataframe, instead of like a numpy array 

In [None]:
# xarray
cropped_cat = cat.isel(dict(x=slice(0, 256), y=slice(0, 256)))
cropped_cat

# but you can still use numpy style indexing
numpy_cat = cat[:256, :256, :]

np.all(cropped_cat == numpy_cat)

In [None]:
plt.imshow(cropped_cat)
plt.show()

You can also name "coordinates" in dimensions

In [None]:
cat = cat.assign_coords(dict(band=["red", "green", "blue"]))
cat

In [None]:
# You can select by name now
blue_cat = cat.sel(dict(band=("blue")))
blue_cat

## 2) xarray + dask

The magic of xarray is that it interfaces not only with numpy but with dask as well, creating a front-end to both interfaces

In [None]:
import glob

import dask
import dask.array as da
import numpy as np
import skimage.io

In [None]:
lazy_cat_fn = dask.delayed(skimage.data.chelsea, pure=True)  # Lazy version of imread

lazy_cat = da.from_delayed(lazy_cat_fn(), dtype=np.uint8, shape=(451, 300, 3))

In [None]:
new_cat = xr.DataArray(lazy_cat, dims=("y", "x", "band"))

In [None]:
# Our xarray is lazy !
new_cat

In [None]:
new_gray_cat = new_cat.mean(dim="band")

In [None]:
new_gray_cat

In [None]:
type(new_gray_cat.data)

In [None]:
# to get the array it's quite simple...
arr = np.asarray(new_gray_cat)

In [None]:
plt.imshow(arr, cmap="gray")
plt.show()

## 3) How do I... ?

You can look at the reference guide https://docs.xarray.dev/en/stable/howdoi.html to gt more ideas about using xarray

## What's next

For more information about xarray you can read the user guide : https://xarray.pydata.org/en/stable/user-guide/index.html

You can also play with the toy data provided : https://xarray.pydata.org/en/stable/gallery.html to get a feel of xarray capabilities, most notably to play with time series data

## Going further : PANGEO: A community platform for Big Data geoscience

![](https://pangeo.io/_images/pangeo_tech_1.png)

Website: https://pangeo.io/index.html

They have a gallery with many interesting examples, many of them using this combination of xarray and dask.

Pangeo focuses primarily on cloud computing (storing the big datasets in cloud-native file formats and also doing the computations in the cloud), but all the tools like xarray and dask developed by this community and shown in the examples also work on your laptop or university's cluster.

They use a technical stack based on modern python 

![](https://pangeo.io/_images/interop.jpeg)

You can look at some examples of provided by pangeo, many of which use dask & xarray :http://gallery.pangeo.io/

For example the excellent dask & xarray tutorials:
- http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html
- http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/xarray.html