# Xarray

This document describes the basics of the xarray library, which provides the datatypes most commonly used when working with the Open Data Cube.

## Background
`Xarray` is a python library which simplifies working with labelled multi-dimension arrays. `Xarray` introduces labels in the forms of dimensions, coordinates and attributes on top of raw `numpy` arrays, allowing for more intitutive and concise development.

First we will load data from the Data Cube to have some example data to work with.

In [1]:
import datacube

# Allow importing of our utilities.
import sys
sys.path.append('../../Scripts')
from deafrica_datahandling import load_ard

In [2]:
dc = datacube.Datacube(app="intro_to_xarray")

In [3]:
# Dar es Salaam, Tanzania - 2018
query = dict(dc=dc,
             min_gooddata=0.7,
             x=(39.20, 39.37),
             y=(-6.90, -6.70),
             time=("2018-01-01", "2018-12-31"),
             output_crs="EPSG:4326",
             resolution=(-0.00027, 0.00027),
             group_by='solar_day')

In [4]:
landsat_ds = load_ard(products=["ls7_usgs_sr_scene", "ls8_usgs_sr_scene"],
                      **query)

Using pixel quality parameters for USGS Collection 1
Finding datasets
    ls7_usgs_sr_scene
    ls8_usgs_sr_scene
Counting good quality pixels for each time step
Filtering to 4 out of 41 time steps with at least 70.0% good quality pixels
Applying pixel quality/cloud mask
Loading 4 time steps


As we saw in the document regarding loading data, the output of `load_ard()` - `landsat_ds` - is an `xarray.Dataset` object, as can be seen below.

In [5]:
landsat_ds

### Interpreting the resulting `xarray.Dataset`
The variable `landsat_ds` contains all data that matched the query parameters (spatial, temporal, product, etc.) inputted into `load_ard()`.

All `xarray.Dataset` objects have these properties:

* Dimensions: These note the sizes of the data along each dimension (the shape).
* Coordinates: These allow indexing of the data by coordinate values.
* Data variables: These are what contain the data. For data loaded from the Data Cube, each `measurement` specified in the query will be a data variable in the returned `xarray.Dataset`.
* Attributes: These contain miscellaneous information about the data.

Here is a description of the significance of these properties for this data:

*Dimensions*

* Identifies the number of timesteps returned in the search (`time`) as well as the number of pixels in the x and y dimensions of the data query. In this case, using a CRS of EPSG:4326, the x and y dimensions are called `longitude` and `latitude`, respectively. However, for many coordinate systems, they will be called `x` and `y`.

*Coordinates* 

* `time` identifies the date attributed to each returned timestep.
* `longitude` and `latitude` are the coordinates for each pixel within the spatial bounds of your query.

*Data variables*

* These are the measurements available for the nominated product. 
For every date (`time`) returned by the query, the measured value at each pixel (`latitude`, `longitude`) is returned as an array for each measurement.
Each data variable is itself an `xarray.DataArray` object (see below). So this data really has 4 dimensions - 2 for space (`latitude`, `longitude`), 1 for time (`time`), and 1 for the data variables (`red`, `green`, `blue`, ...).

*Attributes*

* `crs` identifies the coordinate reference system (CRS) of the loaded data. 



### Inspecting an individual `xarray.DataArray`
The `xarray.Dataset` object we loaded above is a collection of individual `xarray.DataArray` objects that hold the actual data for each data variable/measurement. 
For example, all measurements listed under _Data variables_ above (e.g. `blue`, `green`, `red`, `nir`, `swir1`, `swir2`) are `xarray.DataArray` objects.

We can inspect the data in these `xarray.DataArray` objects for an `xarray.Dataset` named `ds` using either of the following syntaxes:
```
ds["measurement_name"]
```
or:
```
ds.measurement_name
```

Being able to access data from individual data variables/measurements allows us to manipulate and analyse data from individual satellite bands or specific layers in a dataset. 
For example, we can access data from the near infra-red satellite band (i.e. `nir`):

In [6]:
landsat_ds.nir

Note that the object header informs us that it is an `xarray.DataArray` containing data for the `nir` satellite band. 

Like an `xarray.Dataset`, the array also includes information about the data's dimensions, coordinates and attributes**.

> **Note**: For a more in-depth introduction to `xarray` data structures, refer to the [official xarray documentation](http://xarray.pydata.org/en/stable/data-structures.html)

## Indexing

Indexing data in an `xarray.Dataset` or `xarray.DataArray` can be done in 2 ways: integer indexing and label indexing.

### Integer indexing
Integer indexing is the selection of data by coordinate **indexes**. The first index in a coordinate array (e.g. `time`) is 0. The next index is 1, and so on. In xarray, this is achieved with the [`isel()`](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.isel.html) method.

Here is an example of integer indexing with the `time` dimension. The following code selects the first time slice. Notice that there is no `time` dimension in the output. There is still a `time` coordinate, but it is a single value.

In [7]:
landsat_ds.isel(time=0)

Multiple coordinate values of a dimension can be selected with `slice`. The lower value is inclusive and the upper value is exclusive, so `slice(0,3)` selects the first 3 elements.

In [8]:
landsat_ds.isel(time=slice(0,3))

### Label indexing
Label indexing is the selection of data by coordinate **values** (or "labels"). Coordinate values for x and y dimensions are numbers. Coordiante values for the `time` dimension have a certain time datatype, though we can use date strings of the format `YYYY-MM-DD` to index by time. In xarray, this is achieved with the [`sel()`](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.sel.html) method.

Here is an example of label indexing with the `time` dimension. The following code selects the first time slice.

In [9]:
landsat_ds.sel(time="2018-01-31")

Notice that, in this case, there is a `time` dimension in the output. Additionally, the `time` coordinate is a single value, but that value is within an array (indicated by its **bold** font).

In other words, the data is still 3D. To achieve the same result as we did when using `isel()`, call [`squeeze()`](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.squeeze.html), which removes dimensions of length 1.

In [10]:
landsat_ds.sel(time="2018-01-31").squeeze()

Label indexing also supports `slice`.

In [11]:
landsat_ds.sel(time=slice("2018-01-31", "2018-07-18"))

## Learn more

More information about `xarray` data structures, functions, and more can be found [here](http://xarray.pydata.org/en/stable/).

See the ["How do I..."](http://xarray.pydata.org/en/stable/howdoi.html) xarray webpage for guidance about what xarray methods to use in different scenarios.

This [external notebook](https://rabernat.github.io/research_computing/xarray.html) introduces more uses of xarray and may help you advance your skills further.