<img align="right" src="../../additional_data/banner_siegel.png" style="width:1100px;">

# Advanced xArray

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Products used**: 
* **Prerequisites**:  Users of this notebook should have a basic understanding of:
    * How to run a [Jupyter notebook](01_jupyter_introduction.ipynb)
    * The basic structure of the eo2cube [satellite datasets](02_eo2cube.ipynb)
    * How to browse through the available [products and measurements](03_products_and_measurements.ipynb) of the eo2cube datacube 
    * How to [load data from the eo2cube datacube](04_loading_data_and_basic_xarray.ipynb) 

## Background

The Python library `xarray` simplifies working with labelled multi-dimension arrays. The library introduces labels in the forms of dimensions, coordinates and attributes on top of `numpy` arrays. This structure allows easier and more effective handling of remote sensing raster data in a Python environment. Therefore, it is essential to fully understand the structure of an `xarray`. A first introduction into the usage of `xarray` within the eo2cube environment was given in ["04_loading_data_and_basic_xarray"](04_loading_data_and_basic_xarray.ipynb). This notebook builds on this gained knowledge and attempts to give a deeper understanding of the `xarray` data structure of raster data. Since the `xarray.Dataset` within the datacube environment is specialised for the use of remote sensing raster data, it differs slightly from the original `xarray` library. However, if you are interested in learning more about the basic structures of the original `xarray`, have a look at this [**"introduction to xarray" notebook**](intro_to_xarray.ipynb) within the "intro_to_python" directory.
To get more information about the `xarray` package, visit the [offical documentation website](http://xarray.pydata.org/en/stable/).

## Description

This notebook introduces users to the `xarray` library within the datacube environment. It aims to deepen the understanding of the `xarray` structure as a container for remote sensing raster data. Also it introduces useful `xarray` functions to effectivly work with raster data in the eo2cube environment. Within this notebook the following topics are covered:

* Definition of the `xarray.Dataset` structure for eo2cube remote sensing data
    * Access of `xarray.Dataset` dimensions, measurements and metadata
    * Inspection of `xarray.DataArray` structure and data values
* Indexing and slicing of `xarray.Dataset`
* Application of built-in `xarray` functions for analyzing raster data

***

## Load packages

The `datacube` package is required to query the eo2cube datacube database and load the requested data. The `with_ui_cbk` function from `odc.ui` enables a progress bar when loading large amounts of data. The `xarray` and `numpy` package are needed for the different methods and analysis steps within this notebook. 

In [119]:
import datacube
from odc.ui import with_ui_cbk
import xarray as xr
import numpy as np

## Datacube connection and load data

First we connect to the datacube and load a dumy dataset from the eo2cube. For this we will use the `s2_l2a_bavaria` product. An area around Würzburg is loaded for April 2020. For more information about how to use the `dc. load()` function, check out [notebook 04](04_loading_data_and_basic_xarray.ipynb).

In [120]:
dc = datacube.Datacube(app = '05_advanced_xarray', config = '/home/datacube/.datacube.conf')

In [121]:
data = dc.load(product= "s2_l2a_bavaria",
               measurements= ["blue", "green", "red"],
               x= (9.8506165, 11.273325),
               y= (49.7352601, 50.191334),
               time= ("2020-04-01", "2020-04-07"),
               group_by = "solar_day",
               progress_cbk=with_ui_cbk())

data

VBox(children=(HBox(children=(Label(value=''), Label(value='')), layout=Layout(justify_content='space-between'…

## `xarray.Dataset` structure

The variable `data` is an `xarray.Dataset`. A `xarray.Dataset` is basically a dictionary structure or data container that packs the raster dataset into "dimensions", "coordinates", "data variables" and "attributes".

The **"dimensions"** represent the absolute dimensions of the data, i.e. the amount of time steps (how many different scenes are available in the dataset) and the absolute pixel number in x and y (lon and lat) direction of the dataset (how many pixels exist in x and y (lon and lat) direction).

The **"coordinates"** represent the actual values of the dimensions. These are stored in `xarray.DataArrays`. In their core, `xarray.DataArrays` are a build on raw `numpy` arrays. To see a preview of the contained values click the database symbol on the right of the `xarray.DataArray`. [The section below](#index_xarray1) focuses on how to select and actually work with the values of different `xarray.DataArrays` of a dataset.
The `time` variable within the "coordinates" section displays an array with the time steps available in the dataset. In the example of `data`, the dataset contains four time steps.
The `y` and `x` variables display the pixel coordinate values for each pixel within the spatial bounds of the dataset. In our example, the `y` array holds 5284 values and the `x` array holds  5284 values (like defined in the "dimensions"). For every unique combination of `y` and `x` coordinate a value for each measurement/band exists.
The `spatial_ref` variable contains an array with only a single value which defines the EPSG code of the CRS.

The **"data variables"** display all loaded measurements/bands of the dataset. The measurements are labeled with the band names of the raster data. The actual values of a measurement are stored in a multi-dimensional `xarray.DataArray`. This array consists of three dimension. The first dimension represents the `time` variable. In our example, the `xarray.DataArray` of a band stores four arrays in the first dimensions, each representing one time step. The second dimension represents the `y` coordinates. Therefore, a band `xarray.DataArray` in `data` stores 5248 values in the second dimensions, each representing a single pixel in the y direction. The third dimension represents the `x` coordinates. For the `data` dataset, a band `xarray.DataArray` stores 10310 values in the third dimension, each representing a single pixel in the x direction. In [the following section](#index_array2) we will learn how to select and work with the band `xarray.DataArrays` and therefore obtain a deeper understanding of the multi-dimensional `xarray.DataArray` structure.

The **"attributes"** section stores the defined CRS of the loaded data.

## Inspection of a `xarray.Dataset`

As disscused above, the `xarray.Dataset` packs the raster data into different variables ("dimensions", "coordinates", "data variables", "attributes"). These sections can be addressed individually be using the follwing syntaxes:
```python
data["measurement_name"]
```
or
```python
data.measurement_name
```

The follwing code selects the "dimensions" of a `xarray.Dataset`. The result is a dict-like container storing the dimension names with the representing absolute values.

In [122]:
data.dims

Frozen(SortedKeysDict({'time': 4, 'y': 5284, 'x': 10310}))

The following code selects the "coordinates" of a `xarray.Dataset`. Te result is a dict-like container of arrays (coordinates) taht label each point.

In [123]:
data.coords

Coordinates:
  * time         (time) datetime64[ns] 2020-04-01T10:26:54 ... 2020-04-06T10:...
  * y            (y) float64 5.562e+06 5.562e+06 5.562e+06 ... 5.51e+06 5.51e+06
  * x            (x) float64 5.607e+05 5.607e+05 ... 6.638e+05 6.638e+05
    spatial_ref  int32 25832

The following code selects the "attributes" of a `xarray.Dataset`. The result is dict holding metadata.

In [124]:
data.attrs

{'crs': 'EPSG:25832', 'grid_mapping': 'spatial_ref'}

<a id='index_xarray1'></a>
### Inspecting a `xarray.DataArray`

To inspect a `xarray.DataArray` stored in "coordinates" or "data variables" use the same syntax like above. The ability to acces every `xarray.DataArray` individually can be useful to inspecting or manipulating the data stored in the array. Like a an `xarray.Dataset`, the `xarray.DataArray` also includes the information about the data´s "dimensions", "coordinates" and "attributes". The following code selects the `xarra.DataArray`containing the data for the `red` band of the `s2_l2a_bavaria` product:

In [125]:
data.red

In addition to the `xarray.Dataset` the specified band `xarray.DataArray` stores some additional metadata in the "attributes" section, like `nodata` values or information about the values `units`.

As you can see, the `xarray.DataArray` is still a container for the data (band) values and its related metadata ("coordinates", "attributes"). To access the actual band values stored in the`xarray.DataArray` add the argument `.values`. This returns the raw (single- or multi-dimensional) `numpy` array holding the actual band values.

In [126]:
data.red.values

array([[[ 805,  459,  438, ...,  423,  505,  796],
        [ 885,  658,  550, ...,  532,  877, 1278],
        [1074,  866,  714, ...,  919, 1372, 1410],
        ...,
        [ 928,  987, 1050, ...,  366,  496,  580],
        [1062, 1128, 1252, ...,  370,  327,  449],
        [1108, 1310, 1492, ...,  455,  397,  413]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[3038, 2890, 2878, ...,    0,    0,    0],
        [3216, 3038, 2932, ...,    0,    0,    0],
        [3442, 3320, 3060, ...,    0,    0,    0],
        ...,
        [ 901,  924,  995, ...,    0,    0,    0],
        [ 976, 1090, 1196, ...,    0,    0,    0],
        [1013, 1240, 1366, ...,    0,    0,    0]],

       [[ 733,  420,  469, ...,  434,

<a id='index_array2'></a>
### Multi-dimensional `xarray.DataArray` structure

The structure of a multi-dimensional measurement/band `xarray.DataArray` consists of three dimensions. The first dimension describes the `time` and represents an array for each unique time step. To select only the values of the first time step run the code below. [The next section](#index_array3) demonstrates some other `xarray` indexing functions which are more easy and effective. However, to better understand the structure of these multi-dimensional arrays, we first make use of this method.

In [127]:
data.red[0] #data values for first time step of the red band

The `xarray.DataArray` above contains all data values of the `red` band for the first time step. Still, there are the y and x dimensions which define the values. To access the second dimension (y coordinates) run the follwing code.

In [128]:
data.red[0][0]

Notice, that within the "coordinates" section the y-coordinate only contains a single value. That is because we indexed the `red` band to the first pixel in y direction. Therefore, the resulting `xarray.DataArray` now contains all the x pixels data values in the first row (first y pixel (2nd dim)) for the first time step (1st dim).

To now select the first pixel in x-direction run the following code. This gives us a unique combination of the three dimensions `time`, `y` and `x`.

In [129]:
data.red[0][0][0]

The resulting `xarray.DataArray` contains the unique data value for the first pixel (y[0],x[0]) of the raster dataset in the first scene (time). Based on the `xarray` structure, every data value of a measurement/band in a multidimensional vector can be assigned to a pixel by a unique combination of the three dimensions.

<a id='index_array3'></a>
## Indexing

The `xarray` library offers two convenient methods of selecting data. You can either use the function `isel()` (like `numpy`) to select a scene from your dataset by an index. Alternativly, you can use the `sel()` function to slice your dataset based on the dimension labels.
The following example uses the positional indexing method to select the first scene of the `data` dataset.

In [130]:
data.isel(time=0)

The argument `time` within the `isel()` function also takes vectors as an index. The example below selects the first three scenes of `data`:

In [131]:
data.isel(time=[0,1,2])

The function `sel()` is a very powerful indexing method when working with big datasets. The argument `time` takes any form of the time dimension label. You can either use:
* `YYYY` to select all scenes in this year
* `YYYY-MM` to select all scenes of this month
* `YYYY-MM-DD` to select all scenes of this day

For demonstration of the indexing by label method we need a bigger dumy dataset (`data` only includes four scenes in April 2020). `data_1` contains scenes of the `s2_l2a_bavaria` product from December 2019 to Februray 2020.

In [132]:
data_1 = dc.load(product= "s2_l2a_bavaria",
                 measurements= ["blue", "green", "red"],
                 x= (9.8506165, 11.273325),
                 y= (49.7352601, 50.191334),
                 time= ("2019-12-01", "2020-02-28"),
                 group_by = "solar_day",
                 progress_cbk=with_ui_cbk())

data_1

VBox(children=(HBox(children=(Label(value=''), Label(value='')), layout=Layout(justify_content='space-between'…

First we select all scenes of `data_1` in the year "2020" using the `sel()` function. Note, that the total amount of scenes is lower because the scenes from December 2019 were dropped. To see all the remaining time steps click the databse symbol on the right of the `time` array.

In [133]:
data_1.sel(time="2020")

This example selects all scenes from January 2020 of included in `data_1`.

In [134]:
data_1.sel(time="2020-01")

This example selects all scenes from the 25.12.2019 included in the `data_1`. Since we used the `group_by = "solar_day"` argument in the `dc.load()` function, only one scene is available at this date.

In [135]:
data_1.sel(time = "2019-12-25")

For both methods (`isel()`and `sel()`) a **slicing** operator exists. If the function `slice()` is passed onto the index function, the dataset can be sliced. 
The first example uses the slicing by position method to select the first five scenes in `data_1`. The start value is included and the stop value is excluded.
The second example uses the slicing by label method to select the scenes between "2019-12-08" and "2019-12-25". Note, that when using the `slice()` function with the `sel()` method, both start and stop value are included.

In [136]:
data_1.isel(time=slice(0,5))

In [137]:
data_1.sel(time=slice("2019-12-08","2019-12-25"))

`xarray` also includes some useful features for the inspection of the time dimension. It allows to easily extract additional information from a dataset. The following code automatically groups the time dimension in seasons ("DJF", "MAM", JJA", "SON"). Since `data_1` only contains scens from winter months, only the label "DJF" will appear.

In [138]:
data_1.time.dt.season

It is also possible to extract the "day of year" for a time step.

In [139]:
data_1.time.dt.dayofyear

It is possible to index and **slice within the x and y dimensions**. The following example selects the value for each band of the pixel in the second colum of the raster and the fifth row of the raster (`x=2,y=5`)

In [140]:
data_1.isel(x=2, y= 5)

Again, this method can be combined with the `slice()` operator to do a spatial subset of the dataset based on the position of the pixels. If you know the actual coordinate (x,y) value (extent) of the spatial subset area, use the `sel()` function.
Additionally, this subset is also sliced in the time dimensions.

The following example subsets the `data_1` by the spatial location of the pixels. Only the pixels from the first to the fifth column and the pixels from the first to the fifth row are included in the output. Also the scenes where filtered in the time dimension between the first and fifth time step.

In [141]:
data_1.isel(time=slice(0,5), x= slice(0,5), y=slice(0,5))

## Built-in `xarray` functions for data manipulation 

This notebook presents some basic built-in functions of the `xarray` library to manipulate and transform data in a `xarray.Dataset`. The [notebook 07](07_basic_analysis.ipynb) will cover this topic with a focus on an application oriented remote sensing approach.

The simple built-in functions allow the user to do simple calculations with a `xarray.Dataset`.
The **basic math** built-in `xarray` functions are:
* `min()`, `max()`
* `mean()`, `median()`
* `sum()`
* `std()`

The following code demonstrates the easy use of the `max()` function to extract the maximum value of the red band in the `data` dataset.

In [151]:
data.red.max()

To apply a function to every value of a specified dimension (e.g. to calculate the mean of every time step) the `dim` argument in the basic math function must be define with the dimension label.

This examples calculates the mean of the `red` band for each pixel (defined by the unique `x`, `y` combination) over every time step.

In [164]:
data.red.mean(dim=["x", "y"])

This examples works the other way around. It calculates the standard deviation of every pixel (`x`, `y`) over all timesteps of the dataset `data`.

In [173]:
data.red.std(dim="time")

Remember, to access the raw `numpy` array that stores the values of the resulting `xarray.DataArrays`, the suffix `.values` is needed. This allows you to work with the "actual" data.

In [177]:
data.blue.sum(dim=["x","y"]).values

array([26225165045,           0, 34554443445, 29890138833])

The `where()` function provides the option to **mask** a `xarray.Dataset` based on a logical condition. By default, the function converts all values that match the condition to NaN values. This is extremly usefull when applied in combination with a binary mask to mask your data to the desired values. The argument `other` let´s you define a subset value for all values that match the condition (default is `nan`). The argument `drop` drops all values which do not match with the condition.
The following example masks the datatset `data` to only the values which have a reflectance value of greater than 700 in the `red` band.

In [187]:
data.where(data.red > 700)

isin()

resample()

group_by()

interpolate()

## Recommended next steps

To continue with the beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebooks](01_jupyter_introduction.ipynb)
2. [eo2cube](02_eo2cube.ipynb)
3. [Products and Measurements](03_products_and_measurements.ipynb)
4. [Loading data and introduction to xarrays](04_loading_data_and_basic_xarray.ipynb)
5. **Advanced xarrays operations (this notebook)**
6. [Plotting data](06_plotting.ipynb)
7. [Basic analysis of remote sensing data](07_basic_analysis.ipynb)
8. [Parallel processing with Dask](08_parallel_processing_with_dask.ipynb)

***

## Additional information

<font size="2">This notebook for the usage in the Open Data Cube entities of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/), is adapted from [Geoscience Australia](https://github.com/GeoscienceAustralia/dea-notebooks), published using the Apache License, Version 2.0. Thanks! </font>

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** February 2021