<img align="right" src="../../additional_data/banner_siegel.png" style="width:1100px;">

# Xarray for eo2cube

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Prerequisites**: There is no prerequisite learning required.


## Background
Xarray is an open source project and Python package which offers a toolkit for working with
multi-dimensional arrays of data. Xarray dataset is an in-memory representation of a netCDF (network Common Data Form) file, which is is a set of libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
Xarray provides the basic data structures for Open Data Cube, as well as powerful tools for computation and visualization of the data from different satellite sensors. Since the `xarray.Dataset` within the datacube environment is specialised for the use of remote sensing raster data, it differs slightly from the original `xarray` library. However, if you are interested in learning more about the structures of the original `xarray`, have a look at this [**"introduction to xarray" notebook**](intro_to_xarray.ipynb) within the "intro_to_python" directory.
To get more information about the `xarray` package, visit the [offical documentation website](http://xarray.pydata.org/en/stable/).


## Description

This notebook introduces users to the `xarray` library within the datacube environment. It aims to build up the understanding of the `xarray` structure as a container for remote sensing raster data. Within this notebook the following topics are covered:

* Definition of the `xarray.Dataset` structure for eo2cube remote sensing data
    * Access of `xarray.Dataset` dimensions, measurements and metadata
    * Inspection of `xarray.DataArray` structure and data values
* Indexing and slicing of `xarray.Dataset`
***

## Introduction to Xarray

To working with the open data cube, we have to load the data in the framework of `xarray.Dataset` which contains all data that matched our basic query. This data format stores the satellite data in an effective and easy way. For analyzing eo2cube data it is essential to understand the basic structure of an `xarray.Dataset`. Because of the importance of the `xarray`, we devote this notebook for covering this topic.

All data products in eo2cube is organized and stored in the xarray framework. Imagine we have an optical satellite image with only three bands: Red, NIR and SWIR which represent signals in different ranges of the spectral spectrum. These bands are represented as 2-dimensional numpy arrays, one dimension for the latitude and one for the longitude. Besides, some metadata also comes with this image to give us information about, for example, spatial resolution of the images, coordinate reference system (CRS) and units used in the dataset. Other information packed in this Xarray raster dataset including **"dimensions"**, **"coordinates"**, **"data variables"** and **"attributes"**. One of the most important information for analysing the satellite data is the measurements of different bands displayed under **"data variables"**, labeled with the band names. 

## Setting up
### Load packages
Loading packages and connection to the ODC are the necessary steps for the following demonstrations.

In [2]:
import datacube
from odc.ui import with_ui_cbk
import xarray as xr
import numpy as np

### Datacube connection and load data

We connect to the datacube and load a dumy dataset from the eo2cube using the `s2_l2a_bavaria` product. An area around Würzburg is loaded for April 2020. For more information about how to use the `dc.load()` function, check out [notebook 04](04_loading_data_and_basic_xarray.ipynb).

In [3]:
dc = datacube.Datacube(app = '05_advanced_xarray', config = '/home/datacube/.datacube.conf')

In [4]:
data = dc.load(product= "s2_l2a_bavaria",
               measurements= ["blue", "green", "red"],
               x= (9.8506165, 11.273325),
               y= (49.7352601, 50.191334),
               time= ("2020-04-01", "2020-04-07"),
               group_by = "solar_day",
               progress_cbk=with_ui_cbk())

data

VBox(children=(HBox(children=(Label(value=''), Label(value='')), layout=Layout(justify_content='space-between'…

## What is **`xarray.Dataset`**?
A Dataset can be seen as a dictionary structure packing up the data, dimensions and attributes. Variables in a Dataset object are called DataArrays and they share dimensions with the higher level Dataset. The figure below provides an illustrative example:

<img align="centre" src="dataset-diagram.png" style="width:900px;">

### - Inspecting `xarray.Dataset`

The different variables of `xarray.Dataset` can be accessed individually using the follwing syntaxes as if it were a Python dictionary, or using the . notation.
```python
data["measurement_name"]
```
or
```python
data.measurement_name
```

For example, we can select the "dimensions" of a `xarray.Dataset` with the following syntaxes:

In [5]:
data.dims

Frozen(SortedKeysDict({'time': 4, 'y': 5284, 'x': 10310}))

The following code selects the "coordinates" of a `xarray.Dataset`. Te result is a dict-like container of arrays (coordinates) that label each point in the 2D-plane.

In [6]:
data.coords

Coordinates:
  * time         (time) datetime64[ns] 2020-04-01T10:26:54 ... 2020-04-06T10:...
  * y            (y) float64 5.562e+06 5.562e+06 5.562e+06 ... 5.51e+06 5.51e+06
  * x            (x) float64 5.607e+05 5.607e+05 ... 6.638e+05 6.638e+05
    spatial_ref  int32 25832

The following code selects the "attributes". The result is dict holding metadata information.

In [7]:
data.attrs

{'crs': 'EPSG:25832', 'grid_mapping': 'spatial_ref'}

<a id='index_array2'></a>
## What is **`xarray.DataArray`**?

The structure of a multi-dimensional measurement/band `xarray.DataArray` consists of three dimensions. The first dimension describes the `time` and represents an array for each unique time step. 

<a id='index_xarray1'></a>
### - Inspecting `xarray.DataArray`

Similarly, we can access individual bands of the satellite image within the `Xarray Dataset`. Without extracting the time dimension, the results will include information for all time stamps. The following code can extract `red` band of the dataset which is a `xarray.DataArray`. Like a an `xarray.Dataset`, the `xarray.DataArray` also includes the information about the data´s "dimensions", "coordinates" and "attributes".

In [8]:
data.red

The following code extract red band as `numpy` array by excluding other megadata in the container. This returns the raw (single- or multi-dimensional) array holding the actual band values.

In [8]:
data.red.values

array([[[ 805,  459,  438, ...,  423,  505,  796],
        [ 885,  658,  550, ...,  532,  877, 1278],
        [1074,  866,  714, ...,  919, 1372, 1410],
        ...,
        [ 928,  987, 1050, ...,  366,  496,  580],
        [1062, 1128, 1252, ...,  370,  327,  449],
        [1108, 1310, 1492, ...,  455,  397,  413]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[3038, 2890, 2878, ...,    0,    0,    0],
        [3216, 3038, 2932, ...,    0,    0,    0],
        [3442, 3320, 3060, ...,    0,    0,    0],
        ...,
        [ 901,  924,  995, ...,    0,    0,    0],
        [ 976, 1090, 1196, ...,    0,    0,    0],
        [1013, 1240, 1366, ...,    0,    0,    0]],

       [[ 733,  420,  469, ...,  434,

Similarly, we can extract a smaller subset of the data by choosing the time stamps we want to focus on. The code below select only the values of the first time step:

In [10]:
data.red[0] #data values for first time step of the red band

The result includes the y and x dimensions which define the values. The code below subset also the second dimension (the first time stamp and the first y coordinates):

In [11]:
data.red[0][0]

Notice, that within the "coordinates" section the y-coordinate only contains a single value. That is because we indexed the `red` band to the first pixel in y direction.

This code select also the first pixel in x-direction (The first `x`, the first `y` and the first `time`).

In [12]:
data.red[0][0][0]

Based on the `xarray` structure, every data value of a measurement/band in a multidimensional vector can be assigned to a pixel by a unique combination of the three dimensions.

<a id='index_array3'></a>
## Basic Indexing

### **1) isel(): Using index number**

The `xarray` library offers two convenient methods of selecting data. You can either use the function `isel()` (like `numpy`) to select a scene from your dataset by an index. Alternativly, you can use the `sel()` function to slice your dataset based on the dimension labels. The following code selects only the first time stamp:

In [13]:
data.isel(time=0)

The argument for time do not necessarily need to be a single number, it can also be a vector with the use of []. The following code selects the first two time stamps:

In [9]:
data.isel(time=[0,1])

### 3) **sel(): Using time labels**

The function `sel()` is differewnt from `isel()`. It is a very powerful indexing method when working with big datasets, with the argument `time` takeing not index, but any form of the time dimension label. You can either use:
* `YYYY` to select all scenes of this year
* `YYYY-MM` to select all scenes of this month
* `YYYY-MM-DD` to select all scenes of this day

*Further Set up: For demonstration of the indexing by label method we need a bigger dumy dataset. `data_1` contains `s2_l2a_bavaria` product from December 2019 to Februray 2020.*

In [15]:
data_1 = dc.load(product= "s2_l2a_bavaria",
                 measurements= ["blue", "green", "red"],
                 x= (9.8506165, 11.273325),
                 y= (49.7352601, 50.191334),
                 time= ("2019-12-01", "2020-02-28"),
                 group_by = "solar_day",
                 progress_cbk=with_ui_cbk())

data_1

VBox(children=(HBox(children=(Label(value=''), Label(value='')), layout=Layout(justify_content='space-between'…

Here we select all scenes of `data_1` in the year "2020" using the `sel()` function. (Note: the scenes from December 2019 were dropped.)

In [16]:
data_1.sel(time="2020")

This example selects all scenes from January 2020.

In [17]:
data_1.sel(time="2020-01")

This example selects all scenes from the 25.12.2019 included in the `data_1`. Since we used the `group_by = "solar_day"` argument in the `dc.load()` function, which mean combining scenes for every single day, only one scene is available.

In [18]:
data_1.sel(time = "2019-12-25")

For both methods (`isel()`and `sel()`) a **slicing** operator exists. If the function `slice()` is passed onto the index function, the dataset can be sliced. 
The first example uses the slicing by position method to select the first five scenes in `data_1`. The start value is included and the stop value is excluded.
The second example uses the slicing by label method to select the scenes between "2019-12-08" and "2019-12-25". Note, that when using the `slice()` function with the `sel()` method, both start and stop value are included.

Indeed, Xarray is a very powerful Python package. User can easily extract specific data in different dimensions, doing raster algebra such as calculate summary statistics and spectral indexes, and data manipulation such as resampling and rescaling. If you are interested in learning more about Xarray operation, have a look at the advanced material [**"Advanced Xarray" notebook**](06_advanced_xarray.ipynb).

## Recommended next steps

To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebook](01_jupyter_introduction.ipynb)
2. [eo2cube](02_eo2cube.ipynb)
3. **Xarray basics (this notebook)**
4. [Products and Measurements](03_products_and_measurements.ipynb)
5. [Loading data](04_loading_data.ipynb)
6. [Advanced xarrays operations](05_advanced_xarray.ipynb)
7. [Plotting data](06_plotting.ipynb)
8. [Basic analysis of remote sensing data](07_basic_analysis.ipynb)
9. [Parallel processing with Dask](08_parallel_processing_with_dask.ipynb)

Once you have worked through the beginner's guide, you can join advanced users by exploring:

* The "DEA datasets" directory in the repository, where you can explore DEA products in depth.
* The "Frequently used code" directory, which contains a recipe book of common techniques and methods for analysing DEA data.
* The "Real-world examples" directory, which provides more complex workflows and analysis case studies.

***
## Additional information

This notebook for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/), is partly adapted from [Earth Lab](https://www.earthdatascience.org/courses/intro-to-earth-data-science/), published using the CC BY-NC-ND License 4.0. Thanks!

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** March 2021