<img align="right" src="../../additional_data/banner_siegel.png" style="width:1100px;">

# Xarray-I: Data Structure 

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Prerequisites**: There is no prerequisite learning required.


## Background

In the previous notebook, we learnt that the data we need to access are loaded in  **`xarray.dataset`**. This is the form in which earth observation data are usually stored in a datacube. 

Xarray is an open source project and Python package which offers a toolkit for working with
multi-dimensional arrays of data. **`xarray.dataset`** is an in-memory representation of a netCDF (network Common Data Form) file, which is a set of libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. Since the **`xarray.dataset`** within the datacube environment is specialised for the use of remote sensing raster data, it differs slightly from the original **`xarray`** library. 

Understanding the structure of a **`xarray.dataset`** is the key to enable us work with these data. In this notebook, we are mainly dedicated to help users to understand its data structure. First, let's come to the end stage of the previous notebook, where we have loaded a data product. The data product "s2_l2a_bavaria" is used as example.

## Description

This notebook introduces users to the `xarray` library within the datacube environment. Within this notebook the following topics are covered:

* Definition of the `xarray.Dataset` structure for eo2cube remote sensing data
    * Access of `xarray.Dataset` dimensions, measurements and metadata
    * Inspection of `xarray.DataArray` structure and data values
    
* Indexing and slicing of `xarray.Dataset`

***

In [2]:
import datacube
# To access and work with available data

import pandas as pd
# To format tables

from odc.ui import DcViewer 
# Provides an interface for interactively exploring the products available in the datacube

from odc.ui import with_ui_cbk
# Enables a progress bar when loading large amounts of data.

import xarray as xr

import matplotlib.pyplot as plt

# Set config for displaying tables nicely
# !! USEFUL !! otherwise parts of longer infos won't be displayed in tables
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

# Connect to DataCube
# argument "app" --- user defined name for a session (e.g. choose one matching the purpose of this notebook)
dc = datacube.Datacube(app = "nb_understand_ndArrays", config = '/home/datacube/.datacube.conf')

# Load Data Product
ds = dc.load(product = "s2_l2a_bavaria",
             measurements = ["blue", "green", "red"],
             longitude = [12.493, 12.509],
             latitude = [47.861, 47.868],
             time = ("2018-10-01", "2019-03-31"))

print(ds)

<xarray.Dataset>
Dimensions:      (time: 165, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2018-10-01T10:15:58 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 8744 8704 8608 8600 8432 ... 519 525 499 473
    green        (time, y, x) int16 8632 8624 8560 8504 8496 ... 736 760 738 726
    red          (time, y, x) int16 8608 8608 8544 8544 8440 ... 450 469 448 433
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


## **What is inside a `xarray.dataset`?**
The figure below is a diagramm depicting the structure of the **`xarray.dataset`** we've just loaded.

![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

This dataset has three ***Data Variables***, "blue", "green" and "red" (shown with colors in the diagramm), each refer to an individual spectral band. Each data variable can be regarded as a **multi-dimensional *Data Array*** of same structure; in this case, it is a **three-dimensional array** (shown as 3D Cube in the diagramm) where `time`, `x` and `y` are its ***Dimensions*** (shown as axis along each cube in the diagramm).

In this dataset, there are 165 ***Coordinates*** under `time` dimension, which means there are 165 time steps along the `time` axis. There are 124 coordinates under `x` dimension and 84 coordinates under `y` dimension, indicating that there are 124 pixels along `x` axis and 84 pixels along `y` axis.

As for the term ***Dataset***, it is like a *Container* holding all the multi-dimensional arrays of same structure (shown as the red-lined box holding all 3D Cubes in the diagramm).

So this instance dataset has a spatial extent of 124 by 84 pixels at given lon/lat locations, spans over 165 time stamps and 3 spectral band.

In summary, ***`xarray.dataset`*** is a dictionary-like container of ***`DataArrays`***, of which each is a labeled, n-dimensional array holding 4 properties:
* **Data Variables (`values`)**: A `numpy.ndarray` holding values *(e.g. reflectance values of spectral bands)*.
* **Dimensions (`dims`)**: Dimension names for each axis *(e.g. 'x', 'y', 'time')*.
* **Coordinates (`coords`)**: Coordinates of each value along each axis *(e.g. longitudes along 'x'-axis, latitudes along 'y'-axis, datetime objects along 'time'-axis)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.

Now let's further deconstruct the dataset we have just loaded to get things clear!   :D

* **To check existing dimensions of a dataset**

In [17]:
ds.dims

Frozen(SortedKeysDict({'time': 165, 'y': 84, 'x': 124}))

* **To check the coordinates of a dataset**

In [20]:
ds.coords

Coordinates:
  * time         (time) datetime64[ns] 2018-10-01T10:15:58 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832

* **To check all coordinates along time dimension**
<br>
<img src=https://live.staticflickr.com/65535/51115452191_ec160d4514_o.png, width="450">

In [5]:
ds.time
# OR
#ds.coords['time']

NameError: name 'ds' is not defined

* **To check attributes of the dataset**

In [23]:
ds.attrs

{'crs': 'EPSG:25832', 'grid_mapping': 'spatial_ref'}

<a id='index_array2'></a>
## **What is a `xarray.DataArray`**?

The structure of a multi-dimensional measurement/band `xarray.DataArray` consists of three dimensions, x, y, and time. The first dimension describes the `time` and represents an array for each unique time step. 

The following code extract `blue` band of the dataset which is a `xarray.DataArray`. Like a an `xarray.Dataset`, the `xarray.DataArray` also includes the information about the data´s "dimensions", "coordinates" and "attributes".

* **To select all data of "blue" band**
<br>
<img src=https://live.staticflickr.com/65535/51115092614_366cb774a8_o.png, width="350">

In [26]:
print(ds.blue)
# OR
#ds['blue']

<xarray.DataArray 'blue' (time: 165, y: 84, x: 124)>
array([[[8744, 8704, 8608, ..., 8928, 8928, 8928],
        [8720, 8632, 8536, ..., 8888, 8888, 8896],
        [8680, 8672, 8528, ..., 8928, 8856, 8928],
        ...,
        [9088, 9056, 9080, ..., 9712, 9712, 9552],
        [9048, 9008, 9056, ..., 9720, 9680, 9640],
        [9096, 9056, 9032, ..., 9672, 9696, 9672]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
...
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,

The following code extract blue band as `numpy` array by excluding other megadata in the container. This returns the raw (single- or multi-dimensional) array holding the actual band values.

In [25]:
# Only print pixel values
print(ds.blue.values)

[[[8744 8704 8608 ... 8928 8928 8928]
  [8720 8632 8536 ... 8888 8888 8896]
  [8680 8672 8528 ... 8928 8856 8928]
  ...
  [9088 9056 9080 ... 9712 9712 9552]
  [9048 9008 9056 ... 9720 9680 9640]
  [9096 9056 9032 ... 9672 9696 9672]]

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 ...

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 [[ 430  433  429 ...  584  577  575]
  [ 441  435

* **To select blue band data at the first time stamp**

Similarly, we can extract a smaller subset of the data by choosing the time stamps we want to focus on. The code below select only the values of the first time step:

<br>
<img src=https://live.staticflickr.com/65535/51116131265_8464728bc1_o.png, width="350">

In [28]:
print(ds.blue[0])

<xarray.DataArray 'blue' (y: 84, x: 124)>
array([[8744, 8704, 8608, ..., 8928, 8928, 8928],
       [8720, 8632, 8536, ..., 8888, 8888, 8896],
       [8680, 8672, 8528, ..., 8928, 8856, 8928],
       ...,
       [9088, 9056, 9080, ..., 9712, 9712, 9552],
       [9048, 9008, 9056, ..., 9720, 9680, 9640],
       [9096, 9056, 9032, ..., 9672, 9696, 9672]], dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


* **To select blue band data at the first time stamp while the latitude is the largest in the defined spatial extent**

The result above includes the x and y dimensions. The code below subset also the second dimension y:

<img src=https://live.staticflickr.com/65535/51115337046_aeb75d0d03_o.png, width="350">

In [29]:
print(ds.blue[0][0])

<xarray.DataArray 'blue' (x: 124)>
array([8744, 8704, 8608, 8600, 8432, 8320, 8328, 8312, 8200, 8200, 8188,
       8224, 8280, 8312, 8288, 8312, 8312, 8304, 8288, 8400, 8392, 8456,
       8456, 8552, 8552, 8536, 8504, 8528, 8488, 8536, 8488, 8488, 8536,
       8512, 8504, 8600, 8624, 8600, 8640, 8712, 8656, 8680, 8752, 8768,
       8768, 8824, 8856, 8888, 8952, 9008, 9032, 8968, 8992, 9000, 9024,
       8984, 9016, 8992, 9096, 9080, 9032, 9008, 9040, 9032, 9032, 8992,
       8864, 8872, 8832, 8864, 8864, 8872, 8944, 8864, 8960, 8960, 8992,
       9120, 9072, 8976, 8992, 9040, 9032, 9072, 9120, 9040, 9080, 9120,
       9040, 9040, 9088, 9088, 8944, 8944, 8936, 8888, 8888, 8888, 8872,
       8808, 8872, 8832, 8792, 8832, 8840, 8864, 8832, 8712, 8832, 8856,
       8800, 8880, 8840, 8832, 8920, 8912, 8800, 8880, 8856, 8880, 8912,
       8928, 8928, 8928], dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
    y            float64 5.308e+06
  * x            (x) fl

Notice, that within the "coordinates" section the y-coordinate only contains a single value. That is because we indexed the `blue` band to the first pixel in y direction.

* **To select the upper-left corner pixel**

This code select also the first pixel in x-direction (The first `x`, the first `y` and the first `time`).

<br>
<img src=https://live.staticflickr.com/65535/51116131235_b0cca9589f_o.png, width="350">

In [30]:
print(ds.blue[0][0][0])

<xarray.DataArray 'blue' ()>
array(8744, dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
    y            float64 5.308e+06
    x            float64 7.612e+05
    spatial_ref  int32 25832
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


Based on the `xarray` structure, every data value of a measurement/band in a multidimensional vector can be assigned to a pixel by a unique combination of the three dimensions.

### ***To subset dataset with `isel` vs. `sel`***
* Use `isel` when subsetting with **index**
* Use `sel` when subsetting with **labels**

* **To select data of all spectral bands at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51114879732_7d62db54f4_o.png, width="750">

### **1) isel(): Using index number**

In [4]:
ds.isel(time=[0])

NameError: name 'ds' is not defined

The argument for time do not necessarily need to be a single number, it can also be a vector with the use of []. The following code selects the first two time stamps:

In [3]:
ds.isel(time=[0,1])

NameError: name 'ds' is not defined

The function `sel()` is differewnt from `isel()`. It is a very powerful indexing method when working with big datasets, with the argument `time` takeing not index, but any form of the time dimension label. You can either use:

**1)** `YYYY` to select all scenes of this year

**2)** `YYYY-MM` to select all scenes of this month

**3)** `YYYY-MM-DD` to select all scenes of this day

* **To select data of all spectral bands of year 2019** 

Here we select all 2019 scenes using `sel()` function. 
<br>
<img src=https://live.staticflickr.com/65535/51116281070_75f1b46a9c_o.png, width="750">

In [41]:
ds.sel(time='2019')

<xarray.Dataset>
Dimensions:      (time: 90, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2019-01-01T10:07:21 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 0 0 0 0 0 0 0 ... 518 515 519 525 499 473
    green        (time, y, x) int16 0 0 0 0 0 0 0 ... 756 752 736 760 738 726
    red          (time, y, x) int16 0 0 0 0 0 0 0 ... 428 433 450 469 448 433
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


This example selects all scenes from January 2019.

In [17]:
ds.sel(time="2019-01")

This example selects all scenes from the 25.12.2019. Since we used the `group_by = "solar_day"` argument in the `dc.load()` function, which mean combining scenes for every single day, only one scene is available.

In [18]:
ds.sel(time = "2019-12-25")

***

## Further Information
To get more information about the `xarray` package, visit the [offical documentation website](http://xarray.pydata.org/en/stable/).

***

## Recommended next steps

If you now understand the **data structure** of `xarray.dataset` and **basic indexing** methods illustrated in this notebook, you are ready to move on to the next notebook where you will learn more about **advanced indexing** and calculating some **basic statistical parameters** of the n-dimensional arrays! In case you are gaining interest in exploring the world of **xarrays**, you may lay yourself into the [Xarray user guide](http://xarray.pydata.org/en/stable/index.html). :D

To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebooks](01_jupyter_introduction.ipynb)
2. [eo2cube](02_eo2cube_introduction.ipynb)
3. [Search and Load Data](03_data_lookup_and_loading.ipynb)
4. **Xarray I: Data Structure (this notebook)**
5. [Xarray II: Index and Statistics](05_xarrayII.ipynb)
6. [Plot](06_plotting_basics.ipynb)
7. [Basic analysis of remote sensing data](07_basic_analysis.ipynb)
8. [Parallel processing with Dask](08_parallel_processing_with_dask.ipynb)

***
## Additional information

This notebook is for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/).

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** April 2021