<img align="right" src="../../additional_data/banner_siegel.png" style="width:1100px;">

# Xarray-I: Data Structure 

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Prerequisites**: Users of this notebook should have a basic understanding of:
    * How to run a [Jupyter notebook](01_jupyter_introduction.ipynb)
    * The basic structure of the eo2cube [satellite datasets](02_eo2cube_introduction.ipynb)
    * How to [lookup and load data](03_data_lookup_and_loading.ipynb)

## Background

In the previous notebook, we experienced that the data we wanna access are loaded in a form called **`xarray.dataset`**. This is the form in which earth observation data are usually stored in a datacube.

**`xarray`** is an open source project and Python package which offers a toolkit for working with ***multi-dimensional arrays*** of data. **`xarray.dataset`** is an in-memory representation of a netCDF (network Common Data Form) file. Understanding the structure of a **`xarray.dataset`** is the key to enable us work with these data. Thus, in this notebook, we are mainly dedicated to helping users of our datacube understand its data structure.

Firstly let's come to the end stage of the previous notebook, where we have loaded a data product. The data product "s2_l2a_bavaria" is used as example in this notebook.

## Description

The following topics are convered in this notebook:
* What is inside a `xrray.dataset` (the structure)?
* (Basic) Subset Dataset / DataArray
* Reshape a Dataset

In [1]:
import datacube
# To access and work with available data

import pandas as pd
# To format tables

from odc.ui import DcViewer 
# Provides an interface for interactively exploring the products available in the datacube

from odc.ui import with_ui_cbk
# Enables a progress bar when loading large amounts of data.

import xarray as xr

import matplotlib.pyplot as plt

# Set config for displaying tables nicely
# !! USEFUL !! otherwise parts of longer infos won't be displayed in tables
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

# Connect to DataCube
# argument "app" --- user defined name for a session (e.g. choose one matching the purpose of this notebook)
dc = datacube.Datacube(app = "nb_understand_ndArrays", config = '/home/datacube/.datacube.conf')

# Load Data Product
ds = dc.load(product = "s2_l2a_bavaria",
             measurements = ["blue", "green", "red"],
             longitude = [12.493, 12.509],
             latitude = [47.861, 47.868],
             time = ("2018-10-01", "2019-03-31"))

print(ds)

<xarray.Dataset>
Dimensions:      (time: 165, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2018-10-01T10:15:58 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 8744 8704 8608 8600 8432 ... 519 525 499 473
    green        (time, y, x) int16 8632 8624 8560 8504 8496 ... 736 760 738 726
    red          (time, y, x) int16 8608 8608 8544 8544 8440 ... 450 469 448 433
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


## **What is inside a `xarray.dataset`?**
The figure below is a diagramm depicting the structure of the **`xarray.dataset`** we've just loaded. Combined with the diagramm, we hope you may better interpret the texts below explaining the data strucutre of a **`xarray.dataset`**.

![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

As read from the output block, this dataset has three ***Data Variables*** , "blue", "green" and "red" (shown with colors in the diagramm), referring to individual spectral band.

Each data variable can be regarded as a **multi-dimensional *Data Array*** of same structure; in this case, it is a **three-dimensional array** (shown as 3D Cube in the diagramm) where `time`, `x` and `y` are its ***Dimensions*** (shown as axis along each cube in the diagramm).

In this dataset, there are 165 ***Coordinates*** under `time` dimension, which means there are 165 time steps along the `time` axis. There are 124 coordinates under `x` dimension and 84 coordinates under `y` dimension, indicating that there are 124 pixels along `x` axis and 84 pixels along `y` axis.

As for the term ***Dataset***, it is like a *Container* holding all the multi-dimensional arrays of same structure (shown as the red-lined box holding all 3D Cubes in the diagramm).

So this instance dataset has a spatial extent of 124 by 84 pixels at given lon/lat locations, spans over 165 time stamps and 3 spectral band.

In summary, ***`xarray.dataset`*** is a dictionary-like container of ***`DataArrays`***, of which each is a labeled, n-dimensional array holding 4 properties:
* **Data Variables (`values`)**: A `numpy.ndarray` holding values *(e.g. reflectance values of spectral bands)*.
* **Dimensions (`dims`)**: Dimension names for each axis *(e.g. 'x', 'y', 'time')*.
* **Coordinates (`coords`)**: Coordinates of each value along each axis *(e.g. longitudes along 'x'-axis, latitudes along 'y'-axis, datetime objects along 'time'-axis)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.

Now let's deconstruct the dataset we have just loaded a bit further to have things more clarified!:D

* **To check existing dimensions of a dataset**

In [2]:
ds.dims

Frozen(SortedKeysDict({'time': 165, 'y': 84, 'x': 124}))

* **To check the coordinates of a dataset**

In [3]:
ds.coords

Coordinates:
  * time         (time) datetime64[ns] 2018-10-01T10:15:58 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832

* **To check all coordinates along a specific dimension**
<br>
<img src=https://live.staticflickr.com/65535/51115452191_ec160d4514_o.png, width="450">

In [4]:
print(ds.time)
# OR
#ds.coords['time']

<xarray.DataArray 'time' (time: 165)>
array(['2018-10-01T10:15:58.000000000', '2018-10-03T10:00:23.000000000',
       '2018-10-04T10:20:18.000000000', '2018-10-06T10:16:51.000000000',
       '2018-10-08T10:00:40.000000000', '2018-10-09T10:25:59.000000000',
       '2018-10-11T10:10:20.000000000', '2018-10-13T10:00:23.000000000',
       '2018-10-14T10:26:27.000000000', '2018-10-16T10:10:21.000000000',
       '2018-10-18T10:02:15.000000000', '2018-10-19T10:20:29.000000000',
       '2018-10-21T10:12:01.000000000', '2018-10-23T10:01:59.000000000',
       '2018-10-24T10:24:28.000000000', '2018-10-26T10:11:08.000000000',
       '2018-10-28T10:01:17.000000000', '2018-10-29T10:21:31.000000000',
       '2018-10-31T10:11:40.000000000', '2018-11-02T10:01:45.000000000',
       '2018-11-03T10:23:38.000000000', '2018-11-05T10:14:04.000000000',
       '2018-11-07T10:07:24.000000000', '2018-11-08T10:27:12.000000000',
       '2018-11-08T10:27:18.000000000', '2018-11-10T10:17:22.000000000',
       '2018-

* **To check attributes of the dataset**

In [5]:
ds.attrs

{'crs': 'EPSG:25832', 'grid_mapping': 'spatial_ref'}

## **Subset Dataset / DataArray**

* **To select all data of "blue" band**
<br>
<img src=https://live.staticflickr.com/65535/51115092614_366cb774a8_o.png, width="350">

In [6]:
print(ds.blue)
# OR
#ds['blue']

<xarray.DataArray 'blue' (time: 165, y: 84, x: 124)>
array([[[8744, 8704, 8608, ..., 8928, 8928, 8928],
        [8720, 8632, 8536, ..., 8888, 8888, 8896],
        [8680, 8672, 8528, ..., 8928, 8856, 8928],
        ...,
        [9088, 9056, 9080, ..., 9712, 9712, 9552],
        [9048, 9008, 9056, ..., 9720, 9680, 9640],
        [9096, 9056, 9032, ..., 9672, 9696, 9672]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        ...,
...
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,

In [7]:
# Only print pixel values
print(ds.blue.values)

[[[8744 8704 8608 ... 8928 8928 8928]
  [8720 8632 8536 ... 8888 8888 8896]
  [8680 8672 8528 ... 8928 8856 8928]
  ...
  [9088 9056 9080 ... 9712 9712 9552]
  [9048 9008 9056 ... 9720 9680 9640]
  [9096 9056 9032 ... 9672 9696 9672]]

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 ...

 [[   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  ...
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]
  [   0    0    0 ...    0    0    0]]

 [[ 430  433  429 ...  584  577  575]
  [ 441  435

* **To select blue band data at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51116131265_8464728bc1_o.png, width="350">

In [8]:
print(ds.blue[0])

<xarray.DataArray 'blue' (y: 84, x: 124)>
array([[8744, 8704, 8608, ..., 8928, 8928, 8928],
       [8720, 8632, 8536, ..., 8888, 8888, 8896],
       [8680, 8672, 8528, ..., 8928, 8856, 8928],
       ...,
       [9088, 9056, 9080, ..., 9712, 9712, 9552],
       [9048, 9008, 9056, ..., 9720, 9680, 9640],
       [9096, 9056, 9032, ..., 9672, 9696, 9672]], dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


* **To select blue band data at the first time stamp while the latitude is the largest in the defined spatial extent**
<img src=https://live.staticflickr.com/65535/51115337046_aeb75d0d03_o.png, width="350">

In [9]:
print(ds.blue[0][0])

<xarray.DataArray 'blue' (x: 124)>
array([8744, 8704, 8608, 8600, 8432, 8320, 8328, 8312, 8200, 8200, 8188,
       8224, 8280, 8312, 8288, 8312, 8312, 8304, 8288, 8400, 8392, 8456,
       8456, 8552, 8552, 8536, 8504, 8528, 8488, 8536, 8488, 8488, 8536,
       8512, 8504, 8600, 8624, 8600, 8640, 8712, 8656, 8680, 8752, 8768,
       8768, 8824, 8856, 8888, 8952, 9008, 9032, 8968, 8992, 9000, 9024,
       8984, 9016, 8992, 9096, 9080, 9032, 9008, 9040, 9032, 9032, 8992,
       8864, 8872, 8832, 8864, 8864, 8872, 8944, 8864, 8960, 8960, 8992,
       9120, 9072, 8976, 8992, 9040, 9032, 9072, 9120, 9040, 9080, 9120,
       9040, 9040, 9088, 9088, 8944, 8944, 8936, 8888, 8888, 8888, 8872,
       8808, 8872, 8832, 8792, 8832, 8840, 8864, 8832, 8712, 8832, 8856,
       8800, 8880, 8840, 8832, 8920, 8912, 8800, 8880, 8856, 8880, 8912,
       8928, 8928, 8928], dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
    y            float64 5.308e+06
  * x            (x) fl

* **To select the upper-left corner pixel**
<br>
<img src=https://live.staticflickr.com/65535/51116131235_b0cca9589f_o.png, width="350">

In [10]:
print(ds.blue[0][0][0])

<xarray.DataArray 'blue' ()>
array(8744, dtype=int16)
Coordinates:
    time         datetime64[ns] 2018-10-01T10:15:58
    y            float64 5.308e+06
    x            float64 7.612e+05
    spatial_ref  int32 25832
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


### **subset dataset with `isel` vs. `sel`**
* Use `isel` when subsetting with **index**
* Use `sel` when subsetting with **labels**

* **To select data of all spectral bands at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51114879732_7d62db54f4_o.png, width="750">

In [11]:
print(ds.isel(time=[0]))

<xarray.Dataset>
Dimensions:      (time: 1, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2018-10-01T10:15:58
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 8744 8704 8608 8600 ... 9808 9672 9696 9672
    green        (time, y, x) int16 8632 8624 8560 8504 ... 9680 9656 9616 9528
    red          (time, y, x) int16 8608 8608 8544 8544 ... 9632 9560 9528 9480
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


* **To select data of all spectral bands of year 2019** 
<br>
<img src=https://live.staticflickr.com/65535/51116281070_75f1b46a9c_o.png, width="750">

In [12]:
print(ds.sel(time='2019'))

<xarray.Dataset>
Dimensions:      (time: 90, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2019-01-01T10:07:21 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 0 0 0 0 0 0 0 ... 518 515 519 525 499 473
    green        (time, y, x) int16 0 0 0 0 0 0 0 ... 756 752 736 760 738 726
    red          (time, y, x) int16 0 0 0 0 0 0 0 ... 428 433 450 469 448 433
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


***Tip: More about indexing and sebsetting Dataset or DataArray is presented in the [Notebook_05](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/05_xarrayII.ipynb).***

## **Reshape Dataset**

* **Convert the Dataset (subset to 2019) to a *4-dimension* DataArray**

In [15]:
ds19 = ds.sel(time='2019').to_array().rename({"variable":"band"})
print(ds19)

<xarray.DataArray (band: 3, time: 90, y: 84, x: 124)>
array([[[[   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         ...,
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0]],

        [[   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         ...,
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0]],

        [[   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0],
         ...,
...
         ...,
         [   0,    0,    0, ...,    0,    0,    0],
         [   0,    0,    0, ...,    0,    0,    0]

* **Convert the *4-dimension* DataArray back to a Dataset by setting the "time" as DataVariable (reshaped)**

![ds_reshaped](https://live.staticflickr.com/65535/51151694092_ca550152d6_o.png)

In [16]:
ds_reshp = ds19.to_dataset(dim="time")
print(ds_reshp)

<xarray.Dataset>
Dimensions:              (band: 3, x: 124, y: 84)
Coordinates:
  * y                    (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x                    (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref          int32 25832
  * band                 (band) <U5 'blue' 'green' 'red'
Data variables:
    2019-01-01 10:07:21  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    2019-01-02 10:27:14  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    2019-01-02 10:27:20  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    2019-01-04 10:17:19  (band, y, x) int16 10360 10272 10264 ... 9744 9784 9784
    2019-01-04 10:17:21  (band, y, x) int16 10800 10760 10720 ... 9968 9984 9944
    2019-01-06 10:07:25  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    2019-01-07 10:27:12  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    2019-01-07 10:27:18  (band, y, x) int16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    20

## Recommended next steps

If you now understand the **data structure** of `xarray.dataset` and **basic indexing** methods illustrated in this notebook, you are ready to move on to the next notebook where you will learn more about **advanced indexing** and calculating some **basic statistical parameters** of the n-dimensional arrays!:D

In case you are gaining interest in exploring the world of **xarrays**, you may lay yourself into the [Xarray user guide](http://xarray.pydata.org/en/stable/index.html).

To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebooks](01_jupyter_introduction.ipynb)
2. [eo2cube](02_eo2cube_introduction.ipynb)
3. [Loading Data](03_data_lookup_and_loading.ipynb)
4. **Xarray I: Data Structure (this notebook)**
5. [Xarray II: Index and Statistics](05_xarrayII.ipynb)
6. [Plotting data](06_plotting_basics.ipynb)
7. [Spatial analysis](07_basic_analysis.ipynb)
8. [Parallel processing with Dask](08_parallel_processing_with_dask.ipynb)

The additional notebooks are designed for users to build up both basic and advanced skills which are not covered by the beginner's guide. Self-motivated users can go through them according to their own needs. They act as complements for the guide:
<br>

1. [Python's file management tools](I_file_management.ipynb)
2. [Image Processing basics using NumPy and Matplotlib](II_numpy_image_processing.ipynb)
3. [Vector Processing](III_basic_vector_processing.ipynb)
4. [Advanced Plotting](IV_advanced_plotting.ipynb)

Once you have worked through the beginner's guide, you can join advanced users by exploring:

* The "DEA datasets" directory in the repository, where you can explore DEA products in depth.
* The "Frequently used code" directory, which contains a recipe book of common techniques and methods for analysing DEA data.
* The "Real-world examples" directory, which provides more complex workflows and analysis case studies.

***
## Additional information

This notebook is for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/).

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** April 2021