<img align="right" src="../../additional_data/banner_siegel.png" style="width:1000px;">

# Xarray-I: Data Structure 

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Prerequisites**: No prerequisite learning is required.


## Background

The Python library **`xarray`** is the form in which earth observation data are usually stored in a datacube.
It is an open source project and Python package which offers a toolkit for working with ***multi-dimensional arrays*** of data. **`xarray.dataset`** is an in-memory representation of a netCDF (network Common Data Form) file. Understanding the structure of a **`xarray.dataset`** is the key to enabling our work with these data. Thus, in this notebook, we are mainly dedicated to helping users of our datacube understand its data structure.

## Description

In this notebook, topics covered include:
* **What is inside a `xrray.dataset` (the structure)?**
* **(Basic) Subset Dataset / DataArray**
* **Reshape a Dataset**

In [2]:
import datacube
import pandas as pd
from odc.ui import DcViewer 
from odc.ui import with_ui_cbk
import xarray as xr
import matplotlib.pyplot as plt

# Set config for displaying tables nicely
# !! OTHERWISE !! parts of longer infos won't be displayed in tables
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

# Connect to DataCube
# argument "app" --- user defined name for a session (e.g. choose one matching the purpose of this notebook)
dc = datacube.Datacube(app = "nb_understand_ndArrays")

In [11]:
# Load Data Product
ds = dc.load(product= "s2_l2a",
             x = (24.78 ,24.88),
             y = (-28.90, -28.81),
             output_crs = "EPSG:32734",
             time = ("2019-12-01", "2020-03-31"),
             measurements= ["blue", "green", "red"],
             resolution = (-10,10),
             group_by = "solar_day",
             progress_cbk=with_ui_cbk())

ds

VBox(children=(HBox(children=(Label(value=''), Label(value='')), layout=Layout(justify_content='space-between'…

In [None]:
#da = ds.to_array().rename({"variable":"band"})
#print(da)

In [None]:
#ds2 = da.to_dataset(dim="time")
#ds2

## **What is inside a `xarray.dataset`?**
The figure below is a diagram depicting the structure of the **`xarray.dataset`** we've just loaded. We hope you may better interpret the texts below explaining the data structure of a **`xarray.dataset`**, with the diagram.

![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

As we read from the output block, this dataset has three ***Data Variables***, "blue", "green", and "red" (shown with colors in the diagram), referring to the individual spectral band.

Each data variable can be regarded as a **multi-dimensional *Data Array*** with the same structure. It is a **three-dimensional array** (shown as a 3D cube in the diagram) where `time`, `x`, and `y` are its ***Dimensions*** (shown as the axis along with each cube in the diagram).

In this dataset, there are 49 ***coordinates*** under the `time` dimension, which means there are 49 time steps along the `time` axis. There are 1010 coordinates under `x` dimension and 1031 coordinates under `y` dimension, indicating 1010 pixels along `x` axis and 1031 pixels along `y` axis.

The term ***dataset*** is like a *container* holding all the multi-dimensional arrays of the same structure (shown as the red-lined box containing all 3D Cubes in the diagram).

So this instance dataset has a spatial extent of 1010 by 1031 pixels at given long/lat locations, spans over 49 time stamps and includes 3 spectral band.

**In summary, *`xarray.dataset`* is substantially a container for high-dimensional *`DataArray`* with common attributes (e.g., crs) attached:**
* **Data Variables (`values`)**: It's generally the first/highest dimension to subset from a high dimensional array. Each `data variable` contains a multi-dimensional array of all other dimensions.
* **Dimensions (`dims`)**: Other dimensions arranged in hierachical order *(e.g. 'time', 'y', 'x')*.
* **Coordinates (`coords`)**: Coordinates along each `Dimension` *(e.g. timesteps along 'time' dimension, latitudes along 'y' dimension, longitudes along 'x' dimension)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.

Now let's deconstruct the dataset we have just loaded a bit further to have things more clarified!:D

* **To check the structure of the dataset**

In [12]:
ds.values

<bound method Mapping.values of <xarray.Dataset>
Dimensions:      (time: 49, y: 1031, x: 1010)
Coordinates:
  * time         (time) datetime64[ns] 2019-12-01T08:28:13 ... 2020-03-30T08:...
  * y            (y) float64 6.807e+06 6.807e+06 ... 6.797e+06 6.797e+06
  * x            (x) float64 8.687e+05 8.687e+05 ... 8.787e+05 8.788e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) uint16 1122 1094 1082 1060 ... 454 301 356 450
    green        (time, y, x) uint16 1596 1544 1516 1520 ... 711 614 620 762
    red          (time, y, x) uint16 2244 2188 2156 2132 ... 857 572 703 907
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref>

* **To check existing dimensions of the dataset**

In [13]:
ds.dims

Frozen({'time': 49, 'y': 1031, 'x': 1010})

* **To check the coordinates of the dataset**

In [14]:
ds.coords

Coordinates:
  * time         (time) datetime64[ns] 2019-12-01T08:28:13 ... 2020-03-30T08:...
  * y            (y) float64 6.807e+06 6.807e+06 ... 6.797e+06 6.797e+06
  * x            (x) float64 8.687e+05 8.687e+05 ... 8.787e+05 8.788e+05
    spatial_ref  int32 32734

* **To check all coordinates along a specific dimension**
<br>
<img src=https://live.staticflickr.com/65535/51115452191_ec160d4514_o.png, width="450">

In [15]:
ds.time
# OR
#ds.coords['time']

* **To check attributes of the dataset**

In [16]:
ds.attrs

{'crs': 'EPSG:32734', 'grid_mapping': 'spatial_ref'}

## **Subset Dataset / DataArray**

* **To select all data of "blue" band**
<br>
<img src=https://live.staticflickr.com/65535/51115092614_366cb774a8_o.png, width="350">

In [17]:
ds.blue
# OR
#ds['blue']

In [18]:
# Only print pixel values
ds.blue.values

array([[[1122, 1094, 1082, ...,  834,  839,  875],
        [1118, 1080, 1108, ...,  835,  839,  846],
        [1108, 1098, 1112, ...,  789,  828,  821],
        ...,
        [ 945,  944,  987, ...,  683,  772,  922],
        [ 982, 1036, 1042, ...,  658,  727,  866],
        [ 982, 1070, 1096, ...,  622,  700,  880]],

       [[1112, 1096, 1074, ...,    0,    0,    0],
        [1110, 1074, 1090, ...,    0,    0,    0],
        [1080, 1050, 1072, ...,    0,    0,    0],
        ...,
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0],
        [   0,    0,    0, ...,    0,    0,    0]],

       [[7328, 7304, 7296, ..., 9064, 8952, 8824],
        [7368, 7296, 7296, ..., 9000, 8888, 8704],
        [7364, 7340, 7328, ..., 9000, 8920, 8760],
        ...,
        [6608, 6588, 6592, ..., 7208, 7216, 7184],
        [6596, 6644, 6620, ..., 7252, 7228, 7216],
        [6608, 6632, 6620, ..., 7260, 7220, 7252]],

       ...,

       [[ 465,  449,  44

* **To select blue band data at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51116131265_8464728bc1_o.png, width="350">

In [19]:
ds.blue[0]

* **To select blue band data at the first time stamp while the latitude is the largest in the defined spatial extent**
<img src=https://live.staticflickr.com/65535/51115337046_aeb75d0d03_o.png, width="350">

In [20]:
ds.blue[0][0]

* **To select the upper-left corner pixel**
<br>
<img src=https://live.staticflickr.com/65535/51116131235_b0cca9589f_o.png, width="350">

In [21]:
ds.blue[0][0][0]

### **subset dataset with `isel` vs. `sel`**
* Use `isel` when subsetting with **index**
* Use `sel` when subsetting with **labels**

* **To select data of all spectral bands at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51114879732_7d62db54f4_o.png, width="750">

In [18]:
ds.isel(time=[0])

* **To select data of all spectral bands of year 2020** 
<br>
<img src=https://live.staticflickr.com/65535/51116281070_75f1b46a9c_o.png, width="750">

In [7]:
ds.sel(time='2020')

## **Reshape Dataset**

* **Convert the Dataset (subset to 2019) to a *4-dimension* DataArray**

In [22]:
da = ds.sel(time='2019').to_array().rename({"variable":"band"})
da

* **Convert the *4-dimension* DataArray back to a Dataset by setting the "time" as DataVariable (reshaped)**

![ds_reshaped](https://live.staticflickr.com/65535/51151694092_ca550152d6_o.png)

In [23]:
ds_reshp = da.to_dataset(dim="time")
print(ds_reshp)

<xarray.Dataset>
Dimensions:              (band: 3, y: 1031, x: 1010)
Coordinates:
  * y                    (y) float64 6.807e+06 6.807e+06 ... 6.797e+06 6.797e+06
  * x                    (x) float64 8.687e+05 8.687e+05 ... 8.787e+05 8.788e+05
    spatial_ref          int32 32734
  * band                 (band) <U5 'blue' 'green' 'red'
Data variables: (12/13)
    2019-12-01 08:28:13  (band, y, x) uint16 1122 1094 1082 ... 1268 1554 1914
    2019-12-04 08:38:08  (band, y, x) uint16 1112 1096 1074 1050 ... 0 0 0 0
    2019-12-06 08:28:08  (band, y, x) uint16 7328 7304 7296 ... 7304 7284 7304
    2019-12-09 08:38:04  (band, y, x) uint16 1686 1716 1694 1678 ... 0 0 0 0
    2019-12-11 08:28:10  (band, y, x) uint16 1046 1048 1024 ... 4408 4408 4288
    2019-12-14 08:38:05  (band, y, x) uint16 896 896 866 860 846 ... 0 0 0 0 0
    ...                   ...
    2019-12-19 08:38:04  (band, y, x) uint16 951 965 955 944 986 ... 0 0 0 0 0
    2019-12-21 08:28:08  (band, y, x) uint16 865 928 860 7

## Recommended next steps

If you now understand the **data structure** of `xarray.dataset` and the **basic indexing** methods illustrated in this notebook, you are ready to move on to the next notebook, where you will learn more about **advanced indexing** and calculating some **basic statistical parameters** of the n-dimensional arrays!:D

If you are gaining interest in exploring the world of **xarrays**, you may lay yourself in the [Xarray user guide](http://xarray.pydata.org/en/stable/index.html).

<br>
To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebooks](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/01_jupyter_introduction.ipynb)
2. [eo2cube](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/02_eo2cube_introduction.ipynb)
3. [Loading Data](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/03_data_lookup_and_loading.ipynb)
4. ***Xarray I: Data Structure (this notebook)***
5. [Xarray II: Index and Statistics](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/05_xarrayII.ipynb)
6. [Plotting data](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/06_plotting_basics.ipynb)
7. [Spatial analysis](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/07_basic_analysis.ipynb)
8. [Parallel processing with Dask](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/08_parallel_processing_with_dask.ipynb)

The additional notebooks are designed for users to build up both basic and advanced skills which are not covered by the beginner's guide. Self-motivated users can go through them according to their own needs. They act as complements for the guide:
<br>

1. [Python's file management tools](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/I_file_management.ipynb)
2. [Image Processing basics using NumPy and Matplotlib](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/II_numpy_image_processing.ipynb)
3. [Vector Processing](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/III_process_vector_data.ipynb)
4. [Advanced Plotting](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/IV_advanced_plotting.ipynb)

***
## Additional information

This notebook is for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/).

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** May 2022