<img align="right" src="../../additional_data/banner_siegel.png" style="width:1100px;">

# Xarray-I: Data Structure 

* [**Sign up to the JupyterHub**](https://www.phenocube.org/) to run this notebook interactively from your browser
* **Compatibility:** Notebook currently compatible with the Open Data Cube environments of the University of Wuerzburg
* **Prerequisites**: There is no prerequisite learning required.


## Background

In the previous notebook, we experienced that the data we wanna access are loaded in a form called **`xarray.dataset`**. This is the form in which earth observation data are usually stored in a datacube.

**`xarray`** is an open source project and Python package which offers a toolkit for working with ***multi-dimensional arrays*** of data. **`xarray.dataset`** is an in-memory representation of a netCDF (network Common Data Form) file. Understanding the structure of a **`xarray.dataset`** is the key to enable us work with these data. Thus, in this notebook, we are mainly dedicated to helping users of our datacube understand its data structure.

Firstly let's come to the end stage of the previous notebook, where we have loaded a data product. The data product "s2_l2a_bavaria" is used as example in this notebook.

## Description

The following topics are convered in this notebook:
* **What is inside a `xrray.dataset` (the structure)?**
* **(Basic) Subset Dataset / DataArray**
* **Reshape a Dataset**

In [1]:
import datacube
# To access and work with available data

import pandas as pd
# To format tables

#from odc.ui import DcViewer 
# Provides an interface for interactively exploring the products available in the datacube

#from odc.ui import with_ui_cbk
# Enables a progress bar when loading large amounts of data.

import xarray as xr

import matplotlib.pyplot as plt

# Set config for displaying tables nicely
# !! USEFUL !! otherwise parts of longer infos won't be displayed in tables
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

# Connect to DataCube
# argument "app" --- user defined name for a session (e.g. choose one matching the purpose of this notebook)
dc = datacube.Datacube(app = "nb_understand_ndArrays", config = '/home/datacube/.datacube.conf')

In [3]:
# Load Data Product
ds = dc.load(product = "s2_l2a_namibia",
             measurements = ["blue", "green", "red"],
             longitude = [17.793, 17.809],
             latitude = [-24.564, -24.557],
             time = ("2020-10-01", "2021-03-31"),
             group_by = "solar_day")

#ds = dc.load(product = "s2_l2a_bavaria",
#             measurements = ["blue", "green", "red"],
#             longitude = [12.493, 12.509],
#             latitude = [47.861, 47.868],
#             time = ("2018-10-01", "2019-03-31"))

print(ds)

<xarray.Dataset>
Dimensions:      (time: 35, x: 164, y: 82)
Coordinates:
  * time         (time) datetime64[ns] 2020-10-03T09:07:26 ... 2020-12-29T08:...
  * y            (y) float64 7.28e+06 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x            (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) int16 817 927 879 696 664 ... 5928 6032 6068 6040
    green        (time, y, x) int16 1112 1284 1132 1023 ... 5576 5600 5576 5628
    red          (time, y, x) int16 1636 1802 1650 1500 ... 5372 5328 5312 5348
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


## **What is inside a `xarray.dataset`?**
The figure below is a diagramm depicting the structure of the **`xarray.dataset`** we've just loaded. Combined with the diagramm, we hope you may better interpret the texts below explaining the data strucutre of a **`xarray.dataset`**.

![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

As read from the output block, this dataset has three ***Data Variables*** , "blue", "green" and "red" (shown with colors in the diagramm), referring to individual spectral band.

Each data variable can be regarded as a **multi-dimensional *Data Array*** of same structure; in this case, it is a **three-dimensional array** (shown as 3D Cube in the diagramm) where `time`, `x` and `y` are its ***Dimensions*** (shown as axis along each cube in the diagramm).

In this dataset, there are 35 ***Coordinates*** under `time` dimension, which means there are 35 time steps along the `time` axis. There are 164 coordinates under `x` dimension and 82 coordinates under `y` dimension, indicating that there are 164 pixels along `x` axis and 82 pixels along `y` axis.

As for the term ***Dataset***, it is like a *Container* holding all the multi-dimensional arrays of same structure (shown as the red-lined box holding all 3D Cubes in the diagramm).

So this instance dataset has a spatial extent of 164 by 82 pixels at given lon/lat locations, spans over 35 time stamps and 3 spectral band.

In summary, ***`xarray.dataset`*** is a dictionary-like container of ***`DataArrays`***, of which each is a labeled, n-dimensional array holding 4 properties:
* **Data Variables (`values`)**: A `numpy.ndarray` holding values *(e.g. reflectance values of spectral bands)*.
* **Dimensions (`dims`)**: Dimension names for each axis *(e.g. 'x', 'y', 'time')*.
* **Coordinates (`coords`)**: Coordinates of each value along each axis *(e.g. longitudes along 'x'-axis, latitudes along 'y'-axis, datetime objects along 'time'-axis)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.

Now let's deconstruct the dataset we have just loaded a bit further to have things more clarified!:D

* **To check existing dimensions of a dataset**

In [4]:
ds.dims

Frozen(SortedKeysDict({'time': 35, 'y': 82, 'x': 164}))

* **To check the coordinates of a dataset**

In [5]:
ds.coords

Coordinates:
  * time         (time) datetime64[ns] 2020-10-03T09:07:26 ... 2020-12-29T08:...
  * y            (y) float64 7.28e+06 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x            (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref  int32 32734

* **To check all coordinates along a specific dimension**
<br>
<img src=https://live.staticflickr.com/65535/51115452191_ec160d4514_o.png, width="450">

In [6]:
print(ds.time)
# OR
#ds.coords['time']

<xarray.DataArray 'time' (time: 35)>
array(['2020-10-03T09:07:26.000000000', '2020-10-08T09:07:28.000000000',
       '2020-10-10T08:57:30.000000000', '2020-10-13T09:07:26.000000000',
       '2020-10-15T08:57:32.000000000', '2020-10-18T09:07:28.000000000',
       '2020-10-20T08:57:30.000000000', '2020-10-23T09:07:26.000000000',
       '2020-10-25T08:57:32.000000000', '2020-10-28T09:07:27.000000000',
       '2020-10-30T08:57:29.000000000', '2020-11-02T09:07:25.000000000',
       '2020-11-04T08:57:31.000000000', '2020-11-07T09:07:27.000000000',
       '2020-11-09T08:57:28.000000000', '2020-11-12T09:07:23.000000000',
       '2020-11-14T08:57:30.000000000', '2020-11-17T09:07:25.000000000',
       '2020-11-19T08:57:27.000000000', '2020-11-22T09:07:23.000000000',
       '2020-11-24T08:57:28.000000000', '2020-11-27T09:07:23.000000000',
       '2020-11-29T08:57:26.000000000', '2020-12-02T09:07:21.000000000',
       '2020-12-04T08:57:25.000000000', '2020-12-07T09:07:20.000000000',
       '2020-1

* **To check attributes of the dataset**

In [7]:
ds.attrs

{'crs': 'EPSG:32734', 'grid_mapping': 'spatial_ref'}

## **Subset Dataset / DataArray**

* **To select all data of "blue" band**
<br>
<img src=https://live.staticflickr.com/65535/51115092614_366cb774a8_o.png, width="350">

In [8]:
print(ds.blue)
# OR
#ds['blue']

<xarray.DataArray 'blue' (time: 35, y: 82, x: 164)>
array([[[ 817,  927,  879, ...,  677,  702,  698],
        [ 799,  812,  792, ...,  673,  710,  703],
        [ 739,  722,  750, ...,  661,  704,  658],
        ...,
        [ 738,  758,  841, ...,  722,  693,  727],
        [ 731,  785,  864, ...,  768,  756,  745],
        [ 767,  833,  842, ...,  829,  792,  771]],

       [[ 815,  805,  821, ...,  673,  679,  691],
        [ 802,  734,  741, ...,  626,  668,  676],
        [ 794,  784,  711, ...,  601,  634,  638],
        ...,
        [ 772,  735,  808, ...,  792,  785,  749],
        [ 789,  777,  858, ...,  846,  822,  778],
        [ 765,  846,  875, ...,  810,  761,  764]],

       [[ 976,  989, 1070, ...,  833,  797,  827],
        [ 980,  943,  933, ...,  770,  776,  827],
        [ 944,  901,  866, ...,  725,  782,  802],
        ...,
...
        ...,
        [1390, 1360, 1382, ..., 1106, 1086, 1064],
        [1384, 1348, 1422, ..., 1094, 1104, 1092],
        [1396, 1360, 

In [9]:
# Only print pixel values
print(ds.blue.values)

[[[ 817  927  879 ...  677  702  698]
  [ 799  812  792 ...  673  710  703]
  [ 739  722  750 ...  661  704  658]
  ...
  [ 738  758  841 ...  722  693  727]
  [ 731  785  864 ...  768  756  745]
  [ 767  833  842 ...  829  792  771]]

 [[ 815  805  821 ...  673  679  691]
  [ 802  734  741 ...  626  668  676]
  [ 794  784  711 ...  601  634  638]
  ...
  [ 772  735  808 ...  792  785  749]
  [ 789  777  858 ...  846  822  778]
  [ 765  846  875 ...  810  761  764]]

 [[ 976  989 1070 ...  833  797  827]
  [ 980  943  933 ...  770  776  827]
  [ 944  901  866 ...  725  782  802]
  ...
  [ 951  860  935 ...  946  926  929]
  [ 924  868  943 ...  985  970  940]
  [ 924  935  996 ... 1038 1032  971]]

 ...

 [[1240 1256 1338 ... 1120 1122 1122]
  [1254 1252 1284 ... 1092 1120 1120]
  [1248 1234 1256 ... 1046 1110 1122]
  ...
  [1390 1360 1382 ... 1106 1086 1064]
  [1384 1348 1422 ... 1094 1104 1092]
  [1396 1360 1378 ... 1178 1160 1088]]

 [[ 944  984 1064 ...  974  910  940]
  [ 979  886

* **To select blue band data at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51116131265_8464728bc1_o.png, width="350">

In [10]:
print(ds.blue[0])

<xarray.DataArray 'blue' (y: 82, x: 164)>
array([[817, 927, 879, ..., 677, 702, 698],
       [799, 812, 792, ..., 673, 710, 703],
       [739, 722, 750, ..., 661, 704, 658],
       ...,
       [738, 758, 841, ..., 722, 693, 727],
       [731, 785, 864, ..., 768, 756, 745],
       [767, 833, 842, ..., 829, 792, 771]], dtype=int16)
Coordinates:
    time         datetime64[ns] 2020-10-03T09:07:26
  * y            (y) float64 7.28e+06 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x            (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref  int32 32734
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


* **To select blue band data at the first time stamp while the latitude is the largest in the defined spatial extent**
<img src=https://live.staticflickr.com/65535/51115337046_aeb75d0d03_o.png, width="350">

In [11]:
print(ds.blue[0][0])

<xarray.DataArray 'blue' (x: 164)>
array([ 817,  927,  879,  696,  664,  657,  650,  718,  750,  797,  775,
        810,  791,  781,  772,  757,  762,  779,  785,  814,  810,  856,
        859,  822,  828,  866,  839,  775,  753,  783,  877,  905,  843,
        843,  892,  871,  840,  886,  940,  905,  816,  785,  813,  822,
        853,  818,  825,  898,  912,  929,  888,  951,  947,  921,  919,
        919,  916,  935,  938,  908,  885,  887,  857,  893,  932,  833,
        762,  769,  775,  798,  743,  694,  604,  741,  836,  831,  810,
        779,  827,  883,  879,  836,  825,  841,  804,  783,  860,  876,
        851,  934,  926,  905, 1012,  942, 1017, 1062, 1000, 1088, 1146,
       1158, 1202, 1170, 1166, 1118, 1066, 1084, 1094, 1072, 1066,  974,
        996,  971,  964, 1009, 1052, 1034,  884,  883,  850,  866, 1078,
       1032,  921,  962,  907,  882,  854,  833,  777,  739,  749,  739,
        710,  803,  752,  724,  736,  741,  685,  703,  724,  808,  886,
        818,  83

* **To select the upper-left corner pixel**
<br>
<img src=https://live.staticflickr.com/65535/51116131235_b0cca9589f_o.png, width="350">

In [12]:
print(ds.blue[0][0][0])

<xarray.DataArray 'blue' ()>
array(817, dtype=int16)
Coordinates:
    time         datetime64[ns] 2020-10-03T09:07:26
    y            float64 7.28e+06
    x            float64 1.751e+05
    spatial_ref  int32 32734
Attributes:
    units:         reflectance
    nodata:        0
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


### **subset dataset with `isel` vs. `sel`**
* Use `isel` when subsetting with **index**
* Use `sel` when subsetting with **labels**

* **To select data of all spectral bands at the first time stamp**
<br>
<img src=https://live.staticflickr.com/65535/51114879732_7d62db54f4_o.png, width="750">

In [13]:
print(ds.isel(time=[0]))

<xarray.Dataset>
Dimensions:      (time: 1, x: 164, y: 82)
Coordinates:
  * time         (time) datetime64[ns] 2020-10-03T09:07:26
  * y            (y) float64 7.28e+06 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x            (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) int16 817 927 879 696 664 ... 860 804 829 792 771
    green        (time, y, x) int16 1112 1284 1132 1023 ... 1144 1160 1092 1044
    red          (time, y, x) int16 1636 1802 1650 1500 ... 1696 1708 1648 1610
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


* **To select data of all spectral bands of year 2020** 
<br>
<img src=https://live.staticflickr.com/65535/51116281070_75f1b46a9c_o.png, width="750">

In [15]:
print(ds.sel(time='2020-12'))
#print(ds.sel(time='2019'))

<xarray.Dataset>
Dimensions:      (time: 12, x: 164, y: 82)
Coordinates:
  * time         (time) datetime64[ns] 2020-12-02T09:07:21 ... 2020-12-29T08:...
  * y            (y) float64 7.28e+06 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x            (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref  int32 32734
Data variables:
    blue         (time, y, x) int16 867 948 949 807 755 ... 5928 6032 6068 6040
    green        (time, y, x) int16 1154 1250 1204 1098 ... 5576 5600 5576 5628
    red          (time, y, x) int16 1648 1736 1682 1578 ... 5372 5328 5312 5348
Attributes:
    crs:           EPSG:32734
    grid_mapping:  spatial_ref


***Tip: More about indexing and sebsetting Dataset or DataArray is presented in the [Notebook_05](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/05_xarrayII.ipynb).***

## **Reshape Dataset**

* **Convert the Dataset (subset to 2019) to a *4-dimension* DataArray**

In [19]:
da = ds.sel(time='2020-12').to_array().rename({"variable":"band"})
print(da)

<xarray.DataArray (band: 3, time: 12, y: 82, x: 164)>
array([[[[ 867,  948,  949, ...,  767,  758,  775],
         [ 871,  840,  857, ...,  772,  789,  778],
         [ 816,  792,  819, ...,  726,  771,  743],
         ...,
         [ 846,  864,  855, ...,  788,  760,  794],
         [ 816,  847,  888, ...,  833,  822,  797],
         [ 851,  898,  916, ...,  861,  846,  846]],

        [[1116, 1158, 1230, ...,  919,  899,  936],
         [1130, 1036, 1036, ...,  867,  919,  931],
         [1080, 1005,  999, ...,  846,  872,  848],
         ...,
         [1050, 1026, 1056, ...,  989,  951,  961],
         [1046, 1054, 1108, ..., 1076, 1060, 1007],
         [1042, 1088, 1130, ..., 1100, 1068, 1013]],

        [[ 882,  868,  871, ...,  736,  725,  734],
         [ 845,  760,  768, ...,  714,  749,  714],
         [ 822,  807,  756, ...,  673,  693,  662],
         ...,
...
         ...,
         [1954, 1922, 2010, ..., 1998, 1972, 1942],
         [1964, 1968, 2060, ..., 2004, 1998, 1972]

* **Convert the *4-dimension* DataArray back to a Dataset by setting the "time" as DataVariable (reshaped)**

![ds_reshaped](https://live.staticflickr.com/65535/51151694092_ca550152d6_o.png)

In [20]:
ds_reshp = da.to_dataset(dim="time")
print(ds_reshp)

<xarray.Dataset>
Dimensions:              (band: 3, x: 164, y: 82)
Coordinates:
  * y                    (y) float64 7.28e+06 7.28e+06 ... 7.28e+06 7.28e+06
  * x                    (x) float64 1.751e+05 1.751e+05 ... 1.767e+05 1.768e+05
    spatial_ref          int32 32734
  * band                 (band) <U5 'blue' 'green' 'red'
Data variables:
    2020-12-02 09:07:21  (band, y, x) int16 867 948 949 807 ... 1776 1698 1636
    2020-12-04 08:57:25  (band, y, x) int16 1116 1158 1230 ... 2138 2126 2064
    2020-12-07 09:07:20  (band, y, x) int16 882 868 871 747 ... 1770 1734 1654
    2020-12-09 08:57:24  (band, y, x) int16 1210 1328 1310 ... 2136 2078 2020
    2020-12-12 09:07:19  (band, y, x) int16 877 973 901 748 ... 1768 1694 1648
    2020-12-14 08:57:24  (band, y, x) int16 1084 1172 1114 ... 2160 2088 2018
    2020-12-17 09:07:20  (band, y, x) int16 860 878 889 816 ... 1792 1738 1668
    2020-12-19 08:57:22  (band, y, x) int16 1234 1290 1382 ... 2156 2104 2026
    2020-12-22 09:07:19 

## Recommended next steps

If you now understand the **data structure** of `xarray.dataset` and **basic indexing** methods illustrated in this notebook, you are ready to move on to the next notebook where you will learn more about **advanced indexing** and calculating some **basic statistical parameters** of the n-dimensional arrays!:D

In case you are gaining interest in exploring the world of **xarrays**, you may lay yourself into the [Xarray user guide](http://xarray.pydata.org/en/stable/index.html).

<br>
To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. [Jupyter Notebooks](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/01_jupyter_introduction.ipynb)
2. [eo2cube](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/02_eo2cube_introduction.ipynb)
3. [Loading Data](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/03_data_lookup_and_loading.ipynb)
4. ***Xarray I: Data Structure (this notebook)***
5. [Xarray II: Index and Statistics](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/05_xarrayII.ipynb)
6. [Plotting data](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/06_plotting_basics.ipynb)
7. [Spatial analysis](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/07_basic_analysis.ipynb)
8. [Parallel processing with Dask](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/08_parallel_processing_with_dask.ipynb)

The additional notebooks are designed for users to build up both basic and advanced skills which are not covered by the beginner's guide. Self-motivated users can go through them according to their own needs. They act as complements for the guide:
<br>

1. [Python's file management tools](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/I_file_management.ipynb)
2. [Image Processing basics using NumPy and Matplotlib](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/II_numpy_image_processing.ipynb)
3. [Vector Processing](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/III_process_vector_data.ipynb)
4. [Advanced Plotting](https://github.com/eo2cube/eo2cube_notebooks/blob/main/get_started/intro_to_eo2cube/IV_advanced_plotting.ipynb)

***
## Additional information

This notebook is for the usage of Jupyter Notebook of the [Department of Remote Sensing](http://remote-sensing.org/), [University of Wuerzburg](https://www.uni-wuerzburg.de/startseite/).

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 


**Contact:** If you would like to report an issue with this notebook, you can file one on [Github](https://github.com).

**Last modified:** April 2021