## Background

In the previous notebook, we experienced that the data we wanna access are loaded in a form called **`xarray.dataset`**. This is the form in which earth observation data are usually stored in a datacube. Understanding the structure of a **`xarray.dataset`** is the key to enable us work with these data. Thus, in this notebook, we are mainly dedicated to helping users of our datacube understand its data structure.

Firstly let's come to the end stage of the previous notebook, where we have loaded a data product. The data product "s2_l2a_bavaria" is used as example in this notebook.

In [4]:
import datacube
# To access and work with available data

import pandas as pd
# To format tables

from odc.ui import DcViewer 
# Provides an interface for interactively exploring the products available in the datacube

from odc.ui import with_ui_cbk
# Enables a progress bar when loading large amounts of data.

import xarray as xr

import matplotlib.pyplot as plt

# Set config for displaying tables nicely
# !! USEFUL !! otherwise parts of longer infos won't be displayed in tables
pd.set_option("display.max_colwidth", 200)
pd.set_option("display.max_rows", None)

# Connect to DataCube
# argument "app" --- user defined name for a session (e.g. choose one matching the purpose of this notebook)
dc = datacube.Datacube(app = "nb_understand_ndArrays", config = '/home/datacube/.datacube.conf')

# Load Data Product
ds = dc.load(product = "s2_l2a_bavaria",
             measurements = ["blue", "green", "red"],
             longitude = [12.493, 12.509],
             latitude = [47.861, 47.868],
             time = ("2018-04-01", "2019-03-31"))

print(ds)

<xarray.Dataset>
Dimensions:      (time: 269, x: 124, y: 84)
Coordinates:
  * time         (time) datetime64[ns] 2018-04-01T10:00:22 ... 2019-03-30T10:...
  * y            (y) float64 5.308e+06 5.308e+06 ... 5.307e+06 5.307e+06
  * x            (x) float64 7.612e+05 7.612e+05 ... 7.624e+05 7.624e+05
    spatial_ref  int32 25832
Data variables:
    blue         (time, y, x) int16 0 0 0 0 0 0 0 ... 518 515 519 525 499 473
    green        (time, y, x) int16 0 0 0 0 0 0 0 ... 756 752 736 760 738 726
    red          (time, y, x) int16 0 0 0 0 0 0 0 ... 428 433 450 469 448 433
Attributes:
    crs:           EPSG:25832
    grid_mapping:  spatial_ref


![xarray data structure](https://live.staticflickr.com/65535/51083605166_70dd29baa8_k.jpg)

The figure above is a diagramm of the structure of the **`xarray.dataset`** we've just loaded. Combined with the diagramm, we hope you may better interpret the explanation below of the data strucutre.

As read from the output block, this dataset has three ***Data Variables*** , "blue", "green" and "red" (shown with colors in the diagramm). It refers to the spectral bands and can be defined with `measurements` argument in the `load` function.

Ultimately, each data variable is a **multi-dimensional *Data Array*** of same structure; in this case, it is a **three-dimensional array** (shown as 3D Cube in the diagramm). `time`, `x` and `y` are the ***Dimensions*** (shown as axis along each cube in the diagramm), which are defined with argument `time`, `longitude` and `latitude` respectively in the `load` function.

In this dataset, there are 269 ***Coordinates*** under `time` dimension, which means there are 269 steps along the `time` axis (as shown in the diagramm). Same rule again, there are 124 coordinates under `x` dimension and 84 coordinates under `y` dimension, indicating that there are 124 pixels along `x` axis and 84 pixels along `y` axis.

As for the term ***Dataset***, it's just a *Container* holding all the multi-dimensional arrays of same structure (shown as the red-lined box holding all 3D Cubes in the diagramm).

Thus we know our dataset has a spatial extent of 124 by 84 pixels within given lon/lat range, spans over 269 time stamps at each spectral band.

Now let's deconstruct the dataset we've just loaded to have things (hopefully) more cleared!:D

## 2 Core Data Structures of "xarray"

### I. `DataArray`
* **Labeled, N-dimensional array**
* based on **`pandas.Series`**

Its Key Properties:
* **Data Variables (`values`)**: A `numpy.ndarray` holding values *(e.g. reflectance values of spectral bands)*.
* **Dimensions (`dims`)**: Dimension names for each axis *(e.g. 'x', 'y', 'time')*.
* **Coordinates (`coords`)**: Coordinates of each value along each axis *(e.g. longitudes along 'x'-axis, latitudes along 'y'-axis, datetime objects along 'time'-axis)*
* **Attributes (`attrs`)**: A dictionary(`dict`) containing Metadata.


### II. `Dataset`
* N-dimensional array database
* **Container of `DataArray`**
* **Dictionary-like**
* Analog to **`pandas.DataFrame`**

Having these two data structures, "xarray" is able to
* **Pull out arrays by name**
* **Select or Combine data along a dimension across all array simultaneously** *(e.g. Select NIR-band images from time stamps 1-12, assuming there're 12 time stamps in total, for an area X; Then calculate the median of 12 time stamps for each pixel?)*

![RGB](https://live.staticflickr.com/65535/51014414875_615bb016ea_k.jpg)

Let's look at the structure of he graphic above depicts, 

In [51]:
blue = ds["blue"][1][979]#[500:501]

print(len(blue))
blue

501


Now if you feel you understand the structure of multi-dimensional data, we believe you are ready to learn how to grasp the subset of data you need for your analysis.