# Deep dive into the Zarr format: Inside Zarr

## Introduction
This tutorial introduces the structure of a Zarr sample for **Sentinel 2 L2A** data. We will demonstrate how to visualise the `zarr` encoding structure, explore embedded information, and retrieve metadata for further processing.

### Prerequisites
A sample dataset for this tutorial can be obtained from the [EOPF available Samples](https://common.s3.sbg.perf.cloud.ovh.net/product.html). If further data sets want to be explored, the code indicates where.

For local **Sentinel 2 L2A** data exploration, the resource with the format `S02MSIL2A_....zarr` should be located and downloaded in the ame directory as this example.

> **Note:** <br>
> Further sample descriptions will be included in subsequent notebook updates.<br>
><br>

To manage the indicated libraries, it is recommended to work within a dedicated and stable set up. To ensure package compatibility and avoid conflicts, the following virtual environment setup is suggested:

For Conda:

`conda create --name zarr_explore python=3.11 os xarray zarr numpy jupyter`

For pip:

`pip install os xarray zarr numpy jupyter`

### Setting up the environment
The `xarray` library facilitates the handling of labeled multi-dimensional arrays, enabling more efficient processing. This library will be explored in detail in [Chapter 3](), but its [documentation](https://docs.xarray.dev/en/stable/) provides additional resources.

We then import the specific dependencies.

In [1]:
import os
import xarray as xr

From `xarray`, The `.open_datatree()` function enables access and decoding of a `DataTree` from a file-like object (in this case, the `.zarr` stored file), creating a tree node for each group within the file.

In [2]:
# Open the Zarr store with xarray as a DataTree
zarr_sample= xr.open_datatree(
    "S02MSIL2A_20230629T120347_0000_A064_TC64.zarr",  # Substitute with the downloaded sample of your interest
    engine="zarr", # storage format
    chunks={}, # allows to open the default chunking
)

To allow us retrieve the names for each of the stored groups inside `zarr`, the subsequent function definition allows us looping and retrieving the names to be visualised at each main node. 
This will allow general overview of the elements stored within them without the fine description.

In [3]:
def print_gen_structure(node, indent=""):
    print(f"{indent}{node.name}")     #allows us access each node
    for child_name, child_node in node.children.items(): #loops inside the selected nodes to extract naming
        print_gen_structure(child_node, indent + "  ") # prints the name of the selected nodes

The following output displays the attributes, conditions, measurements, and quality information. Such structure was generaly defined in [Chapter 1]().

In [4]:
print("Zarr Sentinel 2 L2A Structure")
print_gen_structure(zarr_sample.root) 
print("-" * 30)

Zarr Sentinel 2 L2A Structure
None
  conditions
    geometry
    mask
      detector_footprint
        r10m
        r20m
        r60m
      l1c_classification
        r60m
      l2a_classification
        r20m
        r60m
    meteorology
      cams
      ecmwf
  measurements
    reflectance
      r10m
      r20m
      r60m
  quality
    atmosphere
      r10m
      r20m
      r60m
    l2a_quicklook
      r10m
      r20m
      r60m
    mask
      r10m
      r20m
      r60m
    probability
      r20m
------------------------------


If we follow the tree stucture, we can see that it also displays complementary information such as the available resolutions and embedded data, corresponding to the diagram.

![Sentinel-2 L2A zarr structure](img/s2l2a.jpg)

To have a finer visualisation of the `zarr` element, `xarray` also allows us to access a representation of the entire data content within the `.zarr` object. This visualisation displays each group defined inside the `.zarr` file and its respective arrays, including detailed information such as general metadata, dimensions, chunking geometry, and chunk size.

In [9]:
# Open the Zarr store with xarray and print the detailed structure.
# Run this lines in case the print() of the whole data set is of your interest.
# print("Dataset Structure:")
# print(zarr_sample)
# print("-" * 30)

If we are  looking forward to extract specific information from a group, `xarray`'s lables allows us to retrieve by group, the information we are interested in.
Lets say we are willing to visualise only the reflectance values inside this asset.
We need to remember then, that we have 3 available resolutions (10m, 20m and 60m), and by following the `measurements/reflectance` path inside the `zarr`, we can retrieve the `reflectance` group structure:

In [6]:
# Retrieving the reflectance groups:
print(zarr_sample["measurements/reflectance"])

<xarray.DataTree 'reflectance'>
Group: /measurements/reflectance
├── Group: /measurements/reflectance/r10m
│       Dimensions:  (y: 10980, x: 10980)
│       Coordinates:
│         * x        (x) int32 44kB 300005 300015 300025 300035 ... 409775 409785 409795
│         * y        (y) int32 44kB 4600015 4600005 4599995 ... 4490245 4490235 4490225
│       Data variables:
│           b02      (y, x) float64 964MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
│           b03      (y, x) float64 964MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
│           b04      (y, x) float64 964MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
│           b08      (y, x) float64 964MB dask.array<chunksize=(1830, 1830), meta=np.ndarray>
├── Group: /measurements/reflectance/r20m
│       Dimensions:  (y: 5490, x: 5490)
│       Coordinates:
│         * x        (x) int32 22kB 300010 300030 300050 300070 ... 409750 409770 409790
│         * y        (y) int32 22kB 4600010 4599990 4599970 .

Inside this element, we are able to visualise the three main groups (nodes of the parent node `reflectance`) as `r10`,`r20` and `r60`, under the **Group** dropdown.
If we revise further inside each of them we will find the chunks containing the arrays with the reflectance information.

Also, through `zarr_sample.attrs[]` we are able to visualise both the `stac_discovery` and `other_metadata`.
For the properties inside `stac_discovery` for example:

In [7]:
# STAC metadata style:
zarr_sample.attrs["stac_discovery"]['properties']

{'created': '2023-06-29T12:03:47+00:00',
 'datetime': '2018-08-20T08:36:01.024Z',
 'end_datetime': '2018-08-20T08:36:01.024000+00:00',
 'eo:bands': [{'center_wavelength': 442.7,
   'common_name': 'coastal',
   'full_width_half_max': 0.02,
   'name': 'b01',
   'solar_illumination': 1884.69},
  {'center_wavelength': 492.7,
   'common_name': 'blue',
   'full_width_half_max': 0.065,
   'name': 'b02',
   'solar_illumination': 1959.66},
  {'center_wavelength': 559.8,
   'common_name': 'green',
   'full_width_half_max': 0.035,
   'name': 'b03',
   'solar_illumination': 1823.24},
  {'center_wavelength': 664.6,
   'common_name': 'red',
   'full_width_half_max': 0.03,
   'name': 'b04',
   'solar_illumination': 1512.06},
  {'center_wavelength': 704.1,
   'common_name': 'rededge',
   'full_width_half_max': 0.015,
   'name': 'b05',
   'solar_illumination': 1424.64},
  {'center_wavelength': 740.5,
   'common_name': 'rededge',
   'full_width_half_max': 0.015,
   'name': 'b06',
   'solar_illumination'

And inside `other_metadata`, the information from a band, for example `b06`:

In [8]:
# Complementing metadata:
zarr_sample.attrs["other_metadata"]['band_description']['b06']

{'bandwidth': 13.0,
 'central_wavelength': 740.5,
 'onboard_compression_rate': '2.655',
 'onboard_integration_time': '2.7251472',
 'physical_gain': '4.86464934',
 'spectral_response_step': '1',
 'spectral_response_values': '0.00171088 0.05467153 0.25806676 0.64722098 0.89218999 0.90232877 0.91508768 0.94115846 0.96299993 0.97510481 0.9770217 0.98736251 1 0.98880277 0.97179916 0.90126739 0.60672391 0.20520227 0.0267569',
 'units': 'nm',
 'wavelength_max': 749.0,
 'wavelength_min': 731.0}

## Conclusion
This tutorial provides an initial understanding of the `zarr` structure for Sentinel 2 L2A data. 
By using the `xarray` library, one can effectively navigate and inspect the different components within the `zarr` format, including its metadata and array organisation. 
This foundation will help deeply undestand the subsequent data analysis and processing workflows intended in our series.

For a deeper description of the metadata structure, follow the [metadata structure]() tutorial.