# STAC to `zarr`: How to access information

### Introduction

In this tutorial we will demonstrate how to discover, access, and analyse Earth Observation data using the [EOPF Sentinel Zarr Sample Service STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en) and EOPF `zarr` datasets. 
We will show a step on step guide perfect for beginners in Earth observation data processing.

### What we will learn

- ☁️ How to open cloud-optimised datasets through EOPF STAC Catalog
- 🏗️ Understand EOPF Zarr organisation with visualisations
- 🔎 Common techniques for examining datasets
- 📊 Perform simple data analysis examples

### Prerequisites

This tutorial requires the `xarray-eopf` extension for data manipulation. To find out more about the library, access the [documentation](https://eopf-sample-service.github.io/xarray-eopf/).

The [EOPF Sentinel Zarr API connection tutorial](03_eopf_stac_conection.ipynb) tutorial gives an introduction to the workflow for accessing the STAC collection we are interested in.

<hr>

#### Import libraries

In [1]:
import requests
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Optional, cast
from pystac import Collection, MediaType
from pystac_client import Client, CollectionClient
from datetime import datetime
import xarray as xr

# xr.set_options(display_expand_attrs=False)

#### Helper functions

##### `list_found_elements`
As we are expecting to visualise several elements that will be stored in lists, we define a function that will allow us retrieve item `id`'s and collections `id`'s for further retrieval.

In [2]:
def list_found_elements(search_result):
    id = []
    coll = []
    for item in search_result.items(): #retrieves the result inside the catalogue.
        id.append(item.id)
        coll.append(item.collection_id)
    return id , coll

## Establish the connection

Our first step is to create our connection to interact with the EOPF STAC Catalog.<br>
This involves defining the starting point for the data we wish to retrieve.<br>

The API's base URL is available through the 🔗**Source** ([click here](https://stac.core.eopf.eodc.eu/)), which can be found in the **API & URL** tab of the [EOPF Sentinel Zarr Sample Service STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en).

![EOPF API url for connection](img/api_connection.png)

Through `Client.open()` function, we can create the access to the starting point of the Catalogue by providing the specific url.

In [3]:
max_description_length = 100

eopf_stac_api_root_endpoint = "https://stac.core.eopf.eodc.eu/" #root starting point
eopf_catalog = Client.open(url=eopf_stac_api_root_endpoint)

Rectifying the catalog we have just accessed:

In [4]:
print(
    "Connected to Catalog {id}: {description}".format(
        id=eopf_catalog.id,
        description=eopf_catalog.description
        if len(eopf_catalog.description) <= max_description_length
        else eopf_catalog.description[: max_description_length - 3] + "...",
    )
)

## Accessing Items of interest

For this tutorial, we will focus on the Sentinel-2 L2A Collection. The EOPF STAC Catalog corresponding id is: `sentinel-2-l2a`.

As we are interested in retrieving and exploring an Item from the collection, we will focus again over the Innsbruck area we have defined in the [previous tutorial]().

In [5]:
innsbruck_s2 = eopf_catalog.search( # searching in the Catalog
    collections= 'sentinel-2-l2a', # interest Collection,
    bbox=(11.124756, 47.311058, # AOI extent
          11.459839,47.463624),
    datetime='2020-05-01T00:00:00Z/2025-05-31T23:59:59.999999Z' # interest period
)

combined_ins =list_found_elements(innsbruck_s2)

print("Search Results:")
print('Total Items Found for Sentinel-2 L-2A over Innsbruck:  ',len(combined_ins[0]))

In [6]:
first_item_id=combined_ins[0][0]
print(first_item_id)

Here, we retrieve the sentinel-2-l2a collection object from the EOPF STAC catalog. This object provides access to details and items within that specific collection.

In [7]:
c_sentinel2 = eopf_catalog.get_collection('sentinel-2-l2a')
#Choosing the first item available to be opened:
item= c_sentinel2.get_item(id=first_item_id)
item_assets = item.get_assets(media_type=MediaType.ZARR)

cloud_storage = item_assets['product'].href

print('Item cloud storage URL for retrieval:',cloud_storage)

## Examining Dataset Structure

In the following step, we open the cloud-optimised Zarr dataset using `xarray.open_datatree` supported by the `xarray-eopf extension`.

The subsequent loop then prints out all the available groups within the opened `DataTree`, providing a comprehensive overview of the `zarr`'s hierarchical structure.

In [8]:
dt = xr.open_datatree(
    cloud_storage,        # the cloud storage url from the Item we are interested in
    engine="eopf-zarr",   # xarray-eopf defined engine 
    op_mode="native",     # visualisation mode
    chunks={})            # default eopf chunking size

for dt_group in sorted(dt.groups):
    print("DataTree group {group_name}".format(group_name=dt_group)) # getting the available groups

## Root Dataset Metadata


We specifically look for groups containing data variables under `/measurements/reflectance/r20m` (which corresponds to Sentinel-2 bands at 20m resolution). The output provides key information about the selected group, including its dimensions, available data variables (the different spectral bands), and coordinates.

In [9]:
# Get /measurements/reflectance/r20m group
groups = list(dt.groups)
interesting_groups = [
    group for group in groups if group.startswith('/measurements/reflectance/r20m')
    and dt[group].ds.data_vars
]
print(f"\n🔍 Searching for groups with data variables in '/measurements/reflectance/r20m'...")


In [10]:
if interesting_groups:
    sample_group = interesting_groups[0]
    group_ds = dt[sample_group].ds
    
    print(f"📊 Group '{sample_group}' Information")
    print("=" * 50)
    print(f"🔢 Dimensions: {dict(group_ds.dims)}")
    print(f"📏 Data Variables: {list(group_ds.data_vars.keys())}")
    print(f"🗺️  Coordinates: {list(group_ds.coords.keys())}")

else:
    print("No groups with data variables found in the first 5 groups.")

This cell inspects the attributes of the root dataset within the DataTree. Attributes often contain important high-level metadata about the entire product, such as processing details, STAC discovery information, and more. We print the first few attributes to give an idea of the available metadata.


In [11]:
# Examine the root dataset
root_dataset = dt.ds

print("📊 Root Dataset Metadata")
print("=" * 40)

if root_dataset.attrs:
    print(f"\n📝 Attributes (first 3):")
    for key, value in list(root_dataset.attrs.items())[:3]:
        print(f"   {key}: {str(value)[:80]}{'...' if len(str(value)) > 80 else ''}")

## Visualising the RGB quicklook composite

EOPF zarr assets include a quicklook RGB coposite included within the assets of `zarr`. <br>

We open the Zarr dataset again, but this time, we specifically target the `quality/l2a_quicklook/r20m group` and its variables.<br>
This group typically contains a true-colour (RGB) quicklook composite, which is a readily viewable representation of the satellite image. 

We use `xr.open_dataset` with the `xarray-eopf` variables parameter to load only the relevant data for the quicklook.

In [12]:
## Visualising the RGB quicklook composite:
ds = xr.open_dataset(
    cloud_storage,        # the cloud storage url from the Item we are interested in
    engine="eopf-zarr",   # xarray-eopf defined engine 
    op_mode="native",     # visualisation mode
    chunks={},            # default eopf chunking size
    group_sep="/",
    variables="quality/l2a_quicklook/r20m/*",
)

And we can define, the following RGB composite:

In [13]:
ds["quality/l2a_quicklook/r20m/tci"].plot.imshow()
plt.title('RGB Quicklook')
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.grid(False) # Turn off grid for image plots
plt.axis('tight') # Ensure axes fit the data tightly

## Simple Data Analysis: Calculating NDVI

This section demonstrates a common Earth Observation data analysis technique: calculating the Normalised Difference Vegetation Index (NDVI).<br>

First, we open the Zarr dataset specifically for the **red** (B04) and **Near-Infrared** (B08) bands, which are crucial for NDVI calculation. We also specify `resolution=20` to ensure we are working with the 20-meter resolution bands.

In [14]:
red_nir = xr.open_dataset(
    cloud_storage,
    engine="eopf-zarr",
    chunks={},
    spline_orders=0,
    variables=['b04', 'b08'],
    resolution= 20
)

Here, we cast the red (B04) and Near-Infrared (B08) bands to floating-point numbers. This is important for accurate mathematical operations, especially division, in the subsequent **NDVI** calculation.

In [15]:
red_f = red_nir.b04.astype(float)
nir_f = red_nir.b08.astype(float)

Now, we perform the initial steps for **NDVI** calculation:
- `sum_bands`: Calculates the sum of the Near-Infrared and Red bands.
- `diff_bands`: Calculates the difference between the Near-Infrared and Red bands.

In [16]:
sum_bands = nir_f + red_f
diff_bands = nir_f - red_f
ndvi = diff_bands / sum_bands

To prevent division by zero errors in areas where both red and NIR bands might be zero (e.g., water bodies or clouds), this line replaces any **NaN** values resulting from division by zero with 0. This ensures a clean and robust NDVI product.

In [17]:
ndvi = ndvi.where(sum_bands != 0, 0)

Finally, this cell visualises the calculated **NDVI**.

In [18]:
ndvi.plot(cmap='RdYlGn', vmin=-1, vmax=1)
plt.title('Normalized Difference Vegetation Index (NDVI)')
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.grid(False) # Turn off grid for image plots
plt.axis('tight') # Ensure axes fit the data tightly

# Display the plot
plt.show()

## 💪 Now it is your turn

With the foundations laid, you're now equipped to dive deeper into Earth Observation data. These are your tasks:
#### **1. Explore 5 additional Sentinel-2 Items for Innsbruck**
Replicate the RGB quicklook and have an overview of the spatial changes.<br>

#### **2. Investigating NDVI**:
Replicate the NDVI calculation for the additional Innsbruck items.

#### **3. Applying more advanced analysis techniques**: 
The EOPF STAC Catalog offers a wealth of data beyond Sentinel-2. Try searching for and analysing data from different sensors or products.<br>

With `xarray` and its rich ecosystem, you can perform more sophisticated operations like time-series analysis, anomaly detection, or machine learning applications on these cloud-optimised datasets.

<hr>

## Conclusion

In this chapter, the user has successfully navigated the process of accessing and analysing Earth Observation data, by establishing a connection to the [EOPF Sentinel Zarr Sample Service STAC Catalog](https://stac.browser.user.eopf.eodc.eu/?.language=en). This initial step provides access to a comprehensive catalog of Earth Observation data, an essential resource for discovering available satellite imagery. <br>

One of the key achievements highlighted in this chapter is the identification and location of cloud-optimised Zarr assets. This capability is fundamental to efficient data access in contemporary Earth observation workflows. This tutorial guides the user through the process of opening these hierarchical satellite datasets seamlessly using `xarray`'s `DataTree`, a powerful tool designed for managing complex data structures.


<hr>

## What's next?

In the following tutorial, we will explore how to open a series of items to perform a time series analysis.
We will continue...