<a href="https://jupyterhub.user.eopf.eodc.eu/hub/user-redirect/git-pull?repo=https://github.com/eopf-toolkit/eopf-101&branch=main&urlpath=lab/tree/eopf-101/65_create_overviews.ipynb" target="_blank">
  <button style="background-color:#0072ce; color:white; padding:0.6em 1.2em; font-size:1rem; border:none; border-radius:6px; margin-top:1em;">
    🚀 Launch this notebook in JupyterLab
  </button>
</a>

### Introduction

In this notebook, we will demonstrate how to create **overviews** (also called multiscale pyramids) for large Earth Observation datasets stored in Zarr format. Overviews are downscaled representations of gridded data that optimize visualization and enable scalable access to large datasets.

### What we will learn

- 🗂️ How to compute multiscale overview levels from high-resolution satellite data
- 📊 How to attach GeoZarr-compliant metadata to datasets
- 💾 How to write overview pyramids to Zarr storage

### Prerequisites

This notebook uses:
- **Dataset**: Sentinel-2 L2A reflectance data from EODC object storage
- **Resolution**: 10m spatial resolution (10980 × 10980 pixels)
- **Bands**: Blue (b02), Green (b03), Red (b04), NIR (b08)

The workflow follows a **compute-then-write pattern** that separates in-memory computation from disk persistence, allowing validation before committing changes.

---

#### Import libraries

In [None]:
import os
import warnings
import xarray as xr
import json
from pathlib import Path
import zarr
import dask
warnings.filterwarnings("ignore")

## Copy Remote Dataset to Local Storage

Before we can add overviews, we need to copy the entire remote Zarr dataset to local storage. This creates a local copy that we can modify by adding overview levels and metadata.

**Why copy the entire dataset:**
- The remote dataset is read-only (object storage) and we need a writable local copy to add new groups (L1-L5)
- This preserves the complete original structure


To make sure that we use a convenient scene, we select a source URL from the catalogue.

In [None]:
#remote_url = (    "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202506-s02msil2a/10/products/cpm_v256/S2C_MSIL2A_20250610T103641_N0511_R008_T32UMD_20250610T132001.zarr")
remote_url = (    "https://objects.eodc.eu/e05ab01a9d56408d82ac32d69a5aae2a:202508-s02msil2a/31/products/cpm_v256/S2A_MSIL2A_20250831T135741_N0511_R010_T26WPD_20250831T185012.zarr")
local_zarr_path = "output/S2A_MSIL2A_20250831T135741_N0511_R010_T26WPD_20250831T185012.zarr"

As a first step, we download the remote Zarr dataset, split it into smaller 512-pixel chunks, and saves it as a local Zarr copy ready for exploration.
Chunking helps large datasets load faster especially when used in visualisation tools that read data in small spatial tiles.

In [None]:
print(f"Copying remote Zarr to {local_zarr_path}... (may take several minutes)")
os.makedirs("output", exist_ok=True)
s2l2a_remote = xr.open_datatree(remote_url, engine="zarr")
# Rechunk all datasets in the datatree
for node in s2l2a_remote.subtree:
    if node.has_data:
        node.ds = node.ds.chunk({dim: 512 for dim in node.ds.dims})
s2l2a_remote.to_zarr(
    local_zarr_path,
    mode="w",
    zarr_version=2,
    consolidated=False,
    align_chunks=True,
)

Now we open the local copy of the dataset and look inside the group that contains the 10-metre reflectance data to understand which variables, dimensions, and coordinates it contains.

In [None]:
# --- Step 2: Load Local Dataset and Inspect Structure ---
variable_group_path = "measurements/reflectance/r10m"
dataset = xr.open_dataset(f"{local_zarr_path}/{variable_group_path}", engine="zarr")
dataset

## Compute Overviews (In-Memory)

Now we compute the overview levels **in memory only** - no data is written to disk at this stage.

We extract the reflectance group at 10m resolution and automatically discover variables and dimensions.

**Key operations:**
- Open the local Zarr dataset as a datatree
- Extract the reflectance group and convert to xarray Dataset
- Automatically discover all variables and dimensions
- Identify spatial coordinate names (x, y)


In this step, we identify the spatial dimensions and variables in the dataset and define the scale levels that will be used to generate lower-resolution overviews.

In [None]:
scales = [2, 4, 8, 16, 32, 64, 128]  # Scale factors for each level
variables = [var for var in dataset.data_vars]  # Discover variables
spatial_dims = [dim for dim in dataset.dims]  # Discover dimensions
x_dim = next((d for d in spatial_dims if d in ['x', 'X', 'lon', 'longitude']), 'x')  # Identify x dimension
y_dim = next((d for d in spatial_dims if d in ['y', 'Y', 'lat', 'latitude']), 'y')  # Identify y dimension
print(f"Variables: {variables} | Dims: {spatial_dims} | Shape: {dataset[variables[0]].shape} | Using: x_dim='{x_dim}', y_dim='{y_dim}'\n")

Now we generate a series of lower-resolution overview datasets directly in memory.

For each scale factor, we use xarray’s coarsen() function to average groups of pixels along the spatial dimensions (x, y). Each coarsened version is stored under a level name like L1, L2, etc., representing progressively coarser spatial resolutions.

In [None]:
overviews = {}  # Generate in-memory overview datasets
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    coarsened = dataset.coarsen({x_dim: factor, y_dim: factor}, boundary="trim").mean()
    overviews[level_id] = coarsened[variables]

print(f"Created {len(overviews)} overview levels:")
for level_id, level_ds in overviews.items():
    print(f"  {level_id}: shape {level_ds[variables[0]].shape}, dims {dict(level_ds.dims)}")
print("\nOverview datasets created successfully (in memory only, not written to disk)")

## Attach Multiscales Metadata

With the overviews computed, we now attach **GeoZarr-compliant metadata** to the dataset. This metadata describes:

- **Version**: Schema version ("1.0")
- **Resampling method**: How data was aggregated ("average")
- **Variables**: Which bands have overviews
- **Layout**: The complete hierarchy including L0 (base) and all derived levels

The metadata is stored in `dataset.attrs["multiscales"]` following the GeoZarr Overviews specification. This ensures interoperability with GeoZarr-aware tools and libraries.

Here we prepare the information needed to describe the overview hierarchy in the GeoZarr metadata. We set overview_path to indicate where the overview groups will be stored, record the resampling_method ("average") used to create them, and compute the base spatial resolutions (x_res and y_res) from the coordinate spacing.

In [None]:
overview_path = "overviews"  # Where overviews are written ("." for direct children)
resampling_method = "average"
x_res = abs(float(dataset['x'].values[1] - dataset['x'].values[0]))
y_res = abs(float(dataset['y'].values[1] - dataset['y'].values[0]))

Now, we build the multiscales layout metadata that describes how all overview levels relate to the base dataset.

The first entry (L0) represents the original data, including its spatial resolution (cell_size). Each subsequent level (L1, L2, …) is added to the layout with information about its path, the level it was derived from, the scale factors applied, the resampling method, and its corresponding cell size.

Finally, this complete structure is stored in dataset.attrs["multiscales"] following the GeoZarr Overviews specification (draft). The printed JSON summary shows the final metadata layout that GeoZarr-aware tools can use to identify and navigate between resolution levels.

In [None]:
layout = [{"id": "L0", "path": ".", "cell_size": [x_res, y_res]}]  # Base level (native data at current group)
for i, factor in enumerate(scales):
    level_id = f"L{i+1}"
    level_path = level_id if overview_path == "." else f"{overview_path}/{level_id}"
    # CHANGE: Calculate level cell_size and add to layout entry
    level_cell_size = [x_res * factor, y_res * factor]
    layout.append({"id": level_id, "path": level_path, "derived_from": "L0" if i == 0 else f"L{i}", "factors": [factor, factor], "resampling_method": resampling_method, "cell_size": level_cell_size})
dataset.attrs["multiscales"] = {"version": "1.0", "resampling_method": resampling_method, "variables": variables, "layout": layout}
print("Metadata structure:")
print(json.dumps(dataset.attrs["multiscales"], indent=2))



### Write Overviews to Local Zarr Store

Now we add the computed overviews to the local Zarr store without duplicating the native data.

**Overview path options:**
- `overview_path="."` - Write overviews as direct children (L1, L2, L3, ...)
- `overview_path="overviews"` - Write overviews in a subfolder (overviews/L1, overviews/L2, ...)

**Write operations:**
1. **Write L1-L5** - Add overview levels as subgroups
2. **Add metadata** - Update group attributes with multiscales metadata

**Key point:** Native data stays at the group level. The multiscales metadata uses `path: "."` for L0 to reference the existing native data without duplication.

**Result with `overview_path="."`:**
```
measurements/reflectance/r10m/
├── b02, b03, b04, b08  # Native data (L0 via path=".")
├── x, y                # Coordinates
├── L1/                 # Overview levels (direct children)
├── L2/
├── L3/
├── L4/
├── L5/
└── .zattrs             # multiscales metadata
```

**Alternative with `overview_path="overviews"`:**
```
measurements/reflectance/r10m/
├── b02, b03, b04, b08  # Native data (L0 via path=".")
├── x, y                # Coordinates
├── overviews/          # Overview levels in subfolder
│   ├── L1/
│   ├── L2/
│   ├── L3/
│   ├── L4/
│   └── L5/
└── .zattrs             # multiscales metadata
```

In [None]:

target_group_path = os.path.join(local_zarr_path, variable_group_path)
print(f"Adding overviews to {target_group_path} | Variables: {variables} | Path: '{overview_path}'\n")

base_path = Path(target_group_path) / overview_path
print(f"Creating Zarr group: {overview_path}/")
zarr.open_group(str(base_path.absolute()), mode='a', zarr_version=2)  # Create proper Zarr group

print(f"Writing {len(overviews)} overview levels...")
for level_id, level_dataset in overviews.items():
    level_dataset.to_zarr(str((base_path / level_id).absolute()), mode="a", zarr_version=2)

coords_only = xr.Dataset(coords=dataset.coords, attrs=dataset.attrs)  # Metadata update (no data vars)
coords_only.to_zarr(str(Path(target_group_path).absolute()), mode="a", zarr_version=2)

print(f"Generating consolidated metadata for {overview_path}/")
zarr.consolidate_metadata(str(base_path.absolute()))  # Create .zmetadata

print(f"\nSuccessfully added overviews to {target_group_path}\n")
print(f"Final structure:\n  {variable_group_path}/")
print(f"    ├── {', '.join(variables)} (native data - L0 via path '.')")
print(f"    ├── {x_dim}, {y_dim} (coordinates)")
if overview_path == ".":
    for level_id in overviews.keys(): print(f"    ├── {level_id}/")
else:
    print(f"    ├── {overview_path}/")
    for level_id in overviews.keys(): print(f"    │   ├── {level_id}/")
print(f"    └── .zattrs (with multiscales metadata)")


## 💪 Now it is your turn

With everything we have learnt so far, you are now able to create multiscale overviews for your own datasets.

### Task 1: Experiment with Different Scale Factors

Try modifying the `scales` list to create different pyramid structures. For example:
- **Fewer levels**: `scales = [2, 4, 8]` for a smaller pyramid
- **More aggressive downsampling**: `scales = [4, 16, 64]` for rapid zoom levels
- **Fine-grained levels**: `scales = [2, 3, 4, 6, 8]` for smoother transitions

### Task 2: Apply to Your Own Dataset

Use this notebook as a template for your own Earth Observation data:
1. Replace the URL with your own Zarr dataset
2. Let the code discover variables and dimensions automatically
3. Adjust scale factors based on your data resolution
4. Validate and write the results

---

## Conclusion

This tutorial demonstrated the complete workflow for creating GeoZarr-compliant multiscale overviews:

1. ✅ Load and discover dataset structure automatically
2. ✅ Compute overview levels in memory (no disk I/O)
3. ✅ Attach specification-compliant metadata
4. ✅ Write to Zarr storage

**Key takeaways:**
- The **compute-then-write pattern** separates computation from I/O
- **Dynamic discovery** makes code adaptable to different datasets

## What's next?

- **Visualization**: Use the created overviews with mapping libraries (Leaflet, Mapbox) for progressive rendering
- **Performance**: Experiment with different chunking strategies for optimal I/O performance