# Enhanced GeoZarr for EOPF: Full Multiscale Zarr V3 Store with RGB Visualization and Performance Analysis

This enhanced notebook demonstrates how to transform EOPF Zarr stores into complete GeoZarr V3 compliant datasets with:

- **Full EOPF structure preservation**: All resolution groups and variables
- **Complete multiscale support**: COG-style overviews for all bands
- **Enhanced visualization**: RGB composite plots using overview levels
- **Performance analysis**: Overview processing time and resolution comparison
- **Modular code organization**: Most functionality moved to helper module

Following COG conventions, the overviews maintain native projection and use /2 downsampling logic.

## Setup and Data Loading

First, we'll import required libraries and set up our environment. We're using the experimental Zarr V3 API for enhanced functionality and compatibility.

In [None]:
import os
os.environ["ZARR_V3_EXPERIMENTAL_API"] = "1"

import json
import cf_xarray  # noqa
import dask.array as da
import matplotlib.pyplot as plt
import morecantile
import numpy as np
import panel
import rasterio
import numcodecs
import rioxarray  # noqa
import xarray as xr
import zarr
import dask
import time
from rio_tiler.io.xarray import XarrayReader

from geozarr_examples.cog_multiscales import (
    setup_eopf_metadata,
    create_full_eopf_zarr_store,
    plot_rgb_overview,
    get_sentinel2_rgb_bands,
    verify_overview_coordinates,
    plot_overview_levels
)

### Configuration

Set up paths and parameters for our data processing:
- `spatial_chunk`: Size of spatial chunks for efficient data access
- `min_dimension`: Minimum size for overview levels
- `tileWidth`: Base tile width for TMS compatibility

In [None]:
# Set up paths and parameters
fp_base = "S2B_MSIL1C_20250113T103309_N0511_R108_T32TLQ_20250113T122458"
input_url = f"https://objectstore.eodc.eu:2222/e05ab01a9d56408d82ac32d69a5aae2a:sample-data/tutorial_data/cpm_v253/{fp_base}.zarr"
v3_output = f"../output/v3/{fp_base}_full_multiscales.zarr"

spatial_chunk = 4096  # Size of spatial chunks
min_dimension = 256   # Minimum dimension for overviews
tileWidth = 256      # Base tile width

### Setting up Dask Client

Initialize a Dask client for parallel processing capabilities. This will help with processing large datasets efficiently.

In [None]:
from dask.distributed import Client
client = Client()  # set up local cluster
client

### Loading the EOPF Data

Load the Earth Observation Processing Framework (EOPF) data from the remote Zarr store. This data follows the EOPF structure with multiple resolution groups.

In [None]:
# Load the EOPF DataTree
dt = xr.open_datatree(input_url, engine="zarr", chunks={})
print("EOPF DataTree structure:")
print(dt)

## Creating the Enhanced GeoZarr Store

Now we'll create a full EOPF Zarr store with multiscales. This process:
1. Preserves all resolution groups (r10m, r20m, r60m)
2. Sets up proper CF metadata and CRS information
3. Creates COG-style overview levels for all bands
4. Collects timing data for performance analysis

The `create_full_eopf_zarr_store` function handles this transformation while maintaining the original EOPF structure.

In [None]:
# Create the full EOPF Zarr store with multiscales
print("Creating full EOPF Zarr store with multiscales...")
print("This will process all resolution groups and create overview levels for all bands.")
print("This may take several minutes depending on data size and number of bands.")

try:
    # Create store and proceed with timing analysis
    result = create_full_eopf_zarr_store(
        dt=dt,
        output_path=v3_output,
        spatial_chunk=spatial_chunk,
        min_dimension=min_dimension,
        tileWidth=tileWidth,
        load_data=False,  # Use lazy loading for large datasets
        max_retries=3     # Retry failed operations
    )
    print("\n✅ Full EOPF Zarr store created successfully!")
    
except Exception as e:
    print(f"\n❌ Failed: {e}")
    raise

## Overview Creation Performance Analysis

Let's analyze the timing data for overview creation across different resolution groups and bands:

In [None]:
# Analyze timing data from overview creation
for group_name, group_overviews in result['overview_levels'].items():
    print(f"\n=== Performance Analysis for {group_name} ===\n")
    
    for var, var_data in group_overviews.items():
        if 'timing' in var_data:
            timing_data = var_data['timing']
            print(f"\nAnalyzing {var}:")
            print(f"{'Level':>6} {'Scale':>8} {'Pixels':>12} {'Time (s)':>10} {'MP/s':>10}")
            print("-" * 50)
            
            for timing in timing_data:
                level = timing['level']
                pixels = timing['pixels']
                proc_time = timing['time']
                megapixels = pixels / 1e6
                mp_per_second = megapixels / proc_time if proc_time > 0 else 0
                
                print(f"{level:6d} {timing['scale_factor']:8d} {pixels:12,d} {proc_time:10.2f} {mp_per_second:10.2f}")
            
            # Create a performance plot for this variable
            plt.figure(figsize=(10, 6))
            levels = [t['level'] for t in timing_data]
            times = [t['time'] for t in timing_data]
            pixels = [t['pixels'] for t in timing_data]
            
            # Plot processing time
            ax1 = plt.gca()
            line1 = ax1.plot(levels, times, 'b-o', label='Processing Time')
            ax1.set_xlabel('Overview Level')
            ax1.set_ylabel('Processing Time (s)', color='b')
            ax1.tick_params(axis='y', labelcolor='b')
            
            # Plot pixel count on secondary y-axis
            ax2 = ax1.twinx()
            line2 = ax2.plot(levels, pixels, 'r-s', label='Pixel Count')
            ax2.set_ylabel('Pixel Count', color='r')
            ax2.tick_params(axis='y', labelcolor='r')
            
            # Add legend
            lines = line1 + line2
            labels = [l.get_label() for l in lines]
            ax1.legend(lines, labels, loc='upper right')
            
            plt.title(f'Overview Creation Performance: {group_name}/{var}')
            plt.grid(True, alpha=0.3)
            plt.tight_layout()
            plt.show()

## Performance Analysis and Overview Comparison

Now we'll analyze how different overview levels affect:
1. Processing time
2. Memory usage (through pixel count)
3. Visual quality

This analysis helps understand the trade-offs between resolution and performance, guiding the choice of appropriate overview levels for different use cases.

In [None]:
# Find the best resolution group for RGB visualization
rgb_group = None
for preferred_group in ['/measurements/reflectance/r10m', '/measurements/reflectance/r20m', '/measurements/reflectance/r60m']:
    if preferred_group in result['processed_groups']:
        available_bands = list(result['processed_groups'][preferred_group].data_vars)
        red_band, green_band, blue_band = get_sentinel2_rgb_bands(preferred_group)
        
        if all(band in available_bands for band in [red_band, green_band, blue_band]):
            rgb_group = preferred_group
            break

if rgb_group:
    print(f"Using {rgb_group} for RGB visualization")
    red_band, green_band, blue_band = get_sentinel2_rgb_bands(rgb_group)
    print(f"RGB bands: R={red_band}, G={green_band}, B={blue_band}")

    # Create figure for overview comparison and timing analysis
    fig = plt.figure(figsize=(20, 15))
    gs = plt.GridSpec(2, 3)
    
    # Plot overview levels 0-5 with processing times
    timing_data = []
    for i in range(6):  # 0 through 5
        ax = plt.subplot(gs[i//3, i%3])
        
        start_time = time.time()
        if i == 0:
            # Native resolution
            ds = xr.open_zarr(f"{v3_output}/{rgb_group}")
            red_data = ds[red_band].values
            green_data = ds[green_band].values
            blue_data = ds[blue_band].values
        else:
            # Overview level
            ds = xr.open_zarr(f"{v3_output}/{rgb_group}/{red_band}_overviews", group=str(i))
            red_data = ds[red_band].values
            ds = xr.open_zarr(f"{v3_output}/{rgb_group}/{green_band}_overviews", group=str(i))
            green_data = ds[green_band].values
            ds = xr.open_zarr(f"{v3_output}/{rgb_group}/{blue_band}_overviews", group=str(i))
            blue_data = ds[blue_band].values
        
        # Create RGB array and measure performance
        rgb_array = np.stack([red_data, green_data, blue_data], axis=-1)
        proc_time = time.time() - start_time
        pixel_count = rgb_array.shape[0] * rgb_array.shape[1]
        
        # Apply contrast stretching for better visualization
        rgb_stretched = np.zeros_like(rgb_array)
        for j in range(3):
            band = rgb_array[:, :, j]
            p_low, p_high = np.percentile(band[~np.isnan(band)], (2, 98))
            rgb_stretched[:, :, j] = np.clip((band - p_low) / (p_high - p_low), 0, 1)
        
        # Plot with performance metrics
        ax.imshow(rgb_stretched)
        scale = 2**i if i > 0 else 1
        ax.set_title(f'Overview Level {i} (1:{scale})\nTime: {proc_time:.2f}s\nPixels: {pixel_count:,}')
        ax.axis('off')
        
        # Store timing data for analysis
        timing_data.append({
            'level': i,
            'time': proc_time,
            'pixels': pixel_count
        })
    
    # Create performance analysis graph
    timing_ax = plt.subplot(gs[1, :])
    levels = [d['level'] for d in timing_data]
    times = [d['time'] for d in timing_data]
    pixels = [d['pixels'] for d in timing_data]
    
    # Plot processing time
    color1 = 'tab:blue'
    timing_ax.set_xlabel('Overview Level')
    timing_ax.set_ylabel('Processing Time (s)', color=color1)
    line1 = timing_ax.plot(levels, times, color=color1, marker='o', label='Processing Time')
    timing_ax.tick_params(axis='y', labelcolor=color1)
    
    # Plot pixel count on secondary y-axis
    ax2 = timing_ax.twinx()
    color2 = 'tab:orange'
    ax2.set_ylabel('Pixel Count', color=color2)
    line2 = ax2.plot(levels, pixels, color=color2, marker='s', label='Pixel Count')
    ax2.tick_params(axis='y', labelcolor=color2)
    
    # Add legend and formatting
    lines = line1 + line2
    labels = [l.get_label() for l in lines]
    timing_ax.legend(lines, labels, loc='upper right')
    timing_ax.grid(True, alpha=0.3)
    timing_ax.set_title('Processing Time and Pixel Count vs Overview Level')
    
    plt.suptitle('RGB Composite Overview Levels with Performance Analysis', y=1.02, fontsize=16)
    plt.tight_layout()
    plt.show()


## Analysis and Interpretation

The visualization and performance analysis above reveals several key insights:

### 1. Overview Creation Performance
- Each overview level requires less processing time than the previous one
- Processing efficiency (MP/s) tends to improve for smaller overview levels
- The relationship between level and processing time is roughly logarithmic

### 2. Visual Quality vs Resolution Trade-off
- **Level 0 (Native)**: Full resolution, maximum detail
- **Level 1-2**: Good balance of detail and performance
- **Level 3-5**: Progressive reduction in detail, but still useful for overview purposes

### 3. Performance Characteristics
- **Processing Time**: Decreases significantly with each overview level
  - Level 0 (native) has the highest processing time
  - Each subsequent level shows marked improvement
- **Pixel Count**: Decreases exponentially (1/4 with each level)
  - Directly correlates with memory usage
  - Impacts both processing time and storage requirements

### 4. Practical Applications
- **Interactive Visualization**: Use higher overview levels for initial display and navigation
- **Detailed Analysis**: Switch to lower overview levels when zoomed in
- **Memory Management**: Choose overview level based on available system resources

### 5. Processing Efficiency
- Block averaging provides efficient downsampling
- Memory usage scales well with overview levels
- Parallel processing capabilities help with large datasets

This analysis demonstrates the value of having multiple overview levels, allowing applications to choose the appropriate level based on their specific needs for performance vs. detail.

## Technical Implementation Notes

The overview system implements several key features:

1. **Efficient Downsampling**
   - Uses block averaging for better quality
   - Preserves data characteristics across resolutions

2. **Coordinate System Handling**
   - Maintains proper georeferencing at all levels
   - Preserves CRS and transformation information

3. **Performance Optimizations**
   - Chunked storage for efficient access
   - Parallel processing capabilities
   - Memory-efficient processing of large datasets

4. **Standards Compliance**
   - Follows COG conventions for overviews
   - Compatible with GeoZarr specifications
   - Maintains EOPF structural requirements

These technical choices ensure the resulting dataset is both efficient and standards-compliant.