---
title: "Creating and Optimizing Multi-temporal EOPF Zarr Datasets"
format:
  html:
    code-fold: false
jupyter: python3
---

## Introduction

In this notebook, we will create a multi-temporal EOPF Zarr dataset from multiple Sentinel-2 acquisitions, focusing on the 10-meter resolution bands. We will explore different chunking strategies and demonstrate their impact on storage efficiency and access performance. This hands-on approach will help you understand how to optimize Zarr chunking for your specific Earth Observation workflows.

## What we will learn

- 🛰️ How to create a reduced EOPF Zarr dataset from multiple Sentinel-2 acquisitions
- 📊 How to implement different chunking strategies for multi-temporal data
- ⚡ How to measure and compare performance metrics for different chunk sizes
- 🔧 How to optimize chunking for specific access patterns (spatial vs temporal)
- 💾 How to evaluate storage efficiency with different compression settings

## Prerequisites

This notebook builds upon the concepts introduced in the [Zarr Chunking Introduction](251_zarr_chunking_intro.qmd). You should be familiar with:
- Basic Zarr concepts and structure
- STAC catalog navigation
- Xarray operations

::: {.callout-note}
**Note:** This notebook uses utility functions from `zarr_chunking_utils.py` to keep the code focused on the key concepts. You can explore the utility functions to understand the implementation details.
:::

<hr>

### Import libraries

In [4]:
import numpy as np
import xarray as xr
import zarr
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import our utility functions
from zarr_chunking_utils import (
    create_dask_client,
    get_sentinel2_data,
    create_sample_data,
    create_multitemporal_zarr,
    create_multitemporal_zarr_from_eopf,
    analyze_zarr_performance,
    compare_chunking_strategies,
    visualize_chunk_layout
)

# Set up plotting style
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


<hr>

## 1. Setting up the environment

First, we will initialize our Dask client for parallel processing and define our area of interest.

In [5]:
# Initialize Dask client for parallel processing
client = create_dask_client(n_workers=4, threads_per_worker=2, memory_limit='4GB')
print(f"\n✓ Dask client initialized with {len(client.nthreads())} workers")

Dask dashboard available at: http://127.0.0.1:42591/status

✓ Dask client initialized with 4 workers


In [6]:
# Define area of interest and time period
# Example: Agricultural area in Netherlands
bbox = [5.0, 52.0, 5.5, 52.5]  # [min_lon, min_lat, max_lon, max_lat]
start_date = "2024-06-01"
end_date = "2024-09-30"

print(f"Area of Interest: {bbox}")
print(f"Time period: {start_date} to {end_date}")

Area of Interest: [5.0, 52.0, 5.5, 52.5]
Time period: 2024-06-01 to 2024-09-30


## 2. Retrieving Sentinel-2 data from EOPF STAC

We will first attempt to retrieve real Sentinel-2 data from the EOPF STAC catalog. If the connection is not available, we will fall back to sample data for demonstration purposes.

In [None]:
# Try to retrieve real data from EOPF STAC
items = get_sentinel2_data(
    bbox=bbox,
    start_date=start_date,
    end_date=end_date,
    max_items=20,  # Get 5 acquisitions
    cloud_cover=20
)

# Display acquisition information
print("\n✓ Found EOPF data:")
for item in items:
    date = pd.to_datetime(item['datetime']).strftime('%Y-%m-%d')
    print(f"  - {item['id']}: {date}")
use_real_data = True

# Define the 10m bands we want to work with
bands_10m = ['B02', 'B03', 'B04', 'B08']  # Blue, Green, Red, NIR


Searching for Sentinel-2 data from EOPF STAC Catalog...
  Area: [5.0, 52.0, 5.5, 52.5]
  Period: 2024-06-01 to 2024-09-30
  Max cloud cover: 20%
Found 0 Sentinel-2 acquisitions with cloud storage URLs

✓ Found EOPF data:


## 3. Defining chunking strategies

We will test three different chunking strategies, each optimized for different access patterns:

1. **Spatial-optimized**: Large spatial chunks (1024×1024), single time steps
2. **Temporal-optimized**: Small spatial chunks (256×256), all time steps together
3. **Balanced**: Medium chunks (512×512), 2 time steps per chunk

In [None]:
# Define chunking strategies
chunking_strategies = {
    'spatial_optimized': {'time': 1, 'y': 1024, 'x': 1024},
    'temporal_optimized': {'time': 5, 'y': 256, 'x': 256},
    'balanced': {'time': 2, 'y': 512, 'x': 512}
}

# Display chunking strategies
print("Chunking Strategies:")
print("=" * 50)
for name, config in chunking_strategies.items():
    chunk_size_mb = (config['time'] * config['y'] * config['x'] * 2 * 4) / (1024**2)  # 2 bytes per pixel, 4 bands
    print(f"\n{name}:")
    print(f"  Time chunks: {config['time']} steps")
    print(f"  Spatial chunks: {config['y']} × {config['x']} pixels")
    print(f"  Estimated chunk size: ~{chunk_size_mb:.1f} MB")

### Visualizing chunk layouts

Let us visualize how each strategy divides the data array:

In [None]:
# Visualize spatial-optimized chunking
print("Spatial-Optimized Chunking:")
visualize_chunk_layout(
    chunking_strategies['spatial_optimized'],
    {'time': 5, 'y': 2048, 'x': 2048}
)

In [None]:
# Visualize temporal-optimized chunking
print("Temporal-Optimized Chunking:")
visualize_chunk_layout(
    chunking_strategies['temporal_optimized'],
    {'time': 5, 'y': 2048, 'x': 2048}
)

## 4. Creating Zarr datasets with different strategies

Now we will save our sample dataset using each chunking strategy and compare their characteristics.

In [None]:
# Create output directory
output_dir = Path('zarr_chunking_examples')
output_dir.mkdir(exist_ok=True)

# Compare all strategies
print("Creating Zarr datasets with different chunking strategies...\n")

# Process each strategy with real EOPF data
results = {}
for strategy_name, chunk_config in chunking_strategies.items():
    print(f"\nProcessing {strategy_name} strategy...")
    output_path = output_dir / f'sentinel2_{strategy_name}.zarr'
    
    # Remove existing if present
    if output_path.exists():
        import shutil
        shutil.rmtree(output_path)
    
    # Create Zarr from EOPF data
    create_multitemporal_zarr_from_eopf(
        items=items,
        output_path=str(output_path),
        chunk_strategy=chunk_config,
        bands_10m=['b02', 'b03', 'b04', 'b08'],
        use_sample_data=not use_real_data
    )
    
    # Analyze performance
    metrics = analyze_zarr_performance(str(output_path))
    results[strategy_name] = metrics

# Create comparison dataframe
comparison_data = []
for strategy, metrics in results.items():
    comparison_data.append({
        'Strategy': strategy,
        'Total Size (MB)': round(metrics['total_size_mb'], 2),
        'Avg Chunk Size (MB)': round(metrics['avg_chunk_size_mb'], 2),
        'Compression Ratio': round(metrics['avg_compression_ratio'], 2),
        'Number of Chunks': metrics['num_chunks'],
        'Spatial Read (s)': round(metrics.get('spatial_read_time_s', 0), 3),
        'Temporal Read (s)': round(metrics.get('temporal_read_time_s', 0), 3)
    })
comparison_df = pd.DataFrame(comparison_data)

print("\n" + "=" * 80)
print("Performance Comparison Results:")
print("=" * 80)
print(comparison_df.to_string(index=False))

## 5. Analyzing performance results

Let us visualize the performance characteristics of each chunking strategy.

In [None]:
# Create performance visualization
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Plot 1: Storage size comparison
ax = axes[0, 0]
comparison_df.plot(x='Strategy', y='Total Size (MB)', kind='bar', ax=ax, color='steelblue', legend=False)
ax.set_title('Total Storage Size')
ax.set_ylabel('Size (MB)')
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)

# Plot 2: Average chunk size
ax = axes[0, 1]
comparison_df.plot(x='Strategy', y='Avg Chunk Size (MB)', kind='bar', ax=ax, color='coral', legend=False)
ax.set_title('Average Chunk Size')
ax.set_ylabel('Size (MB)')
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)

# Plot 3: Compression ratio
ax = axes[0, 2]
comparison_df.plot(x='Strategy', y='Compression Ratio', kind='bar', ax=ax, color='green', legend=False)
ax.set_title('Compression Ratio')
ax.set_ylabel('Ratio')
ax.set_xlabel('')
ax.tick_params(axis='x', rotation=45)

# Plot 4: Number of chunks
ax = axes[1, 0]
comparison_df.plot(x='Strategy', y='Number of Chunks', kind='bar', ax=ax, color='purple', legend=False)
ax.set_title('Total Number of Chunks')
ax.set_ylabel('Count')
ax.set_xlabel('Strategy')
ax.tick_params(axis='x', rotation=45)

# Plot 5: Read performance comparison
ax = axes[1, 1]
x = np.arange(len(comparison_df))
width = 0.35
ax.bar(x - width/2, comparison_df['Spatial Read (s)'], width, label='Spatial Read', color='#2E86AB')
ax.bar(x + width/2, comparison_df['Temporal Read (s)'], width, label='Temporal Read', color='#A23B72')
ax.set_xlabel('Strategy')
ax.set_ylabel('Time (seconds)')
ax.set_title('Read Performance Comparison')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Strategy'], rotation=45)
ax.legend()

# Plot 6: Efficiency score (custom metric)
ax = axes[1, 2]
# Calculate efficiency score: lower is better
# Combines storage size, read times, and chunk overhead
efficiency = (
    comparison_df['Total Size (MB)'] / comparison_df['Total Size (MB)'].min() +
    comparison_df['Spatial Read (s)'] / comparison_df['Spatial Read (s)'].min() +
    comparison_df['Temporal Read (s)'] / comparison_df['Temporal Read (s)'].min()
) / 3

ax.bar(comparison_df['Strategy'], efficiency, color='gold')
ax.set_title('Overall Efficiency Score\n(Lower is Better)')
ax.set_ylabel('Score')
ax.set_xlabel('Strategy')
ax.tick_params(axis='x', rotation=45)

plt.suptitle('Zarr Chunking Strategy Performance Analysis', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

## 6. Testing different compression levels

Let us explore how compression levels affect storage size and performance for the balanced chunking strategy.

In [None]:
# Test different compression levels
compression_levels = [1, 3, 5, 7, 9]
compression_results = []

print("Testing different compression levels...\n")

for level in compression_levels:
    output_path = output_dir / f'sentinel2_compression_level_{level}.zarr'
    
    # Remove existing if present
    if output_path.exists():
        import shutil
        shutil.rmtree(output_path)
    
    # Create Zarr with specific compression level
    create_multitemporal_zarr_from_eopf(
        items=items,
        output_path=str(output_path),
        chunk_strategy=chunking_strategies['balanced'],
        compression='zstd',
        compression_level=level,
        use_sample_data=not use_real_data
    )
    
    # Analyze performance
    metrics = analyze_zarr_performance(str(output_path), test_reads=False)
    
    compression_results.append({
        'Compression Level': level,
        'Total Size (MB)': round(metrics['total_size_mb'], 2),
        'Compression Ratio': round(metrics['avg_compression_ratio'], 2)
    })

compression_df = pd.DataFrame(compression_results)
print("\nCompression Level Comparison:")
print("=" * 50)
print(compression_df.to_string(index=False))

In [None]:
# Visualize compression impact
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot storage size vs compression level
ax = axes[0]
ax.plot(compression_df['Compression Level'], compression_df['Total Size (MB)'], 
        marker='o', linewidth=2, markersize=8, color='steelblue')
ax.set_xlabel('Compression Level')
ax.set_ylabel('Total Size (MB)')
ax.set_title('Storage Size vs Compression Level')
ax.grid(True, alpha=0.3)

# Plot compression ratio vs compression level
ax = axes[1]
ax.plot(compression_df['Compression Level'], compression_df['Compression Ratio'], 
        marker='s', linewidth=2, markersize=8, color='green')
ax.set_xlabel('Compression Level')
ax.set_ylabel('Compression Ratio')
ax.set_title('Compression Ratio vs Compression Level')
ax.grid(True, alpha=0.3)

plt.suptitle('Impact of Compression Level on Storage Efficiency', fontsize=14)
plt.tight_layout()
plt.show()

<hr>

## 💪 Now it is your turn

The following exercises will help you master Zarr chunking strategies for your own Earth Observation workflows.

### Task 1: Explore Your Own Chunking Strategy

* Define a new chunking strategy that might work better for your specific use case
* Add it to the `chunking_strategies` dictionary
* Run the comparison again to see how it performs

### Task 2: Test with Different Data Dimensions

* Modify the `data_shape` to simulate a longer time series (e.g., 20 time steps)
* How does this affect the optimal chunking strategy?
* Which strategy performs best for time series analysis?

### Task 3: Optimize for Your Access Pattern

* If you primarily need to extract time series for individual pixels, which strategy would you choose?
* If you need to process entire scenes at specific dates, which strategy is optimal?
* Create a custom chunking strategy that balances both requirements

### Task 4: Experiment with Different Compression Algorithms

* Modify the `create_multitemporal_zarr` function call to test different compression algorithms (e.g., 'lz4', 'zlib', 'blosclz')
* Compare the trade-offs between compression ratio and read performance
* Which algorithm works best for your data?

In [None]:
# Space for your experiments
# Example: Add your custom chunking strategy

# my_custom_strategy = {
#     'time': 3,
#     'y': 768,
#     'x': 768
# }

# Add your code here...

## Conclusion

In this notebook, we have demonstrated how to create multi-temporal EOPF Zarr datasets with different chunking strategies. We explored three main approaches—spatial-optimized, temporal-optimized, and balanced chunking—and analyzed their performance characteristics.

Key takeaways:
- **Spatial-optimized chunking** (large spatial chunks, small temporal chunks) excels for spatial analysis workflows
- **Temporal-optimized chunking** (small spatial chunks, large temporal chunks) is ideal for time series extraction
- **Balanced chunking** provides reasonable performance for mixed access patterns
- Compression level significantly affects storage size but has diminishing returns beyond level 5-7
- The optimal strategy depends on your specific access patterns and computational constraints

Remember that chunk size selection is one of the most critical optimization decisions in Earth Observation data processing. Always profile your specific workflows to determine the optimal configuration.

## What's next?

In the next notebook, we will explore advanced Zarr features including hierarchical storage, multi-resolution pyramids, and cloud-optimized access patterns for large-scale Earth Observation analysis.