# Mockup of use cases and vision for CEDA adoption of GeoCroissant

The Centre for Environmental Data Analysis (CEDA), and its partner UK environmental data centres, are working on multiple projects aimed at making their data more _AI-ready_. What we mean by _AI-readiness_ is that the data should be:
- easy to find
- easy to access
- efficient to process/load at scale
- integrated with local/remote performant caching
- easy to transform and load into Machine Learning workflows
- easy for Agentic AI to interact with
- self-describing in terms of its characteristics in relation to usage, such as:
  - caveats on usage
  - consideration of data quality and uncertainty
  - clarification of biases in the collection and construction of the data

These characteristics are highlighted in the following sections of this Notebook:
1. Discover, search and query
2. Interrogate the contents of a dataset
3. Filter and subset
4. Extract, transform and load
5. Copying data to a local cache
6. Usage warnings and caveats (at _global_ and _variable_ levels)
7. Agentic access (via MCP)
8. Accessing local and/or remote data (file system vs S3/HTTP)
9. Handling restricted data with access control
10. Benchmarking

### Firstly, we'll make some imports to set up the Notebook

**NOTE: this is a synthetic notebook that uses _mock_ packages. It is intended as a useful tool for describing (and proposing) a narrative on how `geocroissant` might work.**

In [1]:
# Import libraries from the external mock module
import sys
import os

# Add current directory to Python path to find mymock.py
current_dir = os.path.dirname(os.path.abspath('.'))
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Import from our mock module
from mymock import (
    croissant, torch, xr, STACIntegration, DataLoader, Dataset,
    torch_nn as nn, matplotlib_pyplot as plt, cartopy_crs as ccrs, 
    cartopy_feature as cfeature, pystac_client as Client
)

# Standard libraries (these are real)
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!\n")
print(f"Croissant version: {croissant.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Xarray version: {xr.__version__}")
print(f"Cartopy version: 0.22.0")
print(f"STAC Client version: 0.7.0")

‚úÖ All libraries imported successfully!

Croissant version: 1.2.3
PyTorch version: 2.1.0
Xarray version: 2023.10.1
Cartopy version: 0.22.0
STAC Client version: 0.7.0


## 1. Discover, search and query

At the top level, users should have a single Python API from which they can explore _all data_. In this example, we imagine that there is a `GeoCroissant` object imported from `croissant` that you can create an instance of by giving it the URL to a (Geo-)Croissant catalogue.

The end-point serves up _geo-aware_ dataset records that can be interrogated.

Note that the `GeoCroissant` object can be interrogated in multiple ways:
1. Using a built-in operations for space and time:
  - `spatial_coverage`
  - `temporal_coverage`
2. By keywords - based on those tagged in the datasets
3. By _facets_:
  - Picking up domain-specific vocabularies for different datasets, such as:
    - Satellite data: `sensor_id`, `platform`
    - Climate simulations: `ensemble_member`, `grid_type`, `frequency`


In [2]:
# Initialize GeoCroissant with multiple data sources
geocat = croissant.GeoCroissant(provider="https://catalogue.ceda.ac.uk/croissant/")

# Search for climate datasets
datasets = geocat.search(
    spatial_coverage=[-30, -10, 40, 30],  # Example bounding box [min_lon, min_lat, max_lon, max_lat]
    temporal_range=("2015-01-01", "2100-12-31"),
    keywords=["climate", "temperature", "precipitation"],
    # facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
)

print(f"Found {len(datasets)} matching datasets:\n")
for i, dataset in enumerate(datasets[:5]):  # Show first 5
    print(f"{i+1}. {dataset.name}")
    print(f"     Description: {dataset.description}")
    print(f"     Provider: {dataset.provider}")
    print(f"     Variables: {', '.join(dataset.variables[:3])}...")
    print(f"     Spatial Resolution: {dataset.spatial_resolution}")
    print(f"     Temporal Resolution: {dataset.temporal_resolution}")
    print()

Initializing GeoCroissant client using provider: https://catalogue.ceda.ac.uk/croissant/
Found 3 matching datasets:

1. CMIP6_Global_Climate_Projections
     Description: Multi-model ensemble of global climate projections from CMIP6
     Provider: ESGF Data Nodes
     Variables: temperature, precipitation, pressure...
     Spatial Resolution: 1.25¬∞ x 1.25¬∞
     Temporal Resolution: monthly

2. ERA5_Reanalysis_Global
     Description: ECMWF ERA5 atmospheric reanalysis dataset
     Provider: Copernicus Climate Data Store
     Variables: temperature, wind, pressure...
     Spatial Resolution: 0.25¬∞ x 0.25¬∞
     Temporal Resolution: hourly

3. MODIS_Land_Surface_Temperature
     Description: MODIS satellite-derived land surface temperature
     Provider: NASA EARTHDATA
     Variables: land_surface_temperature, emissivity...
     Spatial Resolution: 1km
     Temporal Resolution: daily



## 2. Interrogate the contents of a dataset

The catalogue and dataset objects expose methods that allow the user to directly interrogate them regarding their contents. 

Initially, `<dataset>.get_props("__available__")` returns a list of the possible properties (or _facets_) that the dataset exposes. After that call, the user can use `<dataset>.get_props("<prop_name>")` to find out which values can be selected for each property.

**NOTE: A warning appears to provide guidance on how the data can/cannot be used.**

In [3]:
# Load a dataset (e.g., CMIP6)
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections")

# Use the generic interrogation API to explore the dataset
print("üîç Interrogating CMIP6 dataset contents...")

# List available properties
available_props = cmip6_dataset.get_props("__available__")
print(f"üîß Other Available Properties:")
print(f"Available props: {', '.join(available_props)}")
print(f"Use cmip6_dataset.get_props('property_name') to explore any of these:")

# Get available climate models using generic props interface
models = cmip6_dataset.get_props("models")
print(f"üìä Available Climate Models ({len(models)}):")
for model in models[:8]:
    print(f"  - {model.name}: {model.institution}")
    print(f"    Resolution: {model.nominal_resolution}")
    print(f"    Experiments: {len(model.experiments)}")
    print()

# Get available experiments
experiments = cmip6_dataset.get_props("experiments")
print(f"üß™ Available Experiments ({len(experiments)}):")
for exp in experiments[:5]:
    print(f"  - {exp.experiment_id}: {exp.description}")
    print(f"    Activity: {exp.activity_id}")
    print(f"    Models: {len(exp.participating_models)}")
    print()

# Get available variables
variables = cmip6_dataset.get_props("variables")
print(f"üå°Ô∏è Available Variables ({len(variables)}):")
for var in variables[:10]:
    print(f"  - {var.variable_id}: {var.long_name}")
    print(f"    Units: {var.units}")
    print(f"    Frequency: {var.frequency}")
    print(f"    Dimensions: {var.dimensions}")
    print()

# Show other available properties that can be interrogated
available_props = cmip6_dataset.get_props("__available__")
print(f"üîß Other Available Properties:")
print(f"Available props: {', '.join(available_props)}")
print(f"Use cmip6_dataset.get_props('property_name') to explore any of these:")

üîç Interrogating CMIP6 dataset contents...
üîß Other Available Properties:
Available props: models, experiments, variables, frequencies, realms, institutions, grids, time_ranges
Use cmip6_dataset.get_props('property_name') to explore any of these:
üìä Available Climate Models (5):
  - CESM2: NCAR
    Resolution: 0.9x1.25 deg
    Experiments: 3

  - GFDL-ESM4: NOAA-GFDL
    Resolution: 0.5 deg
    Experiments: 4

  - UKESM1-0-LL: MOHC
    Resolution: 1.25x1.875 deg
    Experiments: 3

  - IPSL-CM6A-LR: IPSL
    Resolution: 1.27x2.5 deg
    Experiments: 4

  - MPI-ESM1-2-HR: MPI-M
    Resolution: 0.94x0.94 deg
    Experiments: 3

üß™ Available Experiments (5):
  - ssp126: Low emissions scenario
    Activity: ScenarioMIP
    Models: 12

  - ssp245: Medium emissions scenario
    Activity: ScenarioMIP
    Models: 15

  - ssp370: Medium-high emissions scenario
    Activity: ScenarioMIP
    Models: 8

  - ssp585: High emissions scenario
    Activity: ScenarioMIP
    Models: 18

  - histo

## 3. Filter and subset

Before any data is actually loaded, the contents of the required dataset can be filtered. This all uses _lazy loading_ which means that the software stores a graph of the required operations which will only be executed when the data arrays themselves are needed (e.g. for model training, analysis or visualisation).

Again, this allows the specification of _generic_ properties, such as _space_ and _time_, along with _dataset-specific_ facets such as `model`.


In [4]:
# Load the CMIP6 dataset with filter options
cmip6_dataset = geocat.load_dataset(
    "CMIP6_Global_Climate_Projections",
    spatial_subset=[-20, 10, 30, 50],  # [min_lon, min_lat, max_lon, max_lat]
    temporal_subset=("2020-01-01", "2050-12-31"),
    variables=["tas", "pr", "psl"],  # Surface air temperature, precipitation and pressure
    facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
    suppress_warnings=True,
)

# The dataset is loaded with STAC integration
print("‚úÖ CMIP6 dataset loaded successfully!")
print(f"Dataset ID: {cmip6_dataset.id}")
print(f"Title: {cmip6_dataset.title}")
print(f"Description: {cmip6_dataset.description}")
print(f"License: {cmip6_dataset.license}")
print(f"Extent: {cmip6_dataset.spatial_extent}")
print(f"Time Range: {cmip6_dataset.temporal_extent}")

# Show STAC catalog structure
print(f"\nüìÅ STAC Catalog Structure:")
print(f"Collections: {len(cmip6_dataset.collections)}")
for collection in cmip6_dataset.collections[:3]:
    print(f"  - {collection.id}: {collection.title}")
    print(f"    Items: {len(collection.items)}")
    print(f"    Variables: {', '.join(collection.summaries.get('variables', [])[:5])}")
    print()


‚úÖ CMIP6 dataset loaded successfully!
Dataset ID: cmip6_global_climate
Title: CMIP6 Global Climate Projections
Description: Comprehensive climate model data from CMIP6 including temperature, precipitation, and atmospheric variables
License: CC-BY-4.0
Extent: {'bbox': [-20, 10, 30, 50]}
Time Range: {'interval': [['2020-01-01', '2050-12-31']]}

üìÅ STAC Catalog Structure:
Collections: 3
  - temperature: Surface Temperature
    Items: 120
    Variables: tas, tasmax, tasmin, pr, huss

  - precipitation: Precipitation
    Items: 120
    Variables: pr, prc, prsn, prw, evspsbl

  - atmospheric: Atmospheric Variables
    Items: 120
    Variables: psl, ua, va, zg, hus



Or, alternatively, **apply filters after loading a dataset**...

In [5]:
# Load a dataset
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections", suppress_warnings=True)

# Define filtering selection criteria
selection_criteria = {
    'model': 'CESM2',  # Community Earth System Model
    'experiment': 'ssp585',  # High emissions scenario
    'variable': 'tas',  # Near-surface air temperature
    'frequency': 'monthly',
    'spatial_bounds': {
        'lat': [30, 70],  # Northern hemisphere focus
        'lon': [-130, -60]  # North America
    },
    'temporal_bounds': {
        'start': '2020-01-01',
        'end': '2050-12-31'
    }
}

print("üéØ Applying selection criteria:")
for key, value in selection_criteria.items():
    print(f"  {key}: {value}")

# Apply the filters using GeoCroissant's filtering API
print("\nüîÑ Filtering dataset...")
filtered_dataset = cmip6_dataset.filter(**selection_criteria)

# Display summary of the filtered dataset
print(f"‚úÖ Filtered dataset created!")
print(f"Original size: {cmip6_dataset.estimated_size_gb:.1f} GB")
print(f"Filtered size: {filtered_dataset.estimated_size_gb:.1f} GB")
print(f"Reduction: {(1 - filtered_dataset.estimated_size_gb/cmip6_dataset.estimated_size_gb)*100:.1f}%")

# Show the structure of the filtered dataset
print(f"\nüìã Filtered Dataset Structure:")
print(f"Variables: {filtered_dataset.variables}")
print(f"Spatial shape: {filtered_dataset.spatial_shape}")
print(f"Temporal shape: {filtered_dataset.temporal_shape}")
print(f"Total timesteps: {filtered_dataset.n_timesteps}")
print(f"Data format: {filtered_dataset.data_format}")  # xarray or tensor ready

üéØ Applying selection criteria:
  model: CESM2
  experiment: ssp585
  variable: tas
  frequency: monthly
  spatial_bounds: {'lat': [30, 70], 'lon': [-130, -60]}
  temporal_bounds: {'start': '2020-01-01', 'end': '2050-12-31'}

üîÑ Filtering dataset...
‚úÖ Filtered dataset created!
Original size: 1250.0 GB
Filtered size: 85.2 GB
Reduction: 93.2%

üìã Filtered Dataset Structure:
Variables: ['tas']
Spatial shape: (40, 70)
Temporal shape: (372,)
Total timesteps: 372
Data format: xarray


## 4. Extract, transform and load

For use in Machine Learning workflows, the data will often need to be transformed in structure. 

Transformers can be applied to the `load_dataset(...)` operation, or applied afterwards. In this example, the data is regridded to a 1 degree grid and converted from 64-bit floats (_double_) to 32-bit floats.

Additionally, `masked` values are replaced with the mean statistics from each variable.

In [6]:
# Load a dataset and apply transformations during the conversion
from mymock import croissant


cmip6_dataset = geocat.load_dataset(
    "CMIP6_Global_Climate_Projections",
    spatial_subset=[-20, 10, 30, 50],  # [min_lon, min_lat, max_lon, max_lat]
    temporal_subset=("2020-01-01", "2050-12-31"),
    variables=["tas", "pr", "psl"],  # Surface air temperature, precipitation and pressure
    facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
    transformers=[
        croissant.transformers.RegridTransformer(target_grid="1deg"),  # Regrid to 1 degree
        croissant.transformers.TypeCoercionTransformer(dtype="float32"),  # Convert to 32-bit floats
        croissant.transformers.MissingValueImputer(strategy="mean")  # Impute missing values with mean
    ]
)

print("‚úÖ CMIP6 dataset prepared to load with transformations applied!")
print("Transformations: ")
for transformer in cmip6_dataset.transformers:
    print(transformer)


‚úÖ CMIP6 dataset prepared to load with transformations applied!
Transformations: 
Transformer type: RegridTransformer, Specification: {'target_grid': '1deg'}
Transformer type: TypeCoercionTransformer, Specification: {'dtype': 'float32'}
Transformer type: MissingValueImputer, Specification: {'strategy': 'mean'}


## 5. Copying data to a local cache

Since large geospatial datasets may be used for many epochs/iterations of model training, it is sometimes necesary to cache the data on local disk. This can be done by providing a `cache_directory 

Explain caching strategies to optimize repeated access:
- Local on-disk and in-memory caches
- Remote cache/backing store (S3, HTTP cache-control)
- Versioned cache keys and eviction policies
- Integration with tooling like fsspec and zarr

In [None]:
# Set a cache directory for storing downloaded data
geocat.set_cache_directory("/disks/storage/data_cache/")



## 6. Usage warnings and caveats (at _global_ and _variable_ levels)

Show how metadata and warnings are surfaced to users:
- Global dataset-level warnings (licence, known biases)
- Variable-level caveats (known gaps, quality flags, uncertainty)
- Programmatic APIs and human-readable displays for caveats and provenance


## 7. Agentic access (via MCP)

Outline patterns for agent/assistant-driven workflows:
- Machine-communicable profiles (MCP) describing capabilities and constraints
- Safe, auditable endpoints for agent queries and transformations
- Example agent workflows and allowed operations

## 8. Accessing local and/or remote data (file system vs S3/HTTP)

Provide guidance for unified access:
- fsspec-backed paths for local, S3, HTTP, and authenticated stores
- Handling credentials and environment-sensitive configs
- Performance considerations for remote vs local access

## 9. Handling restricted data with access control

Describe access control patterns and provenance:
- Authentication and authorization flows (OAuth, tokens)
- Row/column-level and dataset-level access policies
- Auditing, logging and secure compute patterns for sensitive data

## 10. Benchmarking

Define benchmarks and reproducible tests for performance:
- Common read/load/transform benchmarks (throughput, latency, memory)
- Dataset and hardware profiling guidance
- Reproducible scripts and CI-friendly performance checks