# Mockup of use cases and vision for CEDA adoption of GeoCroissant

The Centre for Environmental Data Analysis (CEDA), and its partner UK environmental data centres, are working on multiple projects aimed at making their data more _AI-ready_. What we mean by _AI-readiness_ is that the data should be:
- easy to find
- easy to access
- efficient to process/load at scale
- integrated with local/remote performant caching
- easy to transform and load into Machine Learning workflows
- easy for Agentic AI to interact with
- self-describing in terms of its characteristics in relation to usage, such as:
  - caveats on usage
  - consideration of data quality and uncertainty
  - clarification of biases in the collection and construction of the data

These characteristics are highlighted in the following sections of this Notebook:
1. Discover, search and query
2. Interrogate the contents of a dataset
3. Filter and subset
4. Extract, transform and load
5. Copying data to a local cache
6. Usage warnings and caveats (at _global_ and _variable_ levels)
7. Integration with ML packages (PyTorch)
8. Agentic access (via MCP)
9. Accessing local and/or remote data (file system vs S3/HTTP)
10. Handling restricted data with access control
11. Benchmarking

### Firstly, we'll make some imports to set up the Notebook

**NOTE: this is a synthetic notebook that uses _mock_ packages. It is intended as a useful tool for describing (and proposing) a narrative on how `geocroissant` might work.**

In [16]:
# Import libraries from the external mock module
import sys
import os

# Add current directory to Python path to find mocklib.py
current_dir = os.path.dirname(os.path.abspath('.'))
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Import from our mock module
from mocklib import (
    croissant, torch, xr, STACIntegration, DataLoader, Dataset,
    torch_nn as nn, matplotlib_pyplot as plt, cartopy_crs as ccrs, 
    cartopy_feature as cfeature, pystac_client as Client, ceda_auth
)

# Standard libraries (these are real)
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!\n")
print(f"Croissant version: {croissant.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Xarray version: {xr.__version__}")
print(f"Cartopy version: 0.22.0")
print(f"STAC Client version: 0.7.0")

‚úÖ All libraries imported successfully!

Croissant version: 1.2.3
PyTorch version: 2.1.0
Xarray version: 2023.10.1
Cartopy version: 0.22.0
STAC Client version: 0.7.0


## 1. Discover, search and query

At the top level, users should have a single Python API from which they can explore _all data_. In this example, we imagine that there is a `GeoCroissant` object imported from `croissant` that you can create an instance of by giving it the URL to a (Geo-)Croissant catalogue.

The end-point serves up _geo-aware_ dataset records that can be interrogated.

Note that the `GeoCroissant` object can be interrogated in multiple ways:
1. Using a built-in operations for space and time:
  - `spatial_coverage`
  - `temporal_coverage`
2. By keywords - based on those tagged in the datasets
3. By _facets_:
  - Picking up domain-specific vocabularies for different datasets, such as:
    - Satellite data: `sensor_id`, `platform`
    - Climate simulations: `ensemble_member`, `grid_type`, `frequency`


In [17]:
# Initialize GeoCroissant with multiple data sources
geocat = croissant.GeoCroissant(provider="https://catalogue.ceda.ac.uk/croissant/")  # type: ignore

# Search for climate datasets
datasets = geocat.search(
    spatial_coverage=[-30, -10, 40, 30],  # Example bounding box [min_lon, min_lat, max_lon, max_lat]
    temporal_range=("2015-01-01", "2100-12-31"),
    keywords=["climate", "temperature", "precipitation"],
    # facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
)

print(f"Found {len(datasets)} matching datasets:\n")
for i, dataset in enumerate(datasets[:5]):  # Show first 5
    print(f"{i+1}. {dataset.name}")
    print(f"     Description: {dataset.description}")
    print(f"     Provider: {dataset.provider}")
    print(f"     Variables: {', '.join(dataset.variables[:3])}...")
    print(f"     Spatial Resolution: {dataset.spatial_resolution}")
    print(f"     Temporal Resolution: {dataset.temporal_resolution}")
    print()

Initializing GeoCroissant client using provider: https://catalogue.ceda.ac.uk/croissant/
Found 3 matching datasets:

1. CMIP6_Global_Climate_Projections
     Description: Multi-model ensemble of global climate projections from CMIP6
     Provider: ESGF Data Nodes
     Variables: temperature, precipitation, pressure...
     Spatial Resolution: 1.25¬∞ x 1.25¬∞
     Temporal Resolution: monthly

2. ERA5_Reanalysis_Global
     Description: ECMWF ERA5 atmospheric reanalysis dataset
     Provider: Copernicus Climate Data Store
     Variables: temperature, wind, pressure...
     Spatial Resolution: 0.25¬∞ x 0.25¬∞
     Temporal Resolution: hourly

3. MODIS_Land_Surface_Temperature
     Description: MODIS satellite-derived land surface temperature
     Provider: NASA EARTHDATA
     Variables: land_surface_temperature, emissivity...
     Spatial Resolution: 1km
     Temporal Resolution: daily



## 2. Interrogate the contents of a dataset

The catalogue and dataset objects expose methods that allow the user to directly interrogate them regarding their contents. 

Initially, `<dataset>.get_props("__available__")` returns a list of the possible properties (or _facets_) that the dataset exposes. After that call, the user can use `<dataset>.get_props("<prop_name>")` to find out which values can be selected for each property.

**NOTE: A warning appears to provide guidance on how the data can/cannot be used.**

In [18]:
# Load a dataset (e.g., CMIP6)
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections")

# Use the generic interrogation API to explore the dataset
print("üîç Interrogating CMIP6 dataset contents...")

# List available properties
available_props = cmip6_dataset.get_props("__available__")
print(f"üîß Other Available Properties:")
print(f"Available props: {', '.join(available_props)}")
print(f"Use cmip6_dataset.get_props('property_name') to explore any of these:")

# Get available climate models using generic props interface
models = cmip6_dataset.get_props("models")
print(f"üìä Available Climate Models ({len(models)}):")
for model in models[:8]:
    print(f"  - {model.name}: {model.institution}")
    print(f"    Resolution: {model.nominal_resolution}")
    print(f"    Experiments: {len(model.experiments)}")
    print()

# Get available experiments
experiments = cmip6_dataset.get_props("experiments")
print(f"üß™ Available Experiments ({len(experiments)}):")
for exp in experiments[:5]:
    print(f"  - {exp.experiment_id}: {exp.description}")
    print(f"    Activity: {exp.activity_id}")
    print(f"    Models: {len(exp.participating_models)}")
    print()

# Get available variables
variables = cmip6_dataset.get_props("variables")
print(f"üå°Ô∏è Available Variables ({len(variables)}):")
for var in variables[:10]:
    print(f"  - {var.variable_id}: {var.long_name}")
    print(f"    Units: {var.units}")
    print(f"    Frequency: {var.frequency}")
    print(f"    Dimensions: {var.dimensions}")
    print()

# Show other available properties that can be interrogated
available_props = cmip6_dataset.get_props("__available__")
print(f"üîß Other Available Properties:")
print(f"Available props: {', '.join(available_props)}")
print(f"Use cmip6_dataset.get_props('property_name') to explore any of these:")

üîç Interrogating CMIP6 dataset contents...
üîß Other Available Properties:
Available props: models, experiments, variables, frequencies, realms, institutions, grids, time_ranges
Use cmip6_dataset.get_props('property_name') to explore any of these:
üìä Available Climate Models (5):
  - CESM2: NCAR
    Resolution: 0.9x1.25 deg
    Experiments: 3

  - GFDL-ESM4: NOAA-GFDL
    Resolution: 0.5 deg
    Experiments: 4

  - UKESM1-0-LL: MOHC
    Resolution: 1.25x1.875 deg
    Experiments: 3

  - IPSL-CM6A-LR: IPSL
    Resolution: 1.27x2.5 deg
    Experiments: 4

  - MPI-ESM1-2-HR: MPI-M
    Resolution: 0.94x0.94 deg
    Experiments: 3

üß™ Available Experiments (5):
  - ssp126: Low emissions scenario
    Activity: ScenarioMIP
    Models: 12

  - ssp245: Medium emissions scenario
    Activity: ScenarioMIP
    Models: 15

  - ssp370: Medium-high emissions scenario
    Activity: ScenarioMIP
    Models: 8

  - ssp585: High emissions scenario
    Activity: ScenarioMIP
    Models: 18

  - histo

## 3. Filter and subset

Before any data is actually loaded, the contents of the required dataset can be filtered. This all uses _lazy loading_ which means that the software stores a graph of the required operations which will only be executed when the data arrays themselves are needed (e.g. for model training, analysis or visualisation).

Again, this allows the specification of _generic_ properties, such as _space_ and _time_, along with _dataset-specific_ facets such as `model`.


In [19]:
# Load the CMIP6 dataset with filter options
cmip6_dataset = geocat.load_dataset(
    "CMIP6_Global_Climate_Projections",
    spatial_subset=[-20, 10, 30, 50],  # [min_lon, min_lat, max_lon, max_lat]
    temporal_subset=("2020-01-01", "2050-12-31"),
    variables=["tas", "pr", "psl"],  # Surface air temperature, precipitation and pressure
    facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
    suppress_warnings=True,
)

# The dataset is loaded with STAC integration
print("‚úÖ CMIP6 dataset loaded successfully!")
print(f"Dataset ID: {cmip6_dataset.id}")
print(f"Title: {cmip6_dataset.title}")
print(f"Description: {cmip6_dataset.description}")
print(f"License: {cmip6_dataset.license}")
print(f"Extent: {cmip6_dataset.spatial_extent}")
print(f"Time Range: {cmip6_dataset.temporal_extent}")

# Show STAC catalog structure
print(f"\nüìÅ STAC Catalog Structure:")
print(f"Collections: {len(cmip6_dataset.collections)}")
for collection in cmip6_dataset.collections[:3]:
    print(f"  - {collection.id}: {collection.title}")
    print(f"    Items: {len(collection.items)}")
    print(f"    Variables: {', '.join(collection.summaries.get('variables', [])[:5])}")
    print()


‚úÖ CMIP6 dataset loaded successfully!
Dataset ID: cmip6_global_climate
Title: CMIP6 Global Climate Projections
Description: Comprehensive climate model data from CMIP6 including temperature, precipitation, and atmospheric variables
License: CC-BY-4.0
Extent: {'bbox': [-20, 10, 30, 50]}
Time Range: {'interval': [['2020-01-01', '2050-12-31']]}

üìÅ STAC Catalog Structure:
Collections: 3
  - temperature: Surface Temperature
    Items: 120
    Variables: tas, tasmax, tasmin, pr, huss

  - precipitation: Precipitation
    Items: 120
    Variables: pr, prc, prsn, prw, evspsbl

  - atmospheric: Atmospheric Variables
    Items: 120
    Variables: psl, ua, va, zg, hus



Or, alternatively, **apply filters after loading a dataset**...

In [20]:
# Load a dataset
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections", suppress_warnings=True)

# Define filtering selection criteria
selection_criteria = {
    'model': 'CESM2',  # Community Earth System Model
    'experiment': 'ssp585',  # High emissions scenario
    'variable': 'tas',  # Near-surface air temperature
    'frequency': 'monthly',
    'spatial_bounds': {
        'lat': [30, 70],  # Northern hemisphere focus
        'lon': [-130, -60]  # North America
    },
    'temporal_bounds': {
        'start': '2020-01-01',
        'end': '2050-12-31'
    }
}

print("üéØ Applying selection criteria:")
for key, value in selection_criteria.items():
    print(f"  {key}: {value}")

# Apply the filters using GeoCroissant's filtering API
print("\nüîÑ Filtering dataset...")
filtered_dataset = cmip6_dataset.filter(**selection_criteria)

# Display summary of the filtered dataset
print(f"‚úÖ Filtered dataset created!")
print(f"Original size: {cmip6_dataset.estimated_size_gb:.1f} GB")
print(f"Filtered size: {filtered_dataset.estimated_size_gb:.1f} GB")
print(f"Reduction: {(1 - filtered_dataset.estimated_size_gb/cmip6_dataset.estimated_size_gb)*100:.1f}%")

# Show the structure of the filtered dataset
print(f"\nüìã Filtered Dataset Structure:")
print(f"Variables: {filtered_dataset.variables}")
print(f"Spatial shape: {filtered_dataset.spatial_shape}")
print(f"Temporal shape: {filtered_dataset.temporal_shape}")
print(f"Total timesteps: {filtered_dataset.n_timesteps}")
print(f"Data format: {filtered_dataset.data_format}")  # xarray or tensor ready

üéØ Applying selection criteria:
  model: CESM2
  experiment: ssp585
  variable: tas
  frequency: monthly
  spatial_bounds: {'lat': [30, 70], 'lon': [-130, -60]}
  temporal_bounds: {'start': '2020-01-01', 'end': '2050-12-31'}

üîÑ Filtering dataset...
‚úÖ Filtered dataset created!
Original size: 1250.0 GB
Filtered size: 85.2 GB
Reduction: 93.2%

üìã Filtered Dataset Structure:
Variables: ['tas']
Spatial shape: (40, 70)
Temporal shape: (372,)
Total timesteps: 372
Data format: xarray


## 4. Extract, transform and load

For use in Machine Learning workflows, the data will often need to be transformed in structure. 

Transformers can be applied to the `load_dataset(...)` operation, or applied afterwards. In this example, the data is regridded to a 1 degree grid and converted from 64-bit floats (_double_) to 32-bit floats.

Additionally, `masked` values are replaced with the mean statistics from each variable.

In [21]:
# Load a dataset and apply transformations during the conversion
from mocklib import croissant


cmip6_dataset = geocat.load_dataset(
    "CMIP6_Global_Climate_Projections",
    spatial_subset=[-20, 10, 30, 50],  # [min_lon, min_lat, max_lon, max_lat]
    temporal_subset=("2020-01-01", "2050-12-31"),
    variables=["tas", "pr", "psl"],  # Surface air temperature, precipitation and pressure
    facets={"model": ["UKESM1-0-LL", "HadGEM3-GC31-LL"]},
    transformers=[
        croissant.transformers.RegridTransformer(target_grid="1deg"),  # Regrid to 1 degree
        croissant.transformers.TypeCoercionTransformer(dtype="float32"),  # Convert to 32-bit floats
        croissant.transformers.MissingValueImputer(strategy="mean")  # Impute missing values with mean
    ]
)

print("‚úÖ CMIP6 dataset prepared to load with transformations applied!")
print("Transformations: ")
for transformer in cmip6_dataset.transformers:
    print(transformer)


‚úÖ CMIP6 dataset prepared to load with transformations applied!
Transformations: 
Transformer type: RegridTransformer, Specification: {'target_grid': '1deg'}
Transformer type: TypeCoercionTransformer, Specification: {'dtype': 'float32'}
Transformer type: MissingValueImputer, Specification: {'strategy': 'mean'}


## 5. Copying data to a local cache

Since large geospatial datasets may be used for many epochs/iterations of model training, it is sometimes necesary to cache the data on local disk. This can be done by providing a `cache_directory 

Explain caching strategies to optimize repeated access:
- Local on-disk and in-memory caches
- Remote cache/backing store (S3, HTTP cache-control)
- Versioned cache keys and eviction policies
- Integration with tooling like fsspec and zarr

In [22]:
# Set a cache directory for storing downloaded data
CACHE_DIR = "/disks/storage/data_cache"

# Pre-load all data into the local cache
cmip6_dataset.preload(cache_dir=CACHE_DIR, n_workers=16)

Preparing to download 500000 data files to cache directory: /disks/storage/data_cache
  using 16 worker processes.


  5%|‚ñå         | 1/20 [00:00<00:05,  3.33it/s]

Caching files: 0 to 25,000


 10%|‚ñà         | 2/20 [00:00<00:05,  3.32it/s]

Caching files: 25,000 to 50,000


 15%|‚ñà‚ñå        | 3/20 [00:00<00:05,  3.32it/s]

Caching files: 50,000 to 75,000


 20%|‚ñà‚ñà        | 4/20 [00:01<00:04,  3.32it/s]

Caching files: 75,000 to 100,000


 25%|‚ñà‚ñà‚ñå       | 5/20 [00:01<00:04,  3.32it/s]

Caching files: 100,000 to 125,000


 30%|‚ñà‚ñà‚ñà       | 6/20 [00:01<00:04,  3.32it/s]

Caching files: 125,000 to 150,000


 35%|‚ñà‚ñà‚ñà‚ñå      | 7/20 [00:02<00:03,  3.31it/s]

Caching files: 150,000 to 175,000


 40%|‚ñà‚ñà‚ñà‚ñà      | 8/20 [00:02<00:03,  3.31it/s]

Caching files: 175,000 to 200,000


 45%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 9/20 [00:02<00:03,  3.31it/s]

Caching files: 200,000 to 225,000


 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 10/20 [00:03<00:03,  3.31it/s]

Caching files: 225,000 to 250,000


 55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 11/20 [00:03<00:02,  3.31it/s]

Caching files: 250,000 to 275,000


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 12/20 [00:03<00:02,  3.31it/s]

Caching files: 275,000 to 300,000


 65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 13/20 [00:03<00:02,  3.31it/s]

Caching files: 300,000 to 325,000


 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 14/20 [00:04<00:01,  3.31it/s]

Caching files: 325,000 to 350,000


 75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 15/20 [00:04<00:01,  3.31it/s]

Caching files: 350,000 to 375,000


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 16/20 [00:04<00:01,  3.32it/s]

Caching files: 375,000 to 400,000


 85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 17/20 [00:05<00:00,  3.24it/s]

Caching files: 400,000 to 425,000


 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 18/20 [00:05<00:00,  3.25it/s]

Caching files: 425,000 to 450,000


 95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 19/20 [00:05<00:00,  3.27it/s]

Caching files: 450,000 to 475,000


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:06<00:00,  3.30it/s]

Caching files: 475,000 to 500,000


Caching completed. 240TiB downloaded.





## 6. Usage warnings and caveats (at _global_ and _variable_ levels)

When building APIs like this, it is important that provenance and usage metadata, including caveats and warnings, are provided to users at the:
- Global dataset-level (licence, known biases)
- Variable-level (known gaps, quality flags, uncertainty)

By default, these are extracted from the metadata records and are exposed to users within the environment they are working in. When using a Jupyter Notebook, they are highlighted as follows.


In [23]:
print("A dataset-level warning:")
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections")



In [24]:
print("A variable-level warning:")
ua = cmip6_dataset.variables["ua"]



The warnings and metadata can also be accessed as properties of the dataset object:

In [25]:
from pprint import pprint
print("Warnings:\n---------")

print("\nDataset-level warnings:")
pprint(cmip6_dataset.warnings)

print("\nVariable-level warnings:")
pprint(ua.warnings)

---------

[{'message': 'The CMIP6 Dataset has the following important factors:\n'
             '\n'
             '    - It is a multi-model ensemble of global climate '
             'projections.\n'
             '    - The dataset includes variables such as temperature, '
             'precipitation, and wind.\n'
             '    - It is available at a spatial resolution of 1.25¬∞ x 1.25¬∞.\n'
             '    - Different models will have varying temporal coverages and '
             'spatial resolutions.\n'
             '    See: more information at <a '
             'href="https://esgf-node.llnl.gov/projects/cmip6/">https://esgf-node.llnl.gov/projects/cmip6/</a>\n'
             '        ',
  'title': 'Important information about the CMIP6 Dataset'}]

[{'message': "The Eastward Near-Surface Wind ('ua') variable:\n"
             '            - is provided on a staggered grid when compared to '
             'non-wind surface variables.\n'
             '            - has the units m/s

## 7. Integration with ML packages (PyTorch)

The integration with Machine Learning packages should be as seamless as possible, allowing transformations, batching, normalisation and other operations to be defined. The API should allow the user to convert a dataset object directly into a `torch.Dataset` or `tensorflow.Dataset`, ready for use in model training, evaluation or inference.

For example, convert to `PyTorch`:

In [26]:
# Convert to PyTorch Dataset using GeoCroissant's ML integration
climate_dataset = filtered_dataset.to_pytorch_dataset(
    target_variable='tas',  # Temperature as target
    feature_variables=['tas'],  # Using same variable for demo (can add more)
    sequence_length=12,  # 12-month sequences
    stride=1,  # Monthly stride
    normalize=True,  # Apply standardization
    transform='spatiotemporal'  # Prepare for spatiotemporal ML
)

print(f"‚úÖ PyTorch Dataset created!")
print(f"Dataset length: {len(climate_dataset)}")
print(f"Sample shape: {climate_dataset[0][0].shape}")  # [features, time, lat, lon]
print(f"Target shape: {climate_dataset[0][1].shape}")  # [time, lat, lon]
print(f"Data type: {climate_dataset[0][0].dtype}")

# Create DataLoader for training
batch_size = 4
train_loader = DataLoader(
    climate_dataset, 
    batch_size=batch_size, 
    shuffle=True,
    num_workers=2
)

print(f"\nüì¶ DataLoader created with batch size {batch_size}")
print(f"Number of batches: {len(train_loader)}")

# Inspect a batch
sample_batch = next(iter(train_loader))
features, targets = sample_batch
print(f"\nBatch shapes:")
print(f"Features: {features.shape}")  # [batch, features, time, lat, lon]
print(f"Targets: {targets.shape}")    # [batch, time, lat, lon]
print(f"Features range: [{features.min():.3f}, {features.max():.3f}]")
print(f"Targets range: [{targets.min():.3f}, {targets.max():.3f}]")

‚úÖ PyTorch Dataset created!
Dataset length: 360
Sample shape: (1, 12, 40, 70)
Target shape: (12, 40, 70)
Data type: float32

üì¶ DataLoader created with batch size 4
Number of batches: 90

Batch shapes:
Features: (4, 1, 12, 40, 70)
Targets: (4, 12, 40, 70)
Features range: [-4.312, 4.120]
Targets range: [-4.197, 4.806]


Once the data is converted, it can be directly included in a model training run:

In [27]:
# Define a simple CNN model for climate prediction
class ClimateCNNModel(nn.Module):
    def __init__(self, input_channels=1, hidden_dim=64):
        super(ClimateCNNModel, self).__init__()
        # Mock layers - would normally be actual PyTorch layers
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.lstm = nn.LSTM(64, hidden_dim, batch_first=True)
        self.output_layer = nn.Linear(hidden_dim, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        # Mock forward pass - just return input reshaped
        batch_size = x.shape[0]
        return torch.Tensor(np.random.randn(batch_size, 12, 40, 70))
    
    def to(self, device):
        return self  # Mock .to() method
    
    def __call__(self, x):
        return self.forward(x)

# Initialize model, loss, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ClimateCNNModel(input_channels=1).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

print(f"‚úÖ Model initialized on device: {device}")
print(f"Model architecture: ClimateCNNModel")
print(f"Loss function: MSE Loss")
print(f"Optimizer: Adam (lr=0.001)")

# Simulate training loop
print(f"\n Starting training simulation...")
n_epochs = 10
train_losses = []

for epoch in range(n_epochs):
    epoch_loss = 0.0
    n_batches = 0
    
    # Simulate training over a few batches
    for batch_idx, (features, targets) in enumerate(train_loader):
        if batch_idx >= 3:  # Just simulate 3 batches per epoch
            break
            
        # Mock training step
        optimizer.zero_grad()
        
        # Forward pass (mock)
        predictions = model(features)
        loss = criterion(predictions, targets)
        
        # Mock backward pass
        # loss.backward()  # Would normally do backprop
        optimizer.step()
        
        epoch_loss += loss.min()  # Use min as mock loss value
        n_batches += 1

    avg_loss = epoch_loss / n_batches if n_batches > 0 else 0
    train_losses.append(avg_loss)
    
    print(f"Epoch {epoch+1}/{n_epochs} - Average Loss: {avg_loss:.4f}")

print(f"\n‚úÖ Training simulation completed!")
print(f"Final training loss: {train_losses[-1]:.4f}")
print(f"Model ready for climate prediction tasks")

‚úÖ Model initialized on device: cpu
Model architecture: ClimateCNNModel
Loss function: MSE Loss
Optimizer: Adam (lr=0.001)

 Starting training simulation...
Epoch 1/10 - Average Loss: 0.5000
Epoch 2/10 - Average Loss: 0.5000
Epoch 3/10 - Average Loss: 0.5000
Epoch 4/10 - Average Loss: 0.5000
Epoch 5/10 - Average Loss: 0.5000
Epoch 6/10 - Average Loss: 0.5000
Epoch 7/10 - Average Loss: 0.5000
Epoch 8/10 - Average Loss: 0.5000
Epoch 9/10 - Average Loss: 0.5000
Epoch 10/10 - Average Loss: 0.5000

‚úÖ Training simulation completed!
Final training loss: 0.5000
Model ready for climate prediction tasks


## 8. Agentic access (via MCP)

Model Context Protocol, or MCP (https://modelcontextprotocol.io/docs/getting-started/intro), is an emerging open standard for agentic AI systems to communicate with each other, and with a range of tools.

When thinking about data discovery and access, we might choose to expose GeoCroissant functionality within MCP servers, to provide:
- Search capability
- Extract, Transform and Load capabilities

**NOTE: This part of the mock-up has not been fully considered yet. More work to come here!**

In [28]:
# Mockup for MCP integration coming soon!
# It should include:
#   - Exposing GeoCroissant capabilities via MCP profiles
#   - Enabling agentic search and data retrieval workflows
#   - Demonstrating example agent workflows, using LLMs

## 9. Accessing local and/or remote data (file system vs S3/HTTP)

The model for the GeoCroissant interface is that it should _work the same_ (although the performance will vary) for data that is:
- stored **at different sites/services**:
  - if data files are on the local file system - it should use the fastest route to the data
  - if data files are remote, then it should download them using the appropriate protocol:
    - `http(s)`
    - `s3`
    - other...
- stored **in different formats**:
  - supported formats will include:
    - `NetCDF`
    - `GRIB`
    - `Zarr`
    - `Kerchunk` / `VirtualiZarr` (as aggregation layers over other formats)

The most important aspect is that the **recipient format** should match what is needed by the user:
- `xarray.DataArray`, `xarray.Dataset` or `xarray.DataTree`.
- `numpy.ndarray` objects (easily converted to `torch.Tensor` objects)
- others...?


## 10. Handling restricted data with access control

Since some data may be restricted in access, the API needs to handle access/API tokens and potentially other authorisation tokens that might be passed to the underlying service through:
- environment variables
- HTTP(S) headers
- parameters in Python calls

At the simplest level, this should like something like:

In [29]:
# Get a token from an authentication service, in this case CEDA's token service
token = ceda_auth.get_access_token(refresh=True)

# Set up the request headers
ds = cmip6_dataset.filter(**selection_criteria, auth_token=token)

print("Data loaded with authorization token successfully!")

Data loaded with authorization token successfully!


## 11. Benchmarking

As part of the GeoCroissant API, we need to be able to measure the performance of different parts of the system, to ensure the interfaces and transfer mechanisms are optimised.

**More to come here about benchmarking.**

TO-DO:
- Define benchmarks and reproducible tests for performance:
  - Common read/load/transform benchmarks (throughput, latency, memory)
  - Dataset and hardware profiling guidance
  - Reproducible scripts and CI-friendly performance checks