# GeoCroissant CMIP6 Dataset Mockup

This notebook demonstrates how to use the **GeoCroissant** library to discover, interrogate, and load CMIP6 environmental datasets for machine learning applications.

**Key Features Demonstrated:**
- Dataset discovery and search
- CMIP6 dataset loading with STAC integration
- Data interrogation and filtering
- PyTorch Dataset creation for ML training
- Visualization of geospatial climate data

> **Note:** This is a mockup demonstration. The actual `croissant` library and extensions shown here are conceptual and not yet implemented.

## 1. Import Required Libraries

First, let's import all the necessary libraries for working with geospatial environmental datasets.

In [None]:
# Import libraries from the external mock module
import sys
import os

# Add current directory to Python path to find mymock.py
current_dir = os.path.dirname(os.path.abspath('.'))
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Import from our mock module
from mymock import (
    croissant, torch, xr, GeoCroissant, STACIntegration, DataLoader, Dataset,
    torch_nn as nn, matplotlib_pyplot as plt, cartopy_crs as ccrs, 
    cartopy_feature as cfeature, pystac_client as Client
)

# Standard libraries (these are real)
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")
print(f"Croissant version: {croissant.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Xarray version: {xr.__version__}")
print(f"Cartopy version: 0.22.0")
print(f"STAC Client version: 0.7.0")

‚úÖ All libraries imported successfully!
Croissant version: 1.2.3
PyTorch version: 2.1.0
Xarray version: 2023.10.1
Cartopy version: 0.22.0
STAC Client version: 0.7.0


## 2. Discover Available Datasets

Use GeoCroissant to search for available environmental datasets in various repositories.

In [None]:
# Initialize GeoCroissant with multiple data sources
geocat = GeoCroissant(provider="https://catalogue.ceda.ac.uk/croissant/")

# Search for climate datasets
climate_datasets = geocat.search(
    keywords=["climate", "CMIP6", "temperature", "precipitation"],
    spatial_coverage="global",
    temporal_range=("2015-01-01", "2100-12-31")
)

print(f"Found {len(climate_datasets)} climate datasets:")
for i, dataset in enumerate(climate_datasets[:5]):  # Show first 5
    print(f"{i+1}. {dataset.name}")
    print(f"   Description: {dataset.description}")
    print(f"   Provider: {dataset.provider}")
    print(f"   Variables: {', '.join(dataset.variables[:3])}...")
    print(f"   Spatial Resolution: {dataset.spatial_resolution}")
    print(f"   Temporal Resolution: {dataset.temporal_resolution}")
    print()

Found 3 climate datasets:
1. CMIP6_Global_Climate_Projections
   Description: Multi-model ensemble of global climate projections from CMIP6
   Provider: ESGF Data Nodes
   Variables: temperature, precipitation, pressure...
   Spatial Resolution: 1.25¬∞ x 1.25¬∞
   Temporal Resolution: monthly

2. ERA5_Reanalysis_Global
   Description: ECMWF ERA5 atmospheric reanalysis dataset
   Provider: Copernicus Climate Data Store
   Variables: temperature, wind, pressure...
   Spatial Resolution: 0.25¬∞ x 0.25¬∞
   Temporal Resolution: hourly

3. MODIS_Land_Surface_Temperature
   Description: MODIS satellite-derived land surface temperature
   Provider: NASA EARTHDATA
   Variables: land_surface_temperature, emissivity...
   Spatial Resolution: 1km
   Temporal Resolution: daily



## 3. Load CMIP6 Dataset

Load the specific CMIP6 dataset using the integrated STAC client.

In [None]:
# Load the CMIP6 dataset by name
cmip6_dataset = geocat.load_dataset("CMIP6_Global_Climate_Projections")

# The dataset is loaded with STAC integration
print("‚úÖ CMIP6 dataset loaded successfully!")
print(f"Dataset ID: {cmip6_dataset.id}")
print(f"Title: {cmip6_dataset.title}")
print(f"Description: {cmip6_dataset.description}")
print(f"License: {cmip6_dataset.license}")
print(f"Extent: {cmip6_dataset.spatial_extent}")
print(f"Time Range: {cmip6_dataset.temporal_extent}")

# Show STAC catalog structure
print(f"\nüìÅ STAC Catalog Structure:")
print(f"Collections: {len(cmip6_dataset.collections)}")
for collection in cmip6_dataset.collections[:3]:
    print(f"  - {collection.id}: {collection.title}")
    print(f"    Items: {len(collection.items)}")
    print(f"    Variables: {', '.join(collection.summaries.get('variables', [])[:5])}")
    print()

‚úÖ CMIP6 dataset loaded successfully!
Dataset ID: cmip6_global_climate
Title: CMIP6 Global Climate Projections
Description: Comprehensive climate model data from CMIP6 including temperature, precipitation, and atmospheric variables
License: CC-BY-4.0
Extent: {'bbox': [-180, -90, 180, 90]}
Time Range: {'interval': [['2015-01-01', '2100-12-31']]}

üìÅ STAC Catalog Structure:
Collections: 3
  - temperature: Surface Temperature
    Items: 120
    Variables: tas, tasmax, tasmin, pr, huss

  - precipitation: Precipitation
    Items: 120
    Variables: pr, prc, prsn, prw, evspsbl

  - atmospheric: Atmospheric Variables
    Items: 120
    Variables: psl, ua, va, zg, hus



## 4. Interrogate Dataset Contents

Use GeoCroissant's extensions to deeply explore the dataset structure and metadata.

In [28]:
# Use the generic interrogation API to explore the dataset
print("üîç Interrogating CMIP6 dataset contents...")

# Get available climate models using generic props interface
models = cmip6_dataset.get_props("models")
print(f"üìä Available Climate Models ({len(models)}):")
for model in models[:8]:
    print(f"  - {model.name}: {model.institution}")
    print(f"    Resolution: {model.nominal_resolution}")
    print(f"    Experiments: {len(model.experiments)}")
    print()

# Get available experiments
experiments = cmip6_dataset.get_props("experiments")
print(f"üß™ Available Experiments ({len(experiments)}):")
for exp in experiments[:5]:
    print(f"  - {exp.experiment_id}: {exp.description}")
    print(f"    Activity: {exp.activity_id}")
    print(f"    Models: {len(exp.participating_models)}")
    print()

# Get available variables
variables = cmip6_dataset.get_props("variables")
print(f"üå°Ô∏è Available Variables ({len(variables)}):")
for var in variables[:10]:
    print(f"  - {var.variable_id}: {var.long_name}")
    print(f"    Units: {var.units}")
    print(f"    Frequency: {var.frequency}")
    print(f"    Dimensions: {var.dimensions}")
    print()

# Show other available properties that can be interrogated
available_props = cmip6_dataset.get_props("__available__")
print(f"üîß Other Available Properties:")
print(f"Available props: {', '.join(available_props)}")
print(f"Use cmip6_dataset.get_props('property_name') to explore any of these:")

üîç Interrogating CMIP6 dataset contents...
üìä Available Climate Models (5):
  - CESM2: NCAR
    Resolution: 0.9x1.25 deg
    Experiments: 3

  - GFDL-ESM4: NOAA-GFDL
    Resolution: 0.5 deg
    Experiments: 4

  - UKESM1-0-LL: MOHC
    Resolution: 1.25x1.875 deg
    Experiments: 3

  - IPSL-CM6A-LR: IPSL
    Resolution: 1.27x2.5 deg
    Experiments: 4

  - MPI-ESM1-2-HR: MPI-M
    Resolution: 0.94x0.94 deg
    Experiments: 3

üß™ Available Experiments (5):
  - ssp126: Low emissions scenario
    Activity: ScenarioMIP
    Models: 12

  - ssp245: Medium emissions scenario
    Activity: ScenarioMIP
    Models: 15

  - ssp370: Medium-high emissions scenario
    Activity: ScenarioMIP
    Models: 8

  - ssp585: High emissions scenario
    Activity: ScenarioMIP
    Models: 18

  - historical: Historical simulation
    Activity: CMIP
    Models: 25

üå°Ô∏è Available Variables (10):
  - tas: Near-Surface Air Temperature
    Units: K
    Frequency: mon
    Dimensions: ['time', 'lat', 'lon']

## 5. Filter and Subset Data

Apply filters to select specific climate model, experiment, variable, and spatial/temporal subsets.

In [32]:
# Define our selection criteria
selection_criteria = {
    'model': 'CESM2',  # Community Earth System Model
    'experiment': 'ssp585',  # High emissions scenario
    'variable': 'tas',  # Near-surface air temperature
    'frequency': 'monthly',
    'spatial_bounds': {
        'lat': [30, 70],  # Northern hemisphere focus
        'lon': [-130, -60]  # North America
    },
    'temporal_bounds': {
        'start': '2020-01-01',
        'end': '2050-12-31'
    }
}

print("üéØ Applying selection criteria:")
for key, value in selection_criteria.items():
    print(f"  {key}: {value}")

# Apply the filters using GeoCroissant's filtering API
print("\nüîÑ Filtering dataset...")
filtered_dataset = cmip6_dataset.filter(**selection_criteria)

print(f"‚úÖ Filtered dataset created!")
print(f"Original size: {cmip6_dataset.estimated_size_gb:.1f} GB")
print(f"Filtered size: {filtered_dataset.estimated_size_gb:.1f} GB")
print(f"Reduction: {(1 - filtered_dataset.estimated_size_gb/cmip6_dataset.estimated_size_gb)*100:.1f}%")

# Show the structure of the filtered dataset
print(f"\nüìã Filtered Dataset Structure:")
print(f"Variables: {filtered_dataset.variables}")
print(f"Spatial shape: {filtered_dataset.spatial_shape}")
print(f"Temporal shape: {filtered_dataset.temporal_shape}")
print(f"Total timesteps: {filtered_dataset.n_timesteps}")
print(f"Data format: {filtered_dataset.data_format}")  # xarray or tensor ready

üéØ Applying selection criteria:
  model: CESM2
  experiment: ssp585
  variable: tas
  frequency: monthly
  spatial_bounds: {'lat': [30, 70], 'lon': [-130, -60]}
  temporal_bounds: {'start': '2020-01-01', 'end': '2050-12-31'}

üîÑ Filtering dataset...
‚úÖ Filtered dataset created!
Original size: 1250.0 GB
Filtered size: 85.2 GB
Reduction: 93.2%

üìã Filtered Dataset Structure:
Variables: ['tas']
Spatial shape: (40, 70)
Temporal shape: (372,)
Total timesteps: 372
Data format: xarray


## 6. Create PyTorch Dataset

Convert the filtered climate data into a PyTorch Dataset for machine learning applications.

In [37]:
# Convert to PyTorch Dataset using GeoCroissant's ML integration
climate_dataset = filtered_dataset.to_pytorch_dataset(
    target_variable='tas',  # Temperature as target
    feature_variables=['tas'],  # Using same variable for demo (can add more)
    sequence_length=12,  # 12-month sequences
    stride=1,  # Monthly stride
    normalize=True,  # Apply standardization
    transform='spatiotemporal'  # Prepare for spatiotemporal ML
)

print(f"‚úÖ PyTorch Dataset created!")
print(f"Dataset length: {len(climate_dataset)}")
print(f"Sample shape: {climate_dataset[0][0].shape}")  # [features, time, lat, lon]
print(f"Target shape: {climate_dataset[0][1].shape}")  # [time, lat, lon]
print(f"Data type: {climate_dataset[0][0].dtype}")

# Create DataLoader for training
batch_size = 4
train_loader = DataLoader(
    climate_dataset, 
    batch_size=batch_size, 
    shuffle=True,
    num_workers=2
)

print(f"\nüì¶ DataLoader created with batch size {batch_size}")
print(f"Number of batches: {len(train_loader)}")

# Inspect a batch
sample_batch = next(iter(train_loader))
features, targets = sample_batch
print(f"\nBatch shapes:")
print(f"Features: {features.shape}")  # [batch, features, time, lat, lon]
print(f"Targets: {targets.shape}")    # [batch, time, lat, lon]
print(f"Features range: [{features.min():.3f}, {features.max():.3f}]")
print(f"Targets range: [{targets.min():.3f}, {targets.max():.3f}]")

‚úÖ PyTorch Dataset created!
Dataset length: 360
Sample shape: (1, 12, 40, 70)
Target shape: (12, 40, 70)
Data type: float32

üì¶ DataLoader created with batch size 4
Number of batches: 90

Batch shapes:
Features: (4, 1, 12, 40, 70)
Targets: (4, 12, 40, 70)
Features range: [-4.110, 4.534]
Targets range: [-4.498, 4.450]


## 7. ML Training Simulation

Demonstrate how to use the climate dataset in a machine learning training loop.

In [None]:
# Add missing torch methods for the training demo
torch.device = lambda x: 'cpu'  # Mock device
torch.cuda = type('cuda', (), {'is_available': lambda: False})()  # Mock CUDA

# Define a simple CNN model for climate prediction
class ClimateCNNModel(nn.Module):
    def __init__(self, input_channels=1, hidden_dim=64):
        super(ClimateCNNModel, self).__init__()
        # Mock layers - would normally be actual PyTorch layers
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.lstm = nn.LSTM(64, hidden_dim, batch_first=True)
        self.output_layer = nn.Linear(hidden_dim, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        # Mock forward pass - just return input reshaped
        batch_size = x.shape[0]
        return MockTorch.Tensor(np.random.randn(batch_size, 12, 40, 70))
    
    def to(self, device):
        return self  # Mock .to() method

# Initialize model, loss, and optimizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ClimateCNNModel(input_channels=1).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

print(f"‚úÖ Model initialized on device: {device}")
print(f"Model architecture: ClimateCNNModel")
print(f"Loss function: MSE Loss")
print(f"Optimizer: Adam (lr=0.001)")

# Simulate training loop
print(f"\n Starting training simulation...")
n_epochs = 3
train_losses = []

for epoch in range(n_epochs):
    epoch_loss = 0.0
    n_batches = 0
    
    # Simulate training over a few batches
    for batch_idx, (features, targets) in enumerate(train_loader):
        if batch_idx >= 3:  # Just simulate 3 batches per epoch
            break
            
        # Mock training step
        optimizer.zero_grad()
        
        # Forward pass (mock)
        predictions = model(features)
        loss = criterion(predictions, targets)
        
        # Mock backward pass
        # loss.backward()  # Would normally do backprop
        optimizer.step()
        
        epoch_loss += loss.min()  # Use min as mock loss value
        n_batches += 1
    
    avg_loss = epoch_loss / n_batches if n_batches > 0 else 0
    train_losses.append(avg_loss)
    
    print(f"Epoch {epoch+1}/{n_epochs} - Average Loss: {avg_loss:.4f}")

print(f"\n‚úÖ Training simulation completed!")
print(f"Final training loss: {train_losses[-1]:.4f}")
print(f"Model ready for climate prediction tasks")

## 8. Visualize Single Variable Layer

Extract and visualize a single layer of climate data using the GeoCroissant visualization tools.

In [None]:
# Extract a specific time slice for visualization
print("üé® Extracting data for visualization...")

# Get a single timestep from the original xarray data
sample_data = filtered_dataset.to_xarray().isel(time=0)  # First timestep
temperature_data = sample_data.tas  # Temperature variable

# Create the visualization
fig = plt.figure(figsize=(12, 8))
ax = plt.axes(projection=ccrs.PlateCarree())

# Plot the temperature data
im = ax.contourf(
    temperature_data.lon, 
    temperature_data.lat, 
    temperature_data.values,
    levels=20,
    cmap='RdYlBu_r',
    transform=ccrs.PlateCarree()
)

# Add geographic features
ax.add_feature(cfeature.COASTLINE, linewidth=0.5)
ax.add_feature(cfeature.BORDERS, linewidth=0.3)
ax.add_feature(cfeature.OCEAN, color='lightblue', alpha=0.3)
ax.add_feature(cfeature.LAND, color='lightgray', alpha=0.3)

# Add gridlines
gl = ax.gridlines(draw_labels=True, dms=True, x_inline=False, y_inline=False)
gl.top_labels = False
gl.right_labels = False

# Add colorbar
cbar = plt.colorbar(im, ax=ax, shrink=0.7, pad=0.02)
cbar.set_label('Temperature (K)', rotation=270, labelpad=15)

# Set title and labels
plt.title(f'CMIP6 Surface Temperature - {sample_data.time.dt.strftime("%Y-%m").values}\\n'
          f'Model: CESM2, Experiment: SSP5-8.5', fontsize=14, pad=20)

# Set extent to match our filtered region
ax.set_extent([-130, -60, 30, 70], crs=ccrs.PlateCarree())

plt.tight_layout()
plt.show()

# Show data statistics
print(f"\\nüìä Data Statistics:")
print(f"Temperature range: {temperature_data.min().values:.1f}K to {temperature_data.max().values:.1f}K")
print(f"Mean temperature: {temperature_data.mean().values:.1f}K")
print(f"Spatial resolution: {abs(temperature_data.lat.diff('lat').mean().values):.3f}¬∞ lat x {abs(temperature_data.lon.diff('lon').mean().values):.3f}¬∞ lon")
print(f"Grid points: {len(temperature_data.lat)} lat x {len(temperature_data.lon)} lon")
print(f"Date: {sample_data.time.dt.strftime('%Y-%m-%d').values}")

## Summary

This notebook demonstrated the complete workflow for using **GeoCroissant** with CMIP6 environmental datasets:

### Key Capabilities Shown:
1. **üîç Dataset Discovery**: Search across multiple climate data repositories
2. **üì¶ Easy Loading**: Load CMIP6 datasets with STAC integration  
3. **üîç Deep Interrogation**: Explore models, experiments, variables, and metadata
4. **üéØ Smart Filtering**: Apply complex spatial, temporal, and variable filters
5. **ü§ñ ML Integration**: Convert to PyTorch datasets for training
6. **üèÉ‚Äç‚ôÇÔ∏è Training Ready**: Use in actual ML training pipelines
7. **üé® Visualization**: Plot geospatial climate data with cartographic projections

### Next Steps:
- **Implement** the actual GeoCroissant library and extensions
- **Integrate** with real STAC catalogs (CEDA, ESGF, etc.)
- **Add** more sophisticated ML dataset transformations
- **Extend** to other environmental datasets (ERA5, satellite data, etc.)
- **Optimize** for large-scale distributed computing

> This mockup provides a clear vision for how environmental scientists and ML researchers could seamlessly work with massive climate datasets! üåç