# Multi-Dataset Analysis: Combining Ocean Data Sources

This notebook demonstrates how to combine data from multiple ODP datasets for integrated analysis.

**What you'll learn:**
- Load and explore multiple datasets
- Join data by common keys (temporal, spatial, categorical)
- Perform cross-dataset analysis
- Handle different data resolutions and formats

**Prerequisites:**
- Running in ODP Workspace (auto-authenticated)
- Completed previous tutorials (01-03)

## Why Combine Datasets?

Ocean science often requires integrating multiple data sources:
- **Bathymetry + Biology**: Correlate species distribution with seafloor depth
- **Temperature + Chemistry**: Link physical and chemical oceanography
- **Observations + Models**: Validate model predictions against measurements
- **Multi-temporal**: Compare conditions across time periods

## 1. Setup

In [None]:
from odp.client import Client
import pandas as pd
import numpy as np
import requests

# Initialize ODP client
client = Client()

# STAC API for dataset discovery
STAC_API = "https://api.hubocean.earth/api/stac"

print("Client initialized")

# Optional visualization
try:
    import matplotlib.pyplot as plt
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

## 2. Discover Related Datasets

Use STAC API to find datasets in a common geographic region.

In [None]:
# Search for datasets in Norwegian waters
norwegian_bbox = [0, 55, 35, 75]  # [west, south, east, north]

search_params = {
    "bbox": norwegian_bbox,
    "limit": 50
}

response = requests.post(f"{STAC_API}/search", json=search_params)

if response.status_code == 200:
    results = response.json()
    features = results.get('features', [])
    print(f"Found {len(features)} datasets in Norwegian waters\n")
    
    # Display available datasets
    for i, f in enumerate(features[:15]):
        props = f.get('properties', {})
        title = props.get('title', f['id'][:40])
        desc = props.get('description', 'No description')[:60]
        print(f"{i+1:2}. {title}")
        print(f"    {desc}...")
else:
    print(f"Search failed: {response.status_code}")
    features = []

In [None]:
# Helper function to probe dataset type
def get_dataset_info(dataset_id):
    """Get dataset type and basic info."""
    try:
        ds = client.dataset(dataset_id)
        schema = ds.table.schema()
        files = ds.files.list()
        
        info = {
            'id': dataset_id,
            'is_tabular': schema is not None,
            'columns': [f.name for f in schema] if schema else [],
            'file_count': len(files) if files else 0
        }
        
        if schema:
            stats = ds.table.stats()
            info['row_count'] = stats.num_rows if stats else 0
        
        return info
    except Exception as e:
        return {'id': dataset_id, 'error': str(e)}

# Probe a few datasets
print("Probing dataset types...\n")
for f in features[:5]:
    info = get_dataset_info(f['id'])
    dtype = 'TABULAR' if info.get('is_tabular') else 'FILE'
    title = f.get('properties', {}).get('title', f['id'][:30])
    print(f"  {title}: {dtype}")
    if info.get('columns'):
        print(f"    Columns: {info['columns'][:5]}...")

## 3. Load Datasets for Analysis

We'll work with two complementary dataset types. If tabular datasets aren't available, we'll demonstrate the pattern with synthetic data.

In [None]:
# Select datasets to combine
# Option 1: Use discovered tabular datasets
# Option 2: Use user's personal datasets
# Option 3: Generate synthetic demo data

print("Dataset Selection Options:")
print("1. Enter two dataset UUIDs to combine")
print("2. Use synthetic demo data (recommended for learning)")

choice = input("\nChoice (1 or 2): ").strip()

USE_SYNTHETIC = choice != '1'

if not USE_SYNTHETIC:
    DATASET_1 = input("Dataset 1 UUID: ").strip()
    DATASET_2 = input("Dataset 2 UUID: ").strip()
    print(f"\nUsing datasets: {DATASET_1[:8]}... and {DATASET_2[:8]}...")
else:
    print("\nUsing synthetic demo data")

In [None]:
# Generate synthetic oceanographic data for demonstration
if USE_SYNTHETIC:
    np.random.seed(42)
    
    # Dataset 1: Temperature observations from monitoring stations
    stations = [
        {"id": "ST001", "lat": 60.5, "lon": 5.0, "name": "Bergen Offshore"},
        {"id": "ST002", "lat": 63.0, "lon": 8.5, "name": "Trondheim Deep"},
        {"id": "ST003", "lat": 66.5, "lon": 13.0, "name": "Arctic Gateway"},
        {"id": "ST004", "lat": 70.0, "lon": 20.0, "name": "Barents Entry"},
    ]
    
    temp_records = []
    for month in range(1, 13):
        for station in stations:
            # Temperature varies by latitude and season
            base_temp = 12 - (station['lat'] - 58) * 0.2
            seasonal = 4 * np.sin((month - 3) * np.pi / 6)  # Peak in August
            temp = base_temp + seasonal + np.random.normal(0, 1)
            
            temp_records.append({
                'station_id': station['id'],
                'station_name': station['name'],
                'latitude': station['lat'],
                'longitude': station['lon'],
                'month': month,
                'temperature_c': round(temp, 2),
                'salinity_psu': round(34.5 + np.random.normal(0, 0.3), 2)
            })
    
    df_temperature = pd.DataFrame(temp_records)
    print(f"Temperature dataset: {len(df_temperature)} records")
    print(f"Stations: {df_temperature['station_name'].unique().tolist()}")
    df_temperature.head()

In [None]:
if USE_SYNTHETIC:
    # Dataset 2: Species observations (biota)
    species_list = [
        {"name": "Atlantic Cod", "temp_min": 2, "temp_max": 12},
        {"name": "Atlantic Herring", "temp_min": 4, "temp_max": 14},
        {"name": "Mackerel", "temp_min": 8, "temp_max": 18},
        {"name": "Capelin", "temp_min": 0, "temp_max": 8},
    ]
    
    bio_records = []
    for month in range(1, 13):
        for station in stations:
            # Get approximate temperature at this station/month
            base_temp = 12 - (station['lat'] - 58) * 0.2
            seasonal = 4 * np.sin((month - 3) * np.pi / 6)
            local_temp = base_temp + seasonal
            
            # Species presence depends on temperature preference
            for species in species_list:
                if species['temp_min'] <= local_temp <= species['temp_max']:
                    # Species is present - generate observation
                    abundance = np.random.poisson(50)  # Count
                    if abundance > 0:
                        bio_records.append({
                            'station_id': station['id'],
                            'month': month,
                            'species': species['name'],
                            'abundance': abundance,
                            'biomass_kg': round(abundance * np.random.uniform(0.5, 5.0), 1)
                        })
    
    df_biology = pd.DataFrame(bio_records)
    print(f"Biology dataset: {len(df_biology)} records")
    print(f"Species: {df_biology['species'].unique().tolist()}")
    df_biology.head()

In [None]:
# Load from real ODP datasets if available
if not USE_SYNTHETIC:
    try:
        ds1 = client.dataset(DATASET_1)
        ds2 = client.dataset(DATASET_2)
        
        # Try to load tabular data
        schema1 = ds1.table.schema()
        schema2 = ds2.table.schema()
        
        if schema1:
            df_dataset1 = ds1.table.select().all(max_rows=10000).dataframe()
            print(f"Dataset 1: {len(df_dataset1)} rows, columns: {list(df_dataset1.columns)}")
        else:
            print(f"Dataset 1 is file-based, not tabular")
            df_dataset1 = None
            
        if schema2:
            df_dataset2 = ds2.table.select().all(max_rows=10000).dataframe()
            print(f"Dataset 2: {len(df_dataset2)} rows, columns: {list(df_dataset2.columns)}")
        else:
            print(f"Dataset 2 is file-based, not tabular")
            df_dataset2 = None
            
    except Exception as e:
        print(f"Error loading datasets: {e}")
        print("Falling back to synthetic data")
        USE_SYNTHETIC = True

## 4. Join Strategies

Common approaches for combining oceanographic datasets:

| Join Type | Key Column(s) | Use Case |
|-----------|---------------|----------|
| **Exact** | station_id, timestamp | Same sampling locations |
| **Temporal** | date/month/year | Time-aligned data |
| **Spatial** | lat/lon bins, H3 hex | Nearby observations |
| **Categorical** | region, species | Group-based analysis |

In [None]:
# Strategy 1: Exact key join (station + time)
if USE_SYNTHETIC:
    # Join temperature and biology on station_id + month
    df_combined = pd.merge(
        df_biology,
        df_temperature,
        on=['station_id', 'month'],
        how='left'
    )
    
    print(f"Combined dataset: {len(df_combined)} records")
    print(f"Columns: {list(df_combined.columns)}")
    df_combined.head(10)

In [None]:
# Strategy 2: Spatial binning join
# When datasets don't share exact coordinates, bin to a common grid

def bin_coordinates(lat, lon, resolution=1.0):
    """Bin lat/lon to grid cells."""
    lat_bin = np.floor(lat / resolution) * resolution
    lon_bin = np.floor(lon / resolution) * resolution
    return f"{lat_bin:.1f}_{lon_bin:.1f}"

if USE_SYNTHETIC:
    # Add spatial bins to temperature data
    df_temperature['spatial_bin'] = df_temperature.apply(
        lambda row: bin_coordinates(row['latitude'], row['longitude'], resolution=2.0),
        axis=1
    )
    
    print("Spatial bins in temperature data:")
    print(df_temperature['spatial_bin'].value_counts())

## 5. Cross-Dataset Analysis

Now we can analyze relationships between temperature and species distribution.

In [None]:
if USE_SYNTHETIC:
    # Analyze: How does temperature affect species abundance?
    species_temp_analysis = df_combined.groupby('species').agg({
        'temperature_c': ['mean', 'min', 'max'],
        'abundance': ['sum', 'mean'],
        'station_id': 'count'
    }).round(2)
    
    species_temp_analysis.columns = ['_'.join(col) for col in species_temp_analysis.columns]
    species_temp_analysis = species_temp_analysis.rename(columns={'station_id_count': 'observations'})
    
    print("Species Temperature Preferences (from combined data):")
    print(species_temp_analysis)

In [None]:
if USE_SYNTHETIC:
    # Seasonal patterns by species
    seasonal_abundance = df_combined.groupby(['species', 'month']).agg({
        'abundance': 'sum',
        'temperature_c': 'mean'
    }).reset_index()
    
    print("Seasonal abundance patterns:")
    for species in df_combined['species'].unique():
        sp_data = seasonal_abundance[seasonal_abundance['species'] == species]
        peak_month = sp_data.loc[sp_data['abundance'].idxmax(), 'month']
        peak_temp = sp_data.loc[sp_data['abundance'].idxmax(), 'temperature_c']
        print(f"  {species}: Peak in month {peak_month} (temp: {peak_temp:.1f}°C)")

In [None]:
if USE_SYNTHETIC and HAS_MPL:
    # Visualize temperature-abundance relationship
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    for idx, species in enumerate(df_combined['species'].unique()):
        ax = axes[idx // 2, idx % 2]
        sp_data = df_combined[df_combined['species'] == species]
        
        ax.scatter(sp_data['temperature_c'], sp_data['abundance'], alpha=0.6)
        ax.set_xlabel('Temperature (°C)')
        ax.set_ylabel('Abundance')
        ax.set_title(species)
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Species Abundance vs Temperature', fontsize=14)
    plt.tight_layout()
    plt.show()
elif USE_SYNTHETIC:
    print("Install matplotlib for visualization: pip install matplotlib")

## 6. Spatial Correlation Analysis

Analyze how patterns vary across stations.

In [None]:
if USE_SYNTHETIC:
    # Station-level summary
    station_summary = df_combined.groupby(['station_id', 'station_name', 'latitude']).agg({
        'temperature_c': 'mean',
        'abundance': 'sum',
        'species': 'nunique'
    }).reset_index()
    
    station_summary.columns = ['station_id', 'station_name', 'latitude', 
                                'avg_temp', 'total_abundance', 'species_richness']
    
    station_summary = station_summary.sort_values('latitude')
    
    print("Station Summary (sorted by latitude):")
    print(station_summary.to_string(index=False))

In [None]:
if USE_SYNTHETIC:
    # Correlation between temperature and biodiversity
    monthly_biodiversity = df_combined.groupby(['station_id', 'month']).agg({
        'temperature_c': 'first',
        'species': 'nunique',
        'abundance': 'sum'
    }).reset_index()
    
    # Calculate correlation
    temp_richness_corr = monthly_biodiversity['temperature_c'].corr(
        monthly_biodiversity['species']
    )
    temp_abundance_corr = monthly_biodiversity['temperature_c'].corr(
        monthly_biodiversity['abundance']
    )
    
    print(f"Temperature-Biodiversity Correlations:")
    print(f"  Temperature vs Species Richness: {temp_richness_corr:.3f}")
    print(f"  Temperature vs Total Abundance:  {temp_abundance_corr:.3f}")

## 7. Export Combined Data

Save the combined dataset for further analysis or upload to ODP.

In [None]:
if USE_SYNTHETIC:
    # Export to CSV
    output_file = "combined_temp_biology_analysis.csv"
    df_combined.to_csv(output_file, index=False)
    print(f"Exported {len(df_combined)} records to {output_file}")
    
    # Show file size
    import os
    size_kb = os.path.getsize(output_file) / 1024
    print(f"File size: {size_kb:.1f} KB")

In [None]:
# Optional: Upload combined data to your ODP dataset
upload_choice = input("Upload combined data to ODP? Enter dataset UUID or 'skip': ").strip()

if upload_choice and upload_choice.lower() != 'skip':
    try:
        import io
        
        target_ds = client.dataset(upload_choice)
        
        # Convert to CSV bytes
        csv_buffer = io.BytesIO()
        df_combined.to_csv(csv_buffer, index=False)
        csv_bytes = csv_buffer.getvalue()
        
        # Upload
        file_id = target_ds.files.upload("combined_analysis.csv", csv_bytes)
        print(f"Uploaded! File ID: {file_id}")
        
        # Optionally ingest to table
        ingest = input("Ingest to table? (yes/no): ").strip().lower()
        if ingest == 'yes':
            target_ds.files.ingest(file_id, opt="drop")
            print("Ingested to table!")
            
    except Exception as e:
        print(f"Upload failed: {e}")
else:
    print("Skipped upload")

## Summary

This notebook demonstrated multi-dataset analysis techniques:

1. **Dataset Discovery**: Find related datasets using STAC API spatial search
2. **Data Loading**: Load from ODP or use synthetic demo data
3. **Join Strategies**: Exact keys, temporal alignment, spatial binning
4. **Cross-Dataset Analysis**: Temperature-species relationships, seasonal patterns
5. **Spatial Correlation**: Station-level summaries, biodiversity metrics
6. **Export**: Save combined data locally or upload to ODP

## Join Strategy Reference

| Strategy | When to Use | Pandas Method |
|----------|-------------|---------------|
| Exact key | Same sampling scheme | `pd.merge(on='key')` |
| Temporal | Different time resolution | `pd.merge_asof()` |
| Spatial bin | Different locations | Custom binning + merge |
| H3 hexagon | Large-scale spatial | `h3.latlng_to_cell()` + merge |

## Next Steps

- **05_dask_distributed_processing.ipynb**: Scale to large datasets with Dask

## Resources

- [Pandas merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html)
- [ODP Python SDK](https://docs.hubocean.earth/python_sdk/intro/)
- [H3 Spatial Indexing](https://h3geo.org/)