# Catalog Discovery with STAC API

This notebook demonstrates how to programmatically discover and explore datasets available on the Ocean Data Platform using the STAC (SpatioTemporal Asset Catalog) API.

**What you'll learn:**
- Query the STAC API to list available collections
- Search for datasets by spatial extent and keywords
- Retrieve dataset metadata before loading data
- Connect discovered datasets to the Python SDK

**Prerequisites:**
- Running in ODP Workspace (auto-authenticated) or have an API key
- `odp-sdk` installed (`pip install -U odp-sdk`)

## 1. Setup and Configuration

In [None]:
import requests
import json
from pprint import pprint

# STAC API base URL
STAC_BASE_URL = "https://api.hubocean.earth/api/stac"

# Helper function for STAC requests
def stac_get(endpoint):
    """GET request to STAC API endpoint."""
    url = f"{STAC_BASE_URL}{endpoint}"
    response = requests.get(url)
    response.raise_for_status()
    return response.json()

def stac_post(endpoint, payload):
    """POST request to STAC API endpoint."""
    url = f"{STAC_BASE_URL}{endpoint}"
    response = requests.post(url, json=payload)
    response.raise_for_status()
    return response.json()

## 2. Explore the Root Catalog

The STAC root catalog provides links to collections and search endpoints.

In [None]:
# Get the root catalog
root_catalog = stac_get("/")

print("Catalog ID:", root_catalog.get("id"))
print("Description:", root_catalog.get("description"))
print("\nAvailable links:")
for link in root_catalog.get("links", []):
    print(f"  - {link.get('rel')}: {link.get('href')}")

## 3. List All Collections

Collections represent datasets in the STAC model. Each collection has metadata describing its spatial/temporal extent and available assets.

In [None]:
# List all available collections
collections_response = stac_get("/collections")
collections = collections_response.get("collections", [])

print(f"Found {len(collections)} collections:\n")

for coll in collections:
    print(f"ID: {coll.get('id')}")
    print(f"  Title: {coll.get('title', 'N/A')}")
    print(f"  Description: {coll.get('description', 'N/A')[:100]}...")
    
    # Spatial extent
    extent = coll.get("extent", {})
    spatial = extent.get("spatial", {}).get("bbox", [])
    if spatial:
        print(f"  Bounding Box: {spatial[0]}")
    
    print()

## 4. Search by Spatial Extent

The STAC search endpoint allows filtering by:
- **bbox**: Bounding box `[west, south, east, north]`
- **intersects**: GeoJSON geometry
- **datetime**: ISO 8601 date/time range
- **collections**: List of collection IDs to search within

In [None]:
# Search for datasets covering Norwegian waters
# Approximate bounding box for Norwegian Sea
norwegian_sea_bbox = [-5, 55, 30, 75]  # [west, south, east, north]

search_payload = {
    "bbox": norwegian_sea_bbox,
    "limit": 10
}

search_results = stac_post("/search", search_payload)

print(f"Found {len(search_results.get('features', []))} items in Norwegian waters:\n")

for feature in search_results.get("features", []):
    props = feature.get("properties", {})
    print(f"ID: {feature.get('id')}")
    print(f"  Collection: {feature.get('collection')}")
    print(f"  Datetime: {props.get('datetime', 'N/A')}")
    print()

## 5. Search with GeoJSON Polygon

For more precise spatial queries, use a GeoJSON polygon with the `intersects` parameter.

In [None]:
# Define a polygon around the North Sea / Norwegian waters
north_sea_polygon = {
    "type": "Polygon",
    "coordinates": [[
        [-5, 51],   # SW corner
        [15, 51],   # SE corner (expanded east)
        [15, 72],   # NE corner (expanded north)
        [-5, 72],   # NW corner
        [-5, 51]    # Close polygon
    ]]
}

search_payload = {
    "intersects": north_sea_polygon,
    "limit": 100  # Increased limit
}

search_results = stac_post("/search", search_payload)

features = search_results.get("features", [])
print(f"Found {len(features)} items intersecting search polygon\n")

# Collect unique collection IDs from spatial search
discovered_collection_ids = list(set(
    feature.get('collection') for feature in features if feature.get('collection')
))

# Include known working datasets that may not appear in spatial search
# (global datasets or those with different spatial indexing)
known_datasets = [
    "b960c80e-7ead-47af-b6c8-e92a9b5ac659",  # PGS Brazil - Biota and Physics (TABULAR)
    "15dac249-4e3d-474b-a246-ba95cffc8807",  # GLODAP - ocean chemistry
    "5070af58-6d8a-4636-a6a0-8ca9298fb3ab",  # GEBCO Bathymetry
]

for kd in known_datasets:
    if kd not in discovered_collection_ids:
        discovered_collection_ids.append(kd)

print(f"Total: {len(discovered_collection_ids)} unique collections to probe")
print("  (includes known working datasets)")
print("=" * 60)

In [None]:
from odp.client import Client

# Initialize ODP client to probe dataset types
client = Client()

# Probe each collection to determine type (tabular vs file-based)
collection_info = []

print("Probing collections to detect types...")
for coll_id in discovered_collection_ids:
    try:
        # Get STAC metadata for title
        coll_meta = stac_get(f"/collections/{coll_id}")
        title = coll_meta.get('title', 'Unknown')[:50]
        
        # Probe ODP to detect type
        ds = client.dataset(coll_id)
        schema = ds.table.schema()
        
        if schema:
            dtype = "TABULAR"
            stats = ds.table.stats()
            if stats and stats.num_rows:
                detail = f"{stats.num_rows:,} rows"
            else:
                detail = f"{len(schema)} columns"
        else:
            dtype = "FILE"
            files = ds.files.list()
            if files:
                detail = f"{len(files)} files"
            else:
                # No files accessible - might be permissions or different access pattern
                detail = "no direct access"
        
        collection_info.append({
            'id': coll_id,
            'title': title,
            'type': dtype,
            'detail': detail
        })
        print(f"  ✓ {title[:30]}... ({dtype}: {detail})")
    except Exception as e:
        print(f"  ✗ {coll_id[:20]}... (skipped: {str(e)[:30]})")

# Display numbered list for selection
print(f"\n{'='*70}")
print(f"Discovered {len(collection_info)} accessible collections:\n")
print(f"{'#':<3} {'Type':<8} {'Details':<18} {'Title'}")
print("-" * 70)

for i, info in enumerate(collection_info):
    print(f"{i+1:<3} {info['type']:<8} {info['detail']:<18} {info['title']}")

# Note about FILE datasets
file_datasets = [c for c in collection_info if c['type'] == 'FILE']
if file_datasets:
    print(f"\nNote: FILE datasets with 'no direct access' may require different")
    print(f"      access methods or permissions. Try TABULAR datasets first.")

In [None]:
# Select a dataset to explore
print("Enter a number from the list above (or press Enter for #1):")
choice = input("Selection: ").strip()

# Default to first item if empty
if not choice:
    choice = "1"

try:
    idx = int(choice) - 1
    if 0 <= idx < len(collection_info):
        selected = collection_info[idx]
        collection_id = selected['id']
        is_tabular = selected['type'] == 'TABULAR'
        
        print(f"\nSelected: {selected['title']}")
        print(f"Type: {selected['type']} ({selected['detail']})")
        print(f"ID: {collection_id}")
    else:
        print(f"Invalid selection. Please choose 1-{len(collection_info)}")
except ValueError:
    print("Please enter a valid number")

## 6. Select and Explore Dataset

Select a dataset from the discovered list to examine its structure.

In [None]:
# Connect to selected dataset (client already initialized)
if 'collection_id' not in dir() or collection_id is None:
    print("No dataset selected. Run the selection cell above first.")
else:
    dataset = client.dataset(collection_id)

    print(f"Dataset: {selected['title']}")
    print(f"Type: {selected['type']}")
    print(f"ID: {collection_id}")
    print()

    if is_tabular:
        schema = dataset.table.schema()
        print(f"Schema ({len(schema)} columns):")
        for field in schema:
            print(f"  {field.name}: {field.type}")
    else:
        files = dataset.files.list()
        print(f"Contains {len(files)} files")

In [None]:
# Get dataset statistics (branched by type)
if 'is_tabular' not in dir():
    print("No dataset selected. Run the selection cell above first.")
elif is_tabular:
    stats = dataset.table.stats()
    if stats:
        print("Table Statistics:")
        print(f"  Total rows: {stats.num_rows:,}")
        print(f"  Size: {stats.size:,} bytes")
else:
    # File-based dataset: list files
    files = dataset.files.list()
    print(f"Files in dataset ({len(files)} total):\n")
    for f in files[:10]:  # Show first 10
        print(f"  ID: {f.get('id')}")
        print(f"    Name: {f.get('name', 'N/A')}")
        print(f"    Size: {f.get('size', 'N/A')} bytes")
        print(f"    MIME: {f.get('mime-type', 'N/A')}")
        print()
    if len(files) > 10:
        print(f"  ... and {len(files) - 10} more files")

In [None]:
# Preview data (branched by type)
from IPython.display import display

if 'is_tabular' not in dir():
    print("No dataset selected. Run the selection cell above first.")
elif is_tabular:
    # Preview first few rows of tabular data
    preview_df = dataset.table.select().all(max_rows=5).dataframe()
    print("Preview (first 5 rows):")
    display(preview_df)
else:
    # For file-based: show how to download a file
    files = dataset.files.list()
    if files:
        first_file = files[0]
        file_id = first_file.get('id')
        file_name = first_file.get('name', 'downloaded_file')
        
        print(f"Example: Download first file '{file_name}'")
        print(f"File ID: {file_id}")
        print(f"\nTo download, run:")
        print(f"  with open('{file_name}', 'wb') as f:")
        print(f"      for chunk in dataset.files.download('{file_id}'):")
        print(f"          f.write(chunk)")
    else:
        print("No files found in this dataset.")

## 8. Build a Dataset Inventory

Create a summary inventory of available datasets for reference.

In [None]:
import pandas as pd

# Build inventory from collections
inventory = []

for coll in collections:
    extent = coll.get("extent", {})
    spatial = extent.get("spatial", {}).get("bbox", [[]])[0] if extent.get("spatial", {}).get("bbox") else None
    temporal = extent.get("temporal", {}).get("interval", [[]])[0] if extent.get("temporal", {}).get("interval") else None
    
    inventory.append({
        "id": coll.get("id"),
        "title": coll.get("title", "N/A"),
        "description": coll.get("description", "N/A")[:100] + "..." if coll.get("description") else "N/A",
        "license": coll.get("license", "N/A"),
        "bbox": str(spatial) if spatial else "N/A",
        "temporal_start": temporal[0] if temporal else "N/A",
        "temporal_end": temporal[1] if temporal and len(temporal) > 1 else "N/A",
        "keywords": ", ".join(coll.get("keywords", []))
    })

inventory_df = pd.DataFrame(inventory)
print(f"Dataset Inventory ({len(inventory_df)} collections):")
inventory_df

In [None]:
# Save inventory to CSV for reference
inventory_df.to_csv("odp_dataset_inventory.csv", index=False)
print("Inventory saved to odp_dataset_inventory.csv")

## Next Steps

Now that you've discovered available datasets, continue with:

- **02_geospatial_analysis.ipynb**: Query and visualize data using H3 hexagonal aggregation
- **03_data_pipeline.ipynb**: Ingest files and transform into tabular data
- **04_multi_dataset_join.ipynb**: Combine multiple datasets for analysis

## Resources

- [ODP Documentation](https://docs.hubocean.earth/)
- [STAC Specification](https://stacspec.org/)
- [Python SDK Reference](https://docs.hubocean.earth/python_sdk/intro/)
- [ODP Catalog (Web UI)](https://app.hubocean.earth/catalog)