# STAC

## stackstac

## Cloud-Optimized GeoTIFF (COG) STAC Catalogs

### Earth Search by Element 84 Catalogs by Sensor or Organization
<!-- 1. Sentinel
    - [Sentinel-2 Pre-Collection 1 Level-2A](https://stacindex.org/catalogs/earth-search#/WYEJqXnNPKBm6QrPX82kX2Q894FF8MQ3BzoXBFBzdbmeY?t=3):
    - [Sentinel-2 Level-2A](https://stacindex.org/catalogs/earth-search#/43bjKKcJQfxYaT1ir3Ep6uENfjEoQrjkzhd2?t=3)
    - [Sentinel-2 Level-1C](https://stacindex.org/catalogs/earth-search#/43bjKKcJQfxYaT1ir3Ep6uENfjEoQrjkzhYe?t=3)
    - [Sentinel-2 Collection 1 Level-2A](https://stacindex.org/catalogs/earth-search#/5WpJuuvfexLDmorjkAGQ8X6oLkYtawu8NG3NGezU?t=3) -->
<!-- 6. [China-Brazil Earth Resources Satellite (CBERS)](https://stacindex.org/catalogs/cbers) ([AWS link](https://registry.opendata.aws/cbers/)) -->
<!-- for future reference: -->
<!-- https://registry.opendata.aws/openaerialmap/ -->
<!-- https://registry.opendata.aws/pgc-earthdem/ -->
<!-- https://registry.opendata.aws/seefar/ https://coastalcarbon.ai/seefar -->


1. **Maxar Open Data**
    - [STAC index](https://stacindex.org/catalogs/maxar-open-data-catalog-ard-format#/) | ([AWS link](https://registry.opendata.aws/maxar-open-data/))
2. **Harmonized Landsat and Sentinel-2 (HLS)**
    - Sentinel data: [MS Planetary Computer](https://planetarycomputer.microsoft.com/dataset/hls2-s30) 
    - Landsat data: [MS Planetary Computer](https://planetarycomputer.microsoft.com/dataset/hls2-l30)
3. [Earthview dataset](https://satellogic-earthview.s3.us-west-2.amazonaws.com/index.html#data_access)([AWS link](https://registry.opendata.aws/satellogic-earthview/))
4. NAIP: National Agriculture Imagery Program
    - [STAC index](https://stacindex.org/catalogs/earth-search#/DHAWSgxTJP2jRyydLqfVMh?t=3) ([AWS link](https://registry.opendata.aws/naip/))

5. Planet Labs
    - SpaceNet7 STAC collection: [STAC index](https://www.planet.com/data/stac/browser/spacenet7/catalog.json?.language=en) 

### Maxar Open Data STAC Catalog

## Save matching STAC items as hive partitioned geoparquets

1. 

In [None]:
def div_to

## Use Cubo to fetch STAC assets at coordinate point

[cubo](https://cubo.readthedocs.io/en/latest/index.html) is a Python library for working with STAC (SpatioTemporal Asset Catalog) collections and fetching samples from them. It provides a convenient way to interact with STAC APIs and retrieve geospatial data.

Look into managing raster data with duckdb or similar:
- https://www.linkedin.com/pulse/querying-stac-load-satellite-imagery-sentinel2-duckdb-alvaro-huarte-yjuzf/?trackingId=wfMPNnd%2BDh2zi1AoiLG2vg%3D%3D


In [None]:
import cubo 
import xarray as xr
# }import rioxarray as rxr

def percentile_normalize(da, lower_percentile=5, upper_percentile=95):
    """Normalize using percentiles to improve visualization contrast"""
    # Calculate percentiles per band
    mins = da.quantile(lower_percentile/100, dim=('x', 'y'))
    maxs = da.quantile(upper_percentile/100, dim=('x', 'y'))
    
    # Apply normalization
    normalized = (da - mins) / (maxs - mins)
    return normalized.clip(0, 1)

# use cubo to fetch a series of 4 tiles and visualize per their tutorial: https://cubo.readthedocs.io/en/latest/tutorials/getting_started.html
def fetch_cubo_stac_rasters_samples(
    pv_gdf: gpd.GeoDataFrame,
    stac_url='https://planetarycomputer.microsoft.com/api/stac/v1',
    collection='sentinel-2-l2a',
    bands=['B02', 'B03', 'B04'],
    start_date='2023-01-01',
    end_date='2023-03-31',
    units="m",
    edge_size=1280,
    resolution=10,
    sample_size=5,
    visualize_set=False,
):
    """
    Fetch raster samples from a STAC collection using Cubo and vi
    Args:
        pv_gdf (GeoDataFrame): Input GeoDataFrame with geometry column.
        stac_url (str): STAC API URL.
        collection (str): STAC collection name.
        start_date (str): Start date for filtering.
        end_date (str): End date for filtering.
        units (str): Units for the raster data.
        edge_size (int): Size of the edges of the raster tiles.
        resolution (int): Resolution of the raster data.
        sample_size (int): Number of samples to fetch.
    Returns:
        stac_xr_samples (xr.DataArray): Sampled raster data.
"""

    # Sort by area descending and sample
    sampled_gdf = pv_gdf.sort_values(by='area_m2', ascending=False)
    sampled_gdf = sampled_gdf[:1000].sample(sample_size, random_state=42)
    # all should be using 'EPSG:4326'
    target_epsg = sampled_gdf.crs
    default_query = {"eo:cloud_cover": {"lt": 25}}
    
    stac_items = []
    print(f"Fetching samples for {len(sampled_gdf)} locations...")
    for idx, row in sampled_gdf.iterrows():
        pv_lat, pv_long = row.geometry.centroid.y, row.geometry.centroid.x
        print(f"Fetching samples for {idx}: {pv_lat}, {pv_long} with area {row['area_m2']:.2f} m2")
        # Fetch samples using Cubo
        try:
            # Fetch samples using Cubo
            # Pass the EPSG code as an integer, not a string with "EPSG:" prefix
            pv_da = cubo.create(
                lat=pv_lat,
                lon=pv_long,
                stac=stac_url,
                collection=collection,
                bands=bands,
                start_date=start_date,
                end_date=end_date,
                edge_size=1280,
                units=units,
                resolution=resolution,
                # Use integer EPSG code instead of string with 'EPSG:' prefix
                # stackstac_kw=dict( # stackstac keyword arguments
                #     # xy_coords='center',
                #     epsg=32611)
                query=default_query
            )

            # return pv_da
            
            stac_items.append(pv_da)
                
        except Exception as e:
            print(f"Error fetching or visualizing data for {idx}: {e}")
    

    if not stac_items:
        print("No valid data items fetched.")
        return None


    # visualize only last sample for now
    if visualize_set and pv_da is not None:
        # Check that the 'time' coordinate exists and has data
        if "time" in pv_da.coords and len(pv_da.time) > 0:
            # Calculate number of columns for the visualization grid
            axes = min(4, len(pv_da.time))
            # Display RGB composite by selecting red, green, blue bands and scaling
            pv_da = pv_da.groupby('time').first()
            rgb_da = percentile_normalize(pv_da.sel(band=["B04", "B03", "B02"]))
            rgb_da.plot.imshow(col="time", col_wrap=axes)
        
        else:
            print(f"No data available for location {pv_lat}, {pv_long}")
    
    try:
        # First ensure all datasets have same dimensions and variables
        aligned_items = []
        for item in stac_items:
            # Ensure all datasets use the same CRS
            if item.rio.crs is None:
                item = item.rio.write_crs("EPSG:4326")
            
            # Add to aligned list
            aligned_items.append(item)
        
        # Use rioxarray's merge functionality which handles spatial contexts better
        # First combine by time (similar to your original approach)
        merged_data = xr.concat(aligned_items, dim='time')
        
        # For spatial merging with overlapping tiles, you could use:
        # merged_data = rxr.merge.merge_arrays(aligned_items)
        
        return merged_data
        
    except Exception as e:
        print(f"Error merging datasets: {e}")
        # If merge fails, return the list of items instead
        print("Returning list of individual DataArrays without merging")
        return stac_items



# Xarray

## Zarr and ARCO storage formats

## Kerchunk and VirtualiZarr