# Multi-Year Lake Ice Phenology Data Export
## Part 2: GEE Processing and Export (2019-2021)

**Goal:** Export S1 + S2 + ERA5 data for North Slope lakes across 3 years

**Strategy:**
- Process chunks independently (spatial parallelization)
- Export one year at a time per chunk
- Use efficient spatial filtering (only process S2 images that overlap lakes)
- Total exports: ~19 chunks × 3 years = ~57 exports

**Data sources:**
- Sentinel-1 GRD (SAR)
- Sentinel-2 SR Harmonized (optical, for NDSI)
- ERA5-Land (temperature)

**Years:** 2019, 2020, 2021 (match ALPOD temporal coverage)

---
## Setup

In [1]:
import ee
import pandas as pd
import numpy as np
import geopandas as gpd
from datetime import datetime
import time
import xarray as xr

# Initialize Earth Engine
ee.Initialize()

print("Imports successful!")
print(f"Earth Engine initialized: {ee.String('GEE Initialized').getInfo()}")

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


Imports successful!
Earth Engine initialized: GEE Initialized


In [2]:
# Configuration
BUCKET = 'wustl-eeps-geospatial'
BASE_PATH = 'thermokarst_lakes'
CHUNKS_PATH = f'gs://{BUCKET}/{BASE_PATH}/processed/chunks'
OUTPUT_PATH = f'{BASE_PATH}/exports'  # No gs:// prefix for GEE exports

# Years to process (match ALPOD coverage)
YEARS = [2019, 2020, 2021]

# Processing parameters
SCALE = 10  # Sentinel-1 resolution
S2_NDSI_THRESHOLD = 0.4  # NDSI > 0.4 = ice
S2_CLOUD_THRESHOLD = 30  # Maximum cloud cover for S2 images (scene-level)
S2_CLOUD_PROB_THRESHOLD = 40  # s2cloudless probability threshold (pixel-level)
S2_TIME_WINDOW = 3  # Days before/after S1 acquisition to look for S2

# Projection
ALASKA_ALBERS = 'EPSG:3338'

print(f"Configuration:")
print(f"  Years: {YEARS}")
print(f"  Chunks path: {CHUNKS_PATH}")
print(f"  Output: gs://{BUCKET}/{OUTPUT_PATH}")
print(f"  S2 scene cloud threshold: {S2_CLOUD_THRESHOLD}%")
print(f"  S2 pixel cloud probability threshold: {S2_CLOUD_PROB_THRESHOLD}%")
print(f"  S2 time window: ±{S2_TIME_WINDOW} days")

Configuration:
  Years: [2019, 2020, 2021]
  Chunks path: gs://wustl-eeps-geospatial/thermokarst_lakes/processed/chunks
  Output: gs://wustl-eeps-geospatial/thermokarst_lakes/exports
  S2 scene cloud threshold: 30%
  S2 pixel cloud probability threshold: 40%
  S2 time window: ±3 days


In [22]:
## Upload ALPOD data

In [None]:
# =============================================================================
# UPLOAD ALPOD TO GEE ASSET (ONE-TIME ONLY)
# =============================================================================
# Set to False after asset is uploaded
UPLOAD_ALPOD_ASSET = True

if UPLOAD_ALPOD_ASSET:
    print("Loading ALPOD shapefile from GCS...")
    alpod_gdf = gpd.read_file('gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.shp')
    print(f"Loaded {len(alpod_gdf):,} lakes")

    # Fix invalid geometries
    print("Fixing invalid geometries...")
    invalid_count = (~alpod_gdf.is_valid).sum()
    print(f"  Found {invalid_count} invalid geometries")
    
    alpod_gdf['geometry'] = alpod_gdf.geometry.buffer(0)  # Standard fix for invalid geometries
    
    # Remove any null/empty geometries
    alpod_gdf = alpod_gdf[~alpod_gdf.geometry.is_empty & alpod_gdf.geometry.notna()]
    print(f"  After cleaning: {len(alpod_gdf):,} lakes")

    print("Converting to GEE FeatureCollection...")
    alpod_fc = ee.FeatureCollection(alpod_gdf.__geo_interface__)
    print("Conversion complete")

    print("Starting asset upload...")
    task = ee.batch.Export.table.toAsset(
        collection=alpod_fc,
        description='ALPOD_full_upload',
        assetId='projects/eeps-geospatial/assets/ALPOD_full'
    )
    task.start()

    print("\n" + "="*60)
    print("ASSET UPLOAD STARTED!")
    print("="*60)
    print("Check progress: https://code.earthengine.google.com/tasks")
    print("This may take 30-60 minutes for 800k lakes...")
    print("\n⚠️  After upload completes, set UPLOAD_ALPOD_ASSET = False")
else:
    print("Skipping ALPOD upload (already done)")
    print("Asset location: projects/eeps-geospatial/assets/ALPOD_full")

Loading ALPOD shapefile from GCS...
Loaded 801,895 lakes
Fixing invalid geometries...
  Found 209 invalid geometries
  After cleaning: 801,894 lakes
Converting to GEE FeatureCollection...


---
## Helper Functions

In [3]:
def load_chunk_from_bucket(chunk_id):
    """
    Load a chunk GeoJSON from bucket and convert to ee.FeatureCollection
    Downloads to local temp file first to avoid GCS streaming issues
    """
    import os
    
    chunk_file = f'{CHUNKS_PATH}/chunk_{chunk_id:02d}.geojson'
    
    # Download to local temp file
    local_path = f'/tmp/chunk_{chunk_id:02d}.geojson'
    
    # Use gsutil to download (reliable in Vertex AI)
    os.system(f'gsutil -q cp {chunk_file} {local_path}')
    
    # Load from local file
    gdf = gpd.read_file(local_path)
    
    print(f"  Loaded chunk {chunk_id}: {len(gdf)} lakes")
    print(f"    Bounds: {gdf.total_bounds}")
    
    # Keep only essential properties to reduce payload size
    essential_cols = ['geometry']
    
    if 'lake_area_km2' in gdf.columns:
        essential_cols.append('lake_area_km2')
    
    # Create simple ID if needed
    gdf['lake_id'] = range(len(gdf))
    essential_cols.append('lake_id')
    
    gdf_simplified = gdf[essential_cols].copy()
    
    # Simplify geometries to reduce size (5m tolerance)
    print(f"  Simplifying geometries...")
    gdf_simplified['geometry'] = gdf_simplified.geometry.simplify(
        tolerance=5, preserve_topology=True
    )
    
    # Convert to GeoJSON dict
    geojson = gdf_simplified.__geo_interface__
    
    print(f"  Creating EE FeatureCollection...")
    fc = ee.FeatureCollection(geojson)
    print(f"  FeatureCollection created successfully")
    
    return fc, gdf

In [4]:
def add_lake_geometry_metrics(lakes_fc, region_bounds):
    """
    Add lake interior and landscape ring geometries
    Uses GEE Asset for ALPOD (avoids payload bloat)
    """
    # Load ALPOD from GEE Asset (tiny reference, not 800k geometries!)
    all_alpod = ee.FeatureCollection('projects/eeps-geospatial/assets/ALPOD_full')
    
    # Filter to region (with buffer for edge lakes)
    search_bounds = region_bounds.buffer(500)
    all_alpod_nearby = all_alpod.filterBounds(search_bounds)
    
    # Dissolve all nearby lakes for landscape masking
    all_lakes_dissolved = all_alpod_nearby.geometry().dissolve(maxError=10)
    
    def add_geometries(lake):
        lake_geom = lake.geometry()
        lake_id = lake.get('lake_id')
        
        # Lake interior: 10m inward buffer
        lake_interior = lake_geom.buffer(-10)
        
        # Landscape ring: 100m outward buffer, minus ALL lakes
        landscape_ring = lake_geom.buffer(100).difference(all_lakes_dissolved)
        
        return lake.set({
            'lake_id': lake_id,
            'lake_interior': lake_interior,
            'landscape_ring': landscape_ring
        })
    
    return lakes_fc.map(add_geometries)

---
## Sentinel-1 Processing

In [5]:
def process_sentinel1(lakes_fc, year, region_bounds):
    """
    Extract S1 features - simplified to avoid payload bloat
    """
    s1 = (ee.ImageCollection('COPERNICUS/S1_GRD')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VV'))
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VH'))
          .filter(ee.Filter.eq('instrumentMode', 'IW'))
          .filter(ee.Filter.eq('orbitProperties_pass', 'DESCENDING'))
          .select(['VV', 'VH']))
    
    def extract_s1_features(img):
        # Sample lake interiors
        lake_samples = img.reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean(),
            scale=10
        )
        
        # Sample landscape rings
        landscape_samples = img.reduceRegions(
            collection=lakes_fc.select(['lake_id', 'landscape_ring']),
            reducer=ee.Reducer.mean(),
            scale=10
        )
        
        # Join and create output features
        def create_s1_feature(lake_feat):
            lake_id = lake_feat.get('lake_id')
            
            # Find matching landscape sample
            land_feat = landscape_samples.filter(
                ee.Filter.eq('lake_id', lake_id)
            ).first()
            
            # Lake values
            vv_lake = ee.Number(lake_feat.get('VV'))
            vh_lake = ee.Number(lake_feat.get('VH'))
            
            # Landscape values
            vv_land = ee.Number(land_feat.get('VV'))
            vh_land = ee.Number(land_feat.get('VH'))
            
            # Derived features
            vv_vh_ratio = vv_lake.subtract(vh_lake)
            lake_R = vv_lake.unitScale(-20, -5).multiply(255)
            lake_G = vh_lake.unitScale(-28, -12).multiply(255)
            lake_B = vv_vh_ratio.unitScale(8, 18).multiply(255)
            land_R = vv_land.unitScale(-20, -5).multiply(255)
            land_G = vh_land.unitScale(-28, -12).multiply(255)
            land_vv_vh = vv_land.subtract(vh_land)
            land_B = land_vv_vh.unitScale(8, 18).multiply(255)
            
            # Get date from image
            date = img.date()
            
            # Return simple feature with NO geometry and NO complex references
            return ee.Feature(None, {
                'lake_id': lake_id,
                's1_date': date.format('YYYY-MM-dd'),
                's1_doy': date.getRelative('day', 'year'),
                'vv_db': vv_lake,
                'vh_db': vh_lake,
                'vv_vh_ratio': vv_vh_ratio,
                'lake_R': lake_R,
                'lake_G': lake_G,
                'lake_B': lake_B,
                'land_R': land_R,
                'land_G': land_G,
                'land_B': land_B,
                'vv_land_db': vv_land,
                'vh_land_db': vh_land
            })
        
        return lake_samples.map(create_s1_feature)
    
    s1_features = s1.map(extract_s1_features).flatten()
    return s1_features

---
## Sentinel-2 Processing

In [6]:
def compute_ndsi(img):
    """
    Compute Normalized Difference Snow Index (NDSI)
    NDSI = (Green - SWIR1) / (Green + SWIR1)
    """
    green = img.select('B3')
    swir1 = img.select('B11')
    
    ndsi = green.subtract(swir1).divide(green.add(swir1)).rename('ndsi')
    
    return img.addBands(ndsi)

def mask_s2_clouds(img):
    """
    Mask clouds using QA60 band (basic cloud mask)
    """
    qa = img.select('QA60')
    
    # Bits 10 and 11 are clouds and cirrus
    cloud_bit_mask = 1 << 10
    cirrus_bit_mask = 1 << 11
    
    # Both should be zero (clear conditions)
    mask = qa.bitwiseAnd(cloud_bit_mask).eq(0).And(
           qa.bitwiseAnd(cirrus_bit_mask).eq(0))
    
    return img.updateMask(mask)

def add_s2cloudless_mask(img):
    s2_cloudless = ee.ImageCollection('COPERNICUS/S2_CLOUD_PROBABILITY')
    cloud_prob_collection = s2_cloudless.filter(
        ee.Filter.eq('system:index', img.get('system:index'))
    )
    
    has_cloud_data = cloud_prob_collection.size().gt(0)
    
    def apply_s2cloudless_mask():
        cloud_prob = cloud_prob_collection.first().select('probability')
        is_clear = cloud_prob.lt(S2_CLOUD_PROB_THRESHOLD)
        return img.updateMask(is_clear)
    
    def use_qa60_only():
        return img  # Already has QA60 mask
    
    return ee.Image(ee.Algorithms.If(
        has_cloud_data,
        apply_s2cloudless_mask(),
        use_qa60_only()
    ))

In [7]:
def process_sentinel2(lakes_fc, year, region_bounds):
    """
    Extract S2 features - simplified to avoid payload bloat
    """
    s2 = (ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', S2_CLOUD_THRESHOLD)))
    
    def mask_s2_clouds(img):
        qa = img.select('QA60')
        cloud_bit_mask = 1 << 10
        cirrus_bit_mask = 1 << 11
        mask = qa.bitwiseAnd(cloud_bit_mask).eq(0).And(
               qa.bitwiseAnd(cirrus_bit_mask).eq(0))
        return img.updateMask(mask)
    
    s2 = s2.map(mask_s2_clouds).map(add_s2cloudless_mask)
    
    def add_ndsi(img):
        ndsi = img.normalizedDifference(['B3', 'B11']).rename('ndsi')
        return img.addBands(ndsi)
    
    s2 = s2.map(add_ndsi)
    
    def extract_s2_features(img):
        # Sample lake interiors only
        samples = img.select('ndsi').reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean(),
            scale=20
        )
        
        def create_s2_feature(feat):
            ndsi_mean = ee.Number(feat.get('mean'))
            ice_fraction = ee.Algorithms.If(
                ndsi_mean.gte(S2_NDSI_THRESHOLD),
                1.0,
                0.0
            )
            
            date = img.date()
            
            return ee.Feature(None, {
                'lake_id': feat.get('lake_id'),
                's2_date': date.format('YYYY-MM-dd'),
                's2_doy': date.getRelative('day', 'year'),
                's2_ice_fraction': ice_fraction
            })
        
        return samples.map(create_s2_feature)
    
    s2_features = s2.map(extract_s2_features).flatten()
    return s2_features

---
## ERA5 Temperature Processing

In [8]:
def process_era5_temperature(lakes_fc, year):
    """
    Process ERA5-Land temperature data - OPTIMIZED VERSION
    Pre-computes daily means once, then samples all lakes in batch
    Much faster than computing daily mean separately for each lake
    """
    print(f"  Loading ERA5 hourly data for {year}...")
    
    # Load ERA5-Land hourly data
    era5_hourly = (ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY')
                   .filterDate(f'{year}-01-01', f'{year}-12-31')
                   .select('temperature_2m'))
    
    # Convert to Celsius
    def to_celsius(img):
        temp_c = img.subtract(273.15).rename('temp_c')
        return temp_c.copyProperties(img, ['system:time_start'])
    
    era5_hourly = era5_hourly.map(to_celsius)
    
    print(f"  Pre-computing daily means...")
    
    # Determine number of days in year (handle leap years)
    is_leap = ee.Number(year).mod(4).eq(0).And(
        ee.Number(year).mod(100).neq(0).Or(
            ee.Number(year).mod(400).eq(0)
        )
    )
    n_days = ee.Number(ee.Algorithms.If(is_leap, 366, 365))
    
    # Pre-compute daily means for entire year (365 or 366 images)
    def compute_daily_mean(day):
        day = ee.Number(day)
        date = ee.Date.fromYMD(year, 1, 1).advance(day.subtract(1), 'day')
        next_date = date.advance(1, 'day')
        
        # Get all hourly images for this day
        daily_collection = era5_hourly.filterDate(date, next_date)
        
        # Check if we have data
        has_data = daily_collection.size().gt(0)
        
        # Compute mean if data exists, otherwise use missing flag
        daily_mean = ee.Image(ee.Algorithms.If(
            has_data,
            daily_collection.mean(),
            ee.Image.constant(-9999).rename('temp_c')
        ))
        
        return daily_mean.set({
            'system:time_start': date.millis(),
            'doy': day,
            'date': date.format('YYYY-MM-dd')
        })
    
    days = ee.List.sequence(1, n_days)
    era5_daily = ee.ImageCollection.fromImages(days.map(compute_daily_mean))
    
    print(f"  Sampling all lakes from daily means...")
    
    # Now sample ALL lakes from each daily mean (batch operation)
    def sample_all_lakes(daily_img):
        doy = daily_img.get('doy')
        date = daily_img.get('date')
        
        # Sample ALL lakes at once using reduceRegions
        samples = daily_img.reduceRegions(
            collection=lakes_fc,
            reducer=ee.Reducer.first(),  # Get pixel value at centroid
            scale=11000  # ERA5-Land resolution
        )
        
        # Add date info to each sampled feature
        def add_date_info(feat):
            # Get the temperature value (from 'first' property created by reducer)
            # Use ee.Algorithms.If to provide default if missing
            temp_value = ee.Algorithms.If(
                feat.propertyNames().contains('first'),
                feat.get('first'),
                -9999
            )
            
            return feat.set({
                'era5_date': date,
                'era5_doy': doy,
                'temp_c': temp_value
            })
        
        return samples.map(add_date_info)
    
    # Process all daily images
    era5_features = era5_daily.map(sample_all_lakes).flatten()
    
    return era5_features

---
## Export Functions

In [9]:
def export_chunk_year(chunk_id, year, lakes_fc, region_bounds):
    """
    Process and export S1/S2 data for one chunk and one year
    NOTE: ERA5 is handled separately (too large for GEE export)
    """
    print(f"\n{'='*60}")
    print(f"Processing Chunk {chunk_id}, Year {year}")
    print(f"{'='*60}")
    
    # Add lake geometries
    lakes_with_geom = add_lake_geometry_metrics(lakes_fc, region_bounds)
    print("Lakes processed with geometries")
    
    # Process S1
    print("\nProcessing Sentinel-1...")
    s1_features = process_sentinel1(lakes_with_geom, year, region_bounds)
    print("  S1 processing complete")
    
    # Process S2
    print("\nProcessing Sentinel-2...")
    s2_features = process_sentinel2(lakes_with_geom, year, region_bounds)
    print("  S2 processing complete")
    
    # ERA5 will be downloaded separately
    print("\nERA5 temperature...")
    print("  (Will be downloaded separately - more efficient than GEE export)")
    
    # Export S1 and S2 only
    exports = []
    
    # S1 export
    s1_task = ee.batch.Export.table.toCloudStorage(
        collection=s1_features,
        description=f'S1_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
        fileFormat='CSV'
    )
    
    # S2 export
    s2_task = ee.batch.Export.table.toCloudStorage(
        collection=s2_features,
        description=f'S2_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
        fileFormat='CSV'
    )
    
    # Return only S1 and S2 tasks (no ERA5)
    exports = [
        {'task': s1_task, 'type': 'S1', 'count': 'N/A'},
        {'task': s2_task, 'type': 'S2', 'count': 'N/A'}
    ]
    
    return exports

---
## Main Export Loop

In [10]:
# Load chunk statistics to know how many chunks we have
chunk_stats = pd.read_csv(f'gs://{BUCKET}/{BASE_PATH}/processed/chunk_statistics.csv')
n_chunks = len(chunk_stats)

print(f"Total chunks to process: {n_chunks}")
print(f"Years to process: {YEARS}")
print(f"Total exports: {n_chunks * len(YEARS) * 3} (chunks × years × 3 datasets)")
print("\nChunk statistics:")
print(chunk_stats[['chunk_id', 'n_lakes', 'lat_min', 'lat_max', 'lon_min', 'lon_max']])

Total chunks to process: 13
Years to process: [2019, 2020, 2021]
Total exports: 117 (chunks × years × 3 datasets)

Chunk statistics:
    chunk_id  n_lakes    lat_min    lat_max     lon_min     lon_max
0          0     2083  69.160813  70.902079 -153.130795 -151.727647
1          1     1452  69.006262  70.827977 -159.368810 -157.856992
2          2     1327  69.040671  70.501113 -150.347535 -148.888885
3          3     2518  69.006075  71.117631 -155.556146 -154.195153
4          4      780  69.027426  70.294397 -163.578774 -160.999169
5          5      223  69.061209  70.121377 -144.849551 -141.047786
6          6     1987  69.022434  71.334582 -157.932951 -156.619012
7          7      910  69.000834  70.388435 -149.010972 -147.408194
8          8     2044  69.016535  70.903127 -154.464362 -153.026821
9          9     1383  69.000146  70.478816 -151.904233 -150.237160
10        10      903  69.021021  70.839688 -161.140899 -159.325726
11        11     2148  69.298632  71.360948 -156.62

In [11]:
# Test with one chunk and one year first
TEST_CHUNK = 0
TEST_YEAR = 2019

print(f"\n{'#'*60}")
print(f"TEST RUN: Chunk {TEST_CHUNK}, Year {TEST_YEAR}")
print(f"{'#'*60}")

# Load chunk
test_fc, test_gdf = load_chunk_from_bucket(TEST_CHUNK)
test_gdf_wgs84 = test_gdf.to_crs('EPSG:4326') 
test_bounds = ee.Geometry.Rectangle(test_gdf_wgs84.total_bounds.tolist())  # wgs84

# Process and export
test_exports = export_chunk_year(TEST_CHUNK, TEST_YEAR, test_fc, test_bounds)

print(f"\n{'='*60}")
print("TEST EXPORTS PREPARED (NOT STARTED)")
print(f"{'='*60}")
for exp in test_exports:
    print(f"  {exp['type']}: {exp['count']} observations ready")
print("\nTo start exports, run the cells below.")


############################################################
TEST RUN: Chunk 0, Year 2019
############################################################


ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/envs/gee/share/proj failed


  Loaded chunk 0: 2083 lakes
    Bounds: [  32045.19937452 2133331.56155297   87259.03944725 2325197.90976039]
  Simplifying geometries...
  Creating EE FeatureCollection...
  FeatureCollection created successfully

Processing Chunk 0, Year 2019
  Loading full ALPOD dataset for landscape masking...
  CRS Strategy: All geometric operations in Alaska Albers (EPSG:3338)
    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-153.15, -151.71] × [69.15, 70.93]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 12900 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 12900 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric operations
  Geometric transformation

In [12]:
# TEST EXPORTS
# Use this to test the export on one chunk. Can skip it in future runs once working
SKIP_TEST = True  #True to skip this cell. 

if not SKIP_TEST:
    print("Starting test exports...")
    for exp in test_exports:
        exp['task'].start()
        print(f"  Started: {exp['task'].status()['description']}")
    
    print("\nTest exports started! Monitor at: https://code.earthengine.google.com/tasks")
    print("\nOnce test completes successfully, proceed to full export below.")
else:
    print("⏭️  Skipping test exports (SKIP_TEST = True)")
    print("Test exports already completed. Proceed to full export below.")

⏭️  Skipping test exports (SKIP_TEST = True)
Test exports already completed. Proceed to full export below.


---
## Full Export (All Chunks, All Years)

**WARNING:** This will prep ~78 export tasks (19 chunks × 3 years × 2 datasets each)

Only run after test export completes successfully!

In [13]:
# Prep S1/S2 exports with geometry caching
# ERA5 will be downloaded separately

all_exports = []
geometry_cache = {}  # Cache fc_with_geom by chunk_id

total_start = time.time()

for chunk_id in range(n_chunks):
    # ============================================================
    # STEP 1: Process geometries ONCE per chunk (not per year)
    # ============================================================
    if chunk_id not in geometry_cache:
        print(f"\n{'='*60}")
        print(f"Processing Chunk {chunk_id} geometries (first time)")
        print(f"{'='*60}")
        
        chunk_start = time.time()
        
        # Load chunk
        chunk_fc, chunk_gdf = load_chunk_from_bucket(chunk_id)
        chunk_gdf_wgs84 = chunk_gdf.to_crs('EPSG:4326')
        chunk_bounds = ee.Geometry.Rectangle(chunk_gdf_wgs84.total_bounds.tolist())
        
        # Add lake geometries (THE SLOW PART - only do once per chunk!)
        print("  Adding lake geometry metrics (interior buffers + landscape rings)...")
        geom_start = time.time()
        fc_with_geom = add_lake_geometry_metrics(chunk_fc, chunk_bounds)
        geom_time = time.time() - geom_start
        print(f"  ✓ Geometries computed in {geom_time:.1f}s")
        
        # Cache for reuse across years
        geometry_cache[chunk_id] = {
            'fc_with_geom': fc_with_geom,
            'chunk_bounds': chunk_bounds,
            'chunk_gdf': chunk_gdf
        }
        
        chunk_time = time.time() - chunk_start
        print(f"  ✓ Chunk {chunk_id} geometry processing complete ({chunk_time:.1f}s)")
    
    # ============================================================
    # STEP 2: Process each year using cached geometries
    # ============================================================
    for year in YEARS:
        print(f"\nProcessing Chunk {chunk_id}, Year {year}...")
        year_start = time.time()
        
        # Retrieve cached geometries
        cached = geometry_cache[chunk_id]
        fc_with_geom = cached['fc_with_geom']
        chunk_bounds = cached['chunk_bounds']
        
        # Process sensors (year-specific data)
        print("  Processing Sentinel-1...")
        s1_features = process_sentinel1(fc_with_geom, year, chunk_bounds)
        print("  ✓ S1 complete")
        
        print("  Processing Sentinel-2...")
        s2_features = process_sentinel2(fc_with_geom, year, chunk_bounds)
        print("  ✓ S2 complete")
        
        print("  (ERA5 will be downloaded separately)")
        
        # Create export tasks (S1 and S2 only)
        s1_task = ee.batch.Export.table.toCloudStorage(
            collection=s1_features,
            description=f'S1_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
            fileFormat='CSV'
        )
        
        s2_task = ee.batch.Export.table.toCloudStorage(
            collection=s2_features,
            description=f'S2_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
            fileFormat='CSV'
        )
        
        # Store export info (S1 and S2 only)
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S1',
            'task': s1_task,
            'count': 'N/A'
        })
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S2',
            'task': s2_task,
            'count': 'N/A'
        })
        
        year_time = time.time() - year_start
        print(f"  ✓ Year {year} complete ({year_time:.1f}s)")

total_time = time.time() - total_start

print(f"\n{'='*60}")
print(f"ALL EXPORTS PREPARED: {len(all_exports)} tasks (S1 + S2 only)")
print(f"{'='*60}")
print(f"Total preparation time: {total_time/60:.1f} minutes")
print(f"Average time per chunk-year: {total_time/39:.1f} seconds")
print(f"\nGeometry cache stats:")
print(f"  Unique chunks processed: {len(geometry_cache)}")
print(f"  Cache hits (years using cached geometries): {39 - len(geometry_cache)}")
print(f"\nNote: ERA5 data will be downloaded separately (see Notebook 03)")
print(f"\nReady to start. Run next cell to begin {len(all_exports)} exports.")


Processing Chunk 0 geometries (first time)
  Loaded chunk 0: 2083 lakes
    Bounds: [  32045.19937452 2133331.56155297   87259.03944725 2325197.90976039]
  Simplifying geometries...
  Creating EE FeatureCollection...
  FeatureCollection created successfully
  Adding lake geometry metrics (interior buffers + landscape rings)...
  Loading full ALPOD dataset for landscape masking...
  CRS Strategy: All geometric operations in Alaska Albers (EPSG:3338)
    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-153.15, -151.71] × [69.15, 70.93]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 12900 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 12900 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Usi

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-163.60, -160.98] × [69.01, 70.31]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 5423 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 5423 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric operations
  Geometric transformations complete
    All buffers computed in Alaska Albers (EPSG:3338)
    Buffer distances: -10m interior, +100m landscape ring
  ✓ Geometries computed in 57.4s
  ✓ Chunk 4 geometry processing complete (60.0s)

Processing Chunk 4, Year 2019...
  Processing Sentinel-1...
  ✓ S1 complete
  Processing Sentinel-2...
  ✓ S2 complete
  (ERA5 will be downloaded separately)
  ✓ Year

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-149.02, -147.39] × [68.99, 70.40]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 11959 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 11959 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric operations
  Geometric transformations complete
    All buffers computed in Alaska Albers (EPSG:3338)
    Buffer distances: -10m interior, +100m landscape ring
  ✓ Geometries computed in 66.4s
  ✓ Chunk 7 geometry processing complete (69.1s)

Processing Chunk 7, Year 2019...
  Processing Sentinel-1...
  ✓ S1 complete
  Processing Sentinel-2...
  ✓ S2 complete
  (ERA5 will be downloaded separately)
  ✓ Ye

---
## Monitor GEE Export Progress

In [None]:
# Check status of GEE exports
def check_export_status():
    status_summary = {
        'READY': 0,
        'RUNNING': 0,
        'COMPLETED': 0,
        'FAILED': 0,
        'CANCELLED': 0
    }
    
    for exp in all_exports:
        status = exp['task'].status()['state']
        status_summary[status] = status_summary.get(status, 0) + 1
    
    print(f"Export Status Summary:")
    print(f"  Total tasks: {len(all_exports)}")
    for state, count in status_summary.items():
        if count > 0:
            print(f"    {state}: {count}")
    
    return status_summary

# Run this cell periodically to check progress
check_export_status()

# ERA5 Export

In [None]:
# ============================================================
# Wait for GEE Exports & Download ERA5
# ============================================================

import xarray as xr

print("\n" + "="*60)
print("WAITING FOR GEE EXPORTS TO COMPLETE")
print("="*60)
print(f"Total tasks: {len(all_exports)} (S1 + S2)")
print("Monitor progress at: https://code.earthengine.google.com/tasks")
print("\nThis cell will poll GEE every 5 minutes and automatically")
print("download ERA5 data when all exports complete.")
print("")

# ============================================================
# Poll GEE Export Status
# ============================================================

def check_export_status():
    """Check status of all export tasks"""
    statuses = {'COMPLETED': 0, 'RUNNING': 0, 'READY': 0, 'FAILED': 0, 'CANCELLED': 0}
    for exp in all_exports:
        state = exp['task'].status()['state']
        statuses[state] = statuses.get(state, 0) + 1
    return statuses

print("Checking export status...\n")

poll_count = 0
while True:
    statuses = check_export_status()
    completed = statuses['COMPLETED']
    failed = statuses['FAILED']
    running = statuses['RUNNING']
    total = len(all_exports)
    
    print(f"[{time.strftime('%H:%M:%S')}] Status: {completed}/{total} complete, {running} running, {failed} failed")
    
    if completed + failed == total:
        print("\n✓ All GEE tasks finished!")
        break
    
    poll_count += 1
    if poll_count == 1:
        print("(Polling every 5 minutes...)\n")
    
    time.sleep(300)

# Check for failures
failed_tasks = [exp for exp in all_exports if exp['task'].status()['state'] == 'FAILED']

if failed_tasks:
    print(f"\n⚠️  {len(failed_tasks)} tasks FAILED:")
    for exp in failed_tasks[:10]:
        status = exp['task'].status()
        error = status.get('error_message', 'Unknown error')
        print(f"  - Chunk {exp['chunk_id']}, {exp['year']}, {exp['type']}: {error}")
    if len(failed_tasks) > 10:
        print(f"  ... and {len(failed_tasks)-10} more")
    print("\n❌ Cannot proceed to ERA5 download with failed tasks")
    
else:
    print("\n✓ All S1/S2 exports succeeded!")
    print("\n" + "="*60)
    print("DOWNLOADING ERA5 TEMPERATURE DATA")
    print("="*60)
    print("This will take 5-10 minutes for all chunks and years...")
    print("")
    
    era5_start = time.time()
    
    try:
        # Load chunk info
        print("Step 1: Loading lake locations by chunk...")
        chunks_info = []
        
        for chunk_id in range(n_chunks):
            chunk_file = f'{CHUNKS_PATH}/chunk_{chunk_id:02d}.geojson'
            chunk_gdf = gpd.read_file(chunk_file)
            chunk_gdf_wgs84 = chunk_gdf.to_crs('EPSG:4326')
            chunk_gdf_wgs84['centroid_lon'] = chunk_gdf_wgs84.centroid.x
            chunk_gdf_wgs84['centroid_lat'] = chunk_gdf_wgs84.centroid.y
            chunk_gdf_wgs84['lake_id'] = range(len(chunk_gdf_wgs84))
            
            chunks_info.append({
                'chunk_id': chunk_id,
                'gdf': chunk_gdf_wgs84,
                'n_lakes': len(chunk_gdf_wgs84)
            })
        
        total_lakes = sum(c['n_lakes'] for c in chunks_info)
        print(f"  ✓ Loaded {total_lakes:,} lakes across {n_chunks} chunks\n")
        
        # Download ERA5
        print("Step 2: Accessing ERA5-Land dataset...")
        ERA5_URL = 'gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3'
        ds = xr.open_zarr(ERA5_URL, chunks='auto', consolidated=True)
        temp_var = '2m_temperature' if '2m_temperature' in ds else 't2m'
        print(f"  ✓ Dataset opened\n")
        
        print("Step 3: Selecting North Slope region and time period...")
        ds_subset = ds[temp_var].sel(
            latitude=slice(72, 69),
            longitude=slice(-164, -140),
            time=slice(f'{YEARS[0]}-01-01', f'{YEARS[-1]}-12-31T23:59:59')
        )
        print(f"  ✓ Selected: {len(ds_subset.latitude)} lats × {len(ds_subset.longitude)} lons, {len(ds_subset.time)} hours\n")
        
        print("Step 4: Computing daily means (2-3 minutes)...")
        ds_daily = ds_subset.resample(time='1D').mean()
        ds_daily_celsius = ds_daily - 273.15
        print(f"  ✓ {len(ds_daily_celsius.time)} daily means computed\n")
        
        print("Step 5: Processing each chunk and exporting...")
        
        for chunk_info in chunks_info:
            chunk_id = chunk_info['chunk_id']
            chunk_gdf = chunk_info['gdf']
            n_lakes = chunk_info['n_lakes']
            
            print(f"  Chunk {chunk_id:02d} ({n_lakes} lakes)...", end=' ')
            
            # Interpolate
            lake_lats = xr.DataArray(chunk_gdf['centroid_lat'].values, dims=['lake'])
            lake_lons = xr.DataArray(chunk_gdf['centroid_lon'].values, dims=['lake'])
            lake_temps = ds_daily_celsius.interp(latitude=lake_lats, longitude=lake_lons, method='linear')
            
            # Convert to DataFrame
            df_chunk = lake_temps.to_dataframe(name='temp_c').reset_index()
            df_chunk['lake_id'] = df_chunk['lake']
            df_chunk['era5_date'] = df_chunk['time'].dt.strftime('%Y-%m-%d')
            df_chunk['era5_doy'] = df_chunk['time'].dt.dayofyear
            df_chunk['year'] = df_chunk['time'].dt.year
            df_chunk = df_chunk[['lake_id', 'era5_date', 'era5_doy', 'temp_c', 'year']]
            
            # Export by year
            for year in YEARS:
                df_year = df_chunk[df_chunk['year'] == year].copy()
                df_year = df_year.drop(columns=['year']).sort_values(['lake_id', 'era5_doy'])
                output_file = f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/era5_data.csv'
                df_year.to_csv(output_file, index=False)
            
            print(f"✓")
        
        era5_time = time.time() - era5_start
        
        print(f"\n{'='*60}")
        print("ERA5 DOWNLOAD COMPLETE!")
        print(f"{'='*60}")
        print(f"Time: {era5_time/60:.1f} minutes")
        print(f"Files: {n_chunks * len(YEARS)} CSV files created")
        
        # Validation
        print(f"\nValidation check:")
        test_file = f'{OUTPUT_PATH}/{YEARS[0]}/chunk_00/era5_data.csv'
        df_test = pd.read_csv(test_file)
        print(f"  Chunk 00, {YEARS[0]}: {len(df_test):,} rows")
        print(f"  Lakes: {df_test['lake_id'].nunique()}, Temp range: {df_test['temp_c'].min():.1f}°C to {df_test['temp_c'].max():.1f}°C")
        print(f"  ✓ Data looks good!\n")
        
        # Final summary
        total_time = (time.time() - total_start) / 60
        print(f"{'='*60}")
        print("ALL EXPORTS COMPLETE!")
        print(f"{'='*60}")
        print(f"Total time: {total_time:.1f} minutes")
        print(f"  GEE exports: ~{total_time - era5_time/60:.0f} min")
        print(f"  ERA5 download: {era5_time/60:.1f} min")
        print(f"\nData location: {OUTPUT_PATH}/[YEAR]/chunk_[XX]/")
        print(f"  - s1_data.csv")
        print(f"  - s2_data.csv")
        print(f"  - era5_data.csv")
        print(f"\n✓ Ready for analysis!")
        
    except Exception as e:
        print(f"\n❌ ERROR: {e}")
        print("\nYou can download ERA5 separately using Notebook 03")
        raise

---
## Summary

This notebook exports:
- **Sentinel-1**: VV/VH backscatter for each lake interior
- **Sentinel-2**: NDSI-based ice fraction (with cloud filtering)
- **ERA5**: Daily mean temperature at lake locations

**For:**
- ~19 spatial chunks
- 3 years (2019, 2020, 2021)
- ~28,000 North Slope lakes

**Output structure:**
```
gs://wustl-eeps-geospatial/thermokarst_lakes/exports/
├── 2019/
│   ├── chunk_00/
│   │   ├── s1_data.csv
│   │   ├── s2_data.csv
│   │   └── era5_data.csv
│   ├── chunk_01/
│   └── ...
├── 2020/
└── 2021/
```

**Next step:** Combine CSVs and run ice detection algorithm (Notebook 03)

In [16]:
# Load the ALPOD shapefile
alpod_gdf = gpd.read_file(f'gs://{BUCKET}/{BASE_PATH}/source/ALPODlakes.shp')

# Convert to GEE FeatureCollection
alpod_fc = ee.FeatureCollection(alpod_gdf.__geo_interface__)

# Export as GEE Asset
task = ee.batch.Export.table.toAsset(
    collection=alpod_fc,
    description='ALPOD_full_upload',
    assetId='projects/eeps-geospatial/assets/ALPOD_full'  # Adjust project name
)
task.start()

print("Asset upload started!")
print("Check progress: https://code.earthengine.google.com/tasks")
print("This may take 30-60 minutes for 800k lakes...")

DataSourceError: HTTP response code: 404