### Multi-Year Lake Ice Phenology Data Export
## Part 2: GEE Processing and Export (2019-2021)

**Goal:** Export S1 + S2 + ERA5 data for North Slope lakes across 3 years

**Strategy:**
- Process chunks independently (spatial parallelization)
- Export one year at a time per chunk
- Use efficient spatial filtering (only process S2 images that overlap lakes)
- Total exports: ~21 chunks × 3 years = ~63 exports

**Data sources:**
- Sentinel-1 GRD (SAR)
- Sentinel-2 SR Harmonized (optical, for NDSI)
- ERA5-Land (temperature)

**Years:** 2019, 2020, 2021 (match ALPOD temporal coverage)

---
## Setup

In [9]:
import ee
import pandas as pd
import numpy as np
import geopandas as gpd
from datetime import datetime
import time
import xarray as xr
import subprocess
import sys
import warnings

# Initialize Earth Engine
ee.Initialize()

print("Imports successful!")
print(f"Earth Engine initialized: {ee.String('GEE Initialized').getInfo()}")

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


Imports successful!
Earth Engine initialized: GEE Initialized


In [10]:
# Configuration
BUCKET = 'wustl-eeps-geospatial'
BASE_PATH = 'thermokarst_lakes'
CHUNKS_PATH = f'gs://{BUCKET}/{BASE_PATH}/processed/chunks'
OUTPUT_PATH = f'{BASE_PATH}/exports'  # No gs:// prefix for GEE exports

# Years to process (match ALPOD coverage)
YEARS = [2019, 2020, 2021]

# Processing parameters
SCALE = 10  # Sentinel-1 resolution
S2_NDSI_THRESHOLD = 0.4  # NDSI > 0.4 = ice
S2_CLOUD_THRESHOLD = 30  # Maximum cloud cover for S2 images (scene-level)
S2_CLOUD_PROB_THRESHOLD = 40  # s2cloudless probability threshold (pixel-level)
S2_TIME_WINDOW = 3  # Days before/after S1 acquisition to look for S2

# Projection
ALASKA_ALBERS = 'EPSG:3338'

print(f"Configuration:")
print(f"  Years: {YEARS}")
print(f"  Chunks path: {CHUNKS_PATH}")
print(f"  Output: gs://{BUCKET}/{OUTPUT_PATH}")
print(f"  S2 scene cloud threshold: {S2_CLOUD_THRESHOLD}%")
print(f"  S2 pixel cloud probability threshold: {S2_CLOUD_PROB_THRESHOLD}%")
print(f"  S2 time window: ±{S2_TIME_WINDOW} days")

Configuration:
  Years: [2019, 2020, 2021]
  Chunks path: gs://wustl-eeps-geospatial/thermokarst_lakes/processed/chunks
  Output: gs://wustl-eeps-geospatial/thermokarst_lakes/exports
  S2 scene cloud threshold: 30%
  S2 pixel cloud probability threshold: 40%
  S2 time window: ±3 days


## Upload ALPOD data as GEE Asset

In [11]:
# =============================================================================
# UPLOAD ALPOD TO GEE ASSET (ONE-TIME ONLY)
# =============================================================================
# Set to False after asset is uploaded
UPLOAD_ALPOD_ASSET = False

if UPLOAD_ALPOD_ASSET:
    print("Uploading ALPOD to GEE Asset...")
    print("(Direct from GCS )\n")
    
    !/opt/conda/bin/earthengine upload table \
        --asset_id=projects/eeps-geospatial/assets/ALPOD_full \
        gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.shp
    
    print("\n" + "="*60)
    print("Check progress: https://code.earthengine.google.com/tasks")
    print("Look for an 'Ingestion' task")
    print("\n After upload completes, set UPLOAD_ALPOD_ASSET = False")
else:
    print("Skipping ALPOD upload (already done)")
    print("Asset location: projects/eeps-geospatial/assets/ALPOD_full")

Skipping ALPOD upload (already done)
Asset location: projects/eeps-geospatial/assets/ALPOD_full


---
## Helper Functions

In [12]:
def load_chunk_from_bucket(chunk_id):
    """
    Load a chunk and return ee.FeatureCollection from GEE Asset
    Only sends lake IDs to GEE, not geometries (avoids payload limits)
    """
    chunk_file = f'{CHUNKS_PATH}/chunk_{chunk_id:02d}.geojson'
    
    # Load locally to get ALPOD IDs and metadata
    chunk_gdf = gpd.read_file(chunk_file)
    chunk_gdf_wgs84 = chunk_gdf.to_crs('EPSG:4326')
    
    # Get ALPOD IDs
    alpod_ids = chunk_gdf['id'].tolist()
    
    # Load from GEE Asset and filter to these lakes
    all_alpod = ee.FeatureCollection('projects/eeps-geospatial/assets/ALPOD_full')
    chunk_fc = all_alpod.filter(ee.Filter.inList('id', alpod_ids))
    
    # Add lake_id field (use ALPOD id)
    chunk_fc = chunk_fc.map(lambda f: f.set('lake_id', f.get('id')))
    
    return chunk_fc, chunk_gdf

In [13]:
def add_lake_geometry_metrics(lakes_fc, region_bounds):
    """
    Add lake interior and landscape ring geometries
    Uses efficient spatial filtering - only masks lakes within 100m of each target lake
    """
    # Load ALPOD from GEE Asset
    all_alpod = ee.FeatureCollection('projects/eeps-geospatial/assets/ALPOD_full')
    
    def add_geometries(lake):
        lake_geom = lake.geometry()
        lake_id = lake.get('lake_id')
        
        # Lake interior: 10m inward buffer
        lake_interior = lake_geom.buffer(-10)
        
        # Landscape ring: 100m outward buffer
        ring_outer = lake_geom.buffer(100)
        
        # Only find lakes that intersect THIS lake's 100m buffer
        nearby_lakes = all_alpod.filterBounds(ring_outer)
        nearby_dissolved = nearby_lakes.geometry().dissolve(maxError=10)
        
        # Subtract only the nearby lakes
        landscape_ring = ring_outer.difference(nearby_dissolved)
        
        return lake.set({
            'lake_id': lake_id,
            'lake_interior': lake_interior,
            'landscape_ring': landscape_ring
        })
    
    return lakes_fc.map(add_geometries)

---
## Sentinel-1 Processing

In [14]:
def process_sentinel1(lakes_fc, year, region_bounds):
    """
    Extract S1 features for both lake interiors and landscape rings.
    
    For each zone (lake and landscape), exports:
    - VV, VH backscatter (raw dB)
    - VV-VH ratio (dB)
    - RGB scaled features (normalized 0-255)
    """
    
    # Load S1 collection
    s1 = (ee.ImageCollection('COPERNICUS/S1_GRD')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.eq('instrumentMode', 'IW'))
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VV'))
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VH'))
          .filter(ee.Filter.eq('resolution_meters', 10)))
    
    def apply_angle_mask(img):
        """Mask pixels outside 25-50 degree incidence angle range"""
        angle = img.select('angle')
        angle_mask = angle.gt(25).And(angle.lt(50))
        return img.select(['VV', 'VH']).updateMask(angle_mask).copyProperties(img, img.propertyNames())
    
    s1 = s1.map(apply_angle_mask)
    
    def extract_s1_features(img):
        """Extract lake and landscape statistics from each S1 image"""
        
        vv_img = img.select('VV')
        vh_img = img.select('VH')
        
        # Compute VV-VH at image level (before reduction)
        vv_vh_img = vv_img.subtract(vh_img).rename('VV_VH')
        
        # Compute RGB bands (scaled/normalized features)
        r_band = vv_img.unitScale(-20, -5).multiply(255).rename('R')
        g_band = vh_img.unitScale(-28, -12).multiply(255).rename('G')
        b_band = vv_vh_img.unitScale(8, 18).multiply(255).rename('B')
        
        # Stack all bands for efficient extraction
        all_bands = vv_img.addBands(vh_img).addBands(vv_vh_img).addBands(r_band).addBands(g_band).addBands(b_band)
        
        date = img.date()
        orbit = img.get('orbitProperties_pass')
        
        # Sample lake interiors
        lake_samples = all_bands.reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean(),
            scale=10
        )
        
        # Sample landscape rings
        land_samples = all_bands.reduceRegions(
            collection=lakes_fc.select(['lake_id', 'landscape_ring']),
            reducer=ee.Reducer.mean(),
            scale=10
        )
        
        def create_s1_feature(lake_feat):
            lake_id = lake_feat.get('lake_id')
            
            # Find matching landscape sample
            land_feat = land_samples.filter(
                ee.Filter.eq('lake_id', lake_id)
            ).first()
            
            # Lake values - VV-VH already computed at image level
            lake_vv = lake_feat.get('VV')
            lake_vh = lake_feat.get('VH')
            lake_vv_vh = lake_feat.get('VV_VH')
            
            # Landscape values
            land_vv = land_feat.get('VV')
            land_vh = land_feat.get('VH')
            land_vv_vh = land_feat.get('VV_VH')
            
            # RGB values (already computed by reducer)
            lake_r = lake_feat.get('R')
            lake_g = lake_feat.get('G')
            lake_b = lake_feat.get('B')
            land_r = land_feat.get('R')
            land_g = land_feat.get('G')
            land_b = land_feat.get('B')
            
            return ee.Feature(None, {
                'lake_id': lake_id,
                's1_date': date.format('YYYY-MM-dd'),
                's1_doy': date.getRelative('day', 'year'),
                's1_orbit': orbit,
                # Lake features
                'lake_vv_db': lake_vv,
                'lake_vh_db': lake_vh,
                'lake_vv_vh_db': lake_vv_vh,
                'lake_r': lake_r,
                'lake_g': lake_g,
                'lake_b': lake_b,
                # Landscape features
                'land_vv_db': land_vv,
                'land_vh_db': land_vh,
                'land_vv_vh_db': land_vv_vh,
                'land_r': land_r,
                'land_g': land_g,
                'land_b': land_b
            })
        
        return lake_samples.map(create_s1_feature)
    
    s1_features = s1.map(extract_s1_features).flatten()
    
    # Filter out observations where lake had no valid pixels
    s1_features = s1_features.filter(ee.Filter.notNull(['lake_vv_db']))
    
    return s1_features

---
## Sentinel-2 Processing

In [15]:
def compute_ndsi(img):
    """
    Compute Normalized Difference Snow Index (NDSI)
    NDSI = (Green - SWIR1) / (Green + SWIR1)
    """
    green = img.select('B3')
    swir1 = img.select('B11')
    
    ndsi = green.subtract(swir1).divide(green.add(swir1)).rename('ndsi')
    
    return img.addBands(ndsi)

def mask_s2_clouds(img):
    """
    Mask clouds using QA60 band (basic cloud mask)
    """
    qa = img.select('QA60')
    
    # Bits 10 and 11 are clouds and cirrus
    cloud_bit_mask = 1 << 10
    cirrus_bit_mask = 1 << 11
    
    # Both should be zero (clear conditions)
    mask = qa.bitwiseAnd(cloud_bit_mask).eq(0).And(
           qa.bitwiseAnd(cirrus_bit_mask).eq(0))
    
    return img.updateMask(mask)

def add_s2cloudless_mask(img):
    s2_cloudless = ee.ImageCollection('COPERNICUS/S2_CLOUD_PROBABILITY')
    cloud_prob_collection = s2_cloudless.filter(
        ee.Filter.eq('system:index', img.get('system:index'))
    )
    
    has_cloud_data = cloud_prob_collection.size().gt(0)
    
    def apply_s2cloudless_mask():
        cloud_prob = cloud_prob_collection.first().select('probability')
        is_clear = cloud_prob.lt(S2_CLOUD_PROB_THRESHOLD)
        return img.updateMask(is_clear)
    
    def use_qa60_only():
        return img  # Already has QA60 mask
    
    return ee.Image(ee.Algorithms.If(
        has_cloud_data,
        apply_s2cloudless_mask(),
        use_qa60_only()
    ))

In [16]:
# Sentinel-2 Processing

def process_sentinel2(lakes_fc, year, region_bounds):
    """
    Extract S2 features - FIXED VERSION
    
    Exports:
    - s2_ice_fraction: fraction of lake pixels with NDSI > threshold (0-1 continuous)
    - s2_ndsi_mean: mean NDSI value across lake pixels (continuous)
    - s2_cloud_pct: percentage of lake pixels masked by clouds (for QC)
    """
    
    # Load S2 collection with scene-level cloud filter
    s2 = (ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', S2_CLOUD_THRESHOLD)))
    
    # Load s2cloudless for pixel-level cloud masking
    s2_cloudless = (ee.ImageCollection('COPERNICUS/S2_CLOUD_PROBABILITY')
                    .filterDate(f'{year}-01-01', f'{year}-12-31')
                    .filterBounds(region_bounds))
    
    def add_cloud_and_ndsi_bands(img):
        """Add NDSI, binary ice mask, and cloud probability bands"""
        
        # Get matching cloud probability image
        img_date = img.date()
        cloud_prob_img = s2_cloudless.filterDate(
            img_date, img_date.advance(1, 'day')
        ).first()
        
        # Cloud probability (0-100)
        cloud_prob = ee.Image(ee.Algorithms.If(
            cloud_prob_img,
            ee.Image(cloud_prob_img).select('probability'),
            ee.Image.constant(0).rename('probability')
        ))
        
        # QA60 cloud mask
        qa = img.select('QA60')
        cloud_bit_mask = 1 << 10
        cirrus_bit_mask = 1 << 11
        qa_clear = qa.bitwiseAnd(cloud_bit_mask).eq(0).And(
                   qa.bitwiseAnd(cirrus_bit_mask).eq(0))
        
        # Combined cloud mask: QA60 clear AND s2cloudless < threshold
        s2cloudless_clear = cloud_prob.lt(S2_CLOUD_PROB_THRESHOLD)
        combined_clear = qa_clear.And(s2cloudless_clear)
        
        # NDSI with cloud mask applied
        ndsi = img.normalizedDifference(['B3', 'B11']).rename('ndsi')
        ndsi_masked = ndsi.updateMask(combined_clear)
        
        # Binary ice mask (1 where NDSI > threshold, 0 otherwise)
        # This gets averaged to give ice_fraction
        ice_binary = ndsi_masked.gt(S2_NDSI_THRESHOLD).rename('ice_binary')
        
        # Cloud mask as 0/1 for computing cloud percentage
        cloud_mask = combined_clear.Not().rename('is_cloud')
        
        return img.addBands(ndsi_masked).addBands(ice_binary).addBands(cloud_mask)
    
    s2 = s2.map(add_cloud_and_ndsi_bands)
    
    def extract_s2_features(img):
        """Extract lake-level statistics from each S2 image"""
        
        date = img.date()
        
        # Reduce NDSI over lake interiors (mean of continuous values)
        ndsi_stats = img.select('ndsi').reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean().setOutputs(['ndsi_mean']),
            scale=20
        )
        
        # Reduce ice_binary over lake interiors (mean = fraction of ice pixels)
        ice_stats = img.select('ice_binary').reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean().setOutputs(['ice_fraction']),
            scale=20
        )
        
        # Reduce cloud mask to get cloud percentage
        cloud_stats = img.select('is_cloud').reduceRegions(
            collection=lakes_fc.select(['lake_id', 'lake_interior']),
            reducer=ee.Reducer.mean().setOutputs(['cloud_fraction']),
            scale=20
        )
        
        def create_s2_feature(ndsi_feat):
            lake_id = ndsi_feat.get('lake_id')
            
            # Get matching ice and cloud stats
            ice_feat = ice_stats.filter(ee.Filter.eq('lake_id', lake_id)).first()
            cloud_feat = cloud_stats.filter(ee.Filter.eq('lake_id', lake_id)).first()
            
            # Get values with null handling
            ndsi_mean = ndsi_feat.get('ndsi_mean')
            ice_fraction = ice_feat.get('ice_fraction')
            cloud_fraction = cloud_feat.get('cloud_fraction')
            
            # Convert cloud fraction to percentage (0-100)
            cloud_pct = ee.Algorithms.If(
                cloud_fraction,
                ee.Number(cloud_fraction).multiply(100),
                ee.Number(-1)  # Flag for missing data
            )
            
            # Handle nulls: if ndsi_mean is null, set ice_fraction to null too
            # This happens when all pixels are masked by clouds
            ice_fraction_safe = ee.Algorithms.If(
                ndsi_mean,
                ice_fraction,
                None
            )
            
            return ee.Feature(None, {
                'lake_id': lake_id,
                's2_date': date.format('YYYY-MM-dd'),
                's2_doy': date.getRelative('day', 'year'),
                's2_ndsi_mean': ndsi_mean,
                's2_ice_fraction': ice_fraction_safe,
                's2_cloud_pct': cloud_pct
            })
        
        return ndsi_stats.map(create_s2_feature)
    
    s2_features = s2.map(extract_s2_features).flatten()
    
    # Filter out features where all data is null (completely cloudy)
    # This reduces export size and avoids null-heavy CSVs
    s2_features = s2_features.filter(ee.Filter.notNull(['s2_ndsi_mean']))
    
    return s2_features


# =============================================================================
# USAGE IN NOTEBOOK 02:
# =============================================================================
# Replace the existing process_sentinel2 function with this one.
# 
# The exported CSV will now have columns:
#   - lake_id
#   - s2_date
#   - s2_doy  
#   - s2_ndsi_mean     (continuous, -1 to 1)
#   - s2_ice_fraction  (continuous, 0 to 1 = fraction of pixels with NDSI > threshold)
#   - s2_cloud_pct     (0-100, percentage of lake pixels that were cloudy)
#
# In notebook 03, you can then create training labels like:
#   - HIGH CONFIDENCE ICE:   ice_fraction > 0.8 AND cloud_pct < 20
#   - HIGH CONFIDENCE WATER: ice_fraction < 0.2 AND cloud_pct < 20
#   - PARTIAL/AMBIGUOUS:     everything else
# =============================================================================

---
## ERA5 Temperature Processing

In [17]:
def process_era5_temperature(lakes_fc, year):
    """
    Process ERA5-Land temperature data - OPTIMIZED VERSION
    Pre-computes daily means once, then samples all lakes in batch
    Much faster than computing daily mean separately for each lake
    """
    print(f"  Loading ERA5 hourly data for {year}...")
    
    # Load ERA5-Land hourly data
    era5_hourly = (ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY')
                   .filterDate(f'{year}-01-01', f'{year}-12-31')
                   .select('temperature_2m'))
    
    # Convert to Celsius
    def to_celsius(img):
        temp_c = img.subtract(273.15).rename('temp_c')
        return temp_c.copyProperties(img, ['system:time_start'])
    
    era5_hourly = era5_hourly.map(to_celsius)
    
    print(f"  Pre-computing daily means...")
    
    # Determine number of days in year (handle leap years)
    is_leap = ee.Number(year).mod(4).eq(0).And(
        ee.Number(year).mod(100).neq(0).Or(
            ee.Number(year).mod(400).eq(0)
        )
    )
    n_days = ee.Number(ee.Algorithms.If(is_leap, 366, 365))
    
    # Pre-compute daily means for entire year (365 or 366 images)
    def compute_daily_mean(day):
        day = ee.Number(day)
        date = ee.Date.fromYMD(year, 1, 1).advance(day.subtract(1), 'day')
        next_date = date.advance(1, 'day')
        
        # Get all hourly images for this day
        daily_collection = era5_hourly.filterDate(date, next_date)
        
        # Check if we have data
        has_data = daily_collection.size().gt(0)
        
        # Compute mean if data exists, otherwise use missing flag
        daily_mean = ee.Image(ee.Algorithms.If(
            has_data,
            daily_collection.mean(),
            ee.Image.constant(-9999).rename('temp_c')
        ))
        
        return daily_mean.set({
            'system:time_start': date.millis(),
            'doy': day,
            'date': date.format('YYYY-MM-dd')
        })
    
    days = ee.List.sequence(1, n_days)
    era5_daily = ee.ImageCollection.fromImages(days.map(compute_daily_mean))
    
    print(f"  Sampling all lakes from daily means...")
    
    # Now sample ALL lakes from each daily mean (batch operation)
    def sample_all_lakes(daily_img):
        doy = daily_img.get('doy')
        date = daily_img.get('date')
        
        # Sample ALL lakes at once using reduceRegions
        samples = daily_img.reduceRegions(
            collection=lakes_fc,
            reducer=ee.Reducer.first(),  # Get pixel value at centroid
            scale=11000  # ERA5-Land resolution
        )
        
        # Add date info to each sampled feature
        def add_date_info(feat):
            # Get the temperature value (from 'first' property created by reducer)
            # Use ee.Algorithms.If to provide default if missing
            temp_value = ee.Algorithms.If(
                feat.propertyNames().contains('first'),
                feat.get('first'),
                -9999
            )
            
            return feat.set({
                'era5_date': date,
                'era5_doy': doy,
                'temp_c': temp_value
            })
        
        return samples.map(add_date_info)
    
    # Process all daily images
    era5_features = era5_daily.map(sample_all_lakes).flatten()
    
    return era5_features

---
## Export Functions

In [18]:
def export_chunk_year(chunk_id, year, lakes_fc, region_bounds):
    """
    Process and export S1/S2/ERA5 data for one chunk and one year
    """
    print(f"\n{'='*60}")
    print(f"Processing Chunk {chunk_id}, Year {year}")
    print(f"{'='*60}")
    
    # Add lake geometries
    lakes_with_geom = add_lake_geometry_metrics(lakes_fc, region_bounds)
    print("Lakes processed with geometries")
    
    # Process S1
    print("\nProcessing Sentinel-1...")
    s1_features = process_sentinel1(lakes_with_geom, year, region_bounds)
    print("  S1 processing complete")
    
    # Process S2
    print("\nProcessing Sentinel-2...")
    s2_features = process_sentinel2(lakes_with_geom, year, region_bounds)
    print("  S2 processing complete")
    
    # Process ERA5
    print("\nProcessing ERA5 temperature...")
    era5_features = process_era5_temperature(lakes_fc, year)
    print("  ERA5 processing complete")
    
    # Export all three
    exports = []
    
    # S1 export
    s1_task = ee.batch.Export.table.toCloudStorage(
        collection=s1_features,
        description=f'S1_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
        fileFormat='CSV'
    )
    
    # S2 export
    s2_task = ee.batch.Export.table.toCloudStorage(
        collection=s2_features,
        description=f'S2_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
        fileFormat='CSV'
    )
    
    # ERA5 export
    era5_task = ee.batch.Export.table.toCloudStorage(
        collection=era5_features,
        description=f'ERA5_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/era5_data',
        fileFormat='CSV'
    )
    
    # Return all three tasks
    exports = [
        {'task': s1_task, 'type': 'S1', 'count': 'N/A'},
        {'task': s2_task, 'type': 'S2', 'count': 'N/A'},
        {'task': era5_task, 'type': 'ERA5', 'count': 'N/A'}
    ]
    
    return exports

---
## Main Export Loop

In [19]:
# Load chunk statistics to know how many chunks we have
chunk_stats = pd.read_csv(f'gs://{BUCKET}/{BASE_PATH}/processed/chunk_statistics.csv')
n_chunks = len(chunk_stats)

print(f"Total chunks to process: {n_chunks}")
print(f"Years to process: {YEARS}")
print(f"Total exports: {n_chunks * len(YEARS) * 3} (chunks × years × 3 datasets)")
print("\nChunk statistics:")
print(chunk_stats[['chunk_id', 'n_lakes', 'lat_min', 'lat_max', 'lon_min', 'lon_max']])

Total chunks to process: 21
Years to process: [2019, 2020, 2021]
Total exports: 189 (chunks × years × 3 datasets)

Chunk statistics:
    chunk_id  n_lakes    lat_min    lat_max     lon_min     lon_max
0          0     1659  69.003753  70.805772 -157.460685 -156.323643
1          1     1402  69.003327  70.498668 -150.588202 -149.501159
2          2     2172  69.000561  70.506201 -149.575529 -148.460811
3          3     1002  69.002549  70.662131 -161.092265 -159.724530
4          4     2243  70.139174  71.157997 -155.626734 -154.362639
5          5      408  69.061209  70.123854 -144.857630 -141.021667
6          6     1726  69.886316  70.909059 -153.572246 -152.259761
7          7     1704  69.002644  70.811790 -158.909885 -157.856992
8          8      753  69.019065  70.186629 -147.158786 -144.887360
9          9      460  69.027426  70.075681 -163.578774 -162.004458
10        10     1440  69.013772  70.328129 -153.976923 -152.575637
11        11     1392  69.000974  70.218418 -155.66

In [20]:
# Test with one chunk and one year first
TEST_CHUNK = 0
TEST_YEAR = 2019

print(f"\n{'#'*60}")
print(f"TEST RUN: Chunk {TEST_CHUNK}, Year {TEST_YEAR}")
print(f"{'#'*60}")

# Load chunk
test_fc, test_gdf = load_chunk_from_bucket(TEST_CHUNK)
test_gdf_wgs84 = test_gdf.to_crs('EPSG:4326') 
test_bounds = ee.Geometry.Rectangle(test_gdf_wgs84.total_bounds.tolist())  # wgs84

# Process and export
test_exports = export_chunk_year(TEST_CHUNK, TEST_YEAR, test_fc, test_bounds)

print(f"\n{'='*60}")
print("TEST EXPORTS PREPARED (NOT STARTED)")
print(f"{'='*60}")
for exp in test_exports:
    print(f"  {exp['type']}: {exp['count']} observations ready")
print("\nTo start exports, run the cells below.")


############################################################
TEST RUN: Chunk 0, Year 2019
############################################################


ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/envs/gee/share/proj failed



Processing Chunk 0, Year 2019
Lakes processed with geometries

Processing Sentinel-1...
  S1 processing complete

Processing Sentinel-2...
  S2 processing complete

Processing ERA5 temperature...
  Loading ERA5 hourly data for 2019...
  Pre-computing daily means...
  Sampling all lakes from daily means...
  ERA5 processing complete

TEST EXPORTS PREPARED (NOT STARTED)
  S1: N/A observations ready
  S2: N/A observations ready
  ERA5: N/A observations ready

To start exports, run the cells below.


In [21]:
# TEST EXPORTS
# Use this to test the export on one chunk. Can skip it in future runs once working
SKIP_TEST = True  #True to skip this cell. 

if not SKIP_TEST:
    print("Starting test exports...")
    for exp in test_exports:
        exp['task'].start()
        print(f"  Started: {exp['task'].status()['description']}")
    
    print("\nTest exports started! Monitor at: https://code.earthengine.google.com/tasks")
    print("\nOnce test completes successfully, proceed to full export below.")
else:
    print("⏭️  Skipping test exports (SKIP_TEST = True)")
    print("Test exports already completed. Proceed to full export below.")

⏭️  Skipping test exports (SKIP_TEST = True)
Test exports already completed. Proceed to full export below.


---
## Full Export (All Chunks, All Years)

**WARNING:** This will prep ~189 export tasks (21 chunks × 3 years × 3 datasets each)

Only run after test export completes successfully!

In [22]:
# Prep S1/S2/ERA5 exports
all_exports = []
total_start = time.time()
for chunk_id in range(n_chunks):
    print(f"\n{'='*60}")
    print(f"Processing Chunk {chunk_id}")
    print(f"{'='*60}")
    
    chunk_start = time.time()
    
    # Load chunk (now uses GEE Asset - no payload issues!)
    chunk_fc, chunk_gdf = load_chunk_from_bucket(chunk_id)
    chunk_bounds = chunk_fc.geometry().bounds()
    
    print(f"  {len(chunk_gdf)} lakes")
    
    # Add lake geometries
    print("  Adding geometries...")
    fc_with_geom = add_lake_geometry_metrics(chunk_fc, chunk_bounds)
    
    for year in YEARS:
        print(f"  Preparing {year}...", end=' ')
        
        # Process sensors
        s1_features = process_sentinel1(fc_with_geom, year, chunk_bounds)
        s2_features = process_sentinel2(fc_with_geom, year, chunk_bounds)
        era5_features = process_era5_temperature(chunk_fc, year)
        
        # Create export tasks
        s1_task = ee.batch.Export.table.toCloudStorage(
            collection=s1_features,
            description=f'S1_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
            fileFormat='CSV'
        )
        
        s2_task = ee.batch.Export.table.toCloudStorage(
            collection=s2_features,
            description=f'S2_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
            fileFormat='CSV'
        )
        
        era5_task = ee.batch.Export.table.toCloudStorage(
            collection=era5_features,
            description=f'ERA5_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/era5_data',
            fileFormat='CSV'
        )
        
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S1',
            'task': s1_task
        })
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S2',
            'task': s2_task
        })
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'ERA5',
            'task': era5_task
        })
        
        print("Done")
    
    chunk_time = time.time() - chunk_start
    print(f"  Chunk {chunk_id} complete ({chunk_time:.1f}s)")

total_time = time.time() - total_start
print(f"\n{'='*60}")
print(f"ALL EXPORTS PREPARED: {len(all_exports)} tasks")
print(f"{'='*60}")
print(f"Total preparation time: {total_time/60:.1f} minutes")
print(f"\nReady to start. Run next cell to begin {len(all_exports)} exports.")


Processing Chunk 0
  1659 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 0 complete (1.2s)

Processing Chunk 1
  1402 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 1 complete (1.7s)

Processing Chunk 2
  2172 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 2 complete (4.0s)

Processing Chunk 3
  1002 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 3 complete (1.3s)

Processing Chunk 4
  2243 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 4 complete (3.5s)

Processing Chunk 5
  408 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2021... ✓
  Chunk 5 complete (0.7s)

Processing Chunk 6
  1726 lakes
  Adding geometries...
  Preparing 2019... ✓
  Preparing 2020... ✓
  Preparing 2

In [23]:
# Start all exports
started = 0
for i, export in enumerate(all_exports):
    try:
        task = export['task']
        if task.status()['state'] == 'UNSUBMITTED':
            task.start()
            started += 1
            
            # Small delay every 10 tasks to avoid rate limiting
            if started % 10 == 0:
                print(f"  Started {started} tasks...")
                time.sleep(1)
    except Exception as e:
        print(f"Error starting task {i} ({export['type']} chunk {export['chunk_id']} {export['year']}): {e}")

print(f"\nStarted {started} tasks")
print("Monitor at: https://code.earthengine.google.com/tasks")

  Started 10 tasks...
  Started 20 tasks...
  Started 30 tasks...
  Started 40 tasks...
  Started 50 tasks...
  Started 60 tasks...
  Started 70 tasks...
  Started 80 tasks...
  Started 90 tasks...
  Started 100 tasks...
  Started 110 tasks...
  Started 120 tasks...

Started 126 tasks
Monitor at: https://code.earthengine.google.com/tasks


---
## Monitor GEE Export Progress

In [24]:
# Check status of GEE exports
def check_export_status():
    status_summary = {
        'READY': 0,
        'RUNNING': 0,
        'COMPLETED': 0,
        'FAILED': 0,
        'CANCELLED': 0
    }
    
    for exp in all_exports:
        status = exp['task'].status()['state']
        status_summary[status] = status_summary.get(status, 0) + 1
    
    print(f"Export Status Summary:")
    print(f"  Total tasks: {len(all_exports)}")
    for state, count in status_summary.items():
        if count > 0:
            print(f"    {state}: {count}")
    
    return status_summary

# Run this cell periodically to check progress
check_export_status()

Export Status Summary:
  Total tasks: 126
    READY: 125
    RUNNING: 1


{'READY': 125, 'RUNNING': 1, 'COMPLETED': 0, 'FAILED': 0, 'CANCELLED': 0}

---
## Summary

This notebook exports:

**Sentinel-1**: 
- Lake interior: VV, VH backscatter (dB), VV-VH ratio, RGB features (normalized 0-255)
- Landscape ring: VV, VH backscatter (dB), VV-VH ratio, RGB features (normalized 0-255)
- Both ASCENDING and DESCENDING orbits
- Angle mask applied (25-50 degrees)

**Sentinel-2**: 
- s2_ndsi_mean: Mean NDSI value across lake pixels (continuous, -1 to 1)
- s2_ice_fraction: Fraction of lake pixels with NDSI > 0.4 (continuous, 0 to 1)
- s2_cloud_pct: Percentage of lake masked by clouds (0-100, for QC filtering)
- Dual cloud masking: QA60 + s2cloudless

**ERA5**: 
- Daily mean 2m air temperature at lake centroids (Celsius)

**For:**
- 21 spatial chunks
- 3 years (2019, 2020, 2021)
- ~31,000 North Slope lakes (>0.02 km²)

**Output structure:**
```
gs://wustl-eeps-geospatial/thermokarst_lakes/exports/
├── 2019/
│   ├── chunk_00/
│   │   ├── s1_data.csv
│   │   ├── s2_data.csv
│   │   └── era5_data.csv
│   ├── chunk_01/
│   └── ...
├── 2020/
└── 2021/
```

**Next step:** Combine CSVs and run ice detection algorithm (Notebook 03)