# Multi-Year Lake Ice Phenology Data Export
## Part 2: GEE Processing and Export (2019-2021)

**Goal:** Export S1 + S2 + ERA5 data for North Slope lakes across 3 years

**Strategy:**
- Process chunks independently (spatial parallelization)
- Export one year at a time per chunk
- Use efficient spatial filtering (only process S2 images that overlap lakes)
- Total exports: ~19 chunks × 3 years = ~57 exports

**Data sources:**
- Sentinel-1 GRD (SAR)
- Sentinel-2 SR Harmonized (optical, for NDSI)
- ERA5-Land (temperature)

**Years:** 2019, 2020, 2021 (match ALPOD temporal coverage)

---
## Setup

In [3]:
import ee
import pandas as pd
import numpy as np
import geopandas as gpd
from datetime import datetime
import time

# Initialize Earth Engine
ee.Initialize()

print("Imports successful!")
print(f"Earth Engine initialized: {ee.String('GEE Initialized').getInfo()}")

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


Imports successful!
Earth Engine initialized: GEE Initialized


In [4]:
# Configuration
BUCKET = 'wustl-eeps-geospatial'
BASE_PATH = 'thermokarst_lakes'
CHUNKS_PATH = f'gs://{BUCKET}/{BASE_PATH}/processed/chunks'
OUTPUT_PATH = f'{BASE_PATH}/exports'  # No gs:// prefix for GEE exports

# Years to process (match ALPOD coverage)
YEARS = [2019, 2020, 2021]

# Processing parameters
SCALE = 10  # Sentinel-1 resolution
S2_CLOUD_THRESHOLD = 30  # Maximum cloud cover for S2 images (scene-level)
S2_CLOUD_PROB_THRESHOLD = 40  # s2cloudless probability threshold (pixel-level)
S2_TIME_WINDOW = 3  # Days before/after S1 acquisition to look for S2

# Projection
ALASKA_ALBERS = 'EPSG:3338'

print(f"Configuration:")
print(f"  Years: {YEARS}")
print(f"  Chunks path: {CHUNKS_PATH}")
print(f"  Output: gs://{BUCKET}/{OUTPUT_PATH}")
print(f"  S2 scene cloud threshold: {S2_CLOUD_THRESHOLD}%")
print(f"  S2 pixel cloud probability threshold: {S2_CLOUD_PROB_THRESHOLD}%")
print(f"  S2 time window: ±{S2_TIME_WINDOW} days")

Configuration:
  Years: [2019, 2020, 2021]
  Chunks path: gs://wustl-eeps-geospatial/thermokarst_lakes/processed/chunks
  Output: gs://wustl-eeps-geospatial/thermokarst_lakes/exports
  S2 scene cloud threshold: 30%
  S2 pixel cloud probability threshold: 40%
  S2 time window: ±3 days


---
## Helper Functions

In [5]:
def load_chunk_from_bucket(chunk_id):
    """
    Load a chunk GeoJSON from bucket and convert to ee.FeatureCollection
    Downloads to local temp file first to avoid GCS streaming issues
    """
    import os
    
    chunk_file = f'{CHUNKS_PATH}/chunk_{chunk_id:02d}.geojson'
    
    # Download to local temp file
    local_path = f'/tmp/chunk_{chunk_id:02d}.geojson'
    
    # Use gsutil to download (reliable in Vertex AI)
    os.system(f'gsutil -q cp {chunk_file} {local_path}')
    
    # Load from local file
    gdf = gpd.read_file(local_path)
    
    print(f"  Loaded chunk {chunk_id}: {len(gdf)} lakes")
    print(f"    Bounds: {gdf.total_bounds}")
    
    # Keep only essential properties to reduce payload size
    essential_cols = ['geometry']
    
    if 'lake_area_km2' in gdf.columns:
        essential_cols.append('lake_area_km2')
    
    # Create simple ID if needed
    gdf['lake_id'] = range(len(gdf))
    essential_cols.append('lake_id')
    
    gdf_simplified = gdf[essential_cols].copy()
    
    # Simplify geometries to reduce size (5m tolerance)
    print(f"  Simplifying geometries...")
    gdf_simplified['geometry'] = gdf_simplified.geometry.simplify(
        tolerance=5, preserve_topology=True
    )
    
    # Convert to GeoJSON dict
    geojson = gdf_simplified.__geo_interface__
    
    print(f"  Creating EE FeatureCollection...")
    fc = ee.FeatureCollection(geojson)
    print(f"  FeatureCollection created successfully")
    
    return fc, gdf

In [6]:
def add_lake_geometry_metrics(fc, chunk_bounds):
    """
    Add buffered interior and landscape ring geometries with EXPLICIT CRS handling
    
    PUBLICATION-READY VERSION:
    - All geometric operations in Alaska Albers (EPSG:3338)
    - Explicit reprojection with documentation
    - Masks ALL lakes from landscape ring
    
    Args:
        fc: FeatureCollection of lakes to process (>0.05 km²)
        chunk_bounds: ee.Geometry.Rectangle defining spatial extent
    """
    print("  Loading full ALPOD dataset for landscape masking...")
    print("  CRS Strategy: All geometric operations in Alaska Albers (EPSG:3338)")
    
    # Download ALPOD shapefile
    import os
    import geopandas as gpd
    
    os.system('gsutil -q cp gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.shp /tmp/')
    os.system('gsutil -q cp gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.shx /tmp/')
    os.system('gsutil -q cp gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.dbf /tmp/')
    os.system('gsutil -q cp gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.prj /tmp/')
    os.system('gsutil -q cp gs://wustl-eeps-geospatial/thermokarst_lakes/ALPODlakes/ALPODlakes.cpg /tmp/ 2>/dev/null')
    
    # Load ALPOD - comes in EPSG:3338
    all_alpod = gpd.read_file('/tmp/ALPODlakes.shp')
    print(f"    ALPOD loaded: {len(all_alpod)} lakes, CRS: {all_alpod.crs}")
    
    # ============================================================
    # EXPLICIT CRS HANDLING
    # ============================================================
    # Get chunk bounds from GEE (these come in WGS84 from ee.Geometry.Rectangle)
    bounds_coords = chunk_bounds.bounds().getInfo()['coordinates'][0]
    minx_wgs84 = min(c[0] for c in bounds_coords) - 0.01
    maxx_wgs84 = max(c[0] for c in bounds_coords) + 0.01
    miny_wgs84 = min(c[1] for c in bounds_coords) - 0.01
    maxy_wgs84 = max(c[1] for c in bounds_coords) + 0.01
    
    print(f"    Chunk bounds (WGS84): [{minx_wgs84:.2f}, {maxx_wgs84:.2f}] × [{miny_wgs84:.2f}, {maxy_wgs84:.2f}]")
    
    # EXPLICIT: Convert ALPOD to WGS84 for spatial filtering
    print(f"    Converting ALPOD from {all_alpod.crs} to EPSG:4326 for bbox filtering...")
    all_alpod_wgs84 = all_alpod.to_crs('EPSG:4326')
    
    # Filter ALPOD to chunk region (in WGS84)
    nearby_lakes_wgs84 = all_alpod_wgs84.cx[minx_wgs84:maxx_wgs84, miny_wgs84:maxy_wgs84]
    print(f"    Found {len(nearby_lakes_wgs84)} total lakes in chunk region (all sizes)")
    
    if len(nearby_lakes_wgs84) == 0:
        print(f"    WARNING: No lakes found! Check bounds overlap.")
        print(f"      ALPOD extent: {all_alpod_wgs84.total_bounds}")
        print(f"      Chunk extent: [{minx_wgs84}, {miny_wgs84}, {maxx_wgs84}, {maxy_wgs84}]")
    
    # EXPLICIT: Convert filtered lakes BACK to Alaska Albers for GEE
    print(f"    Converting filtered lakes back to Alaska Albers for geometric operations...")
    nearby_lakes_albers = nearby_lakes_wgs84.to_crs('EPSG:3338')
    
    # Simplify in Alaska Albers (tolerance in meters)
    nearby_lakes_albers['geometry'] = nearby_lakes_albers.geometry.simplify(
        tolerance=5,  # 5 meters in Alaska Albers
        preserve_topology=True
    )
    
    print(f"    Simplified geometries (5m tolerance in Alaska Albers)")
    
    # Convert to EE FeatureCollection
    # GEE will auto-detect the CRS from GeoJSON
    all_lakes_geojson = nearby_lakes_albers[['geometry']].__geo_interface__
    all_lakes_ee = ee.FeatureCollection(all_lakes_geojson)
    
    # Verify CRS was preserved
    print(f"    Created EE FeatureCollection with {len(nearby_lakes_albers)} lakes")
    print(f"    EE will respect embedded CRS (Alaska Albers EPSG:3338)")
    
    # PRE-COMPUTE: Union of ALL lake geometries
    all_lakes_union = all_lakes_ee.geometry().dissolve(maxError=10)  # 10m tolerance
    
    print("  Creating lake interior buffers and landscape rings...")
    print("    Using EPSG:3338 (Alaska Albers) for all geometric operations")
    
    def buffer_interior(feat):
        """
        Create 10m interior buffer
        All operations in Alaska Albers (EPSG:3338)
        """
        geom = feat.geometry()
        
        # EXPLICIT: Reproject to Alaska Albers for buffer operation
        # (GEE will use the embedded CRS, but we document it)
        geom_albers = geom  # Already in EPSG:3338 from input
        
        area = geom_albers.area(maxError=10)  # Area in m² (Alaska Albers)
        
        # 10m negative buffer in Alaska Albers
        buffered = geom_albers.buffer(
            distance=-10,  # meters (EPSG:3338)
            maxError=1     # 1m tolerance
        )
        
        # For very small lakes (< 0.01 km²), use full geometry
        buffered = ee.Algorithms.If(
            area.lt(10000),  # 10,000 m² = 0.01 km²
            geom_albers,
            buffered
        )
        
        return feat.set({'lake_interior': buffered})
    
    def add_landscape_ring(feat):
        """
        Create 100m landscape ring around lake
        Removes ALL lakes from ring
        All operations in Alaska Albers (EPSG:3338)
        """
        geom = feat.geometry()
        geom_albers = geom  # Already in EPSG:3338
        
        # Create 100m ring in Alaska Albers
        outer = geom_albers.buffer(
            distance=100,   # meters (EPSG:3338)
            maxError=1      # 1m tolerance
        )
        
        # Remove lake itself
        ring = outer.difference(geom_albers, maxError=1)
        
        # Remove ALL other lakes from ring (big AND small)
        clean_ring = ring.difference(all_lakes_union, maxError=10)
        
        return feat.set({
            'landscape_ring': clean_ring,
            'landscape_area_m2': ring.area(maxError=10)  # For QC
        })
    
    # Apply transformations
    fc_with_interior = fc.map(buffer_interior)
    fc_with_landscape = fc_with_interior.map(add_landscape_ring)
    
    print("  Geometric transformations complete")
    print("    All buffers computed in Alaska Albers (EPSG:3338)")
    print("    Buffer distances: -10m interior, +100m landscape ring")
    
    return fc_with_landscape

---
## Sentinel-1 Processing

In [7]:
def process_sentinel1(lakes_fc, year, region_bounds):
    """
    Extract S1 features from BOTH lake interior AND landscape ring
    Matches original pilot study feature set
    """
    s1 = (ee.ImageCollection('COPERNICUS/S1_GRD')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VV'))
          .filter(ee.Filter.listContains('transmitterReceiverPolarisation', 'VH'))
          .filter(ee.Filter.eq('instrumentMode', 'IW'))
          .filter(ee.Filter.eq('orbitProperties_pass', 'DESCENDING'))
          .select(['VV', 'VH']))
    
    def extract_s1_features(img):
        date = img.date()
        
        def lake_s1_stats(lake):
            lake_geom = ee.Geometry(lake.get('lake_interior'))
            landscape_geom = ee.Geometry(lake.get('landscape_ring'))
            
            # Extract from LAKE INTERIOR
            lake_stats = img.reduceRegion(
                reducer=ee.Reducer.mean(),
                geometry=lake_geom,
                scale=10,
                maxPixels=1e9
            )
            
            # Extract from LANDSCAPE RING
            landscape_stats = img.reduceRegion(
                reducer=ee.Reducer.mean(),
                geometry=landscape_geom,
                scale=10,
                maxPixels=1e9
            )
            
            # Get raw values
            vv_lake = ee.Number(lake_stats.get('VV'))
            vh_lake = ee.Number(lake_stats.get('VH'))
            vv_land = ee.Number(landscape_stats.get('VV'))
            vh_land = ee.Number(landscape_stats.get('VH'))
            
            # Calculate derived features (matching original)
            vv_vh_ratio = vv_lake.subtract(vh_lake)
            
            # RGB-style scaled features (matching original lines 398-403)
            lake_R = vv_lake.unitScale(-20, -5).multiply(255)
            lake_G = vh_lake.unitScale(-28, -12).multiply(255)
            lake_B = vv_vh_ratio.unitScale(8, 18).multiply(255)
            
            land_R = vv_land.unitScale(-20, -5).multiply(255)
            land_G = vh_land.unitScale(-28, -12).multiply(255)
            land_vv_vh = vv_land.subtract(vh_land)
            land_B = land_vv_vh.unitScale(8, 18).multiply(255)
            
            return lake.set({
                's1_date': date.format('YYYY-MM-dd'),
                's1_doy': date.getRelative('day', 'year'),
                # Raw backscatter
                'vv_db': vv_lake,
                'vh_db': vh_lake,
                'vv_vh_ratio': vv_vh_ratio,
                # Lake RGB features (for RF)
                'lake_R': lake_R,
                'lake_G': lake_G,
                'lake_B': lake_B,
                # Landscape RGB features (for RF)
                'land_R': land_R,
                'land_G': land_G,
                'land_B': land_B,
                # Also export landscape raw (for completeness)
                'vv_land_db': vv_land,
                'vh_land_db': vh_land
            })
        
        return lakes_fc.map(lake_s1_stats)
    
    s1_features = s1.map(extract_s1_features).flatten()
    return s1_features

---
## Sentinel-2 Processing

In [8]:
def compute_ndsi(img):
    """
    Compute Normalized Difference Snow Index (NDSI)
    NDSI = (Green - SWIR1) / (Green + SWIR1)
    """
    green = img.select('B3')
    swir1 = img.select('B11')
    
    ndsi = green.subtract(swir1).divide(green.add(swir1)).rename('ndsi')
    
    return img.addBands(ndsi)

def mask_s2_clouds(img):
    """
    Mask clouds using QA60 band (basic cloud mask)
    """
    qa = img.select('QA60')
    
    # Bits 10 and 11 are clouds and cirrus
    cloud_bit_mask = 1 << 10
    cirrus_bit_mask = 1 << 11
    
    # Both should be zero (clear conditions)
    mask = qa.bitwiseAnd(cloud_bit_mask).eq(0).And(
           qa.bitwiseAnd(cirrus_bit_mask).eq(0))
    
    return img.updateMask(mask)

def add_s2cloudless_mask(img):
    s2_cloudless = ee.ImageCollection('COPERNICUS/S2_CLOUD_PROBABILITY')
    cloud_prob_collection = s2_cloudless.filter(
        ee.Filter.eq('system:index', img.get('system:index'))
    )
    
    has_cloud_data = cloud_prob_collection.size().gt(0)
    
    def apply_s2cloudless_mask():
        cloud_prob = cloud_prob_collection.first().select('probability')
        is_clear = cloud_prob.lt(S2_CLOUD_PROB_THRESHOLD)
        return img.updateMask(is_clear)
    
    def use_qa60_only():
        return img  # Already has QA60 mask
    
    return ee.Image(ee.Algorithms.If(
        has_cloud_data,
        apply_s2cloudless_mask(),
        use_qa60_only()
    ))

In [9]:
def process_sentinel2(lakes_fc, year, region_bounds):
    """
    Extract S2 NDSI from lake interior ONLY
    No landscape extraction needed - not used in original RF
    """
    s2 = (ee.ImageCollection('COPERNICUS/S2_SR_HARMONIZED')
          .filterDate(f'{year}-01-01', f'{year}-12-31')
          .filterBounds(region_bounds)
          .filter(ee.Filter.lt('CLOUDY_PIXEL_PERCENTAGE', S2_CLOUD_THRESHOLD))
          .map(mask_s2_clouds)
          .map(add_s2cloudless_mask)
          .map(compute_ndsi))
    
    def extract_s2_for_date(s2_img):
        s2_date = s2_img.date()
        s2_bounds = s2_img.geometry()
        cloud_pct = s2_img.get('CLOUDY_PIXEL_PERCENTAGE')
        
        lakes_in_image = lakes_fc.filterBounds(s2_bounds)
        
        def lake_s2_stats(lake):
            lake_geom = ee.Geometry(lake.get('lake_interior'))
            
            # LAKE INTERIOR ONLY: NDSI for ice detection
            ndsi = s2_img.select('ndsi')
            ice_mask = ndsi.gt(0.4)
            
            lake_stats = ice_mask.reduceRegion(
                reducer=ee.Reducer.mean(),
                geometry=lake_geom,
                scale=20,
                maxPixels=1e9
            )
            
            ice_fraction = lake_stats.get('ndsi')
            
            return lake.set({
                's2_date': s2_date.format('YYYY-MM-dd'),
                's2_doy': s2_date.getRelative('day', 'year'),
                's2_cloud_pct': cloud_pct,
                's2_ice_fraction': ice_fraction
            })
        
        return lakes_in_image.map(lake_s2_stats)
    
    s2_features = s2.map(extract_s2_for_date).flatten()
    return s2_features

---
## ERA5 Temperature Processing

In [10]:
def process_era5_temperature(lakes_fc, year):
    """
    Process ERA5-Land temperature data - OPTIMIZED VERSION
    Pre-computes daily means once, then samples all lakes in batch
    Much faster than computing daily mean separately for each lake
    """
    print(f"  Loading ERA5 hourly data for {year}...")
    
    # Load ERA5-Land hourly data
    era5_hourly = (ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY')
                   .filterDate(f'{year}-01-01', f'{year}-12-31')
                   .select('temperature_2m'))
    
    # Convert to Celsius
    def to_celsius(img):
        temp_c = img.subtract(273.15).rename('temp_c')
        return temp_c.copyProperties(img, ['system:time_start'])
    
    era5_hourly = era5_hourly.map(to_celsius)
    
    print(f"  Pre-computing daily means...")
    
    # Determine number of days in year (handle leap years)
    is_leap = ee.Number(year).mod(4).eq(0).And(
        ee.Number(year).mod(100).neq(0).Or(
            ee.Number(year).mod(400).eq(0)
        )
    )
    n_days = ee.Number(ee.Algorithms.If(is_leap, 366, 365))
    
    # Pre-compute daily means for entire year (365 or 366 images)
    def compute_daily_mean(day):
        day = ee.Number(day)
        date = ee.Date.fromYMD(year, 1, 1).advance(day.subtract(1), 'day')
        next_date = date.advance(1, 'day')
        
        # Get all hourly images for this day
        daily_collection = era5_hourly.filterDate(date, next_date)
        
        # Check if we have data
        has_data = daily_collection.size().gt(0)
        
        # Compute mean if data exists, otherwise use missing flag
        daily_mean = ee.Image(ee.Algorithms.If(
            has_data,
            daily_collection.mean(),
            ee.Image.constant(-9999).rename('temp_c')
        ))
        
        return daily_mean.set({
            'system:time_start': date.millis(),
            'doy': day,
            'date': date.format('YYYY-MM-dd')
        })
    
    days = ee.List.sequence(1, n_days)
    era5_daily = ee.ImageCollection.fromImages(days.map(compute_daily_mean))
    
    print(f"  Sampling all lakes from daily means...")
    
    # Now sample ALL lakes from each daily mean (batch operation)
    def sample_all_lakes(daily_img):
        doy = daily_img.get('doy')
        date = daily_img.get('date')
        
        # Sample ALL lakes at once using reduceRegions
        samples = daily_img.reduceRegions(
            collection=lakes_fc,
            reducer=ee.Reducer.first(),  # Get pixel value at centroid
            scale=11000  # ERA5-Land resolution
        )
        
        # Add date info to each sampled feature
        def add_date_info(feat):
            # Get the temperature value (from 'first' property created by reducer)
            # Use ee.Algorithms.If to provide default if missing
            temp_value = ee.Algorithms.If(
                feat.propertyNames().contains('first'),
                feat.get('first'),
                -9999
            )
            
            return feat.set({
                'era5_date': date,
                'era5_doy': doy,
                'temp_c': temp_value
            })
        
        return samples.map(add_date_info)
    
    # Process all daily images
    era5_features = era5_daily.map(sample_all_lakes).flatten()
    
    return era5_features

---
## Export Functions

In [11]:
def export_chunk_year(chunk_id, year, lakes_fc, region_bounds):
    """
    Process and export S1/S2 data for one chunk and one year
    NOTE: ERA5 is handled separately (too large for GEE export)
    """
    print(f"\n{'='*60}")
    print(f"Processing Chunk {chunk_id}, Year {year}")
    print(f"{'='*60}")
    
    # Add lake geometries
    lakes_with_geom = add_lake_geometry_metrics(lakes_fc, region_bounds)
    print("Lakes processed with geometries")
    
    # Process S1
    print("\nProcessing Sentinel-1...")
    s1_features = process_sentinel1(lakes_with_geom, year, region_bounds)
    print("  S1 processing complete")
    
    # Process S2
    print("\nProcessing Sentinel-2...")
    s2_features = process_sentinel2(lakes_with_geom, year, region_bounds)
    print("  S2 processing complete")
    
    # ERA5 will be downloaded separately
    print("\nERA5 temperature...")
    print("  (Will be downloaded separately - more efficient than GEE export)")
    
    # Export S1 and S2 only
    exports = []
    
    # S1 export
    s1_task = ee.batch.Export.table.toCloudStorage(
        collection=s1_features,
        description=f'S1_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
        fileFormat='CSV'
    )
    
    # S2 export
    s2_task = ee.batch.Export.table.toCloudStorage(
        collection=s2_features,
        description=f'S2_chunk{chunk_id:02d}_{year}',
        bucket=BUCKET,
        fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
        fileFormat='CSV'
    )
    
    # Return only S1 and S2 tasks (no ERA5)
    exports = [
        {'task': s1_task, 'type': 'S1', 'count': 'N/A'},
        {'task': s2_task, 'type': 'S2', 'count': 'N/A'}
    ]
    
    return exports

---
## Main Export Loop

In [12]:
# Load chunk statistics to know how many chunks we have
chunk_stats = pd.read_csv(f'gs://{BUCKET}/{BASE_PATH}/processed/chunk_statistics.csv')
n_chunks = len(chunk_stats)

print(f"Total chunks to process: {n_chunks}")
print(f"Years to process: {YEARS}")
print(f"Total exports: {n_chunks * len(YEARS) * 3} (chunks × years × 3 datasets)")
print("\nChunk statistics:")
print(chunk_stats[['chunk_id', 'n_lakes', 'lat_min', 'lat_max', 'lon_min', 'lon_max']])

Total chunks to process: 13
Years to process: [2019, 2020, 2021]
Total exports: 117 (chunks × years × 3 datasets)

Chunk statistics:
    chunk_id  n_lakes    lat_min    lat_max     lon_min     lon_max
0          0     2083  69.160813  70.902079 -153.130795 -151.727647
1          1     1452  69.006262  70.827977 -159.368810 -157.856992
2          2     1327  69.040671  70.501113 -150.347535 -148.888885
3          3     2518  69.006075  71.117631 -155.556146 -154.195153
4          4      780  69.027426  70.294397 -163.578774 -160.999169
5          5      223  69.061209  70.121377 -144.849551 -141.047786
6          6     1987  69.022434  71.334582 -157.932951 -156.619012
7          7      910  69.000834  70.388435 -149.010972 -147.408194
8          8     2044  69.016535  70.903127 -154.464362 -153.026821
9          9     1383  69.000146  70.478816 -151.904233 -150.237160
10        10      903  69.021021  70.839688 -161.140899 -159.325726
11        11     2148  69.298632  71.360948 -156.62

In [15]:
# Test with one chunk and one year first
TEST_CHUNK = 0
TEST_YEAR = 2019

print(f"\n{'#'*60}")
print(f"TEST RUN: Chunk {TEST_CHUNK}, Year {TEST_YEAR}")
print(f"{'#'*60}")

# Load chunk
test_fc, test_gdf = load_chunk_from_bucket(TEST_CHUNK)
test_gdf_wgs84 = test_gdf.to_crs('EPSG:4326') 
test_bounds = ee.Geometry.Rectangle(test_gdf_wgs84.total_bounds.tolist())  # wgs84

# Process and export
test_exports = export_chunk_year(TEST_CHUNK, TEST_YEAR, test_fc, test_bounds)

print(f"\n{'='*60}")
print("TEST EXPORTS PREPARED (NOT STARTED)")
print(f"{'='*60}")
for exp in test_exports:
    print(f"  {exp['type']}: {exp['count']} observations ready")
print("\nTo start exports, run the cells below.")


############################################################
TEST RUN: Chunk 0, Year 2019
############################################################
  Loaded chunk 0: 2083 lakes
    Bounds: [  32045.19937452 2133331.56155297   87259.03944725 2325197.90976039]
  Simplifying geometries...
  Creating EE FeatureCollection...
  FeatureCollection created successfully

Processing Chunk 0, Year 2019
  Loading full ALPOD dataset for landscape masking...
  CRS Strategy: All geometric operations in Alaska Albers (EPSG:3338)
    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-153.15, -151.71] × [69.15, 70.93]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 12900 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 12900 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3

In [16]:
# TEST EXPORTS
# Use this to test the export on one chunk. Can skip it in future runs once working
SKIP_TEST = True  #True to skip this cell. 

if not SKIP_TEST:
    print("Starting test exports...")
    for exp in test_exports:
        exp['task'].start()
        print(f"  Started: {exp['task'].status()['description']}")
    
    print("\nTest exports started! Monitor at: https://code.earthengine.google.com/tasks")
    print("\nOnce test completes successfully, proceed to full export below.")
else:
    print("⏭️  Skipping test exports (SKIP_TEST = True)")
    print("Test exports already completed. Proceed to full export below.")

⏭️  Skipping test exports (SKIP_TEST = True)
Test exports already completed. Proceed to full export below.


---
## Full Export (All Chunks, All Years)

**WARNING:** This will prep ~171 export tasks (19 chunks × 3 years × 3 datasets each)

Only run after test export completes successfully!

In [17]:
# Prep S1/S2 exports with geometry caching
# ERA5 will be downloaded separately (see Notebook 03)

all_exports = []
geometry_cache = {}  # Cache fc_with_geom by chunk_id

total_start = time.time()

for chunk_id in range(n_chunks):
    # ============================================================
    # STEP 1: Process geometries ONCE per chunk (not per year)
    # ============================================================
    if chunk_id not in geometry_cache:
        print(f"\n{'='*60}")
        print(f"Processing Chunk {chunk_id} geometries (first time)")
        print(f"{'='*60}")
        
        chunk_start = time.time()
        
        # Load chunk
        chunk_fc, chunk_gdf = load_chunk_from_bucket(chunk_id)
        chunk_gdf_wgs84 = chunk_gdf.to_crs('EPSG:4326')
        chunk_bounds = ee.Geometry.Rectangle(chunk_gdf_wgs84.total_bounds.tolist())
        
        # Add lake geometries (THE SLOW PART - only do once per chunk!)
        print("  Adding lake geometry metrics (interior buffers + landscape rings)...")
        geom_start = time.time()
        fc_with_geom = add_lake_geometry_metrics(chunk_fc, chunk_bounds)
        geom_time = time.time() - geom_start
        print(f"  ✓ Geometries computed in {geom_time:.1f}s")
        
        # Cache for reuse across years
        geometry_cache[chunk_id] = {
            'fc_with_geom': fc_with_geom,
            'chunk_bounds': chunk_bounds,
            'chunk_gdf': chunk_gdf
        }
        
        chunk_time = time.time() - chunk_start
        print(f"  ✓ Chunk {chunk_id} geometry processing complete ({chunk_time:.1f}s)")
    
    # ============================================================
    # STEP 2: Process each year using cached geometries
    # ============================================================
    for year in YEARS:
        print(f"\nProcessing Chunk {chunk_id}, Year {year}...")
        year_start = time.time()
        
        # Retrieve cached geometries
        cached = geometry_cache[chunk_id]
        fc_with_geom = cached['fc_with_geom']
        chunk_bounds = cached['chunk_bounds']
        
        # Process sensors (year-specific data)
        print("  Processing Sentinel-1...")
        s1_features = process_sentinel1(fc_with_geom, year, chunk_bounds)
        print("  ✓ S1 complete")
        
        print("  Processing Sentinel-2...")
        s2_features = process_sentinel2(fc_with_geom, year, chunk_bounds)
        print("  ✓ S2 complete")
        
        print("  (ERA5 will be downloaded separately)")
        
        # Create export tasks (S1 and S2 only)
        s1_task = ee.batch.Export.table.toCloudStorage(
            collection=s1_features,
            description=f'S1_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s1_data',
            fileFormat='CSV'
        )
        
        s2_task = ee.batch.Export.table.toCloudStorage(
            collection=s2_features,
            description=f'S2_chunk{chunk_id:02d}_{year}',
            bucket=BUCKET,
            fileNamePrefix=f'{OUTPUT_PATH}/{year}/chunk_{chunk_id:02d}/s2_data',
            fileFormat='CSV'
        )
        
        # Store export info (S1 and S2 only)
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S1',
            'task': s1_task,
            'count': 'N/A'
        })
        all_exports.append({
            'chunk_id': chunk_id,
            'year': year,
            'type': 'S2',
            'task': s2_task,
            'count': 'N/A'
        })
        
        year_time = time.time() - year_start
        print(f"  ✓ Year {year} complete ({year_time:.1f}s)")

total_time = time.time() - total_start

print(f"\n{'='*60}")
print(f"ALL EXPORTS PREPARED: {len(all_exports)} tasks (S1 + S2 only)")
print(f"{'='*60}")
print(f"Total preparation time: {total_time/60:.1f} minutes")
print(f"Average time per chunk-year: {total_time/39:.1f} seconds")
print(f"\nGeometry cache stats:")
print(f"  Unique chunks processed: {len(geometry_cache)}")
print(f"  Cache hits (years using cached geometries): {39 - len(geometry_cache)}")
print(f"\nNote: ERA5 data will be downloaded separately (see Notebook 03)")
print(f"\nReady to start. Run next cell to begin {len(all_exports)} exports.")


Preparing Chunk 0, Year 2019...
  Loaded chunk 0: 2083 lakes
    Bounds: [  32045.19937452 2133331.56155297   87259.03944725 2325197.90976039]
  Simplifying geometries...
  Creating EE FeatureCollection...
  FeatureCollection created successfully

Processing Chunk 0, Year 2019
  Loading full ALPOD dataset for landscape masking...
  CRS Strategy: All geometric operations in Alaska Albers (EPSG:3338)
    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-153.15, -151.71] × [69.15, 70.93]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 12900 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 12900 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric oper

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-155.57, -154.18] × [68.99, 71.13]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 14588 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 14588 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric operations
  Geometric transformations complete
    All buffers computed in Alaska Albers (EPSG:3338)
    Buffer distances: -10m interior, +100m landscape ring
Lakes processed with geometries

Processing Sentinel-1...
  S1 processing complete

Processing Sentinel-2...
  S2 processing complete

Processing ERA5 temperature...
  Loading ERA5 hourly data for 2019...
  Pre-computing daily means...
  Sampling 

httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.
httplib2 transport does not support per-request timeout. Set the timeout when constructing the httplib2.Http instance.


    ALPOD loaded: 801895 lakes, CRS: EPSG:3338
    Chunk bounds (WGS84): [-156.65, -155.37] × [69.29, 71.37]
    Converting ALPOD from EPSG:3338 to EPSG:4326 for bbox filtering...
    Found 11110 total lakes in chunk region (all sizes)
    Converting filtered lakes back to Alaska Albers for geometric operations...
    Simplified geometries (5m tolerance in Alaska Albers)
    Created EE FeatureCollection with 11110 lakes
    EE will respect embedded CRS (Alaska Albers EPSG:3338)
  Creating lake interior buffers and landscape rings...
    Using EPSG:3338 (Alaska Albers) for all geometric operations
  Geometric transformations complete
    All buffers computed in Alaska Albers (EPSG:3338)
    Buffer distances: -10m interior, +100m landscape ring
Lakes processed with geometries

Processing Sentinel-1...
  S1 processing complete

Processing Sentinel-2...
  S2 processing complete

Processing ERA5 temperature...
  Loading ERA5 hourly data for 2020...
  Pre-computing daily means...
  Sampling 

In [18]:
# START ALL EXPORTS
# Only run after one-chunk works!

print(f"Starting {len(all_exports)} export tasks...")
print("This may take a few minutes to submit all tasks.\n")

for i, exp in enumerate(all_exports):
    exp['task'].start()
    
    if (i+1) % 10 == 0:
        print(f"  Started {i+1}/{len(all_exports)} tasks...")
        time.sleep(2)  # Brief pause to avoid overwhelming GEE

print(f"\n{'='*60}")
print(f"ALL {len(all_exports)} EXPORTS STARTED!")
print(f"{'='*60}")
print("\nMonitor progress at: https://code.earthengine.google.com/tasks")
print(f"\nOutputs will be in: gs://{BUCKET}/{OUTPUT_PATH}/YEAR/chunk_XX/")

Starting 117 export tasks...
This may take a few minutes to submit all tasks.



EEException: Request payload size exceeds the limit: 10485760 bytes.

---
## Monitor Export Progress

In [None]:
# Check status of all exports
def check_export_status():
    status_summary = {
        'READY': 0,
        'RUNNING': 0,
        'COMPLETED': 0,
        'FAILED': 0,
        'CANCELLED': 0
    }
    
    for exp in all_exports:
        status = exp['task'].status()['state']
        status_summary[status] = status_summary.get(status, 0) + 1
    
    print(f"Export Status Summary:")
    print(f"  Total tasks: {len(all_exports)}")
    for state, count in status_summary.items():
        if count > 0:
            print(f"    {state}: {count}")
    
    return status_summary

# Run this cell periodically to check progress
check_export_status()

---
## Summary

This notebook exports:
- **Sentinel-1**: VV/VH backscatter for each lake interior
- **Sentinel-2**: NDSI-based ice fraction (with cloud filtering)
- **ERA5**: Daily mean temperature at lake locations

**For:**
- ~19 spatial chunks
- 3 years (2019, 2020, 2021)
- ~28,000 North Slope lakes

**Output structure:**
```
gs://wustl-eeps-geospatial/thermokarst_lakes/exports/
├── 2019/
│   ├── chunk_00/
│   │   ├── s1_data.csv
│   │   ├── s2_data.csv
│   │   └── era5_data.csv
│   ├── chunk_01/
│   └── ...
├── 2020/
└── 2021/
```

**Next step:** Combine CSVs and run ice detection algorithm (Notebook 03)