# Multi-File AlphaEarth Embedding Download Notebook (Improved Version)

This notebook downloads AlphaEarth embeddings for **only the first 3 CSV files** with different temporal strategies, processing **only the first 3 features** from each file:

## Files and Strategies (Limited to 3 files):

1. **Unified_Peak_Data_2016_2017_with_ID(1006).csv** → embedding_1Y_later (download 2017, 2018) - **3 samples only**
2. **Unified_Peak_Data_2018_and_later_with_ID(1006).csv** → embedding_1Y_early (download previous year) - **3 samples only**
3. **matched_records_1947_with_ID_2016_2017(1006).csv** → embedding_1Y_later (download 2017, 2018) - **3 samples only**

## Improvements:
- **Adaptive region sizing**: Starts with 250x250 pixels, reduces to 128x128 or 64x64 if needed
- **Retry mechanism**: Up to 3 attempts per feature with different strategies
- **Better error handling**: Detailed logging and specific error type detection
- **Sequential processing**: Avoids GEE quota issues
- **Data availability diagnosis**: Tests AlphaEarth data availability before processing

## Data Format:
- **Structure**: 64×H×W arrays (64 bands A00-A63)
- **Naming**: gage_[ID] for Unified Peak Data, HWM_[ID] for matched records
- **Output**: Compressed .npz files with metadata
- **Limit**: Maximum 3 features per file (9 total features)

## Authentication:
Uses service account: `zhouwenlc@windy-winter-456502-b1.iam.gserviceaccount.com`


In [1]:
!pip install earthengine-api

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import ee
import pandas as pd
import numpy as np
import os
import logging
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
from dateutil import parser
import time

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)


In [3]:
def initialize_gee():
    """Initialize Google Earth Engine using service account"""
    try:
        SERVICE_ACCOUNT = 'zhouwenlc@windy-winter-456502-b1.iam.gserviceaccount.com'
        KEY_FILE = 'Flood_dataset/windy-winter-456502-b1-e3f770db867c.json'
        credentials = ee.ServiceAccountCredentials(SERVICE_ACCOUNT, KEY_FILE)
        ee.Initialize(credentials)
        logger.info('Google Earth Engine initialized successfully with service account')
        return True
    except Exception as e:
        logger.error(f'Failed to initialize GEE: {e}')
        return False


In [4]:
def parse_date_flexible(date_str):
    """Flexible date parsing"""
    if pd.isna(date_str):
        return None
    
    try:
        # Try parsing with dateutil
        parsed_date = parser.parse(str(date_str))
        return parsed_date
    except Exception as e:
        logger.warning(f'Failed to parse date: {date_str}, error: {e}')
        return None

def get_download_year(peak_date, strategy):
    """Determine download year based on strategy"""
    if pd.isna(peak_date):
        return None
    
    parsed_date = parse_date_flexible(peak_date)
    if parsed_date is None:
        return None
    
    peak_year = parsed_date.year
    
    if strategy == 'early':
        # Download year prior to peak date
        download_year = peak_year - 1
    elif strategy == 'later':
        # Download year after peak date
        download_year = peak_year + 1
    else:
        return None
    
    # AlphaEarth data starts from 2017
    if download_year < 2017:
        return None
    
    return download_year


## Important Fix: Using filterBounds()

**Critical Note**: The key to successful embedding extraction is using `filterBounds(region)` before `sampleRectangle()`. This prevents the "Too many pixels in sample" error by:

1. **First filtering** the image collection to only images that intersect with our region
2. **Then sampling** from the filtered (smaller) image

**Without filterBounds**: Direct sampling from full-year images → Too many pixels → Error
**With filterBounds**: Filter to region first → Sample from smaller image → Success

This matches the approach used in the successful `/u/wz53/alphaearth/csv_embedding_extractor.py` script.


In [5]:
def extract_250x250_patch(latitude, longitude, year, max_retries=3):
    """Extract 250x250 pixel patch from AlphaEarth data with improved error handling"""
    
    # Validate coordinates
    if not (-90 <= latitude <= 90) or not (-180 <= longitude <= 180):
        logger.error(f'Invalid coordinates: lat={latitude}, lon={longitude}')
        return None
    
    for attempt in range(max_retries):
        try:
            logger.info(f'Attempt {attempt + 1}/{max_retries} for ({latitude:.4f}, {longitude:.4f}) in {year}')
            
            point = ee.Geometry.Point([longitude, latitude])
            
            # Try different region sizes if the first attempt fails
            if attempt == 0:
                # Standard 250x250 pixels (2500m x 2500m)
                half_size_meters = 1250
            elif attempt == 1:
                # Smaller 128x128 pixels (1280m x 1280m)
                half_size_meters = 640
            else:
                # Even smaller 64x64 pixels (640m x 640m)
                half_size_meters = 320
            
            lat_rad = np.radians(latitude)
            meters_per_deg_lat = 111320
            meters_per_deg_lon = 111320 * np.cos(lat_rad)
            
            half_size_lat = half_size_meters / meters_per_deg_lat
            half_size_lon = half_size_meters / meters_per_deg_lon
            
            west = longitude - half_size_lon
            east = longitude + half_size_lon
            south = latitude - half_size_lat
            north = latitude + half_size_lat
            
            region = ee.Geometry.Rectangle([west, south, east, north])
            
            # Load AlphaEarth dataset and filter by bounds first
            embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
            filtered_collection = embedding_collection.filterBounds(region).filterDate(
                f'{year}-01-01', f'{year+1}-01-01'
            )
            
            count = filtered_collection.size().getInfo()
            if count == 0:
                logger.warning(f'No AlphaEarth data found for year {year} at ({latitude:.4f}, {longitude:.4f})')
                return None
            
            logger.info(f'Found {count} images for year {year}')
            
            # Get the first image from the filtered collection
            image = filtered_collection.first()
            
            # Sample the image using sampleRectangle with timeout
            pixel_data = image.sampleRectangle(
                region=region,
                defaultValue=0,
                properties=[]
            )
            
            # Get the values with timeout
            pixel_dict = pixel_data.getInfo()
            if not pixel_dict or 'properties' not in pixel_dict:
                logger.warning(f'No data found for point ({latitude:.4f}, {longitude:.4f}) in year {year}')
                if attempt < max_retries - 1:
                    logger.info(f'Retrying with smaller region...')
                    continue
                return None
            
            # Extract embedding bands data
            properties = pixel_dict['properties']
            bands_data = {}
            for i in range(64):
                band_name = f'A{i:02d}'
                if band_name in properties:
                    band_array = np.array(properties[band_name])
                    # Apply flipud for correct display
                    band_array = np.flipud(band_array)
                    bands_data[band_name] = band_array
            
            if len(bands_data) == 0:
                logger.warning(f'No embedding bands found for point ({latitude:.4f}, {longitude:.4f}) in year {year}')
                if attempt < max_retries - 1:
                    logger.info(f'Retrying with smaller region...')
                    continue
                return None
            
            logger.info(f'Successfully extracted {len(bands_data)} bands')
            
            # Stack all 64 bands into a 64×H×W array
            band_names = [f'A{i:02d}' for i in range(64)]
            image_stack = []
            
            for band_name in band_names:
                if band_name in bands_data:
                    image_stack.append(bands_data[band_name])
                else:
                    # Fill missing bands with zeros
                    if bands_data:
                        image_shape = list(bands_data.values())[0].shape
                        image_stack.append(np.zeros(image_shape))
                    else:
                        return None
            
            # Stack to create 64×H×W array
            patch = np.stack(image_stack, axis=0)
            
            logger.info(f'Successfully created patch with shape: {patch.shape}')
            return patch
            
        except Exception as e:
            error_msg = str(e)
            logger.error(f'Attempt {attempt + 1} failed for ({latitude:.4f}, {longitude:.4f}) in {year}: {error_msg}')
            
            # Check for specific error types
            if "Too many pixels" in error_msg:
                logger.info(f'Too many pixels error - will try smaller region on next attempt')
            elif "timeout" in error_msg.lower():
                logger.info(f'Timeout error - will retry')
            elif "quota" in error_msg.lower():
                logger.warning(f'Quota exceeded - waiting before retry')
                time.sleep(5)
            
            if attempt < max_retries - 1:
                time.sleep(2)  # Wait before retry
                continue
            else:
                logger.error(f'All {max_retries} attempts failed for ({latitude:.4f}, {longitude:.4f}) in {year}')
                return None
    
    return None


In [6]:
def save_patch_as_numpy(patch, output_path, feature_id, latitude, longitude, year):
    """Save patch as compressed numpy file with metadata (matching original format)"""
    try:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        
        # Create band names array (matching original format)
        band_names = [f'A{i:02d}' for i in range(64)]
        
        np.savez_compressed(
            output_path,
            image_data=patch,  # 64×H×W array
            feature_id=feature_id,
            centroid_lon=longitude,  # Note: order should be lon, lat (matching original)
            centroid_lat=latitude,
            year=year,
            num_images=1,
            band_names=band_names,
            flipud_applied=True  # 标记已应用flipud
        )
        return True
    except Exception as e:
        logger.error(f'Error saving patch to {output_path}: {e}')
        return False


In [7]:
#def process_single_file(file_config, max_features=None, start_idx=0):
# setting max_features=3, and only process the first 3 features
def process_single_file(file_config, max_features=3, start_idx=0):
    """Process a single file for embedding download"""
    logger.info(f'   Processing: {file_config["description"]}')
    logger.info(f'   File: {file_config["file"]}')
    logger.info(f'   Strategy: {file_config["strategy"]}')
    logger.info(f'   Output: {file_config["output_dir"]}')
    logger.info(f'   Prefix: {file_config["prefix"]}')
    logger.info(f'   Max features: {max_features}')
    
    # Read CSV file
    try:
        df = pd.read_csv(file_config['file'])
        logger.info(f' Records: {len(df):,}')
    except Exception as e:
        logger.error(f' Failed to read file: {e}')
        return {'successful': 0, 'failed': 0, 'skipped': 0}
    
    # Prepare features for download
    features = []
    for idx, row in df.iterrows():
        if max_features and len(features) >= max_features:
            break
        
        if idx < start_idx:
            continue
        
        # Get coordinates
        lat = row.get('latitude')
        lon = row.get('longitude')
        
        if pd.isna(lat) or pd.isna(lon):
            continue
        
        # Get date field based on file type
        if 'Peak_Data' in file_config['file']:
            date_field = 'peak_date'
        else:
            date_field = 'matched_peak_date'
        
        peak_date = row.get(date_field)
        download_year = get_download_year(peak_date, file_config['strategy'])
        
        if download_year is None:
            continue
        
        # Get feature ID
        if 'ID' in row:
            feature_id = row['ID']
        else:
            feature_id = idx
        
        # Create filename
        filename = f'{file_config["prefix"]}{feature_id}.npz'
        output_path = os.path.join(file_config['output_dir'], filename)
        
        # Skip if exists
        if os.path.exists(output_path):
            continue
        
        features.append({
            'feature_id': feature_id,
            'latitude': lat,
            'longitude': lon,
            'download_year': download_year,
            'filename': filename,
            'output_path': output_path
        })
    
    logger.info(f' Valid features for download: {len(features):,}')
    
    if not features:
        logger.info(' No valid features to download')
        return {'successful': 0, 'failed': 0, 'skipped': 0}
    
    # Download embeddings with improved error handling
    stats = {'successful': 0, 'failed': 0, 'skipped': 0}
    
    # Process features sequentially to avoid GEE quota issues
    for i, feature in enumerate(features):
        logger.info(f'Processing feature {i+1}/{len(features)}: {feature["filename"]}')
        
        try:
            # Extract patch with retry mechanism
            patch = extract_250x250_patch(
                feature['latitude'],
                feature['longitude'],
                feature['download_year'],
                max_retries=3
            )
            
            if patch is not None:
                # Save patch
                if save_patch_as_numpy(patch, feature['output_path'], feature['feature_id'], 
                                   feature['latitude'], feature['longitude'], feature['download_year']):
                    stats['successful'] += 1
                    logger.info(f' Successfully saved: {feature["filename"]}')
                else:
                    stats['failed'] += 1
                    logger.error(f' Failed to save: {feature["filename"]}')
            else:
                stats['failed'] += 1
                logger.error(f' Failed to extract patch: {feature["filename"]}')
                
        except Exception as e:
            logger.error(f' Error processing {feature["filename"]}: {e}')
            stats['failed'] += 1
        
        # Rate limiting between features to avoid quota issues
        time.sleep(1)
    
    return stats


In [8]:
def diagnose_embedding_availability(latitude, longitude, year):
    """Diagnose AlphaEarth data availability for a specific location and year"""
    try:
        logger.info(f' Diagnosing AlphaEarth data for ({latitude:.4f}, {longitude:.4f}) in {year}')
        
        # Create a small test region
        point = ee.Geometry.Point([longitude, latitude])
        half_size_meters = 100  # Small 200m x 200m region for testing
        lat_rad = np.radians(latitude)
        meters_per_deg_lat = 111320
        meters_per_deg_lon = 111320 * np.cos(lat_rad)
        
        half_size_lat = half_size_meters / meters_per_deg_lat
        half_size_lon = half_size_meters / meters_per_deg_lon
        
        west = longitude - half_size_lon
        east = longitude + half_size_lon
        south = latitude - half_size_lat
        north = latitude + half_size_lat
        
        region = ee.Geometry.Rectangle([west, south, east, north])
        
        # Check AlphaEarth collection
        embedding_collection = ee.ImageCollection('GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL')
        
        # Check total collection size
        total_size = embedding_collection.size().getInfo()
        logger.info(f'Total AlphaEarth images: {total_size}')
        
        # Check filtered collection
        filtered_collection = embedding_collection.filterBounds(region).filterDate(
            f'{year}-01-01', f'{year+1}-01-01'
        )
        
        filtered_size = filtered_collection.size().getInfo()
        logger.info(f'Filtered images for {year}: {filtered_size}')
        
        if filtered_size > 0:
            # Get image info
            image = filtered_collection.first()
            image_info = image.getInfo()
            logger.info(f'Image properties: {list(image_info.keys())}')
            
            # Check band names
            if 'bands' in image_info:
                band_names = [band['id'] for band in image_info['bands']]
                logger.info(f'Available bands: {band_names[:10]}... (showing first 10)')
            
            return True
        else:
            logger.warning(f'No AlphaEarth data available for {year}')
            return False
            
    except Exception as e:
        logger.error(f'Diagnosis failed: {e}')
        return False


In [9]:
print(' Testing AlphaEarth data availability for first feature...')

# set relative path
base_dir = os.getcwd()
test_file = os.path.join(base_dir, 'csv_data', 'Unified_Peak_Data_2016_2017_with_ID(1006).csv')

# read CSV file
df = pd.read_csv(test_file)

# acquiring data of the first row
first_row = df.iloc[0]
lat = first_row['latitude']
lon = first_row['longitude']
peak_date = first_row['peak_date']

# calculate which year to download
download_year = get_download_year(peak_date, 'later')
print(f'First feature: lat={lat}, lon={lon}, peak_date={peak_date}, download_year={download_year}')

# Diagnose AlphaEarth data availability
if download_year:
    diagnose_embedding_availability(lat, lon, download_year)
else:
    print(' Could not determine download year')

2025-10-22 08:43:43,771 - INFO -  Diagnosing AlphaEarth data for (42.9575, -91.6240) in 2017
2025-10-22 08:43:43,772 - ERROR - Diagnosis failed: Earth Engine client library not initialized. See http://goo.gle/ee-auth.


 Testing AlphaEarth data availability for first feature...
First feature: lat=42.95753, lon=-91.62403, peak_date=2016-08-25 12:00:00, download_year=2017


In [10]:
# Initialize GEE
if not initialize_gee():
    raise Exception("Failed to initialize Google Earth Engine")

print(' GEE initialized successfully!')


2025-10-22 08:43:44,347 - INFO - Google Earth Engine initialized successfully with service account


 GEE initialized successfully!


In [11]:
base_dir = os.getcwd()

# setting output path
output_dirs = [
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_early'),
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_later')
]

# creat folder
for output_dir in output_dirs:
    os.makedirs(output_dir, exist_ok=True)
    print(f'Created directory: {output_dir}')

Created directory: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_early
Created directory: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_later


In [12]:
# Define a file configuration list (for embedded downloads)
file_configs = [
    {
        'description': '2016–2017 Peak Data (later strategy)',
        'file': os.path.join(base_dir, 'csv_data', 'Unified_Peak_Data_2016_2017_with_ID(1006).csv'),
        'strategy': 'later',
        'output_dir': os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_later_example'),
        'prefix': 'gage_'
    },
    {
        'description': '2018+ Peak Data (early strategy)',
        'file': os.path.join(base_dir, 'csv_data', 'Unified_Peak_Data_2018_and_later_with_ID(1006).csv'),
        'strategy': 'early',
        'output_dir': os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_early_example'),
        'prefix': 'gage_'
    },
    {
        'description': '2016–2017 Matched Records (later strategy)',
        'file': os.path.join(base_dir, 'csv_data', 'matched_records_1947_with_ID_2016_2017(1006).csv'),
        'strategy': 'later',
        'output_dir': os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_later_example'),
        'prefix': 'HWM_'
    },
    {
        'description': '2018+ Matched Records (early strategy)',
        'file': os.path.join(base_dir, 'csv_data', 'matched_records_698_with_ID_2018_and_later(1006).csv'),
        'strategy': 'early',
        'output_dir': os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_early_example'),
        'prefix': 'HWM_'
    }
]

# Print Configuration Overview
print('File configurations defined:')
for i, config in enumerate(file_configs, 1):
    print(f'   {i}. {config["description"]}')
    print(f'      File: {config["file"]}')
    print(f'      Strategy: {config["strategy"]}')
    print(f'      Output: {config["output_dir"]}')
    print(f'      Prefix: {config["prefix"]}\n')


File configurations defined:
   1. 2016–2017 Peak Data (later strategy)
      File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/csv_data/Unified_Peak_Data_2016_2017_with_ID(1006).csv
      Strategy: later
      Output: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_later_example
      Prefix: gage_

   2. 2018+ Peak Data (early strategy)
      File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/csv_data/Unified_Peak_Data_2018_and_later_with_ID(1006).csv
      Strategy: early
      Output: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_early_example
      Prefix: gage_

   3. 2016–2017 Matched Records (later strategy)
      File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/csv_data/matched_records_1947_with_ID_

In [13]:
# Start downloading embeddings
#print('Starting Multi-File AlphaEarth Embedding Download')
#print('=' * 50)
# Start downloading embeddings (only first 3 files, 3 features each)
print('Starting Multi-File AlphaEarth Embedding Download (Limited to 3 files, 3 features each)')
print('=' * 70)

total_stats = {'successful': 0, 'failed': 0, 'skipped': 0}

for i, file_config in enumerate(file_configs, 1):
    print(f'\n Processing file {i}/4: {file_config["description"]}')
    
    # Process file
    #stats = process_single_file(file_config)
    # Process file (max 3 features per file)
    stats = process_single_file(file_config, max_features=3)    
    # Update totals
    for key in total_stats:
        total_stats[key] += stats[key]
    
    print(f' Results: {stats["successful"]} successful, {stats["failed"]} failed')

print(f'\n All embedding downloads completed!')
print(f' Final Summary:')
print(f'   Total successful: {total_stats["successful"]:,}')
print(f'   Total failed: {total_stats["failed"]:,}')
print(f'   Total skipped (existing): {total_stats["skipped"]:,}')

# Calculate success rate with zero division protection
total_attempted = total_stats["successful"] + total_stats["failed"]
if total_attempted > 0:
    success_rate = total_stats["successful"] / total_attempted * 100
    print(f'   Overall success rate: {success_rate:.1f}%')
else:
    print(f'   Overall success rate: N/A (no downloads attempted)')




2025-10-22 08:43:44,414 - INFO -    Processing: 2016–2017 Peak Data (later strategy)
2025-10-22 08:43:44,414 - INFO -    File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/csv_data/Unified_Peak_Data_2016_2017_with_ID(1006).csv
2025-10-22 08:43:44,415 - INFO -    Strategy: later
2025-10-22 08:43:44,415 - INFO -    Output: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_later_example
2025-10-22 08:43:44,416 - INFO -    Prefix: gage_
2025-10-22 08:43:44,417 - INFO -    Max features: 3
2025-10-22 08:43:44,421 - INFO - 📊 Records: 3
2025-10-22 08:43:44,424 - INFO - 📋 Valid features for download: 0
2025-10-22 08:43:44,424 - INFO -  No valid features to download
2025-10-22 08:43:44,425 - INFO -    Processing: 2018+ Peak Data (early strategy)
2025-10-22 08:43:44,425 - INFO -    File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/csv_dat

Starting Multi-File AlphaEarth Embedding Download (Limited to 3 files, 3 features each)

 Processing file 1/4: 2016–2017 Peak Data (later strategy)
 Results: 0 successful, 0 failed

 Processing file 2/4: 2018+ Peak Data (early strategy)
 Results: 0 successful, 0 failed

 Processing file 3/4: 2016–2017 Matched Records (later strategy)
 Results: 0 successful, 0 failed

 Processing file 4/4: 2018+ Matched Records (early strategy)
 Results: 0 successful, 0 failed

 All embedding downloads completed!
 Final Summary:
   Total successful: 0
   Total failed: 0
   Total skipped (existing): 0
   Overall success rate: N/A (no downloads attempted)


In [14]:
# Check downloaded files
print('\n  Checking downloaded files:')
base_dir = os.getcwd()
output_dirs = [
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_early_example'),
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_later_example')
]

for output_dir in output_dirs:
    if os.path.exists(output_dir):
        files = [f for f in os.listdir(output_dir) if f.endswith('.npz')]
        print(f'   {output_dir}: {len(files):,} files')
    else:
        print(f'   {output_dir}: Directory not found')



  Checking downloaded files:
   /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_early: 0 files
   /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_later: 0 files


In [15]:
# Verify file format (sample check)
print('\n Verifying file format (sample check):')
base_dir = os.getcwd()
output_dirs = [
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_early_example'),
    os.path.join(base_dir, 'Flood_dataset', 'embedding_1Y_later_example')
]
sample_files = []
for output_dir in output_dirs:
    if os.path.exists(output_dir):
        files = [f for f in os.listdir(output_dir) if f.endswith('.npz')]
        if files:
            sample_files.append(os.path.join(output_dir, files[0]))

for file_path in sample_files[:2]:
    print(f'\n File: {file_path}')
    try:
        data = np.load(file_path)
        print(f'   Keys: {list(data.keys())}')
        if 'image_data' in data:
            print(f'   Image shape: {data["image_data"].shape}')
            print(f'   Data type: {data["image_data"].dtype}')
        if 'feature_id' in data:
            print(f'   Feature ID: {data["feature_id"]}')
        if 'year' in data:
            print(f'   Year: {data["year"]}')
        if 'band_names' in data:
            print(f'   Band names: {data["band_names"][:5]}... (first 5)')
    except Exception as e:
        print(f'   Error: {e}')




 Verifying file format (sample check):

 File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_early_example/gage_12003.npz
   Keys: ['image_data', 'feature_id', 'centroid_lon', 'centroid_lat', 'year', 'num_images', 'band_names', 'flipud_applied']
   Image shape: (64, 256, 256)
   Data type: float64
   Feature ID: 12003
   Year: 2017
   Band names: ['A00' 'A01' 'A02' 'A03' 'A04']... (first 5)

 File: /home/jovyan/Downloading-Google-Earth-Engine-data-through-Python/flooding_event_processing/Flood_dataset/embedding_1Y_later_example/gage_10002.npz
   Keys: ['image_data', 'feature_id', 'centroid_lon', 'centroid_lat', 'year', 'num_images', 'band_names', 'flipud_applied']
   Image shape: (64, 254, 255)
   Data type: float64
   Feature ID: 10002
   Year: 2017
   Band names: ['A00' 'A01' 'A02' 'A03' 'A04']... (first 5)
