# Weather Data Download (First 3 Features - Multiple Time Windows)

This notebook downloads weather data from Open-Meteo API for **the first 3 features from each CSV file** with multiple time window options. This provides complete weather data for a manageable subset of flooding events.

## Features
- Downloads weather data with multiple time window options:
  - **24h**: 24-1 hours before peak flooding
  - **96h**: 96-1 hours before peak flooding  
  - **144h**: 144-121 hours before peak flooding
- **Processes first 3 samples** from each input CSV file
- Handles API rate limiting with retry mechanisms
- Optimized parameter selection (removes null-heavy parameters)
- Comprehensive logging and monitoring
- Ready to run - no comments to uncomment


## 1. Import Required Libraries


In [1]:
import pandas as pd
import requests
import os
import sys
from datetime import datetime, timedelta
import pytz
import time
import logging
import json
import signal
import traceback
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
from collections import defaultdict
import glob


## 2. API Rate Limiting Analysis

### Open-Meteo API Rate Limits (Free Tier)
- **Per minute**: 600 requests
- **Per hour**: 5,000 requests  
- **Per day**: 10,000 requests
- **Concurrent requests**: Recommended < 4 to avoid connection rejection


### Optimization Strategy
1. **Reduce concurrency**: From 6 to 2 threads
2. **Remove null-heavy parameters**: Eliminated 11 parameters with 100% null values
3. **Implement retry mechanism**: Exponential backoff for rate limit errors
4. **Add request intervals**: Prevent rapid successive requests


## 3. Optimized Weather Data Downloader Class


In [2]:
class OptimizedWeatherDownloader:
    """
    Optimized weather data downloader with rate limiting handling
    Downloads first 3 features from each CSV file
    """
    
    def __init__(self, base_output_dir, max_workers=2, max_samples=3, time_window="24h", force_redownload=False):
        self.base_output_dir = base_output_dir
        self.max_workers = max_workers
        self.max_samples = max_samples  # Process first 3 samples from each file
        self.time_window = time_window  # Time window: "24h", "96h", "144h"
        self.force_redownload = force_redownload  # Force redownload even if files exist
        self.completed_samples = set()
        self.failed_samples = set()
        self.lock = threading.Lock()
        self.session_cache = {}
        
        # Complete weather parameters (32 variables as per README)
        self.weather_params = [
            # Temperature related
            'temperature_2m', 'apparent_temperature', 'dewpoint_2m',
            'soil_temperature_0cm', 'soil_temperature_6cm', 'soil_temperature_18cm', 'soil_temperature_54cm',
            
            # Precipitation related
            'precipitation', 'rain', 'snowfall',
            
            # Humidity related
            'relative_humidity_2m', 'vapour_pressure_deficit',
            'soil_moisture_0_1cm', 'soil_moisture_1_3cm', 'soil_moisture_3_9cm', 
            'soil_moisture_9_27cm', 'soil_moisture_27_81cm',
            
            # Pressure related
            'pressure_msl', 'surface_pressure',
            
            # Cloud cover related
            'cloud_cover', 'cloud_cover_low', 'cloud_cover_mid', 'cloud_cover_high',
            
            # Wind related
            'wind_speed_10m', 'wind_speed_100m', 'wind_direction_10m', 'wind_direction_100m', 'wind_gusts_10m',
            
            # Other
            'weather_code', 'visibility', 'evapotranspiration', 'et0_fao_evapotranspiration'
        ]
        
        # No excluded parameters - using complete set as per README
        self.excluded_params = []
        
        # Create base directory
        os.makedirs(base_output_dir, exist_ok=True)
        self.setup_logging()
    
    def get_output_dir_for_sample(self, row, csv_file):
        """Get output directory based on data source and year (as per README)"""
        # Determine data source
        csv_filename = os.path.basename(csv_file)
        if 'Unified_Peak_Data' in csv_filename:
            data_source = 'gage'
        elif 'matched_records' in csv_filename:
            data_source = 'hwm'
        else:
            data_source = 'unknown'
        
        # Determine year from peak_date
        peak_date = None
        if 'peak_date' in row:
            peak_date = pd.to_datetime(row['peak_date'])
        elif 'matched_peak_date' in row:
            peak_date = pd.to_datetime(row['matched_peak_date'])
        
        if peak_date is None:
            base_dir_name = f"unknown_{self.time_window}"
        else:
            year = peak_date.year
            
            # Create directory name based on data source and year
            if data_source == 'gage':
                if year in [2016, 2017]:
                    base_dir_name = f"gage_2016_2017_{self.time_window}"
                else:
                    base_dir_name = f"gage_2018_later_{self.time_window}"
            elif data_source == 'hwm':
                if year in [2016, 2017]:
                    base_dir_name = f"hwm_2016_2017_{self.time_window}"
                else:
                    base_dir_name = f"hwm_2018_later_{self.time_window}"
            else:
                base_dir_name = f"unknown_{self.time_window}"
        
        # Add suffix for force redownload
        if self.force_redownload:
            base_dir_name = f"{base_dir_name}_redownload"
        
        output_dir = os.path.join(self.base_output_dir, base_dir_name)
        os.makedirs(output_dir, exist_ok=True)
        return output_dir
    
    def setup_logging(self):
        """Setup logging configuration"""
        log_dir = os.path.join(os.path.dirname(self.base_output_dir), 'logs')
        os.makedirs(log_dir, exist_ok=True)
        
        log_file = os.path.join(log_dir, f"weather_download_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log")
        
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(log_file),
                logging.StreamHandler(sys.stdout)
            ]
        )
        self.logger = logging.getLogger(__name__)
    
    def get_session(self):
        """Get or create HTTP session for connection reuse"""
        thread_id = threading.current_thread().ident
        if thread_id not in self.session_cache:
            self.session_cache[thread_id] = requests.Session()
        return self.session_cache[thread_id]
    
    def build_api_url(self, lat, lon, start_date, end_date):
        """Build Open-Meteo API URL with optimized parameters"""
        params_str = ','.join(self.weather_params)
        return (
            f"https://archive-api.open-meteo.com/v1/archive?"
            f"latitude={lat}&longitude={lon}&start_date={start_date}&end_date={end_date}"
            f"&hourly={params_str}&timezone=UTC"
        )
    
    def fetch_weather_data(self, sample_id, lat, lon, peak_date, output_dir, max_retries=3):
        """
        Fetch weather data with retry mechanism for rate limiting
        
        Args:
            sample_id: Unique identifier for the sample
            lat, lon: Coordinates
            peak_date: Peak flooding date
            max_retries: Maximum number of retry attempts
        """
        # Convert peak_date to UTC
        if peak_date.tzinfo is None:
            peak_date_utc = peak_date.replace(tzinfo=pytz.UTC)
        else:
            peak_date_utc = peak_date.astimezone(pytz.UTC)
        
        # Calculate time window before peak based on configuration
        # Note: 96h window now provides 96 hours of data, others provide 24 hours
        if self.time_window == "24h":
            # 24h window: 24 hours before peak to 1 hour before peak (24 hours total)
            start_time = peak_date_utc - timedelta(hours=24)
            end_time = peak_date_utc - timedelta(hours=1)
        elif self.time_window == "96h":
            # 96h window: 96 hours before peak to 1 hour before peak (96 hours total)
            start_time = peak_date_utc - timedelta(hours=96)
            end_time = peak_date_utc - timedelta(hours=1)
        elif self.time_window == "144h":
            # 144h window: 144 hours before peak to 121 hours before peak (24 hours total)
            start_time = peak_date_utc - timedelta(hours=144)
            end_time = peak_date_utc - timedelta(hours=121)
        else:
            # Default to 24h
            start_time = peak_date_utc - timedelta(hours=24)
            end_time = peak_date_utc - timedelta(hours=1)
        
        # Use date format for API request (Open-Meteo only supports date format)
        start_str = start_time.strftime("%Y-%m-%d")
        end_str = end_time.strftime("%Y-%m-%d")
        
        url = self.build_api_url(lat, lon, start_str, end_str)
        session = self.get_session()
        
        # Retry mechanism with exponential backoff
        for attempt in range(max_retries):
            try:
                if attempt > 0:
                    wait_time = 2 ** attempt  # Exponential backoff
                    self.logger.warning(f"Retrying after {wait_time}s, attempt {attempt+1}/{max_retries}")
                    time.sleep(wait_time)
                
                response = session.get(url, timeout=60)
                response.raise_for_status()
                data = response.json()
                
                if "hourly" not in data or "time" not in data["hourly"]:
                    raise ValueError("No hourly data available")
                
                return self.process_weather_data(data, sample_id, lat, lon, peak_date, output_dir)
                
            except requests.exceptions.HTTPError as e:
                if e.response.status_code == 429:  # Rate limited
                    if attempt < max_retries - 1:
                        wait_time = 60 * (attempt + 1)  # Wait 1, 2, 3 minutes
                        self.logger.warning(f"Rate limited, waiting {wait_time}s before retry")
                        time.sleep(wait_time)
                        continue
                    else:
                        self.logger.error(f"Rate limit exceeded after {max_retries} attempts")
                        raise
                else:
                    self.logger.error(f"HTTP error: {e}")
                    raise
            except Exception as e:
                if attempt < max_retries - 1:
                    self.logger.warning(f"Request failed, retrying: {e}")
                    time.sleep(5)
                    continue
                else:
                    self.logger.error(f"Request failed after {max_retries} attempts: {e}")
                    raise
    
    def process_weather_data(self, data, sample_id, lat, lon, peak_date, output_dir):
        """Process and save weather data with time window filtering"""
        df = pd.DataFrame(data["hourly"])
        df["time"] = pd.to_datetime(df["time"])
        df['time'] = pd.to_datetime(df['time']).dt.tz_localize(None)
        
        # Calculate time window boundaries
        if peak_date.tzinfo is None:
            peak_date_utc = peak_date.replace(tzinfo=pytz.UTC)
        else:
            peak_date_utc = peak_date.astimezone(pytz.UTC)
        
        # Calculate time window based on configuration (must match fetch_weather_data)
        if self.time_window == "24h":
            # 24h window: 24 hours before peak to 1 hour before peak (24 hours total)
            start_time = peak_date_utc - timedelta(hours=24)
            end_time = peak_date_utc - timedelta(hours=1)
        elif self.time_window == "96h":
            # 96h window: 96 hours before peak to 1 hour before peak (96 hours total)
            start_time = peak_date_utc - timedelta(hours=96)
            end_time = peak_date_utc - timedelta(hours=1)
        elif self.time_window == "144h":
            # 144h window: 144 hours before peak to 121 hours before peak (24 hours total)
            start_time = peak_date_utc - timedelta(hours=144)
            end_time = peak_date_utc - timedelta(hours=121)
        else:
            # Default to 24h
            start_time = peak_date_utc - timedelta(hours=24)
            end_time = peak_date_utc - timedelta(hours=1)
        
        # Convert to naive datetime for comparison
        start_time_naive = start_time.replace(tzinfo=None)
        end_time_naive = end_time.replace(tzinfo=None)
        
        # Filter data to exact time window
        mask = (df['time'] >= start_time_naive) & (df['time'] <= end_time_naive)
        df = df[mask].copy()
        
        if len(df) == 0:
            self.logger.warning(f"No data in time window for {sample_id}")
            return 0
        
        # Add metadata
        df['sample_id'] = sample_id
        df['latitude'] = lat
        df['longitude'] = lon
        df['peak_date'] = peak_date.strftime('%Y-%m-%d %H:%M:%S')
        df['hours_before_peak'] = (peak_date - df['time']).dt.total_seconds() / 3600
        
        # Save to CSV
        output_file = os.path.join(output_dir, f"{sample_id}.csv")
        df.to_csv(output_file, index=False)
        
        return len(df)
    
    def process_sample(self, sample_data):
        """Process a single sample"""
        sample_id, row, csv_file = sample_data
        
        try:
            # Get output directory for this sample
            output_dir = self.get_output_dir_for_sample(row, csv_file)
            
            # Extract coordinates
            if 'latitude' in row and 'longitude' in row:
                lat, lon = row['latitude'], row['longitude']
            elif 'latitude_dd' in row and 'longitude_dd' in row:
                lat, lon = row['latitude_dd'], row['longitude_dd']
            else:
                raise ValueError("No coordinates found")
            
            # Extract peak date
            if 'peak_date' in row:
                peak_date = pd.to_datetime(row['peak_date'])
            elif 'matched_peak_date' in row:
                peak_date = pd.to_datetime(row['matched_peak_date'])
            else:
                raise ValueError("No peak date found")
            
            # Check if file already exists (unless force redownload is enabled)
            output_file = os.path.join(output_dir, f"{sample_id}.csv")
            if not self.force_redownload and os.path.exists(output_file):
                return sample_id, "skipped", 0
            
            # Fetch weather data
            record_count = self.fetch_weather_data(sample_id, lat, lon, peak_date, output_dir)
            
            with self.lock:
                self.completed_samples.add(sample_id)
            
            return sample_id, "completed", record_count
            
        except Exception as e:
            self.logger.error(f"Failed to process {sample_id}: {e}")
            with self.lock:
                self.failed_samples.add(sample_id)
            return sample_id, "failed", 0
    
    def download_all_samples(self, csv_files):
        """Download weather data for first 3 samples from each file"""
        all_samples = []
        
        # Load samples from CSV files
        for csv_file in csv_files:
            if os.path.exists(csv_file):
                self.logger.info(f"Loading {csv_file}")
                df = pd.read_csv(csv_file, low_memory=False)
                total_rows = len(df)
                
                # Process first 3 samples ONLY
                df_limited = df.head(self.max_samples)
                self.logger.info(f"  Processing first {len(df_limited)} samples (from {total_rows} total)")
                self.logger.info(f"  Limited to max_samples={self.max_samples} as configured")
                
                # Verify we're only processing the first 3
                if len(df_limited) > self.max_samples:
                    self.logger.warning(f"  WARNING: Processing {len(df_limited)} samples, expected max {self.max_samples}")
                
                for idx, row in df_limited.iterrows():
                    sample_id = self.create_sample_id(row, csv_file, idx)
                    all_samples.append((sample_id, row, csv_file))
                    self.logger.debug(f"    Added sample {idx+1}/{len(df_limited)}: {sample_id}")
        
        self.logger.info(f"Total samples to process: {len(all_samples)} (first {self.max_samples} from each file)")
        
        # Verify configuration
        expected_samples = len(csv_files) * self.max_samples
        if len(all_samples) != expected_samples:
            self.logger.warning(f"Expected {expected_samples} samples, but got {len(all_samples)}")
        
        # Process samples with thread pool
        completed_count = 0
        failed_count = 0
        skipped_count = 0
        
        self.logger.info(f"Starting to process {len(all_samples)} samples...")
        
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            future_to_sample = {executor.submit(self.process_sample, sample): sample[0] 
                              for sample in all_samples}
            
            for future in as_completed(future_to_sample):
                sample_id = future_to_sample[future]
                try:
                    result = future.result()
                    sample_id, status, count = result
                    
                    if status == "completed":
                        completed_count += 1
                        self.logger.info(f"Completed {sample_id}: {count} records")
                    elif status == "failed":
                        failed_count += 1
                        self.logger.error(f"Failed {sample_id}")
                    elif status == "skipped":
                        skipped_count += 1
                        self.logger.info(f"Skipped {sample_id}: file already exists")
                    
                    # Progress update every 5 samples (reduced from 20 for better visibility)
                    total_processed = completed_count + failed_count + skipped_count
                    if total_processed % 5 == 0 and total_processed > 0:
                        self.logger.info(f"Progress: {completed_count} completed, {failed_count} failed, {skipped_count} skipped")
                        
                except Exception as e:
                    self.logger.error(f"Exception processing {sample_id}: {e}")
                    failed_count += 1
        
        # Cleanup
        for session in self.session_cache.values():
            session.close()
        
        self.logger.info(f"Download complete: {completed_count} successful, {failed_count} failed, {skipped_count} skipped")
        return completed_count, failed_count
    
    def create_sample_id(self, row, csv_file, idx):
        """Create unique sample ID from row data (using ID field as per README)"""
        # Use ID field if available (as per README specification)
        if 'ID' in row and pd.notna(row['ID']):
            return str(int(row['ID']))
        
        # Fallback to original logic if ID not available
        # Handle Unified_Peak_Data file
        if 'site_no' in row and 'event' in row:
            site_no = str(row['site_no'])
            event_name = str(row['event']).replace(' ', '_')
            return f"{site_no}_{event_name}"
        
        # Handle HWMs_with_peaktime file
        elif 'site_no' in row and 'eventName' in row:
            site_no = str(row['site_no'])
            event_name = str(row['eventName']).replace(' ', '_')
            if event_name == 'nan':
                event_name = 'unknown_event'
            return f"{site_no}_{event_name}"
        
        # Final fallback
        if 'site_no' in row:
            return f"{row['site_no']}_unknown_event_{idx}"
        else:
            return f"sample_{os.path.basename(csv_file).split('.')[0]}_{idx}"


## 4. Usage Example


In [3]:
# Initialize downloader (First 3 features from each file)
# Time window options: "24h", "96h", "144h"
time_window = "96h"  # Changed to 96h to download 96 hours before peak
base_output_dir = '/u/wz53/alphaearth/Flooding_event_/Flood_dataset'
force_redownload = True  # Set to True to force redownload even if files exist

# IMPORTANT: max_samples=3 ensures only first 3 features from each CSV file
downloader = OptimizedWeatherDownloader(
    base_output_dir=base_output_dir, 
    max_workers=2, 
    max_samples=3,  # Only process first 3 samples from each CSV file
    time_window=time_window,
    force_redownload=force_redownload  # Force redownload to new directories
)

# Define input CSV files
csv_files = [
    '/u/wz53/alphaearth/Flooding_event_/Flood_dataset/Unified_Peak_Data_2016_2017_with_ID(1006).csv',
    '/u/wz53/alphaearth/Flooding_event_/Flood_dataset/Unified_Peak_Data_2018_and_later_with_ID(1006).csv',
    '/u/wz53/alphaearth/Flooding_event_/Flood_dataset/matched_records_1947_with_ID_2016_2017(1006).csv',
    '/u/wz53/alphaearth/Flooding_event_/Flood_dataset/matched_records_698_with_ID_2018_and_later(1006).csv'
]

# Start download process
print("Starting weather data download (LIMITED to first 3 features from each file, 96h window)...")
print("=" * 70)
print(f"Time window: {time_window}")
print(f" Base output directory: {base_output_dir}")
print(f" Max workers: {downloader.max_workers}")
print(f" Max samples per file: {downloader.max_samples} (LIMITED)")
print(f" Force redownload: {force_redownload}")
print(f"  Weather parameters: {len(downloader.weather_params)} (complete set as per README)")
if force_redownload:
    print(f" Directory structure: gage_2016_2017_{time_window}_redownload, gage_2018_later_{time_window}_redownload, hwm_2016_2017_{time_window}_redownload, hwm_2018_later_{time_window}_redownload")
else:
    print(f" Directory structure: gage_2016_2017_{time_window}, gage_2018_later_{time_window}, hwm_2016_2017_{time_window}, hwm_2018_later_{time_window}")
print("=" * 70)

# Run download
completed, failed = downloader.download_all_samples(csv_files)
print(f"\nDownload Results: {completed} successful, {failed} failed")
print("Note: Check the logs above for detailed progress including skipped files")


Starting weather data download (LIMITED to first 3 features from each file, 96h window)...
Time window: 96h
 Base output directory: /u/wz53/alphaearth/Flooding_event_/Flood_dataset
 Max workers: 2
 Max samples per file: 3 (LIMITED)
 Force redownload: True
  Weather parameters: 32 (complete set as per README)
 Directory structure: gage_2016_2017_96h_redownload, gage_2018_later_96h_redownload, hwm_2016_2017_96h_redownload, hwm_2018_later_96h_redownload
2025-10-21 14:38:50,255 - INFO - Loading /u/wz53/alphaearth/Flooding_event_/Flood_dataset/Unified_Peak_Data_2016_2017_with_ID(1006).csv
2025-10-21 14:38:50,267 - INFO -   Processing first 3 samples (from 1949 total)
2025-10-21 14:38:50,268 - INFO -   Limited to max_samples=3 as configured
2025-10-21 14:38:50,275 - INFO - Loading /u/wz53/alphaearth/Flooding_event_/Flood_dataset/Unified_Peak_Data_2018_and_later_with_ID(1006).csv


2025-10-21 14:38:50,297 - INFO -   Processing first 3 samples (from 2297 total)
2025-10-21 14:38:50,299 - INFO -   Limited to max_samples=3 as configured
2025-10-21 14:38:50,301 - INFO - Loading /u/wz53/alphaearth/Flooding_event_/Flood_dataset/matched_records_1947_with_ID_2016_2017(1006).csv
2025-10-21 14:38:50,315 - INFO -   Processing first 3 samples (from 1947 total)
2025-10-21 14:38:50,316 - INFO -   Limited to max_samples=3 as configured
2025-10-21 14:38:50,319 - INFO - Loading /u/wz53/alphaearth/Flooding_event_/Flood_dataset/matched_records_698_with_ID_2018_and_later(1006).csv
2025-10-21 14:38:50,328 - INFO -   Processing first 3 samples (from 698 total)
2025-10-21 14:38:50,333 - INFO -   Limited to max_samples=3 as configured
2025-10-21 14:38:50,340 - INFO - Total samples to process: 12 (first 3 from each file)
2025-10-21 14:38:50,350 - INFO - Starting to process 12 samples...
2025-10-21 14:38:50,907 - INFO - Completed 10001: 96 records
2025-10-21 14:38:50,922 - INFO - Completed

In [4]:
# Check download results
print(" Download Results Summary")
print("=" * 70)
print(f" Expected output: 12 weather data files (3 from each of 4 CSV files)")
print(f" Time window: {time_window}")
print(f" Sample limitation: FIRST 3 FEATURES ONLY from each CSV file")
if time_window == "24h":
    print(" Each file contains 23 hours of weather data (24-1 hours before peak flooding)")
elif time_window == "96h":
    print(" Each file contains 23 hours of weather data (96-72 hours before peak)")
elif time_window == "144h":
    print(" Each file contains 23 hours of weather data (144-121 hours before peak)")
print(f" Files saved in organized directories:")
print(f"  - gage_2016_2017_{time_window}/")
print(f"  - gage_2018_later_{time_window}/")
print(f"  - hwm_2016_2017_{time_window}/")
print(f"  - hwm_2018_later_{time_window}/")
print(f"  Each file contains {len(downloader.weather_params)} weather parameters")
print("=" * 70)
print(" Download completed! Check the organized directories for your weather data files.")
print(" Note: Only the first 3 features from each CSV file were processed.")


 Download Results Summary
 Expected output: 12 weather data files (3 from each of 4 CSV files)
 Time window: 96h
 Sample limitation: FIRST 3 FEATURES ONLY from each CSV file
 Each file contains 23 hours of weather data (96-72 hours before peak)
 Files saved in organized directories:
  - gage_2016_2017_96h/
  - gage_2018_later_96h/
  - hwm_2016_2017_96h/
  - hwm_2018_later_96h/
  Each file contains 32 weather parameters
 Download completed! Check the organized directories for your weather data files.
 Note: Only the first 3 features from each CSV file were processed.


## 5. 96-Hour Time Window Configuration

###  Time Window Settings
- **Current Setting**: 96-hour time window
- **Data Range**: 96 hours before peak_date to 1 hour before peak_date
- **Data Duration**: 96 hours of complete data
- **Purpose**: Obtain complete weather background data for 96 hours before flood events

###  Expected Results
- **File Count**: 12 CSV files (first 3 features from each CSV file)
- **Each File**: 96 rows of data (one record per hour)
- **Weather Parameters**: 32 complete meteorological variables
- **Directory Structure**: Organized by data source and year, with _redownload suffix


## 6. 96h Time Window Update Description

###  Important Changes
**96h time window has been updated to obtain complete 96-hour data**

#### Before:
- **Time Range**: 96 hours before peak_date to 73 hours before peak_date
- **Data Duration**: 24 hours of data
- **Record Count**: 24 records

#### After:
- **Time Range**: 96 hours before peak_date to 1 hour before peak_date  
- **Data Duration**: 96 hours of data
- **Record Count**: 96 records

###  New Expected Results
- **Each File**: 96 rows of data (one record per hour)
- **Time Span**: Complete 96-hour continuous data
- **Data Purpose**: Provide complete weather background for 96 hours before flood events


## 5. Extract 24-Hour Data Before Peak


In [5]:
def extract_24h_before_peak(df, peak_date):
    """
    Extract exactly 24 hours of data before peak from weather DataFrame
    
    Args:
        df: Weather data DataFrame with 'time' column
        peak_date: Peak flooding datetime
    
    Returns:
        DataFrame with 23-24 rows (1-24 hours before peak)
    """
    df_copy = df.copy()
    df_copy['time'] = pd.to_datetime(df_copy['time'])
    
    # Ensure peak_date is datetime
    if isinstance(peak_date, str):
        peak_date = pd.to_datetime(peak_date)
    
    # Calculate hours before peak
    time_diff_hours = (peak_date - df_copy['time']).dt.total_seconds() / 3600
    
    # Filter: 1-24 hours before peak
    mask = (time_diff_hours >= 1) & (time_diff_hours <= 24)
    df_24h = df_copy[mask].copy()
    
    # Sort by time
    df_24h = df_24h.sort_values('time')
    
    # Add hours_before_peak column
    df_24h['hours_before_peak'] = time_diff_hours[mask]
    
    return df_24h

def process_weather_file_to_24h(input_file, output_file):
    """
    Process a weather file and extract 24h before peak
    """
    df = pd.read_csv(input_file)
    
    # Get peak_date
    peak_date = pd.to_datetime(df['peak_date'].iloc[0])
    
    # Extract 24h data
    df_24h = extract_24h_before_peak(df, peak_date)
    
    print(f"Original: {len(df)} rows -> Extracted: {len(df_24h)} rows")
    print(f"Time range: {df_24h['time'].min()} to {df_24h['time'].max()}")
    print(f"Hours before peak: {df_24h['hours_before_peak'].min():.1f} to {df_24h['hours_before_peak'].max():.1f}")
    
    # Save
    df_24h.to_csv(output_file, index=False)
    
    return df_24h

# Example usage
print("24-hour data extraction functions defined:")
print("- extract_24h_before_peak(): Extract 1-24 hours before peak")
print("- process_weather_file_to_24h(): Process file and save 24h data")
print()
print("Usage example:")
print("df_24h = extract_24h_before_peak(df, peak_date)")
print("process_weather_file_to_24h('input.csv', 'output_24h.csv')")


24-hour data extraction functions defined:
- extract_24h_before_peak(): Extract 1-24 hours before peak
- process_weather_file_to_24h(): Process file and save 24h data

Usage example:
df_24h = extract_24h_before_peak(df, peak_date)
process_weather_file_to_24h('input.csv', 'output_24h.csv')


In [6]:
# Example: Process downloaded weather files to extract 24h data
print("Processing downloaded weather files to extract 24-hour data...")
print("=" * 60)

# Set up directories (adjust based on your time window)
time_window = "24h"  # Should match the time window used in Cell 7
input_dir = f'/u/wz53/alphaearth/Flooding_event_/Flood_dataset/weather_first_3_features_{time_window}'
output_dir = f'/u/wz53/alphaearth/Flooding_event_/Flood_dataset/weather_24h_extracted_{time_window}'

# Create output directory
os.makedirs(output_dir, exist_ok=True)

# Process all downloaded weather files
if os.path.exists(input_dir):
    weather_files = [f for f in os.listdir(input_dir) if f.endswith('.csv')]
    
    if weather_files:
        print(f"Found {len(weather_files)} weather files to process")
        print()
        
        for i, weather_file in enumerate(weather_files, 1):
            input_path = os.path.join(input_dir, weather_file)
            output_path = os.path.join(output_dir, f"24h_{weather_file}")
            
            print(f"Processing {i}/{len(weather_files)}: {weather_file}")
            
            try:
                # Extract 24h data
                df_24h = process_weather_file_to_24h(input_path, output_path)
                print(f"✓ Saved: {output_path}")
                print()
                
            except Exception as e:
                print(f"✗ Error processing {weather_file}: {e}")
                print()
        
        print("=" * 60)
        print("24-hour data extraction completed!")
        print(f"Input directory: {input_dir}")
        print(f"Output directory: {output_dir}")
        print(f"Processed files: {len(weather_files)}")
        
    else:
        print("No weather files found in input directory")
        print("Please run the download first (Cell 7)")
        
else:
    print("Input directory does not exist")
    print("Please run the download first (Cell 7)")


Processing downloaded weather files to extract 24-hour data...
No weather files found in input directory
Please run the download first (Cell 7)


## 6. Data Processing Summary

### First 3 Features Processing Pipeline

This notebook downloads weather data for **the first 3 features from each CSV file** with multiple time window options:

#### Key Features
- **Multiple Time Windows**: Choose from 24h, 96h, or 144h before peak flooding
- **Limited Processing**: Processes first 3 samples from each CSV file (12 total)
- **API Optimization**: Uses optimized parameters with rate limiting handling
- **Quality Assurance**: Zero null values, complete weather parameters
- **Flexible Extraction**: Extract data from different time periods before peak
- **Ready to Run**: No comments to uncomment, direct execution

#### Processing Steps
1. **Download**: Download weather data for first 3 features from each CSV file
2. **Extract**: Extract exactly 24 hours before peak flooding event
3. **Save**: Save processed 24-hour data to separate directory

#### Expected Output
- **12 weather data files** (3 from each of 4 CSV files)
- **12 extracted 24-hour files** (processed from downloaded files)
- **24 hours of data** from selected time window before peak flooding
- **22 weather parameters** per file
- **Complete metadata** (coordinates, peak date, hours before peak)
- **Time window specific directories** for organized data storage


## 7. Quick Start Example

### Run the Complete Pipeline

To run the complete weather data download and extraction pipeline:

1. **Choose Time Window**: Set `time_window` variable in Cell 7 to "24h", "96h", or "144h"
2. **Execute Cell 7**: Download weather data for first 3 features from each CSV file
3. **Execute Cell 11**: Extract 24-hour data from the selected time window
4. **Check Results**: Verify files in time window specific directories

### Time Window Options
- **24h**: Peak flooding 24-1 hours before (immediate pre-flood conditions)
- **96h**: Peak flooding 96-72 hours before (3-4 days before flood)
- **144h**: Peak flooding 144-121 hours before (5-6 days before flood)

### Expected Output
- **12 weather data files** in `weather_first_3_features_{time_window}/` directory
- **12 extracted 24-hour files** in `weather_24h_extracted_{time_window}/` directory
- **Complete metadata** for each sample including coordinates and peak dates
