## Two-Stage Downscaling Approach

### Downscaling Model Explanations by Variable

A concise explanation of the statistical downscaling approach used for each key climate variable. For each variable, we train a profile of **24 independent linear regression models**, one for each **hour of the day**.

---

#### Temperature Profile (`air_temperature_k`)

- **Approach:** Multivariate Linear Regression  
- **Predictors (Inputs):**
  - `tas` – daily mean temperature  
  - `tasmin` – daily minimum temperature  
  - `tasmax` – daily maximum temperature  

- **Logic:**  
  Each hourly model learns a unique equation based on these three inputs. For example:
  - The **2:00 AM** model tends to give the highest weight to `tasmin`
  - The **2:00 PM** model emphasizes `tasmax`

  This allows the model to accurately reconstruct the **diurnal temperature curve** based on the day's specific characteristics.

---

#### Wind Profile (`wind_speed_ms`)

- **Approach:** Two-Stage Multivariate Linear Regression  
- **Challenge:** NASA data provides only one daily wind predictor (`sfcWind`), so we break it into two stages.

##### Stage 1: Daily Characteristics Estimation
Two simple models are trained on historical ERA5 data to infer additional daily characteristics:

- `model_max_from_mean` → predicts `daily_max_wind`  
- `model_std_from_mean` → predicts `daily_std_dev` (gustiness)

##### Stage 2: Hourly Value Generation
The 24 hourly models are then trained using:
- `daily_mean` (original NASA value)  
- `predicted_max` (from Stage 1)  
- `predicted_std_dev` (from Stage 1)

This produces more realistic wind profiles:
- **Gusty days** → spiky, dynamic curves  
- **Calm days** → flatter, smoother curves

---

####  Humidity,  Solar &  Thermal Radiation Profiles

- **Approach:** Univariate Linear Regression  
- **Predictor (Input):** Single daily mean value from NASA data  
  - `hurs` for humidity  
  - `rsds` for solar radiation  
  - Similar single-variable inputs for other radiation variables

- **Logic:**  
  While simpler than the temperature or wind models, these still capture **diurnal shape** across 24 hours.

  **Example (Humidity):**
  - The **2:00 AM** model predicts values *higher than the daily mean*
  - The **2:00 PM** model predicts values *lower than the daily mean*

  This behavior mimics real-world daily humidity cycles.

---

####  Research Materials and Resources

This approach applies widely accepted **statistical downscaling** and **weather generation** techniques.

####  Key Academic Papers

##### 1. On the relationship between daily mean and other statistics:
- **Pryor, S. C., Barthelmie, R. J., & Kjellström, E. (2005)**  
  *A method for statistical downscaling of daily wind speed data*.  
  _Journal of Applied Meteorology, 44(12), 1871–1884_  
  - **Relevance:** Foundational method for inferring wind variability from daily means.

##### 2. On regression-based approaches for temporal downscaling:
- **Gleason, K. L. (2007)**  
  *A daily U.S. data set for meteorological and climatological applications*.  
  _8th Conference on Applied Climatology, AMS_  
  - **Relevance:** Validates use of regression to connect daily summaries to hourly values.

##### 3. On weather generators and statistical models:
- **Wilks, D. S., & Wilby, R. L. (1999)**  
  *The weather generation game: a review of stochastic weather models*.  
  _Progress in Physical Geography, 23(3), 329–357_  
  - **Relevance:** Explains conditional generation of sub-daily data — similar in spirit to the two-stage wind model.

---

#### General Resources

- **IPCC Data Distribution Centre**  
  - Offers high-level guidelines on downscaling techniques and usage.
- **IPCC Scenario Data Guidelines**  
  - See sections on **statistical downscaling** for best practices.
  - https://www.ipcc-data.org/guidelines/index.html

---

*This document is part of ongoing efforts to make climate projection data more granular, realistic, and usable in localized impact assessments and scenario modeling.*


## Climate Downscaling Model Training & Temporal Validation (2000–2020)

This script trains hourly downscaling models using daily climate statistics, then validates them on a held-out test period (2019–2020). It's built to handle multiple climate variables and captures diurnal variation through **24 separate hourly models per variable**.

---

#### 1. Configuration

- **Input:** `MODELING_Train_2000-2020.csv` – Daily summary + hourly targets  
- **Output Directory:** Trained models saved to `/trained_models_temporal_holdout/`  
- **Validation Output:** Predictions saved to `TEMPORAL_VALIDATION_Predictions_2019-2020.csv`

---

#### 2. Load & Split Data

- Loads full 2000–2020 dataset
- Splits into:
  - **Training set:** 2000–2018  
  - **Test set:** 2019–2020 (temporal holdout)

---

#### 3. Model Training

#### Wind Speed (Two-Stage)
- **Stage 1:**  
  Trains two regression models to predict:
  - `wind_speed_ms_max`  
  - `wind_speed_ms_std`  
  from `wind_speed_ms_mean`

- Saves these models for later use.

#### Hourly Models (Per Variable)
Trains **24 hourly Linear Regression models** for each variable:

| Variable                   | Predictors |
|----------------------------|------------|
| `air_temperature_k`        | Mean, Min, Max |
| `wind_speed_ms`            | Mean, Max, Std |
| `relative_humidity_percent`| Mean |
| `solar_radiation_w_m2`     | Mean |
| `thermal_radiation_w_m2`   | Mean |
| `precip_hourly_mm`         | Sum |

All trained models are saved individually using `joblib`.

---

#### 4. Prediction: Temporal Test Period (2019–2020)

- Uses **actual daily stats** from the test set (no synthetic input).
- For each variable:
  - Predicts 24 hourly values using the trained models
  - Reconstructs continuous hourly time series

---

#### 5. Validation

- Actual and predicted hourly values are compared
- Error metrics reported:
  - **MAE** (Mean Absolute Error)
  - **RMSE** (Root Mean Squared Error)


---

###  6. Output

- Saves merged prediction vs. actuals dataset as:
TEMPORAL_VALIDATION_Predictions_2019-2020.csv



---

##### Tools Used

- `pandas`, `numpy`, `sklearn`, `joblib`
- Linear Regression for all modeling steps

---

*This pipeline enables accurate, high-resolution climate variable reconstruction by learning consistent relationships between daily summaries and hourly patterns.*  
Perfect for building baseline models, downscaling ensembles, or running future scenario projections.



In [2]:
import pandas as pd
import numpy as np
import os
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# --- 1. CONFIGURATION ---
# Define the location of your modeling-ready data file and where to save results.
ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted"
MODELING_FILE_FULL = os.path.join(ROOT_DATA_DIR, "MODELING_Train_2000-2020.csv") # The master file
MODEL_SAVE_DIR = os.path.join(ROOT_DATA_DIR, "trained_models_temporal_holdout")
VALIDATION_OUTPUT_FILE = os.path.join(ROOT_DATA_DIR, "TEMPORAL_VALIDATION_Predictions_2019-2020.csv")

os.makedirs(MODEL_SAVE_DIR, exist_ok=True)


# --- 2. LOAD AND SPLIT THE DATA ---
print("--- Step 1: Loading and splitting the master dataset ---")
try:
    master_df = pd.read_csv(MODELING_FILE_FULL, index_col=0, parse_dates=True)
    print("  Successfully loaded master modeling data (2000-2020).")
except FileNotFoundError as e:
    print(f"Error: Could not find data file: {e.filename}.")
    exit()

# Perform the temporal train/test split
train_df = master_df[master_df.index.year <= 2018]
test_df = master_df[master_df.index.year > 2018]

print(f"  Training data shape (2000-2018): {train_df.shape}")
print(f"  Validation data shape (2019-2020):  {test_df.shape}")


# --- 3. TRAIN ALL MODELS (using 2000-2012 data) ---
print("\n--- Step 2: Training all downscaling models on 2000-2018 data ---")
trained_models = {}
predictor_map = {
    'air_temperature_k': ['air_temperature_k_mean', 'air_temperature_k_min', 'air_temperature_k_max'],
    'wind_speed_ms': ['wind_speed_ms_mean', 'wind_speed_ms_max', 'wind_speed_ms_std'],
    'relative_humidity_percent': ['relative_humidity_percent_mean'],
    'solar_radiation_w_m2': ['solar_radiation_w_m2_mean'],
    'thermal_radiation_w_m2': ['thermal_radiation_w_m2_mean'],
    'precip_hourly_mm': ['precip_hourly_mm_sum']
}

# Train Stage-1 wind models
print("  Training wind characteristic models...")
wind_predictors_train = train_df[['wind_speed_ms_mean']]
model_wind_max = LinearRegression().fit(wind_predictors_train, train_df['wind_speed_ms_max'])
model_wind_std = LinearRegression().fit(wind_predictors_train, train_df['wind_speed_ms_std'])

# --- THE FIX: Explicitly save the Stage-1 wind models ---
print("  Saving Stage-1 wind models...")
joblib.dump(model_wind_max, os.path.join(MODEL_SAVE_DIR, 'model_wind_max.pkl'))
joblib.dump(model_wind_std, os.path.join(MODEL_SAVE_DIR, 'model_wind_std.pkl'))
# We can add them to our dictionary for immediate use if needed
trained_models['wind_max_from_mean'] = model_wind_max
trained_models['wind_std_from_mean'] = model_wind_std


# Train main hourly models
for var_name, predictors in predictor_map.items():
    print(f"  Training 24 hourly models for: {var_name}...")
    trained_models[var_name] = {}
    if not all(p in train_df.columns for p in predictors): continue
    for hour in range(24):
        target_col = f"{var_name}_{hour}"
        if target_col not in train_df.columns: continue
        model = LinearRegression().fit(train_df[predictors], train_df[target_col])
        trained_models[var_name][hour] = model
    # Save each variable's model dictionary
    joblib.dump(trained_models[var_name], os.path.join(MODEL_SAVE_DIR, f'models_{var_name}.pkl'))

print("--- All models have been trained successfully. ---")


# --- 4. MAKE PREDICTIONS ON THE TEMPORAL TEST SET (2013-2014) ---
print("\n--- Step 3: Generating hourly predictions for the test period (2019-2020) ---")

# For this ERA5-only validation, we use the actual daily stats from the test set.
X_test_base = test_df.copy()

final_predictions = {}
for var_name, predictors in predictor_map.items():
    if not all(p in X_test_base.columns for p in predictors): continue
    
    print(f"  Predicting hourly values for: {var_name}...")
    hourly_preds_list = []
    X_predict = X_test_base[predictors]
    for hour in range(24):
        model = trained_models[var_name].get(hour)
        if model:
            preds = model.predict(X_predict)
            hourly_preds_list.append(pd.Series(preds, index=X_predict.index, name=hour))
            
    if hourly_preds_list:
        var_df_wide = pd.concat(hourly_preds_list, axis=1)
        var_stacked = var_df_wide.stack()
        var_stacked.index = var_stacked.index.map(lambda x: x[0] + pd.to_timedelta(x[1], unit='h'))
        final_predictions[f'predicted_{var_name}'] = var_stacked

predictions_df = pd.DataFrame(final_predictions)
print("--- Hourly predictions generated successfully. ---")


# --- 5. VALIDATE PREDICTIONS ---
print("\n--- Step 4: Validating predictions against actual 2019-2020 data ---")

# Re-structure the actual 2013-2014 data to be comparable
actuals_df = pd.DataFrame()
for var_name in predictor_map.keys():
    actual_cols = [f"{var_name}_{h}" for h in range(24) if f"{var_name}_{h}" in test_df.columns]
    if not actual_cols: continue
    actual_hourly = test_df[actual_cols]
    actual_hourly.columns = [int(c.split('_')[-1]) for c in actual_cols]
    actual_stacked = actual_hourly.stack()
    actual_stacked.index = actual_stacked.index.map(lambda x: x[0] + pd.to_timedelta(x[1], unit='h'))
    actuals_df[f'actual_{var_name}'] = actual_stacked

# Merge actuals and predictions
validation_df = pd.merge(actuals_df, predictions_df, left_index=True, right_index=True, how="inner")

# Calculate and print final error metrics
print("  Temporal Hold-Out Test Results (2019-2020):")
for var_name in predictor_map.keys():
    actual_col, predicted_col = f'actual_{var_name}', f'predicted_{var_name}'
    if actual_col in validation_df.columns and predicted_col in validation_df.columns:
        temp_compare_df = validation_df[[actual_col, predicted_col]].dropna()
        if not temp_compare_df.empty:
            mae = mean_absolute_error(temp_compare_df[actual_col], temp_compare_df[predicted_col])
            rmse = np.sqrt(mean_squared_error(temp_compare_df[actual_col], temp_compare_df[predicted_col]))
            print(f"    - {var_name}:")
            print(f"        Mean Absolute Error (MAE):  {mae:.4f}")
            print(f"        Root Mean Squared Error (RMSE): {rmse:.4f}")

# --- 6. SAVE FINAL RESULTS ---
print(f"\n--- Step 5: Saving final validation results to {VALIDATION_OUTPUT_FILE} ---")
validation_df.to_csv(VALIDATION_OUTPUT_FILE)
print("Save complete.")


--- Step 1: Loading and splitting the master dataset ---
  Successfully loaded master modeling data (2000-2020).
  Training data shape (2000-2018): (6940, 88)
  Validation data shape (2019-2020):  (731, 88)

--- Step 2: Training all downscaling models on 2000-2018 data ---
  Training wind characteristic models...
  Saving Stage-1 wind models...
  Training 24 hourly models for: air_temperature_k...
  Training 24 hourly models for: wind_speed_ms...
  Training 24 hourly models for: relative_humidity_percent...
  Training 24 hourly models for: solar_radiation_w_m2...
  Training 24 hourly models for: thermal_radiation_w_m2...
  Training 24 hourly models for: precip_hourly_mm...
--- All models have been trained successfully. ---

--- Step 3: Generating hourly predictions for the test period (2019-2020) ---
  Predicting hourly values for: air_temperature_k...
  Predicting hourly values for: wind_speed_ms...
  Predicting hourly values for: relative_humidity_percent...
  Predicting hourly value

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
import os

# --- 1. CONFIGURATION ---
# Define the location of your validation file and where to save plots.
ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted"
VALIDATION_FILE = os.path.join(ROOT_DATA_DIR, "TEMPORAL_VALIDATION_Predictions_2019-2020.csv")
PLOT_SAVE_DIR = os.path.join(ROOT_DATA_DIR, "temporal_validation_plots")

# Create a directory to save the plots
os.makedirs(PLOT_SAVE_DIR, exist_ok=True)

# --- 2. DATA LOADING ---
print("--- Loading temporal validation data ---")
try:
    # Use index_col=0 to specify the first column is the index.
    validation_df = pd.read_csv(VALIDATION_FILE, index_col=0, parse_dates=True)
    print("  Successfully loaded validation data.")
except FileNotFoundError:
    print(f"Error: Validation file not found at {VALIDATION_FILE}")
    exit()

# --- 3. PLOTTING FUNCTION ---

def plot_validation_timeseries(df, var_name, start_date, end_date, resample_freq=None):
    """
    Creates and saves a time series plot comparing actual vs. predicted values
    for a specific variable and date range. Can resample data for longer periods.
    """
    actual_col = f'actual_{var_name}'
    predicted_col = f'predicted_{var_name}'
    
    # Check if both required columns exist in the DataFrame
    if not all(col in df.columns for col in [actual_col, predicted_col]):
        print(f"\nSkipping plot for '{var_name}': one or both columns not found.")
        return

    # Filter the DataFrame for the desired date range
    plot_df = df.loc[start_date:end_date].copy()
    
    if plot_df.empty:
        print(f"\nNo data found for '{var_name}' in the date range {start_date} to {end_date}.")
        return

    # Optional: Resample data for longer time periods (e.g., daily mean for a yearly plot)
    if resample_freq:
        plot_df = plot_df[[actual_col, predicted_col]].resample(resample_freq).mean()
        plot_title_suffix = f"({resample_freq} Resample)"
    else:
        plot_title_suffix = "(Hourly)"

    print(f"\nGenerating plot for '{var_name}' from {start_date} to {end_date} {plot_title_suffix}...")

    # Create the plot
    plt.style.use('seaborn-v0_8-whitegrid')
    fig, ax = plt.subplots(figsize=(18, 8))
    
    # Plot the actual and predicted lines
    ax.plot(plot_df.index, plot_df[actual_col], label='Actual (ERA5)', color='blue', linewidth=2.5, alpha=0.8)
    ax.plot(plot_df.index, plot_df[predicted_col], label='Predicted (Model)', color='red', linewidth=1.5, linestyle='--')
    
    # Formatting the plot
    plt.title(f'Temporal Validation: Actual vs. Predicted {var_name.replace("_", " ").title()}\n({start_date} to {end_date}) {plot_title_suffix}', fontsize=18)
    plt.ylabel(var_name.split('_')[-1].upper(), fontsize=14)
    plt.xlabel('Date and Time', fontsize=14)
    plt.legend(fontsize=12)
    plt.xticks(rotation=30, ha='right')
    plt.tight_layout()
    
    # Save the plot to a file
    plot_filename = f"Temporal_Validation_{var_name}_{start_date}_to_{end_date}_{resample_freq or 'hourly'}.png"
    save_path = os.path.join(PLOT_SAVE_DIR, plot_filename)
    plt.savefig(save_path, dpi=150)
    print(f"  Plot saved to: {save_path}")
    plt.close(fig) # Close the figure to free up memory


# --- 4. EXECUTION ---
if __name__ == "__main__":
    
    variables_to_plot = [
        'air_temperature_k',
        'wind_speed_ms',
        'relative_humidity_percent',
        'precip_hourly_mm',
        'solar_radiation_w_m2',
        'thermal_radiation_w_m2'
    ]

    for variable in variables_to_plot:
        # Generate a 3-day plot (hourly resolution)
        plot_validation_timeseries(validation_df, variable, '2019-07-01', '2019-07-03')
        
        # Generate a 1-month plot (hourly resolution)
        plot_validation_timeseries(validation_df, variable, '2019-07-01', '2019-07-31')

        # Generate a 1-year plot (resampled to daily mean for clarity)
        plot_validation_timeseries(validation_df, variable, '2019-01-01', '2019-12-31', resample_freq='D')
        
    print("\n--- All plots generated successfully. ---")


--- Loading temporal validation data ---
  Successfully loaded validation data.

Generating plot for 'air_temperature_k' from 2019-07-01 to 2019-07-03 (Hourly)...
  Plot saved to: C:\Users\91788\Downloads\ERA5 Data\Extracted\temporal_validation_plots\Temporal_Validation_air_temperature_k_2019-07-01_to_2019-07-03_hourly.png

Generating plot for 'air_temperature_k' from 2019-07-01 to 2019-07-31 (Hourly)...
  Plot saved to: C:\Users\91788\Downloads\ERA5 Data\Extracted\temporal_validation_plots\Temporal_Validation_air_temperature_k_2019-07-01_to_2019-07-31_hourly.png

Generating plot for 'air_temperature_k' from 2019-01-01 to 2019-12-31 (D Resample)...
  Plot saved to: C:\Users\91788\Downloads\ERA5 Data\Extracted\temporal_validation_plots\Temporal_Validation_air_temperature_k_2019-01-01_to_2019-12-31_D.png

Generating plot for 'wind_speed_ms' from 2019-07-01 to 2019-07-03 (Hourly)...
  Plot saved to: C:\Users\91788\Downloads\ERA5 Data\Extracted\temporal_validation_plots\Temporal_Validation

## Final Validation: NASA-Based Downscaling vs. ERA5 Ground Truth (2021–2024)

This script performs **final validation** of previously trained downscaling models using **daily NASA data (2021–2024)** to generate hourly predictions, which are then compared to **ERA5 hourly ground truth** data. The trained models were developed on historical data from 2000–2018.

---

#### Workflow Overview

##### 1. **Configuration**

- **Input:**
  - NASA Daily Data → `NASA_Standardized_Minnesota_2021-2024.csv`
  - ERA5 Hourly Ground Truth → `ERA5_Test_2021-2024.csv`
- **Models:** Loaded from `/trained_models_temporal_holdout/`  
- **Output:** Final validation file → `FINAL_VALIDATION_NASA_vs_ERA5_2021-2024.csv`

---

##### 2. **Data & Model Loading**

- Loads daily NASA inputs and hourly ERA5 truth values
- Loads trained Linear Regression models (one per hour per variable)
- Applies two-stage wind models:
  - `wind_speed_ms_max` and `wind_speed_ms_std` predicted from `wind_speed_ms_mean`

---

##### 3. **Generate Hourly Predictions**

- Using daily predictors, the script reconstructs 24 hourly values for each variable
- Predictions are made for:
  - `air_temperature_k`
  - `wind_speed_ms`
  - `relative_humidity_percent`
- Radiation and precipitation variables are **excluded** from this final run (due to low performance in earlier stages)

---

##### 4. **Validation: NASA Predictions vs. ERA5**

- Merges predicted values with ERA5 hourly data
- Computes key error metrics:



In [11]:
import pandas as pd
import numpy as np
import os
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# --- 1. CONFIGURATION ---
# Define the paths for your prepared data files and saved models.
ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted"
# Use the models trained on the older climate data (2000-2018)
MODEL_SAVE_DIR = os.path.join(ROOT_DATA_DIR, "trained_models_temporal_holdout")

# Input files for this final validation run
# Assumes this CSV contains data for multiple grid points including lat/lon columns
NASA_DAILY_INPUT_FILE = os.path.join(ROOT_DATA_DIR, "NASA_Standardized_Minnesota_2021-2024.csv")
ERA5_HOURLY_GROUND_TRUTH_FILE = os.path.join(ROOT_DATA_DIR, "ERA5_Test_2021-2024.csv")

# Output file for the final validation results
FINAL_VALIDATION_OUTPUT_FILE = os.path.join(ROOT_DATA_DIR, "FINAL_VALIDATION_NASA_vs_ERA5_2021-2024.csv")

# --- NEW: Grid Point Configuration ---
# Define the specific NASA grid point for Minneapolis to filter for.
# Using the coordinates confirmed as correct for the NASA grid.
TARGET_LAT = 45.125
TARGET_LON = 266.625


# --- 2. LOAD DATA AND TRAINED MODELS ---
print("--- Step 1: Loading data and trained models for FINAL validation ---")

# Load the multi-point, daily NASA data
try:
    # Important: We do not set index_col here yet, so we can filter by lat/lon first
    nasa_df_full = pd.read_csv(NASA_DAILY_INPUT_FILE, parse_dates=['time'])
    print(f"  Successfully loaded NASA daily data (2021-2024) with {len(nasa_df_full)} total records.")
except FileNotFoundError:
    print(f"Error: NASA data file not found at {NASA_DAILY_INPUT_FILE}.")
    exit()

# --- NEW: Filtering for the Minneapolis Grid ---
print(f"\n--- Filtering for Minneapolis Grid (Lat: {TARGET_LAT}, Lon: {TARGET_LON}) ---")
nasa_df = nasa_df_full[
    (nasa_df_full['lat'] == TARGET_LAT) & (nasa_df_full['lon'] == TARGET_LON)
].copy() # Use .copy() to avoid SettingWithCopyWarning

if nasa_df.empty:
    print(f"Error: No data found for the specified grid point. Please check your input file and coordinates.")
    exit()

# Set 'time' as the index now that filtering is complete
nasa_df.set_index('time', inplace=True)
print(f"  Filtering complete. Found {len(nasa_df)} records for the Minneapolis grid.")
# --- End of New Section ---


# Load the single-point, hourly ERA5 data (our ground truth)
try:
    era5_df = pd.read_csv(ERA5_HOURLY_GROUND_TRUTH_FILE, index_col='time', parse_dates=True)
    print("  Successfully loaded ERA5 hourly ground truth data (2021-2024).")
except FileNotFoundError:
    print(f"Error: ERA5 ground truth file not found at {ERA5_HOURLY_GROUND_TRUTH_FILE}.")
    exit()

# Load the library of trained models
try:
    print("  Loading pre-trained models (trained on 2000-2018)...")
    # This map uses the exact feature names the models were trained on
    predictor_map = {
        'air_temperature_k': ['air_temperature_k_mean', 'air_temperature_k_min', 'air_temperature_k_max'],
        'wind_speed_ms': ['wind_speed_ms_mean', 'wind_speed_ms_max', 'wind_speed_ms_std'],
        'relative_humidity_percent': ['relative_humidity_percent_mean']
        # We exclude radiation/precip models as the simple linear approach was ineffective
    }
    trained_models = {}
    for var_name in predictor_map.keys():
        model_path = os.path.join(MODEL_SAVE_DIR, f'models_{var_name}.pkl')
        trained_models[var_name] = joblib.load(model_path)
    
    trained_models['wind_max_from_mean'] = joblib.load(os.path.join(MODEL_SAVE_DIR, 'model_wind_max.pkl'))
    trained_models['wind_std_from_mean'] = joblib.load(os.path.join(MODEL_SAVE_DIR, 'model_wind_std.pkl'))
    print("  All models loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading model file: {e.filename}. Please ensure training was successful.")
    exit()


# --- 3. PREPARE NASA DATA FOR PREDICTION ---
print("\n--- Step 2: Preparing NASA data for prediction ---")

# Create the predictor DataFrame by renaming the NASA columns to match the training columns.
nasa_predictors_df = nasa_df.rename(columns={
    'tas': 'air_temperature_k_mean',
    'tasmin': 'air_temperature_k_min',
    'tasmax': 'air_temperature_k_max',
    'sfcWind': 'wind_speed_ms_mean',
    'hurs': 'relative_humidity_percent_mean'
})

# Apply two-stage model to generate wind characteristics
print("  Applying Stage-1 models to generate wind characteristics...")
X_wind_mean = nasa_predictors_df[['wind_speed_ms_mean']]
nasa_predictors_df['wind_speed_ms_max'] = trained_models['wind_max_from_mean'].predict(X_wind_mean)
nasa_predictors_df['wind_speed_ms_std'] = trained_models['wind_std_from_mean'].predict(X_wind_mean)
print("  NASA predictor data prepared.")


# --- 4. GENERATE HOURLY PREDICTIONS ---
print("\n--- Step 3: Generating hourly predictions from NASA daily data ---")
final_predictions = {}
for var_name, predictors in predictor_map.items():
    if not all(p in nasa_predictors_df.columns for p in predictors): continue
    
    print(f"  Predicting hourly values for: {var_name}...")
    hourly_preds_list = []
    X_predict = nasa_predictors_df[predictors]
    for hour in range(24):
        model = trained_models[var_name].get(hour)
        if model:
            preds = model.predict(X_predict)
            hourly_preds_list.append(pd.Series(preds, index=X_predict.index, name=hour))
    if hourly_preds_list:
        var_df_wide = pd.concat(hourly_preds_list, axis=1)
        var_stacked = var_df_wide.stack()
        var_stacked.index = var_stacked.index.map(lambda x: x[0] + pd.to_timedelta(x[1], unit='h'))
        final_predictions[f'predicted_{var_name}'] = var_stacked

predictions_df = pd.DataFrame(final_predictions)
print("--- Hourly predictions generated successfully. ---")


# --- 5. VALIDATE PREDICTIONS AND SAVE ---
print("\n--- Step 4: Validating predictions against ERA5 ground truth ---")
# Merge predictions with the actual hourly ERA5 data
validation_df = pd.merge(
    era5_df.rename(columns=lambda c: f"actual_{c}"),
    predictions_df,
    left_index=True,
    right_index=True,
    how="inner"
)

# Calculate and print final error metrics
print("  Final Validation Results (NASA-based Predictions vs. ERA5 Actuals for 2021-2024):")
for var_name in predictor_map.keys():
    actual_col, predicted_col = f'predicted_{var_name}', f'actual_{var_name}' # Corrected column order
    if actual_col in validation_df.columns and predicted_col in validation_df.columns:
        temp_compare_df = validation_df[[actual_col, predicted_col]].dropna()
        if not temp_compare_df.empty:
            mae = mean_absolute_error(temp_compare_df[actual_col], temp_compare_df[predicted_col])
            rmse = np.sqrt(mean_squared_error(temp_compare_df[actual_col], temp_compare_df[predicted_col]))
            r2 = r2_score(temp_compare_df[actual_col], temp_compare_df[predicted_col])
            
            print(f"    - {var_name}:")
            print(f"        Mean Absolute Error (MAE):    {mae:.4f}")
            print(f"        Root Mean Squared Error (RMSE): {rmse:.4f}")
            print(f"        R-squared (R²):               {r2:.4f}")

# Save the final validation results to a CSV file
print(f"\n--- Step 5: Saving final validation results to {FINAL_VALIDATION_OUTPUT_FILE} ---")
validation_df.to_csv(FINAL_VALIDATION_OUTPUT_FILE)
print("Save complete.")

--- Step 1: Loading data and trained models for FINAL validation ---
  Successfully loaded NASA daily data (2021-2024) with 1028544 total records.

--- Filtering for Minneapolis Grid (Lat: 45.125, Lon: 266.625) ---
  Filtering complete. Found 1461 records for the Minneapolis grid.
  Successfully loaded ERA5 hourly ground truth data (2021-2024).
  Loading pre-trained models (trained on 2000-2018)...
  All models loaded successfully.

--- Step 2: Preparing NASA data for prediction ---
  Applying Stage-1 models to generate wind characteristics...
  NASA predictor data prepared.

--- Step 3: Generating hourly predictions from NASA daily data ---
  Predicting hourly values for: air_temperature_k...
  Predicting hourly values for: wind_speed_ms...
  Predicting hourly values for: relative_humidity_percent...
--- Hourly predictions generated successfully. ---

--- Step 4: Validating predictions against ERA5 ground truth ---
  Final Validation Results (NASA-based Predictions vs. ERA5 Actuals fo

####  Summary: Why the Model Performed Poorly

The high errors and negative R² values are primarily due to **systematic bias between the two datasets** — ERA5 and NASA NEX-GDDP-CMIP6, which are fundamentally different in purpose and construction:

---

##### 1. Different Dataset Natures

- **ERA5 (Reanalysis - "Corrected Past")**:  
  Assimilates real-world observations (from satellites, stations, etc.) into a physical model to reconstruct the most accurate historical record.

- **NASA NEX-GDDP-CMIP6 (Climate Projection - "Simulated Past")**:  
  A statistically downscaled climate model based on future-focused simulations. Even historical runs are not constrained by real-time observations.

---

##### 2. Systematic Biases Between Datasets

- **Spatial Resolution Mismatch**:  
  NASA data is derived from coarser-resolution global models. Even with downscaling, it smooths over fine-grained, local effects that ERA5 captures (e.g., urban heat island, lakes).

- **Different Physical Models & Parametrizations**:  
  The core physics, assumptions, and parameter choices vary, introducing consistent biases (e.g., systematically warmer or windier).

- **Extreme Events Are Muted**:  
  NASA data tends to underrepresent the magnitude of extremes (e.g., strong wind gusts), leading to large prediction errors compared to ERA5.

---

##### 3. Linear Model Breakdown

- **Linear Regression Assumption Fails**:  
  The model assumes a stable, linear relationship (e.g., `ERA5 = m * NASA + c`) — but the real relationship is complex, biased, and not linear.

- **Negative R² for Wind & Humidity**:  
  This means the model performed *worse* than simply predicting the average — a strong indicator of structural model failure.

---

##### 4. Temperature's "Okay" R² is Misleading

- **R² = 0.41** looks decent, but it mostly captures the **daily cycle** (diurnal pattern), not true predictive accuracy.

- **MAE = 7.72°C** is still high, revealing major mismatches due to dataset bias — the model gets the *shape* of the day, but not the *actual values*.

---

#####  TL;DR

We're translating between two fundamentally different "languages" of climate data using a basic dictionary (linear regression). But these datasets differ in grammar, vocabulary, and purpose.

> **You need a smarter translator** — like anomaly correction or bias-aware models — that understands the context and systematic differences between datasets.
