# Climate Data Bias Correction & Downscaling Workflow

This plan outlines a structured approach to improving the accuracy of hourly climate predictions using bias correction and downscaling methods.

#### Phase 1: Data Preparation & Bias Correction (**Current Phase**)

> **Goal**: Create two corrected daily NASA datasets (2021-2024) using bias correction techniques, based on historical calibration data (2000-2014).


##### Task 1.1: Consolidate Historical Data

- **NASA**:  
  - Load and process `NASA_Standardized_Minnesota_2000-2014.csv`

- **ERA5**:  
  - From `ERA5_Train_2000-2020.csv`, extract daily summaries for 2000-2014 to align with NASA data.

##### Task 1.2: Implement Workflow A – *Delta Change Method*

- **Method**:  
  - Calculate historical monthly *differences* (for temperature) or *ratios* (for wind/precipitation) between ERA5 and NASA data (2000-2014).
  - Apply these deltas to raw NASA data from 2021-2024.

- **Output**:  
  - `NASA_Corrected_Delta_2021-2024.csv`

##### Task 1.3: Implement Workflow B – *Quantile Mapping Method*

- **Method**:  
  - Build a quantile mapping function per variable (temperature, wind, etc.) to align NASA data distributions with ERA5 (2000-2014).
  - Apply the mapping functions to the 2021-2024 NASA data.

- **Output**:  
  - `NASA_Corrected_Quantile_2021-2024.csv`

---

#### Phase 2: Downscaling & Prediction

> **Goal**: Test whether the corrected input data improves performance using existing trained models.

##### Task 2.1: Run Downscaling on Corrected Data

- Use the script `DownScaling.ipynb` and run it twice:
  - **Run 1**: Use `NASA_Corrected_Delta_2021-2024.csv`
  - **Run 2**: Use `NASA_Corrected_Quantile_2021-2024.csv`

- **Output**:  
  - Two sets of hourly predictions for 2021-2024.


#### Phase 3: Evaluation & Decision

> **Goal**: Evaluate model performance and decide the best correction method.


##### Task 3.1: Compare Against Ground Truth

- Compare both sets of hourly predictions with actual hourly ERA5 data from `ERA5_Test_2021-2024.csv`.
- Calculate standard metrics:
  - **MAE**
  - **RMSE**
  - **R²**


# Bias Correction with the Delta Change Method

## Context & The Problem

Our initial downscaling model, though structurally robust, produced **hourly predictions with significant errors** — particularly high **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)** — when compared against **ERA5 ground-truth data**.

This discrepancy revealed a **systematic bias** between:

- The **daily outputs of the NASA climate model**, and
- The **daily summaries derived from ERA5 observations**

> To improve the accuracy of the final **hourly downscaled predictions**, we must **first correct the daily input data**.


## Proposed Solution: The Delta Change Method

The **Delta Change Method** is one of the most widely used and effective **bias correction techniques** in climate data analysis. It applies historical bias adjustments to future model outputs in a **month-by-month** manner.


### What is the Delta Change Method?

The method assumes that the **average model error** (or "delta") for a given calendar month, observed during a **historical calibration period**, will remain **consistent in the future**.

Instead of trying to re-learn a complex statistical relationship, this method simply **corrects future projections** using **historical monthly averages** of model bias.


### How Does It Work?

The Delta Method is applied in two steps:

#### 1. **Learn (Calibration Phase)**

- Use a historical period (e.g., **2000–2020**)  
- For each calendar month, compute the **bias between NASA and ERA5** data:
  - **Additive correction** for variables like temperature  
    (e.g., NASA_temp - ERA5_temp)
  - **Multiplicative correction** for variables like wind speed  
    (e.g., NASA_wind / ERA5_wind)

#### 2. **Correct (Application Phase)**

- Apply the monthly delta to future model outputs (e.g., **2021–2024**)
- Example corrections:
  - If January temperatures were historically **+2°C too warm**, subtract 2°C from all **future January temperatures**
  - If July wind speeds were **10% too high**, multiply all future July wind speeds by **0.90**

>  This correction brings the **statistical properties of future NASA data** more in line with the **ERA5 observations**.

---

#### Goal of This Notebook

The purpose of this notebook is to:

- Apply the **Delta Change Method** to the **daily NASA input data**
- Use the corrected daily data as input to our **hourly downscaling models**
- Evaluate whether this bias correction step improves the accuracy of the final **hourly predictions** when compared to results from uncorrected data

> This process is essential for ensuring that the **downscaled data maintains physical realism and statistical alignment** with the real-world observations it is meant to replicate.

In [31]:
import pandas as pd
import numpy as np
import os

# --- 1. CONFIGURATION: Define All File Paths and Settings ---
print("--- Initializing Configuration ---")
ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted" # IMPORTANT: Use your actual path

# Target coordinates for Minneapolis
TARGET_LAT = 45.125
TARGET_LON = 266.625

# Input files for calculating the bias (2000-2020)
HISTORICAL_NASA_FILE_P1 = os.path.join(ROOT_DATA_DIR, "NASA_Standardized_Minnesota_2000-2014.csv")
HISTORICAL_NASA_FILE_P2 = os.path.join(ROOT_DATA_DIR, "NASA_Standardized_Minnesota_2015-2020.csv")
HISTORICAL_ERA5_FILE = os.path.join(ROOT_DATA_DIR, "ERA5_Train_2000-2020.csv")

# Input file to be corrected (2021-2024)
VALIDATION_NASA_FILE = os.path.join(ROOT_DATA_DIR, "NASA_Standardized_Minnesota_2021-2024.csv")

# Final output file
OUTPUT_CORRECTED_FILE = os.path.join(ROOT_DATA_DIR, "NASA_Corrected_Delta_Minneapolis_2021-2024.csv") # Renamed output for clarity

# Define the full calibration period
CALIBRATION_START = '2000-01-01'
CALIBRATION_END = '2020-12-31'


# --- 2. DATA LOADING AND FILTERING ---
print("\n--- Step 2: Loading and Filtering all necessary data files for Minneapolis ---")

# Define a helper function to load and filter NASA data
def load_and_filter_nasa(file_path, lat, lon):
    df = pd.read_csv(file_path, parse_dates=['time'])
    # Filter for the specific grid point
    filtered_df = df[
        (np.isclose(df['lat'], lat)) &
        (np.isclose(df['lon'], lon))
    ].copy()
    # Drop the now-redundant lat/lon columns and set time index
    return filtered_df.drop(columns=['lat', 'lon']).set_index('time')

# Load and filter historical NASA data
nasa_hist_p1_df = load_and_filter_nasa(HISTORICAL_NASA_FILE_P1, TARGET_LAT, TARGET_LON)
nasa_hist_p2_df = load_and_filter_nasa(HISTORICAL_NASA_FILE_P2, TARGET_LAT, TARGET_LON)
nasa_hist_df = pd.concat([nasa_hist_p1_df, nasa_hist_p2_df])
print(f"  Successfully loaded and filtered historical NASA data for Minneapolis ({nasa_hist_df.index.min().year}-{nasa_hist_df.index.max().year}).")

# Load and filter validation NASA data
nasa_val_df = load_and_filter_nasa(VALIDATION_NASA_FILE, TARGET_LAT, TARGET_LON)
print("  Successfully loaded and filtered validation NASA data for Minneapolis.")

# Load the ERA5 file (it's already for Minneapolis)
era5_hist_df_hourly = pd.read_csv(HISTORICAL_ERA5_FILE, parse_dates=['time']).set_index('time')
print("  All data loading complete.")


# --- 3. PREPARE CALIBRATION DATA ---
print(f"\n--- Step 3: Preparing daily summaries for the calibration period ({CALIBRATION_START} to {CALIBRATION_END}) ---")

# Resample ERA5 hourly data to daily summaries
era5_daily_summary = era5_hist_df_hourly.resample('D').agg(
    air_temperature_k=('air_temperature_k', 'mean'),
    tasmin=('air_temperature_k', 'min'),
    tasmax=('air_temperature_k', 'max'),
    wind_speed_ms=('wind_speed_ms', 'mean'),
    relative_humidity_percent=('relative_humidity_percent', 'mean'),
    solar_radiation_w_m2=('solar_radiation_w_m2', 'mean'),
    thermal_radiation_w_m2=('thermal_radiation_w_m2', 'mean'),
    precip_hourly_mm=('precip_hourly_mm', 'sum')
).loc[CALIBRATION_START:CALIBRATION_END]

# Create a mapping to align all NASA column names with the ERA5 summary column names
COLUMN_MAP = {
    'tas': 'air_temperature_k',
    'sfcWind': 'wind_speed_ms',
    'hurs': 'relative_humidity_percent',
    'rsds': 'solar_radiation_w_m2',
    'rlds': 'thermal_radiation_w_m2',
    'precip_daily_mm': 'precip_hourly_mm'
}
nasa_hist_df_renamed = nasa_hist_df.rename(columns=COLUMN_MAP)
print("  Daily summaries created and columns aligned.")


# --- 4. CALCULATE MONTHLY DELTAS ---
print("\n--- Step 4: Calculating historical monthly deltas based on 2000-2020 data ---")

mean_monthly_era5 = era5_daily_summary.groupby(era5_daily_summary.index.month).mean()
mean_monthly_nasa = nasa_hist_df_renamed.groupby(nasa_hist_df_renamed.index.month).mean()

ADDITIVE_VARS = ['air_temperature_k', 'tasmin', 'tasmax']
MULTIPLICATIVE_VARS = ['wind_speed_ms', 'relative_humidity_percent', 'solar_radiation_w_m2', 'thermal_radiation_w_m2', 'precip_hourly_mm']

additive_deltas = mean_monthly_era5[ADDITIVE_VARS] - mean_monthly_nasa[ADDITIVE_VARS]
multiplicative_deltas = mean_monthly_era5[MULTIPLICATIVE_VARS] / mean_monthly_nasa[MULTIPLICATIVE_VARS]
multiplicative_deltas = multiplicative_deltas.replace([np.inf, -np.inf], 1).fillna(1)

deltas_df = pd.concat([additive_deltas, multiplicative_deltas], axis=1)
deltas_df.index.name = 'month'
print("  Monthly deltas calculated successfully:")
print(deltas_df)


# --- 5. APPLY DELTAS TO THE VALIDATION DATASET ---
print(f"\n--- Step 5: Applying deltas to the Minneapolis validation dataset ---")
nasa_corrected_df = nasa_val_df.copy()
nasa_corrected_df['month'] = nasa_corrected_df.index.month

for var in ADDITIVE_VARS:
    nasa_corrected_df[var + '_delta'] = nasa_corrected_df['month'].map(deltas_df[var])
for var in MULTIPLICATIVE_VARS:
    nasa_corrected_df[var + '_delta'] = nasa_corrected_df['month'].map(deltas_df[var])

nasa_corrected_df['tas'] = nasa_corrected_df['tas'] + nasa_corrected_df['air_temperature_k_delta']
nasa_corrected_df['tasmin'] = nasa_corrected_df['tasmin'] + nasa_corrected_df['tasmin_delta']
nasa_corrected_df['tasmax'] = nasa_corrected_df['tasmax'] + nasa_corrected_df['tasmax_delta']
nasa_corrected_df['sfcWind'] = nasa_corrected_df['sfcWind'] * nasa_corrected_df['wind_speed_ms_delta']
nasa_corrected_df['hurs'] = nasa_corrected_df['hurs'] * nasa_corrected_df['relative_humidity_percent_delta']
nasa_corrected_df['rsds'] = nasa_corrected_df['rsds'] * nasa_corrected_df['solar_radiation_w_m2_delta']
nasa_corrected_df['rlds'] = nasa_corrected_df['rlds'] * nasa_corrected_df['thermal_radiation_w_m2_delta']
nasa_corrected_df['precip_daily_mm'] = nasa_corrected_df['precip_daily_mm'] * nasa_corrected_df['precip_hourly_mm_delta']

nasa_corrected_df['hurs'] = nasa_corrected_df['hurs'].clip(0, 100)
nasa_corrected_df = nasa_corrected_df[nasa_val_df.columns]
print("  Deltas applied successfully.")


# --- 6. SAVE THE CORRECTED FILE ---
print(f"\n--- Step 6: Saving bias-corrected data to {OUTPUT_CORRECTED_FILE} ---")
nasa_corrected_df.to_csv(OUTPUT_CORRECTED_FILE)
print(f"\nSave complete. The file '{os.path.basename(OUTPUT_CORRECTED_FILE)}' is ready.")

--- Initializing Configuration ---

--- Step 2: Loading and Filtering all necessary data files for Minneapolis ---
  Successfully loaded and filtered historical NASA data for Minneapolis (2000-2020).
  Successfully loaded and filtered validation NASA data for Minneapolis.
  All data loading complete.

--- Step 3: Preparing daily summaries for the calibration period (2000-01-01 to 2020-12-31) ---
  Daily summaries created and columns aligned.

--- Step 4: Calculating historical monthly deltas based on 2000-2020 data ---
  Monthly deltas calculated successfully:
       air_temperature_k    tasmin    tasmax  wind_speed_ms  \
month                                                         
1              -0.900845  0.578580 -1.459750       0.965596   
2              -0.262087  0.899105 -0.363040       0.980524   
3               1.198082  2.416696  1.031728       0.979971   
4              -0.697369  0.822795 -0.945179       1.022300   
5              -0.214768  1.407489 -0.772635       1.05

# Evaluation 

In [33]:
import pandas as pd
import numpy as np
import os
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

# --- 1. CONFIGURATION ---
print("--- Step 1: Configuring file paths for evaluation ---")
# Use the exact filenames from your directory
ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted" # IMPORTANT: Use your actual path

# *** KEY CHANGE: Use the Minneapolis-specific bias-corrected file as the input ***
NASA_DAILY_INPUT_FILE = os.path.join(ROOT_DATA_DIR, "NASA_Corrected_Delta_Minneapolis_2021-2024.csv")

# Other required files
ERA5_HOURLY_GROUND_TRUTH_FILE = os.path.join(ROOT_DATA_DIR, "ERA5_Test_2021-2024.csv")
MODEL_SAVE_DIR = os.path.join(ROOT_DATA_DIR, "trained_models_temporal_holdout")

# Define the output file for this specific validation run
FINAL_VALIDATION_OUTPUT_FILE = os.path.join(ROOT_DATA_DIR, "DELTA_METHOD_VALIDATION_NASA_vs_ERA5_2021-2024.csv")


# --- 2. LOAD DATA AND TRAINED MODELS ---
print("\n--- Step 2: Loading data and pre-trained models ---")

# Load the bias-corrected daily NASA data
try:
    nasa_df = pd.read_csv(NASA_DAILY_INPUT_FILE, parse_dates=['time']).set_index('time')
    print(f"  Successfully loaded bias-corrected NASA daily data for Minneapolis.")
except FileNotFoundError:
    print(f"Error: Could not find the input file: {NASA_DAILY_INPUT_FILE}")
    print("Please ensure you have run the Delta Change correction script successfully.")
    exit()

# Load the hourly ERA5 ground truth data
try:
    era5_df = pd.read_csv(ERA5_HOURLY_GROUND_TRUTH_FILE, index_col='time', parse_dates=True)
    print("  Successfully loaded ERA5 hourly ground truth data.")
except FileNotFoundError:
    print(f"Error: ERA5 ground truth file not found at {ERA5_HOURLY_GROUND_TRUTH_FILE}.")
    exit()

# Load the library of trained models
try:
    print("  Loading pre-trained models (trained on 2000-2018 ERA5)...")
    predictor_map = {
        'air_temperature_k': ['air_temperature_k_mean', 'air_temperature_k_min', 'air_temperature_k_max'],
        'wind_speed_ms': ['wind_speed_ms_mean', 'wind_speed_ms_max', 'wind_speed_ms_std'],
        'relative_humidity_percent': ['relative_humidity_percent_mean']
    }
    trained_models = {}
    for var_name in predictor_map.keys():
        model_path = os.path.join(MODEL_SAVE_DIR, f'models_{var_name}.pkl')
        trained_models[var_name] = joblib.load(model_path)
    
    trained_models['wind_max_from_mean'] = joblib.load(os.path.join(MODEL_SAVE_DIR, 'model_wind_max.pkl'))
    trained_models['wind_std_from_mean'] = joblib.load(os.path.join(MODEL_SAVE_DIR, 'model_wind_std.pkl'))
    print("  All models loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading model file: {e.filename}. Please ensure training was successful.")
    exit()


# --- 3. PREPARE NASA DATA FOR PREDICTION ---
print("\n--- Step 3: Preparing bias-corrected NASA data for prediction ---")

# Rename the NASA columns to match the names the models were trained on
nasa_predictors_df = nasa_df.rename(columns={
    'tas': 'air_temperature_k_mean',
    'tasmin': 'air_temperature_k_min',
    'tasmax': 'air_temperature_k_max',
    'sfcWind': 'wind_speed_ms_mean',
    'hurs': 'relative_humidity_percent_mean'
})

# Apply the two-stage model to generate wind characteristics
print("  Applying Stage-1 models to generate wind characteristics...")
X_wind_mean = nasa_predictors_df[['wind_speed_ms_mean']]
nasa_predictors_df['wind_speed_ms_max'] = trained_models['wind_max_from_mean'].predict(X_wind_mean)
nasa_predictors_df['wind_speed_ms_std'] = trained_models['wind_std_from_mean'].predict(X_wind_mean)
print("  Predictor data prepared.")


# --- 4. GENERATE HOURLY PREDICTIONS ---
print("\n--- Step 4: Generating hourly predictions from corrected daily data ---")
final_predictions = {}
for var_name, predictors in predictor_map.items():
    print(f"  Predicting hourly values for: {var_name}...")
    hourly_preds_list = []
    # Ensure all required predictors exist
    if not all(p in nasa_predictors_df.columns for p in predictors):
        print(f"    Skipping {var_name}, missing one or more predictors: {predictors}")
        continue
        
    X_predict = nasa_predictors_df[predictors]
    for hour in range(24):
        model = trained_models[var_name].get(hour)
        if model:
            preds = model.predict(X_predict)
            hourly_preds_list.append(pd.Series(preds, index=X_predict.index, name=hour))
            
    if hourly_preds_list:
        var_df_wide = pd.concat(hourly_preds_list, axis=1)
        var_stacked = var_df_wide.stack()
        var_stacked.index = var_stacked.index.map(lambda x: x[0] + pd.to_timedelta(x[1], unit='h'))
        final_predictions[f'predicted_{var_name}'] = var_stacked

predictions_df = pd.DataFrame(final_predictions)
print("--- Hourly predictions generated successfully. ---")


# --- 5. VALIDATE PREDICTIONS AGAINST ERA5 GROUND TRUTH ---
print("\n--- Step 5: Validating predictions against ERA5 ground truth ---")
# Merge predictions with the actual hourly ERA5 data
validation_df = pd.merge(
    era5_df.rename(columns=lambda c: f"actual_{c}"),
    predictions_df,
    left_index=True,
    right_index=True,
    how="inner"
)

# Calculate and print final error metrics
print("\n  Final Validation Results (DELTA METHOD vs. ERA5 Actuals for 2021-2024):")
for var_name in predictor_map.keys():
    actual_col = f'actual_{var_name}'
    predicted_col = f'predicted_{var_name}'
    if actual_col in validation_df.columns and predicted_col in validation_df.columns:
        temp_compare_df = validation_df[[actual_col, predicted_col]].dropna()
        if not temp_compare_df.empty:
            mae = mean_absolute_error(temp_compare_df[actual_col], temp_compare_df[predicted_col])
            rmse = np.sqrt(mean_squared_error(temp_compare_df[actual_col], temp_compare_df[predicted_col]))
            r2 = r2_score(temp_compare_df[actual_col], temp_compare_df[predicted_col])
            
            print(f"    - {var_name}:")
            print(f"        Mean Absolute Error (MAE):    {mae:.4f}")
            print(f"        Root Mean Squared Error (RMSE): {rmse:.4f}")
            print(f"        R-squared (R²):               {r2:.4f}")

# --- 6. SAVE THE FINAL VALIDATION RESULTS ---
print(f"\n--- Step 6: Saving final validation results to {FINAL_VALIDATION_OUTPUT_FILE} ---")
validation_df.to_csv(FINAL_VALIDATION_OUTPUT_FILE)
print("Save complete.")

--- Step 1: Configuring file paths for evaluation ---

--- Step 2: Loading data and pre-trained models ---
  Successfully loaded bias-corrected NASA daily data for Minneapolis.
  Successfully loaded ERA5 hourly ground truth data.
  Loading pre-trained models (trained on 2000-2018 ERA5)...
  All models loaded successfully.

--- Step 3: Preparing bias-corrected NASA data for prediction ---
  Applying Stage-1 models to generate wind characteristics...
  Predictor data prepared.

--- Step 4: Generating hourly predictions from corrected daily data ---
  Predicting hourly values for: air_temperature_k...
  Predicting hourly values for: wind_speed_ms...
  Predicting hourly values for: relative_humidity_percent...
--- Hourly predictions generated successfully. ---

--- Step 5: Validating predictions against ERA5 ground truth ---

  Final Validation Results (DELTA METHOD vs. ERA5 Actuals for 2021-2024):
    - air_temperature_k:
        Mean Absolute Error (MAE):    7.2459
        Root Mean Squa