### NASA NEX-GDDP Data Processing Script (Minnesota-Focused)

link: S3 bucket - https://nex-gddp-cmip6.s3.us-west-2.amazonaws.com/index.html#NEX-GDDP-CMIP6/CNRM-CM6-1/

This Python script processes daily NASA NEX-GDDP-CMIP6 climate data from **NetCDF files stored on AWS S3** into a clean, standardized dataset for **Minnesota**. It performs variable selection, spatial filtering, unit conversions, and outputs a combined CSV file ready for modeling or analysis.

---

#### Key Configuration

- **Bounding Box:** Filters data for the Minnesota region  
- **Years Processed:** 2021–2024  
- **Variables Extracted:**
  - `hurs` – Relative humidity  
  - `pr` – Precipitation  
  - `rlds`, `rsds` – Radiation (longwave and shortwave)  
  - `sfcWind` – Surface wind speed  
  - `tas`, `tasmax`, `tasmin` – Temperature metrics  

---

#### Core Logic & Workflow

#####  1. Data Fetching from S3
- Connects anonymously to **NASA NEX-GDDP-CMIP6** bucket on S3
- Downloads daily NetCDF files for each variable and year to temporary storage
- Loads files into `xarray` for efficient handling

#####  2. Concatenation & Merging
- Combines all years per variable into a single `xarray.Dataset`
- Merges all variables into one comprehensive dataset

#####  3. Spatial Filtering (Minnesota)
- Applies latitude & longitude masks to isolate Minnesota
- Converts spatial data to tabular format with `stack` + `to_dataframe()`

#####  4. Unit Conversion & Cleaning
- Converts:
- `pr` → `precip_daily_mm` (kg/m²/s → mm/day)
- Drops null/duplicate rows and resets the index for clean output

#####  5. Output
- Saves processed data as:
- Located at a user-defined path (via `ROOT_DATA_DIR`)

##### 6. Cleanup
- Releases file handles, deletes temporary NetCDF files, and runs garbage collection

---

##### Dependencies

- `xarray`, `fsspec`, `tempfile`, `numpy`, `pandas`, `dask`, `netCDF4`

---

##### Notes

- Longitude in ERA5 is stored in **0–360** format — Minnesota spans `265–273`
- This script handles **forecast reset bugs** by applying discrete hourly transformation logic if extended
- All temporary NetCDFs are deleted at the end of execution to manage disk space

---

*Use this script as a robust template for extracting and preprocessing other CMIP6 or NASA datasets by adjusting bounds, variables, and model paths.*




In [5]:
import xarray as xr
import fsspec
import tempfile
import os
import numpy as np
import pandas as pd
from dask.diagnostics import ProgressBar
import gc
import time

# --- 1. CONFIGURATION ---
# Define the bounding box for the state of Minnesota
MINNESOTA_BOUNDS = {
    "min_lat": 43.5,
    "max_lat": 49.0,
    "min_lon": 265.0, # Using 0-360 longitude format as in the source files
    "max_lon": 273.0
}

# Define the years we want to fetch and process for this run
SELECTED_YEARS = ['2015', '2016', '2017', '2018', '2019', '2020']
#['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014']

# Define the variables required for our downscaling models
VARIABLES_TO_FETCH = [
    'hurs', 'pr', 'rlds', 'rsds', 'sfcWind', 'tas', 'tasmax', 'tasmin'
]

# --- CORRECTED CONFIGURATION ---
# Define the S3 path and scenario name
SCENARIO = 'ssp126' # Changed from 'ssp126' # historical/r1i1p1f2: 2000-2014 , ssp126/r1i1p1f2: 2015 onwards
BASE_S3_PREFIX = f's3://nex-gddp-cmip6/NEX-GDDP-CMIP6/CNRM-CM6-1/{SCENARIO}/r1i1p1f2/'

# List to keep track of temporary files for cleanup
all_temp_files = []

# --- 2. DATA FETCHING AND PROCESSING ---

def download_and_open_netcdf(s3_path, var_name, year):
    """Downloads a NetCDF file from S3 to a temporary local file and opens it with xarray."""
    try:
        with fsspec.open(s3_path, mode='rb', anon=True) as remote_file:
            # Create a temporary file to store the downloaded data
            fd, tmp_path = tempfile.mkstemp(suffix=".nc")
            with os.fdopen(fd, 'wb') as tmp_file:
                tmp_file.write(remote_file.read())
            all_temp_files.append(tmp_path)
            
            # Open the local temporary file with xarray
            ds = xr.open_dataset(tmp_path, engine='netcdf4', chunks='auto')
            print(f"  - Successfully loaded {var_name} for {year}")
            return ds
    except FileNotFoundError:
        print(f"  - ERROR: File not found for {var_name} in {year} at {s3_path}")
        return None
    except Exception as e:
        print(f"  - ERROR processing {var_name} for {year}: {e}")
        return None

# --- Main Script Execution ---
print("--- Step 1: Loading NASA NEX-GDDP Data from S3 ---")

all_variable_datasets = []
for var_name in VARIABLES_TO_FETCH:
    print(f"\nProcessing variable: {var_name}")
    datasets_for_current_var = []
    for year in SELECTED_YEARS:
        # --- CORRECTED S3 PATH CONSTRUCTION ---
        s3_path = f"{BASE_S3_PREFIX}{var_name}/{var_name}_day_CNRM-CM6-1_{SCENARIO}_r1i1p1f2_gr_{year}.nc"
        ds = download_and_open_netcdf(s3_path, var_name, year)
        if ds is not None:
            datasets_for_current_var.append(ds)
            
    if datasets_for_current_var:
        # Concatenate the yearly datasets for the current variable
        with ProgressBar():
            concatenated_var_ds = xr.concat(datasets_for_current_var, dim='time')
        all_variable_datasets.append(concatenated_var_ds)
        print(f"  --- Concatenated {var_name} for years {min(SELECTED_YEARS)}-{max(SELECTED_YEARS)} ---")

# Merge all variables into a single xarray Dataset
if all_variable_datasets:
    with ProgressBar():
        final_combined_ds = xr.merge(all_variable_datasets)
    print("\n--- Final Combined Xarray Dataset (All Variables, All Years) ---")
else:
    print("No datasets were successfully combined. Exiting.")
    exit()

# --- Step 2: Filtering and Converting to DataFrame ---
print("\n--- Step 2: Filtering for Minnesota and Converting to DataFrame ---")

# Create a boolean mask for the Minnesota bounding box
lat_mask = (final_combined_ds.lat >= MINNESOTA_BOUNDS["min_lat"]) & (final_combined_ds.lat <= MINNESOTA_BOUNDS["max_lat"])
lon_mask = (final_combined_ds.lon >= MINNESOTA_BOUNDS["min_lon"]) & (final_combined_ds.lon <= MINNESOTA_BOUNDS["max_lon"])

# Apply the spatial filter
with ProgressBar():
    filtered_ds = final_combined_ds.where(lat_mask & lon_mask, drop=True)

print("\n--- Converting to Pandas DataFrame ---")
# Stack spatial dimensions and convert to a pandas DataFrame
with ProgressBar():
    stacked_ds = filtered_ds.stack(spatial_point=('lat', 'lon'))
    nasa_df = stacked_ds.to_dataframe()

# --- THE FIX: Robustly handle the index before resetting ---
# Check if 'lat' or 'lon' accidentally ended up as columns and drop them.
cols_to_drop = [col for col in ['lat', 'lon'] if col in nasa_df.columns]
if cols_to_drop:
    print(f"  Found duplicated coordinate columns: {cols_to_drop}. Dropping them before resetting index.")
    nasa_df = nasa_df.drop(columns=cols_to_drop)

# Now, resetting the index will work reliably.
print("  Resetting index to convert coordinates to columns...")
nasa_df = nasa_df.reset_index()

# Clean up any rows that might be all null after filtering
nasa_df = nasa_df.dropna(subset=VARIABLES_TO_FETCH, how='all')

print("\n--- Step 3: Standardizing Units for NASA Data ---")
# Apply the necessary unit conversions
nasa_df['precip_daily_mm'] = nasa_df['pr'] * 86400
# Other variables are already in standard units (K, m/s, %, W/m^2)

print("  Unit standardization complete.")
print("\n--- Sample of Final NASA DataFrame ---")
print(nasa_df.head())
print(f"\nShape of final NASA DataFrame: {nasa_df.shape}")


# --- Step 4: Save the Processed Data ---
try:
    ROOT_DATA_DIR = r"C:\Users\91788\Downloads\ERA5 Data\Extracted"
    os.makedirs(ROOT_DATA_DIR, exist_ok=True)
    output_filename = f"NASA_Standardized_Minnesota_{min(SELECTED_YEARS)}-{max(SELECTED_YEARS)}.csv"
    output_path = os.path.join(ROOT_DATA_DIR, output_filename)
    print(f"\n--- Step 4: Saving final data to: {output_path} ---")
    nasa_df.to_csv(output_path, index=False)
    print("Save complete.")
except NameError:
    print("\nSkipping file save as ROOT_DATA_DIR is not defined.")
except Exception as e:
    print(f"\nAn error occurred while saving the file: {e}")


# --- Cleanup ---
print("\n--- Cleaning up temporary files ---")
# Close xarray file handles
if 'final_combined_ds' in locals(): final_combined_ds.close()
if 'filtered_ds' in locals(): filtered_ds.close()
if 'stacked_ds' in locals(): stacked_ds.close()
gc.collect()
time.sleep(1) # Give a moment for file handles to release

for fp in all_temp_files:
    if os.path.exists(fp):
        try:
            os.remove(fp)
            # print(f"Removed: {os.path.basename(fp)}") # Optional: uncomment for verbose output
        except Exception as e:
            print(f"ERROR: Could not remove {os.path.basename(fp)}: {e}")

--- Step 1: Loading NASA NEX-GDDP Data from S3 ---

Processing variable: hurs
  - Successfully loaded hurs for 2015
  - Successfully loaded hurs for 2016
  - Successfully loaded hurs for 2017
  - Successfully loaded hurs for 2018
  - Successfully loaded hurs for 2019
  - Successfully loaded hurs for 2020
  --- Concatenated hurs for years 2015-2020 ---

Processing variable: pr
  - Successfully loaded pr for 2015
  - Successfully loaded pr for 2016
  - Successfully loaded pr for 2017
  - Successfully loaded pr for 2018
  - Successfully loaded pr for 2019
  - Successfully loaded pr for 2020
  --- Concatenated pr for years 2015-2020 ---

Processing variable: rlds
  - Successfully loaded rlds for 2015
  - Successfully loaded rlds for 2016
  - Successfully loaded rlds for 2017
  - Successfully loaded rlds for 2018
  - Successfully loaded rlds for 2019
  - Successfully loaded rlds for 2020
  --- Concatenated rlds for years 2015-2020 ---

Processing variable: rsds
  - Successfully loaded rsds 