## NASA Center Analysis
<details>
<summary>IMPORTANT (FEB 2025)</summary>
<br>
NEX GDDP currently has a bug that is affected tmin (minimum temperature) and tas (average temperature) for some Centers, only for SSP 1-2.6 and 3-7.0. These are the following Centers that are affected:

AMES
LARC
GISS
JPL
JSC
KSC
WFF

That means any variable that includes tmin or tas, for SSP 1-2.6 and 3-7.0, will be affected, and we will plan on not using those variables for now. The variables include:
Max DTR, Tmin >20C, Tmin < 0C, Coldest Tmin of the Year, Annual Average Tmin, any of the humid heat diagnostics (e.g., heat index, WBGT, etc).
</details>

<details>
<summary>VARIABLES</summary>

| Variable Name       | Long Name                                          | Variable Category | Units     | Description                                                                                                                                                          |
| ------------------- | -------------------------------------------------- | ----------------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| tmax_days_35C       | \# days Tmax ≥35°C                                 | extreme index     | \# days   | Number of days, per year, with Tmax >=35C                                                                                                                            |
| tmax_days_90th      | \# days Tmax ≥90th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 90th percentile. 90th percentile calculated using all daily tmax values from 1995-2014.                                          |
| tmax_days_95th      | \# days Tmax ≥95th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 95th percentile. 95th percentile calculated using all daily tmax values from 1995-2014.                                          |
| tmax_days_99th      | \# days Tmax ≥99th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 99th percentile. 99th percentile calculated using all daily tmax values from 1995-2014.                                          |
| Hottest_Tmax        | Hottest Tmax of the Year (°C)                      | extreme index     | degrees C | Hottest Tmax value every year                                                                                                                                        |
| Max_DTR             | Largest Diurnal Temperature Range of the Year (°C) | extreme index     | degrees C | largest diurnal temperature range (tmax minus tmin) each year                                                                                                        |
| tmin_tropnights_20C | \# days Tmin ≥20°C                                 | extreme index     | \# days   | Number of days, per year with Tmin >=20C                                                                                                                             |
| tmin_frostdays_0C   | \# days Tmin ≤0°C                                  | extreme index     | \# days   | Number of days per year with Tmin <=0C                                                                                                                               |
| Coldest_Tmin        | Coldest Tmin of the Year (°C)                      | extreme index     | degrees C | Coldest minimum temperature each year                                                                                                                                |
| prec_days_dry       | \# days with precipitation ≤0.001 in               | extreme index     | \# days   | Number of days, per year, where precipitation <=1e-3 inches                                                                                                          |
| prec_days_oneinch   | \# days with precipitation ≥1 in                   | extreme index     | \# days   | Number of days, per year, where precipitation >=1 inch                                                                                                               |
| prec_days_90th      | \# days with precipitation ≥90th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=90th percentile. 90th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| prec_days_95th      | \# days with precipitation ≥95th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=95th percentile. 95th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| prec_days_99th      | \# days with precipitation ≥99th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=99th percentile. 99th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| tmax_annave         | Annual Average Tmax (°C)                           | annual average    | degrees C | Annual average maximum daily temperature                                                                                                                             |
| tmin_annave         | Annual Average Tmin (°C)                           | annual average    | degrees C | Annual average minimum daily temperature                                                                                                                             |
| prec_annave         | Annual Total Precipitation (mm)                    | annual SUM        | degrees C | Annual SUM of precipitation                                                                                                                                          |</details>

In [1]:
# Imports
import os
import warnings

import numpy as np
import pandas as pd
import pandasql as psql

# Suppress warnings
warnings.filterwarnings('ignore')

## Initialization

In [2]:
path = 'updated_extremes'  # data directory
center = 'LARC'.upper()    # NASA center to analyze
only_future = True         # flag to use only 2020-2099
ssp = ['ssp126', 'ssp245', 'ssp370'] # scenarios to use

In [3]:
# DO NOT CHANGE THIS CELL
# File name convention: variable_CENTER_ssp###.csv

# NASA Centers
centers = ['AMES', 'GSFC', 'JPL', 'KSC', 'MSFC', 'MAF', 'GISS',
           'LARC', 'SSC', 'GRC', 'WFF', 'JSC', 'WSTF', 'AFRC']

# Check if the provided center is valid
if center not in centers:
    raise ValueError(f'{center} not in {centers}')

# Variable unit: number of DAYS when... assume others are celsius
day_unit = ['days', 'tropnights']

# Shared Socioeconomic Pathways (SSPs) climate scenarios
print(f'Available: {sorted(list({f.split('_')[-1][:-4] for f in os.listdir(path) 
                                 if 'ssp' in f.split('_')[-1]}))}')
# Time periods: 10 years before+after a decade
time_periods = {'short': (2020, 2049),  # 2030's: 2020-2029, 2030-2039, 2040-2049
                'mid':   (2040, 2069),  # 2050's: 2040-2049, 2050-2059, 2060-2069
                'long':  (2070, 2099),  # 2080's: 2070-2079, 2080-2089, 2090-2099
                }

Available: ['ssp126', 'ssp245', 'ssp370', 'ssp585']


# Get Files/Data

In [4]:
def get_files(path: str, center: str):
    '''Returns list of all csv files in the directory that contain the center and ssp name'''
    return [os.path.join(path, f) for f in os.listdir(path) 
             if center in f and any(s in f for s in ssp) and f.endswith('.csv')]

def check_df_consistency(df_list: list[pd.DataFrame]):
    '''Returns T/F if all dataframes in the list have the same column names and index values'''
    if not df_list:
        return False
    
    # Get reference column names and index values from the first dataframe
    ref_cols, ref_index = list(df_list[0].columns), list(df_list[0].index)
    
    # Check if all other dataframes have the same column names and index values
    return all(list(df.columns) == ref_cols and list(df.index) == ref_index 
               for df in df_list[1:])

def label_term(year: int):
    '''Returns list of time period labels for a given year'''
    return [t for t, (s, e) in time_periods.items() if s <= year <= e]


def preprocess(filename: str, only_future: bool=True):
    '''Returns a preprocessed pandas DataFrame from a csv file'''
    df = pd.read_csv(filename)
    
    # Extract variable name from the filename
    name = filename.split('/')[-1][:-4].split('_')
    var = '_'.join(name[:-2])
    
    # Add new columns: term, scenario, and variable
    df.insert(0, 'term', df.years.apply(label_term))
    df.insert(0, 'scenario', name[-1])
    df.insert(0, 'variable', var + ('_d' if any(d in filename for d in day_unit) else '_c'))
    
    # Explode the 'term' column (in case a year belongs to multiple terms)
    df = df.explode('term')

    # Remove rows with NaN terms if only_future is True, otherwise return all rows
    return df.dropna(subset=['term']) if only_future else df # nan's (unlabeled) assumed to be past data

In [5]:
# Preprocess all files as df's
files = sorted(get_files(path, center))
df = [preprocess(f, only_future) for f in files]

# Check if all dataframes have the same format
if not check_df_consistency(df):
    raise ValueError('DataFrames are inconsistent')

# Combine all dataframes into one
df = pd.concat(df).reset_index(drop=True)
cols = ['variable', 'scenario', 'term']

# Check the number of years per time period (expected is 30)
years_per_term = df.groupby(cols).size().unique()
if len(years_per_term) != 1 or years_per_term[0] != 30:
    raise ValueError(
        f'# of years per time period is incorrect: {years_per_term}')

######################################
# REMOVE IF SSP 126, 370 ERROR FIXED
######################################
# For: AMES LARC GISS JPL JSC KSC WFF
# Erronous: Max DTR, Tmin >20C, Tmin < 0C, Coldest Tmin of the Year, Annual Average Tmin, any of the humid heat diagnostics (e.g., heat index, WBGT, etc) in ssp126 and 370
errors = ['Max_DTR', 'tmin', 'Coldest_Tmin']
print(f'ssp126+ssp370 Errornous variables: {
df[df.variable.str.contains('|'.join(errors), na=False, regex=True)].variable.unique()}\n')
if center in ['AMES', 'LARC', 'GISS', 'JPL', 'JSC', 'KSC', 'WFF']:
    # Remove rows with erroneous variables for specific scenarios
    df = df[~(df.variable.str.contains('|'.join(errors), na=False, regex=True) &
              df.scenario.isin(['ssp126', 'ssp370']))]
######################################

print(f'{len(files)} {center} files')
print(files[:5], '\n')
display(df.info())

ssp126+ssp370 Errornous variables: ['Coldest_Tmin_c' 'Max_DTR_c' 'tmin_frostdays_0C_d'
 'tmin_tropnights_20C_d']

45 LARC files
['updated_extremes/Coldest_Tmin_LARC_ssp126.csv', 'updated_extremes/Coldest_Tmin_LARC_ssp245.csv', 'updated_extremes/Coldest_Tmin_LARC_ssp370.csv', 'updated_extremes/HI_days_100F_LARC_ssp126.csv', 'updated_extremes/HI_days_100F_LARC_ssp245.csv'] 

<class 'pandas.core.frame.DataFrame'>
Index: 3330 entries, 90 to 3959
Data columns (total 30 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   variable       3330 non-null   object 
 1   scenario       3330 non-null   object 
 2   term           3330 non-null   object 
 3   years          3330 non-null   float64
 4   ACCESS-CM2     3330 non-null   float64
 5   ACCESS-ESM1-5  3330 non-null   float64
 6   BCC-CSM2-MR    3330 non-null   float64
 7   CESM2          1620 non-null   float64
 8   CMCC-ESM2      3330 non-null   float64
 9   CNRM-CM6-1     3330 non-null   f

None

# Calculations

In [6]:
# Extract column names
mme = list(df.filter(regex='mme-').columns)
models = list(df.columns.drop(cols+mme+['years']))

# Calculate term-wise statistics
# For variables ending with '_d', use median
# For variables ending with '_c', use mean
term_mme = pd.concat([df[df.variable.str.endswith('_d')].groupby(cols)[mme].median(),
                      df[df.variable.str.endswith('_c')].groupby(cols)[mme].mean()
                     ]).reset_index().sort_values(cols, ascending=[1, 1, 0],
                                                  ignore_index=True)
display(term_mme.head())

Unnamed: 0,variable,scenario,term,mme-mean,mme-median,mme-pct25,mme-pct75
0,Coldest_Tmin_c,ssp245,short,-11.268358,-10.740965,-13.503494,-8.679038
1,Coldest_Tmin_c,ssp245,mid,-10.260196,-9.92698,-12.113423,-7.821688
2,Coldest_Tmin_c,ssp245,long,-8.900938,-8.415058,-10.516808,-6.578249
3,HI_days_100F_d,ssp126,short,13.25,11.5,5.25,19.0
4,HI_days_100F_d,ssp126,mid,16.159091,14.5,7.625,22.75


In [7]:
# # Recalculate mme columns
# df['calc_mean'] = df[models].mean(axis=1)
# df['calc_median'] = df[models].median(axis=1)
# df['calc_pct25'] = df[models].quantile(0.25, axis=1)
# df['calc_pct75'] = df[models].quantile(0.75, axis=1)
# calc = ['calc_mean', 'calc_median', 'calc_pct25', 'calc_pct75']

# # Calculate term-wise statistics
# # For variables ending with '_d', use median
# # For variables ending with '_c', use mean
# term_calc = pd.concat([df[df.variable.str.endswith('_d')].groupby(cols)[calc].median(),
#                       df[df.variable.str.endswith('_c')].groupby(cols)[calc].mean()
#                      ]).reset_index().sort_values(cols, ascending=[1, 1, 0],
#                                                   ignore_index=True)
# # Rename columns to match term_mme
# term_calc.columns = term_mme.columns

# # Check if original mme and recalculated values are equal
# if min((term_mme.round(10) == term_calc.round(10)).sum()) != len(term_mme):
#     print(term_mme.compare(term_calc, result_names=('term_mme', 'term_calc')))
#     raise

# display(term_calc.head())

## Calculate Change Per (variable, scenario)
- short - mid
- short - long
- mid - long

In [8]:
query = """
SELECT
    a.variable,
    a.scenario,
    CASE
        WHEN a.term = 'short' AND b.term = 'mid' THEN 'short-mid'
        WHEN a.term = 'short' AND b.term = 'long' THEN 'short-long'
        WHEN a.term = 'mid' AND b.term = 'long' THEN 'mid-long'
    END AS term_diff,
    b.'mme-mean' - a.'mme-mean' AS 'mme-mean',
    b.'mme-median' - a.'mme-median' AS 'mme-median',
    b.'mme-pct25' - a.'mme-pct25' AS 'mme-pct25',
    b.'mme-pct75' - a.'mme-pct75' AS 'mme-pct75'
FROM term_mme a
JOIN term_mme b
    ON a.variable = b.variable
    AND a.scenario = b.scenario
    AND (
        (a.term = 'short' AND b.term = 'mid') OR
        (a.term = 'short' AND b.term = 'long') OR
        (a.term = 'mid' AND b.term = 'long')
    )
ORDER BY 1, 2, 3 DESC
"""

change = psql.sqldf(query, locals())

display(change.head(2))

Unnamed: 0,variable,scenario,term_diff,mme-mean,mme-median,mme-pct25,mme-pct75
0,Coldest_Tmin_c,ssp245,short-mid,1.008162,0.813985,1.390072,0.857351
1,Coldest_Tmin_c,ssp245,short-long,2.36742,2.325906,2.986686,2.100789


# Results
## Degree C

In [9]:
# Degrees C
agg_c = (term_mme[term_mme.variable.str.endswith('_c')]
         .groupby(['variable', 'term'])
         .agg({'mme-mean': ['min', 'max']})
         .sort_values(['variable', 'term'], ascending=[1, 0]))

agg_c['rounded_min'], agg_c['rounded_max'] = agg_c['mme-mean'].round(1).values.T
agg_c['diff'] = (agg_c[('mme-mean', 'max')] - agg_c[('mme-mean', 'min')]).round(1)

agg_c


Unnamed: 0_level_0,Unnamed: 1_level_0,mme-mean,mme-mean,rounded_min,rounded_max,diff
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
variable,term,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Coldest_Tmin_c,short,-11.268358,-11.268358,-11.3,-11.3,0.0
Coldest_Tmin_c,mid,-10.260196,-10.260196,-10.3,-10.3,0.0
Coldest_Tmin_c,long,-8.900938,-8.900938,-8.9,-8.9,0.0
Hottest_Tmax_c,short,39.078562,39.775092,39.1,39.8,0.7
Hottest_Tmax_c,mid,39.55323,40.45273,39.6,40.5,0.9
Hottest_Tmax_c,long,39.878945,41.974741,39.9,42.0,2.1
Max_DTR_c,short,24.67876,24.67876,24.7,24.7,0.0
Max_DTR_c,mid,24.469513,24.469513,24.5,24.5,0.0
Max_DTR_c,long,24.211326,24.211326,24.2,24.2,0.0


## Days

In [10]:
# Days
agg_d = (term_mme[term_mme.variable.str.endswith('_d')]
          .groupby(['variable', 'term'])
          .agg({'mme-median': ['min', 'max']})
          .sort_values(by=['variable', 'term'], ascending=[1, 0]))

round_up_half = lambda x: np.ceil(x) if x % 1 == 0.5 else round(x)

agg_d['rounded_min'] = agg_d[('mme-median', 'min')].apply(round_up_half).astype(int)
agg_d['rounded_max'] = agg_d[('mme-median', 'max')].apply(round_up_half).astype(int)
agg_d['diff'] = (agg_d[('mme-median', 'max')] - agg_d[('mme-median', 'min')]).astype(int)

agg_d

Unnamed: 0_level_0,Unnamed: 1_level_0,mme-median,mme-median,rounded_min,rounded_max,diff
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
variable,term,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
HI_days_100F_d,short,6.25,13.25,6,13,7
HI_days_100F_d,mid,12.5,20.75,13,21,8
HI_days_100F_d,long,14.75,44.25,15,44,29
prec_days_90th_d,short,21.0,21.25,21,21,0
prec_days_90th_d,mid,22.0,22.5,22,23,0
prec_days_90th_d,long,21.5,24.0,22,24,2
prec_days_95th_d,short,11.0,11.0,11,11,0
prec_days_95th_d,mid,11.5,12.0,12,12,0
prec_days_95th_d,long,11.5,13.0,12,13,1
prec_days_99th_d,short,2.0,2.5,2,3,0
