## NASA Center Analysis
mme-mean calc's are wrong
<details>
<summary>VARIABLES</summary>

| Variable Name       | Long Name                                          | Variable Category | Units     | Description                                                                                                                                                          |
| ------------------- | -------------------------------------------------- | ----------------- | --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| tmax_days_35C       | \# days Tmax ≥35°C                                 | extreme index     | \# days   | Number of days, per year, with Tmax >=35C                                                                                                                            |
| tmax_days_90th      | \# days Tmax ≥90th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 90th percentile. 90th percentile calculated using all daily tmax values from 1995-2014.                                          |
| tmax_days_95th      | \# days Tmax ≥95th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 95th percentile. 95th percentile calculated using all daily tmax values from 1995-2014.                                          |
| tmax_days_99th      | \# days Tmax ≥99th Percentile                      | extreme index     | \# days   | Number of days, per year, with Tmax 99th percentile. 99th percentile calculated using all daily tmax values from 1995-2014.                                          |
| Hottest_Tmax        | Hottest Tmax of the Year (°C)                      | extreme index     | degrees C | Hottest Tmax value every year                                                                                                                                        |
| Max_DTR             | Largest Diurnal Temperature Range of the Year (°C) | extreme index     | degrees C | largest diurnal temperature range (tmax minus tmin) each year                                                                                                        |
| tmin_tropnights_20C | \# days Tmin ≥20°C                                 | extreme index     | \# days   | Number of days, per year with Tmin >=20C                                                                                                                             |
| tmin_frostdays_0C   | \# days Tmin ≤0°C                                  | extreme index     | \# days   | Number of days per year with Tmin <=0C                                                                                                                               |
| Coldest_Tmin        | Coldest Tmin of the Year (°C)                      | extreme index     | degrees C | Coldest minimum temperature each year                                                                                                                                |
| prec_days_dry       | \# days with precipitation ≤0.001 in               | extreme index     | \# days   | Number of days, per year, where precipitation <=1e-3 inches                                                                                                          |
| prec_days_oneinch   | \# days with precipitation ≥1 in                   | extreme index     | \# days   | Number of days, per year, where precipitation >=1 inch                                                                                                               |
| prec_days_90th      | \# days with precipitation ≥90th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=90th percentile. 90th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| prec_days_95th      | \# days with precipitation ≥95th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=95th percentile. 95th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| prec_days_99th      | \# days with precipitation ≥99th Percentile        | extreme index     | \# days   | Number of days, per year, where precipitation >=99th percentile. 99th percentile calculated usingd all daily precipitation values (dry days EXCLUDED) from 1995-2014 |
| tmax_annave         | Annual Average Tmax (°C)                           | annual average    | degrees C | Annual average maximum daily temperature                                                                                                                             |
| tmin_annave         | Annual Average Tmin (°C)                           | annual average    | degrees C | Annual average minimum daily temperature                                                                                                                             |
| prec_annave         | Annual Total Precipitation (mm)                    | annual SUM        | degrees C | Annual SUM of precipitation                                                                                                                                          |</details>

In [1]:
# Imports
import os
import warnings

import numpy as np
import pandas as pd
import pandasql as psql

# Suppress warnings
warnings.filterwarnings('ignore')

## Initialization

In [2]:
path = 'updated_extremes'  # data directory
center = 'LARC'.upper()    # NASA center to analyze
path = os.path.join(path, center)
only_future = True         # flag to use only 2020-2099
ssp = ['ssp126', 'ssp245', 'ssp370'] # scenarios to use

In [3]:
# DO NOT CHANGE THIS CELL
# OLD File name convention: variable_CENTER_ssp###.csv
# File name convention: variable.ssp###.CENTER.csv

# NASA Centers
centers = ['AMES', 'GSFC', 'JPL', 'KSC', 'MSFC', 'MAF', 'GISS',
           'LARC', 'SSC', 'GRC', 'WFF', 'JSC', 'WSTF', 'AFRC']

# Check if the provided center is valid
if center not in centers:
    raise ValueError(f'{center} not in {centers}')

# Variable unit: number of DAYS when... assume others are celsius
day_unit = ['days', 'tropnights']

# Shared Socioeconomic Pathways (SSPs) climate scenarios
print(f'Available: {sorted(list({f.split('.')[1] for f in os.listdir(path) 
                                 if 'ssp' in f.split('_')[-1]}))}')
# Time periods: 10 years before+after a decade
time_periods = {'short': (2020, 2049),  # 2030's: 2020-2029, 2030-2039, 2040-2049
                'mid':   (2040, 2069),  # 2050's: 2040-2049, 2050-2059, 2060-2069
                'long':  (2070, 2099),  # 2080's: 2070-2079, 2080-2089, 2090-2099
                }

Available: ['ssp126', 'ssp245', 'ssp370']


# Get Files/Data

In [4]:
def get_files(path: str, center: str):
    '''Returns list of all csv files in the directory that contain the center and ssp name'''
    return [os.path.join(path, f) for f in os.listdir(path) 
             if center in f and any(s in f for s in ssp) and f.endswith('.csv')]

def check_df_consistency(df_list: list[pd.DataFrame]):
    '''Returns T/F if all dataframes in the list have the same column names and index values'''
    if not df_list:
        return False
    
    # Get reference column names and index values from the first dataframe
    ref_cols, ref_index = list(df_list[0].columns), list(df_list[0].index)
    
    # Check if all other dataframes have the same column names and index values
    return all(list(df.columns) == ref_cols and list(df.index) == ref_index 
               for df in df_list[1:])

def label_term(year: int):
    '''Returns list of time period labels for a given year'''
    return [t for t, (s, e) in time_periods.items() if s <= year <= e]


def preprocess(filename: str, only_future: bool=True):
    '''Returns a preprocessed pandas DataFrame from a csv file'''
    df = pd.read_csv(filename)
    name = filename.split('/')[-1][:-4].split('.')
    
    # Add new columns: term, scenario, and variable
    df.insert(0, 'term', df.Years.apply(label_term))
    df.insert(0, 'scenario', name[1])
    df.insert(0, 'variable', name[0] + ('_d' if any(d in filename for d in day_unit) else '_c'))
    
    # Explode the 'term' column (in case a year belongs to multiple terms)
    df = df.explode('term')

    # Remove rows with NaN terms if only_future is True, otherwise return all rows
    return df.dropna(subset=['term']) if only_future else df # nan's (unlabeled) assumed to be past data

In [5]:
# Preprocess all files as df's
files = sorted(get_files(path, center))
df = [preprocess(f, only_future) for f in files]

# Check if all dataframes have the same format
if not check_df_consistency(df):
    raise ValueError('DataFrames are inconsistent')

# Combine all dataframes into one
df = pd.concat(df).reset_index(drop=True)
cols = ['variable', 'scenario', 'term']

# Check the number of years per time period (expected is 30)
years_per_term = df.groupby(cols).size().unique()
if len(years_per_term) != 1 or years_per_term[0] != 30:
    raise ValueError(
        f'# of years per time period is incorrect: {years_per_term}')

print(f'{len(files)} {center} files')
print(files[:5], '\n')
display(df.info())

51 LARC files
['updated_extremes/LARC/Coldest_Tmin.ssp126.LARC.csv', 'updated_extremes/LARC/Coldest_Tmin.ssp245.LARC.csv', 'updated_extremes/LARC/Coldest_Tmin.ssp370.LARC.csv', 'updated_extremes/LARC/Hottest_Tmax.ssp126.LARC.csv', 'updated_extremes/LARC/Hottest_Tmax.ssp245.LARC.csv'] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4590 entries, 0 to 4589
Data columns (total 32 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   variable       4590 non-null   object 
 1   scenario       4590 non-null   object 
 2   term           4590 non-null   object 
 3   Years          4590 non-null   float64
 4   ACCESS-CM2     4590 non-null   float64
 5   ACCESS-ESM1-5  4590 non-null   float64
 6   BCC-CSM2-MR    4590 non-null   float64
 7   CESM2          1620 non-null   float64
 8   CMCC-ESM2      4590 non-null   float64
 9   CNRM-CM6-1     4590 non-null   float64
 10  CNRM-ESM2-1    4590 non-null   float64
 11  EC-Earth3      4590 non-null 

None

In [8]:
MME = list(df.filter(regex='MME-').columns)
models = list(df.columns.drop(cols+MME+['Years']))
df1 = df.set_index(['variable', 'scenario', 'term', 'Years'])
df1['calc_mean'] = df1[models].mean(axis=1)
df1['calc_median'] = df1[models].median(axis=1)
df1['calc_pct25'] = df1[models].quantile(0.25, axis=1)
df1['calc_pct75'] = df1[models].quantile(0.75, axis=1)
df1['calc_pct05'] = df1[models].quantile(0.05, axis=1)
df1['calc_pct95'] = df1[models].quantile(0.95, axis=1)

df_mme = df1.filter(regex='MME-').round(10)
df_cal = df1.filter(regex='calc_').round(10)
df_cal.columns=df_mme.columns

err = df_mme.compare(df_cal, result_names=('term_MME', 'term_calc')).dropna(how='all')
display(err)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,MME-mean,MME-mean,MME-median,MME-median,MME-pct25,MME-pct25,MME-pct75,MME-pct75,MME-pct05,MME-pct05,MME-pct95,MME-pct95
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,term_MME,term_calc,term_MME,term_calc,term_MME,term_calc,term_MME,term_calc,term_MME,term_calc,term_MME,term_calc
variable,scenario,term,Years,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
Coldest_Tmin_c,ssp245,short,2020.0,-13.203454,-11.633955,-12.729980,-11.357941,-15.333984,-13.266235,-10.944702,-9.445007,-19.054031,-16.501160,-7.518524,-7.770294
Coldest_Tmin_c,ssp245,short,2021.0,-10.888566,-11.645869,-9.929718,-11.238373,-13.511353,-12.918243,-8.312775,-9.161407,-14.697510,-18.496933,-7.620392,-6.229919
Coldest_Tmin_c,ssp245,short,2022.0,-13.024539,-11.652946,-11.989258,-11.386108,-14.897949,-13.464233,-9.883087,-9.325989,-22.243896,-16.978699,-7.624878,-6.419098
Coldest_Tmin_c,ssp245,short,2023.0,-12.098663,-12.612939,-12.043091,-10.976562,-12.968109,-13.887787,-10.257416,-9.951202,-18.183319,-19.684601,-8.020020,-9.756409
Coldest_Tmin_c,ssp245,short,2024.0,-13.303314,-11.236225,-12.001373,-11.758453,-15.710175,-12.706421,-10.432739,-10.397491,-20.645370,-15.583130,-8.825684,-6.363983
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tmin_tropnights_20C_d,ssp370,long,2095.0,125.571429,89.428571,126.000000,90.000000,116.000000,79.000000,133.000000,96.000000,102.000000,70.000000,154.000000,112.000000
tmin_tropnights_20C_d,ssp370,long,2096.0,126.095238,86.666667,123.000000,87.000000,117.000000,80.000000,132.000000,94.000000,112.000000,70.000000,148.000000,99.000000
tmin_tropnights_20C_d,ssp370,long,2097.0,125.095238,91.761905,121.000000,90.000000,113.000000,84.000000,137.000000,102.000000,107.000000,63.000000,152.000000,117.000000
tmin_tropnights_20C_d,ssp370,long,2098.0,127.714286,87.285714,125.000000,82.000000,118.000000,75.000000,136.000000,101.000000,113.000000,69.000000,147.000000,117.000000


In [7]:
err.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3409 entries, ('Coldest_Tmin_c', 'ssp126', 'short', np.float64(2020.0)) to ('tmin_tropnights_20C_d', 'ssp370', 'long', np.float64(2099.0))
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   (MME-mean, term_MME)     3289 non-null   float64
 1   (MME-mean, term_calc)    3289 non-null   float64
 2   (MME-median, term_MME)   2871 non-null   float64
 3   (MME-median, term_calc)  2871 non-null   float64
 4   (MME-pct25, term_MME)    2879 non-null   float64
 5   (MME-pct25, term_calc)   2879 non-null   float64
 6   (MME-pct75, term_MME)    2935 non-null   float64
 7   (MME-pct75, term_calc)   2935 non-null   float64
 8   (MME-pct05, term_MME)    2918 non-null   float64
 9   (MME-pct05, term_calc)   2918 non-null   float64
 10  (MME-pct95, term_MME)    3046 non-null   float64
 11  (MME-pct95, term_calc)   3046 non-null   float64
dtypes: float64(12)
me