# Calculating variance across models

This notebook runs through the calculation of multiple types of uncertainty for an example metric, across 3 future scenarios spanning 2015-2100. Here we calculate: 
- **Internal variability:** the component of future uncertainty attributable to natural variations in the climate, or within-model variation. We calculate this using a large ensemble approach to the variance calculation. 
- **Model uncertainty:** the component of future uncertainty attributable to differences between models, which, due to their different components and assumptions, may respond differently to identical radiative forcing scenarios. 
- **Scenario uncertainty:** the component of future uncertainty attributable to differences between Shared Socioeconomic Pathways, or the uncertainty in future emissions, and thus radiative forcing and climate. 

We will calculate the annual average of maximum daily air temperature, area-averaged across Alameda county as an example metric. Note that other metrics over different areas may have different relative uncertainties.

**Intended application**: As a user, I wish to understand projections of air temperature in my region by:
- Calculating variance in internal, model, and scenario uncertainties through end of century
- Visualizing the variance

**Runtime**: With the default settings, this notebook takes approximately **6 minutes** to run from start to finish. Modifications to selections may increase the runtime.

## Step 0: Set-up

In [None]:
import climakitae as ck
import climakitaegui as ckg

import pandas as pd
import xarray as xr
import numpy as np

import matplotlib.pyplot as plt

## Step 1: Calculate historical baseline

If we want to look at anomalies of future projections rather than absolute, first we need to calculate a historical baseline for our metric of interest. From this baseline and the absolute values provided by future projections, we can calculate anomalies. Use the select tool, or bypass and hardcode the selections for historical 1985-2015. 

### 1a) Select historical data
First we need to select the relevant subset of historical data. In this example, we want to select the hybrid-statistically downscaled area average over Alameda County of monthly maximum air temperature over our historical baseline period of 1985-2015. We can either use the selection tool (commented out) to choose this subset of historical data; or we can just hard-coded these selections. 

In [None]:
selections = ckg.Select()

selections.area_average='Yes'
selections.area_subset='CA counties'
selections.cached_area=['Alameda County']
selections.data_type='Gridded'
selections.downscaling_method='Statistical'
selections.resolution='3 km'
selections.scenario_historical=['Historical Climate']
selections.time_slice=(1985, 2015)
selections.timescale='monthly'
selections.units='K'
selections.variable='Maximum air temperature at 2m'

# selections.show()

#### 1b) Retrieve and load data
Next we retrieve the historical data that we have selected, and load it into memory. The historical data may take a minute to load.

In [None]:
hist_data = selections.retrieve()

In [None]:
hist_data = ck.load(hist_data)

## Step 2: Calculate weighted temporal mean of historical data

From monthly timeseries, calculate annual averages (weighted by the number of days in each month), and then average across years.

In [None]:
from climakitae.explore.uncertainty import weighted_temporal_mean

# calculate annual average
ann_avg_temp_hist = weighted_temporal_mean(hist_data)

Calculate the 30-year mean:


In [None]:
hist_mean = ann_avg_temp_hist.mean(dim='time')

## Step 3: Select and retrieve future projections, calculate anomalies 

Next we need to retrieve and load in our desired future projection data. As with the historical data, we'll calculate annual averages, and then use the historical baseline we calculated in the section above to convert our future projections from absolute values into anomalies relative to that 1985-2015 historical baseline. 


#### 3a) Select future projections

In this example, we want hybrid-statistically downscaled maximum temperature for 2015-2100, area-averaged across Alameda county, and we want to include all available future projections and ensemble members across all scenarios. For now we want these to be area-averaged over Alameda county during retrieval, but later on (for at least some metrics), area-averaging should occur after metric calculation. 

In [None]:
selections.area_average='Yes'
selections.area_subset='CA counties'
selections.cached_area=['Alameda County']
selections.data_type='Gridded'
selections.downscaling_method='Statistical'
selections.resolution='3 km'
selections.scenario_historical=[]
selections.scenario_ssp=['SSP 3-7.0', 'SSP 2-4.5', 'SSP 5-8.5']
selections.time_slice=(2015, 2100)
selections.timescale='monthly'
selections.units='K'
selections.variable='Maximum air temperature at 2m'

#### 3b) Retrieve and load data

Next we retrieve the future projections data that we have selected, and load it into memory. Like the historical data, this may take a few minutes. 

In [None]:
ssp_data = selections.retrieve()

In [None]:
ssp_data = ck.load(ssp_data)

#### 3b) Evaluate for missing values
We will next check how many missing values are present.

In [None]:
# Check missing values 
print("There are", np.isnan(ssp_data).sum().values, "missing values in `ssp_data`, out of", ssp_data.size, "expected values.")
print("This is about", round(np.isnan(ssp_data).sum().values / ssp_data.size * 100),"%.")

#### 3c) Construct annual means 

From monthly data we calculate the annual average of the projected data next. 

In [None]:
ann_avg_temp = weighted_temporal_mean(ssp_data)

#### 3d) Calculate anomalies

Then we subtract off the ensemble member minus specific historical mean from all future projections values. 

In [None]:
# First broadcast `ann_avg_temp` and `hist_mean` to match dimensions in order to calc anomalies: 
a, b = xr.broadcast(ann_avg_temp, hist_mean.sel(scenario = 'Historical Climate'))

# subtract historical mean from future projection values: 
anoms = ann_avg_temp - b

## Step 4: Variance calculations

Here we calculate 3 types of uncertainty so we can compare variance: 
- Internal variability (using a large ensemble approach)
- Model uncertainty
- Scenario uncertainty

For more detailed information on internal variability and model uncertainty, check out the `internal_variability.ipynb` and `model_uncertainty.ipynb` notebooks in the Exploratory notebook folder!

#### 4a) Calculate internal variability
While there are several methods of calculating internal variability, here we choose to leverage the **Large Ensemble** (the presence of multiple ensemble members per model) to calculate internal variability. In doing so, we will first calculate the decadal running mean of anomalies, and calculate the variance as a function of time across all available ensemble members, and average variance across time. This results in a single time-invariant estimate of internal variability. 

We first calculate the decadal running mean of anomalies, i.e. the 10-year mean of 2015-2025, the 10-year mean of 2016-2026, and so on. Note that calculating rolling averages contracts the number of years for which we have data (e.g. if looking at 2070-2100, we need 2060-2100 rather than 2070-2100 to get a 30-year timeseries of rolling decadal averages), so we have these rolling 30-year averages for 2025-2100 rather than 2015-2100.

In [None]:
dec_mean_anom = anoms.rolling(time=10, center=False).mean() 

From the decadal averages of anomalies, we calculate the variance across all ensemble members as a function of time. 

In [None]:
var_foft = dec_mean_anom.var(dim='simulation')

For our internal variability calculation we average these variances across time, and then average across the three scenarios. 


In [None]:
# avg across time, then scenario
int_var = var_foft.mean(dim='time').mean() # yields 1 value per scenario, then average across these

#### 4b) Calculate model variability
Here we estimate the (unweighted) model variability by calculating model averages across all of each model's ensemble members, calculating the variance across model averages as a function of time, and then averaging across scenarios (yields a time-varying model variability). 

$M(t) = \frac{1}{N_s}\sum_s var^W_m(x_{m,s,t})$ 

Here, this is variance in decadal average anomaly $x_{m,s,t}$ following the model predictin fits described in [Hawkins and Sutton (2009)](https://journals.ametsoc.org/view/journals/bams/90/8/2009bams2607_1.xml).

First, we'll make a key (outside of the data), mapping each simulation to its parent GCM. 

In [None]:
def get_GCM(sim_dim):
    """Extract GCM from simulation name
    Parameters
    ----------
    sim_dim: dimension of xarray.DataArray
    
    Returns
    -------
    to_return: list
    """

    to_return = []
    
    for one_val in sim_dim:
        one_gcm = str(one_val.values).split('_')[1]
        to_return.append(one_gcm)
    return to_return

# gcm/simulation key
gcm_df = pd.DataFrame(get_GCM(dec_mean_anom.simulation), index = dec_mean_anom.simulation,
                      columns = ['gcm'])

# list out unique GCMs 
gcm_list = list(set(gcm_df['gcm'].values))

Iterate through models to calculate ensemble averages for each model.

In [None]:
# Start by deleting `mod_df` if it's already in memory: 

try:
    mod_df
except NameError:
    print("not yet defined")
else:
    del mod_df

# Iterate through models
for tgt_mod in gcm_list:
    # select subset of simulations in dec_mean_anom from that model 
    # and average across simulations to get a model average
    # (yields 1 model average of decadally-smoothed anomaly per scenario):
    m = dec_mean_anom.sel(simulation = gcm_df.loc[gcm_df['gcm'] == tgt_mod].index).mean('simulation')
    # add gcm coordinate to concat along
    m = m.assign_coords({"gcm": tgt_mod})
    
    # concat if mod_df doesn't exist yet
    try:
        mod_df
    except NameError:
        mod_df = m
    else:
        mod_df = xr.concat([mod_df, m], dim="gcm")

Calculate variance across models for each scenario, and average across scenarios.  

In [None]:
mod_var = mod_df.var(dim = 'gcm').mean(dim = 'scenario')

#### 4c) Calculate scenario variability

Calculate scenario variability, similar to how it was calculated in [Hawkins and Sutton (2009)](https://journals.ametsoc.org/view/journals/bams/90/8/2009bams2607_1.xml): estimating from unweighted variance across weighted multimodel means (eq. 6), take the (unweighted) multimodel mean for three scenarios, and calculate the variance across these. 

$S(t) = var_s(\sum_m x_{m,s,t})$ 


In [None]:
scen_var = dec_mean_anom.mean(dim = 'simulation').var(dim = 'scenario')

#### 4d) Calculate total variability from internal variability, model uncertainty, and scenario uncertainty

First we calculate **total variability**, and model and internal variability as percentages of their combined uncertainty. We'll turn the time-invariant internal variability estimate (`int_var`) into a timeseries (note: forces computation if data not already loaded). 

In [None]:
int_var_rep = np.repeat(int_var.values, mod_var.size)

In [None]:
# Calculate total variance
tot_var = mod_var.values + int_var_rep + scen_var.values

We next will calculate total uncertainty components from internal variability, model uncertainty, and scenario uncertainty, as well as the **fractional uncertainty components** for each. 

In [None]:
# Component uncertainty
modelComponent = mod_var.values
scenarioComponent = scen_var.values
internalComponent =  tot_var - mod_var.values - scen_var.values  # replace values of int_var_rep with NaN if NaN in tot_var

# Fractional uncertainty
fracModel  = mod_var.values / tot_var
fracInternal  = internalComponent / tot_var
fracScenario = scen_var.values / tot_var

#### Step 5: Visualize variance
Now, we'll visualize the proportional difference between model variability and internal variability over time. 

In [None]:
mod_var

In [None]:
fig, (ax0, ax1) = plt.subplots(2, 1, sharex=True, figsize=(6, 6))

x = mod_var.time # x-axis (last year of decadal average anomaly) 

# total variance
ax0.plot(x, modelComponent, color='b',linewidth=2)
ax0.plot(x, scenarioComponent, color='g',linewidth=2)
ax0.plot(x, internalComponent, color='orange',linewidth=2)
ax0.set_ylabel('Total variance [0-1]');

# fractional variance
ax1.fill_between(x, fracModel+fracScenario, 1, color='orange')
ax1.fill_between(x, fracModel, fracModel+fracScenario, color = 'green')
ax1.fill_between(x, 0, fracModel, color='blue')
ax1.set_ylabel('Fractional variance [0-1]');

## Discussion: Implications for selecting projections
Let's think about this figure in the context of making decisions about what future projections to use.

Towards the end of the century, say 2085-2100, model uncertainty (shown in blue) and scenario uncertainty (shown in green) are both much larger than internal variability (shown in orange). So when choosing projections for over this time frame, which models and scenarios you choose will drive differences in your analysis more than which ensemble member from a given model you choose. In other words, if you're trying to select projections and want to see the greatest range of possible outcomes without looking at *all* projections, you should focus your efforts on looking at outputs from different models and scenarios, rather than looking at different ensemble members, because you can expect that the greatest differences over this time period between all of the different future projections available to you will be between models and scenarios, rather than between ensemble members. At 2100, scenario uncertainty contributes the most to total uncertainty, followed by model uncertainty, and with internal variability contributing the least. So you can expect that the largest differences between projections, if you compare them, will be driven by which scenario you're looking at. 

In the first 20 years however, from 2015 to around 2035, we see that model and scenario uncertainty are quite low, and internal variability is actually the largest percentage of total uncertainty. So when choosing projections for over this time frame, thinking carefully about which ensemble member you're choosing is important, as differences between ensemble members will drive the range of possible outcomes. 

In the top figure, we observe that total uncertainty increases as time progresses towards 2100. While the proportion of total uncertainty that internal variability contributes to declines towards 2100, its contribution, by this estimate, remains constant. 

Remember also that we're looking at one specific example metric (annual average maximum surface temperatures) and averaging across one specific geospatial region (Alameda county). The conclusions above about which sources of uncertainty are most important and when will likely change if we choose to look at a different metric or region. 

## Potential future additions
- **Area averaging**: Ideally the area averaging function should be separated from the data selection process, and area-averaging should be implemented after calculating metrics and constructing anomalies.  
- **Missing data**: There is quite a bit of missing data -- should look into whether this is due to differences in which models and simulations contribute data to which SSPs, if data don't span certain time periods (would be surprising), or data are lost at some stage in calculation, such as area averaging. 
- **Data loading**: Xarrays should be loaded into memory right before plot generation, rewriting of code needed to ensure calculations are written such that it is in fact faster to load at that stage rather than at the beginning of the notebook. 
- **Sensitivity testing**: 
    - Large ensemble internal variability sensitivity test: restrict LE internal variance calculation to models with multiple ensemble members (e.g. remove TaiESM because only has 1 ensemble member). See what, if any, difference this makes.
    - Model variance sensitivity test:  try sampling a single ensemble member from each model to calculate model variance, rather than first averaging across ensemble members. See what, if any, difference it makes. 
- **Decadal smoothing and 30-yr averages** may not be appropriate for all purposes. A user may wish to fine-tune the time frame over which smoothing occurs for the particular climate variable(s) and metric of interest they wish to work with. 