# Effect of Forcings on CAMELs Simulations

Now we can look at the output and see which forcing variables have the most error. This code is meant to be run on the subset of basins that the ensembles of different model decisions were run on in the `camels_pysumma.ipynb` notebook. This notebook will not run all the way through if the ensembles were not created.

First we load the imports.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import xarray as xr

<br>

### You will need to edit these paths to be your folders

In [None]:
top = '/glade/work/ashleyvb'
folder = top+'/CAMELs'
folders = folder+'/summa_camels'

<br>

# Summary Statistics of Error on output
Let's look at some error metrics by HRU.
KGE means perfect agreement if it is 1, and <0 means the mean is a better guess. 
Bias means perfect aggreement if it is 0, and larger means larger error. 
All errors have 1's added so we don't divide by 0. 

In [None]:
# truth data set
sim_truth = xr.open_dataset(folders+'/output/merged_day/NLDAStruth_hru.nc')
the_hru = np.array(sim_truth['hruId'])

In [None]:
# Set forcings to hold at constant or MetSim and create dictionaries
cm_vars= ['all','airpres','airtemp','LWRadAtm','pptrate','spechum','SWRadAtm','windspd']
error_kind = ['bias','kge']
est_kind = ['constant','metsim']
seas_kind = ['YEAR','DJF','MAM','JJA','SON']
#forcing, liquid water fluxes for the soil domain, turbulent heat transfer, snow, vegetation, derived 
forc_sim = np.delete(cm_vars,0)
comp_sim=['scalarSurfaceRunoff','scalarAquiferBaseflow','scalarInfiltration','scalarRainPlusMelt','scalarSoilDrainage',
          'scalarLatHeatTotal','scalarSenHeatTotal','scalarSnowSublimation',
          'scalarSWE',
          'scalarCanopyWat',
          'scalarNetRadiation','scalarTotalET','scalarTotalRunoff','scalarTotalSoilWat']
var_sim = np.concatenate([forc_sim, comp_sim])

In [None]:
# definitions for KGE computation
def covariance(x,y,dims=None):
    return xr.dot(x-x.mean(dims), y-y.mean(dims), dims=dims) / x.count(dims)

def correlation(x,y,dims=None):
    return covariance(x,y,dims) / (x.std(dims) * y.std(dims))

In [None]:
# set up xarray
hrud = sim_truth['hru'] #indices here are 0 to number of basins
shape = (len(hrud), len(cm_vars),len(est_kind), len(error_kind),len(seas_kind))
dims = ('hru','var','estimation','error','season')
coords = {'hru': hrud, 'var':cm_vars, 'estimation':est_kind, 'error':error_kind, 'season':seas_kind}
error_data = xr.Dataset(coords=coords)
for s in var_sim:
    error_data[s] = xr.DataArray(data=np.full(shape, np.nan),
                                 coords=coords, dims=dims,
                                 name=s)

<br>
Now run the actual computations on KGE.

In [None]:
%%time
truth0 = sim_truth.drop_vars('hruId').load()
for v in cm_vars:
    for c in est_kind:     
        sim0 = xr.open_dataset(folders+'/output/merged_day/NLDAS' + c + '_' + v +'_hru.nc')
        sim0 = sim0.drop_vars('hruId').load()
        for i, t in enumerate(seas_kind):     
            if i==0: 
                truth = truth0
                sim = sim0
            if i>0: 
                truth = truth0.sel(time=truth0['time.season']==t)
                sim = sim0.sel(time=sim0['time.season']==t)
                
            r = sim.mean(dim='time') #to set up xarray since xr.dot not supported on dataset and have to do loop
            for s in var_sim:         
                r[s] = correlation(sim[s],truth[s],dims='time')
            # KGE value for each hru, add 1 so no nan
            ds = 1 - np.sqrt( np.square(r-1) 
                + np.square( (sim.std(dim='time')+1)/(truth.std(dim='time')+1) - 1) 
                + np.square( (sim.mean(dim='time')+1)/(truth.mean(dim='time')+1) - 1) )
            ds0 = ds.load()
            # bias value for each hru, add 1 so no nan
            ds = np.abs(sim-truth)/(truth+1) 
            ds1 = ds.mean(dim='time').load()
            for s in var_sim:
                error_data[s].loc[:,v,c,'kge',t]  = ds0[s]
                error_data[s].loc[:,v,c,'bias',t] = ds1[s]
    print(v)

In [None]:
# change coordinates and save incase hangs up
error_data = error_data.assign_coords(hru=sim_truth['hruId'])
error_data.to_netcdf(folder+'/regress_data/error_data.nc') 

<br>
KGE does not need to be normalized. We plot the HRU error as stack of values, with no error plotting as a height of 1 for that color. Values less than 0 are plotted as 0. 

In [None]:
#error_data =  xr.open_dataset(folder+'/regress_data/error_data.nc') #read this incase hangs up

In [None]:
# Setup plots
x = np.arange(len(hrud))
col_vars = ['gray','y','r','g','orange','c','m','b']
letter = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
wid = ceil(len(var_sim)/3)
inc = floor(len(hrud)/10)
if inc<1: inc=1
xtic = np.arange(0, len(hrud),inc).tolist()
xtic =[int(i) for i in xtic]
xtics =[str(i+1) for i in xtic]
labels =["V"+i for i in xtics]

<br>
Do the plotting. 

In [None]:
%%time
# Just plotting all seasons. Maybe add winter (ind 1), and summer (ind 3) if you want to see more detail. 
ind = [0]#,1,3]
seas_kind0 = [ seas_kind[i] for i in ind]
for c in est_kind:      
    for t in seas_kind0:     
        plot1 = plt.figure(1, figsize = (20,10))

        for i, s in enumerate(var_sim):
            data0 = error_data[s].loc[:,:,c,'kge',t]
            data = data0.where(data0>0,0) #make the negative values be 0
            data_Master = [0] * len(hrud)
    
            plot2 = plt.subplot(3,wid,i+1)
            for j, v in enumerate(cm_vars):
                plt.bar(height = data.loc[:,v], x = x, width = 1.0, color = col_vars[j], bottom = data_Master)
                #data_Master = [m + n for m, n in zip(data_Master, data.loc[:,v])]
                data_Master = [j+1] * len(hrud)
         
            plt.title('('+letter[i]+') '+s)
            plt.ylim(0,len(cm_vars))
            plt.xticks(xtic, labels, fontsize = 5)
            plt.yticks(np.arange(0, len(cm_vars)+.05, 1).tolist())
            plt.tick_params(axis = "x", which = "both", bottom = False, top = False)
            plt.xlabel("CAMELS basin ("+labels[0]+"-"+labels[-1]+")", fontsize = 9)
            plt.ylabel("KGE", fontsize = 9)

        plt.subplots_adjust(hspace = .4)

        for j, v in enumerate(cm_vars):
            plt.scatter([],[], color = col_vars[j], label = t + '_NLDAS_' + c + '_' + v)
        plt.figlegend(loc = 'lower right')
        plt.show()

<br>
We see that the pptrate and air pressure would be better off constant than at MetSim values (thiner orange and yellow layers in the MetSim plots), but that the air pressure does not matter in the variable calculation (except simulation of air pressure itself). Air temperature has less error in MetSim. 
By season, there is more error in the winter in both Metsim and Constant.

<br>

# Ranking Changes with other Model Configurations
Let's compare this model output to other output from the other configurations, and see if we can change the rankings at all. 
This may be a good idea if, for example, windspeed is ranking as a variable contributing highly to model error and we do not have good estimates of windspeed. In that example, we would want to find a model configuration that ranked windspeed lower in error contribution.