# Purpose Statement

This Notebook Introduces how to download our Benchmark datasets and a single benchmark method for post-processing an Ensemble forecast (EMOS), it is based on the 2018 paper by Stephan Rasp and Sebastian Lerch. Please see their paper for an exploration of the EMOS system we use here, and other interesting methods for post-process bench marking. A brief description is also below:

Rasp & Lerch github for relevant code: [GITHUB](https://github.com/slerch/ppnn)

Paper URL: [Rasp & Lerch 2018](https://arxiv.org/abs/1805.09091)

#### Ensemble Model Output Statistics (EMOS)
----


#### CRPS as a 'Proper Scoring' Metric for Ensemble Evaluation
----


## Import Necessary Packages:

In [None]:
import numpy as np
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import timedelta
import time
import xarray as xr
from netCDF4 import Dataset



import keras
from keras.layers import Input, Dense, merge, Embedding, Flatten, Concatenate
from keras.models import Model, Sequential
from keras.optimizers import Adam, SGD
import keras.backend as K
from keras.callbacks import EarlyStopping

if keras.backend.backend() == 'tensorflow':
    from tensorflow import erf
else:
    from theano.tensor import erf
# import utils

import random 
random.seed(1) #for reproduceability. 

## Functions We Will Need to Evaluate and Train Our EMOS model:

In [194]:
def crps_cost_function(y_true, y_pred, theano=False):
    """Compute the CRPS cost function for a normal distribution defined by
    the mean and standard deviation.
    Code inspired by Kai Polsterer (HITS).
    Args:
        y_true: True values
        y_pred: Tensor containing predictions: [mean, std]
        theano: Set to true if using this with pure theano.
    Returns:
        mean_crps: Scalar with mean CRPS over batch
    """

    # Split input
    mu = y_pred[:, 0]
    sigma = y_pred[:, 1]
    # Ugly workaround for different tensor allocation in keras and theano
    if not theano:
        y_true = y_true[:, 0]   # Need to also get rid of axis 1 to match!

    # To stop sigma from becoming negative we first have to 
    # convert it the the variance and then take the square
    # root again. 
    var = K.square(sigma)
    # The following three variables are just for convenience
    loc = (y_true - mu) / K.sqrt(var)
    phi = 1.0 / np.sqrt(2.0 * np.pi) * K.exp(-K.square(loc) / 2.0)
    Phi = 0.5 * (1.0 + erf(loc / np.sqrt(2.0)))
    # First we will compute the crps for each input/target pair
    crps =  K.sqrt(var) * (loc * (2. * Phi - 1.) + 2 * phi - 1. / np.sqrt(np.pi))
    # Then we take the mean. The cost is now a scalar
    return K.mean(crps)



def build_EMOS_network_keras(compile=False, optimizer='sgd', lr=0.1):
    """Build (and maybe compile) EMOS network in keras.
    Args:
        compile: If true, compile model
        optimizer: String of keras optimizer
        lr: learning rate
    Returns:
        model: Keras model
    """
    mean_in = Input(shape=(1,))
    std_in = Input(shape=(1,))
    mean_out = Dense(1, activation='linear')(mean_in)
    std_out = Dense(1, activation='linear')(std_in)
    x = keras.layers.concatenate([mean_out, std_out], axis=1)
    model = Model(inputs=[mean_in, std_in], outputs=x)

    if compile:
        opt = keras.optimizers.__dict__[optimizer](lr=lr)
        model.compile(optimizer=opt, loss=crps_cost_function)
    return model

## Gather the Data from our Zarr Dataset

In [124]:
AllDat = xr.open_zarr('/Users/will/Desktop/Haupt/Sebastian/ECMWFt2m_zar/')
#Add Mean and Standard deviation of Esnsembles and remove the member dimension
AllDat= AllDat.assign(t2m_fc_mean=AllDat.t2m_fc.mean(dim='member'))
AllDat= AllDat.assign(t2m_fc_std=AllDat.t2m_fc.std(dim='member'))
AllDat=AllDat.drop('t2m_fc')
AllDat.squeeze()
del AllDat['member']
AllDat

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.15 kB 2.15 kB Shape (537,) (537,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",537  1,

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.15 kB 2.15 kB Shape (537,) (537,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",537  1,

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.15 kB 2.15 kB Shape (537,) (537,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",537  1,

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,19.33 kB,19.33 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,|S36,numpy.ndarray
"Array Chunk Bytes 19.33 kB 19.33 kB Shape (537,) (537,) Count 2 Tasks 1 Chunks Type |S36 numpy.ndarray",537  1,

Unnamed: 0,Array,Chunk
Bytes,19.33 kB,19.33 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,|S36,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.15 kB 2.15 kB Shape (537,) (537,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",537  1,

Unnamed: 0,Array,Chunk
Bytes,2.15 kB,2.15 kB
Shape,"(537,)","(537,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,493.56 kB
Shape,"(3653, 537)","(914, 135)"
Count,17 Tasks,16 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 7.85 MB 493.56 kB Shape (3653, 537) (914, 135) Count 17 Tasks 16 Chunks Type float32 numpy.ndarray",537  3653,

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,493.56 kB
Shape,"(3653, 537)","(914, 135)"
Count,17 Tasks,16 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,246.78 kB
Shape,"(3653, 537)","(457, 135)"
Count,609 Tasks,32 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 7.85 MB 246.78 kB Shape (3653, 537) (457, 135) Count 609 Tasks 32 Chunks Type float32 numpy.ndarray",537  3653,

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,246.78 kB
Shape,"(3653, 537)","(457, 135)"
Count,609 Tasks,32 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,246.78 kB
Shape,"(3653, 537)","(457, 135)"
Count,641 Tasks,32 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 7.85 MB 246.78 kB Shape (3653, 537) (457, 135) Count 641 Tasks 32 Chunks Type float32 numpy.ndarray",537  3653,

Unnamed: 0,Array,Chunk
Bytes,7.85 MB,246.78 kB
Shape,"(3653, 537)","(457, 135)"
Count,641 Tasks,32 Chunks
Type,float32,numpy.ndarray


## Split the data into appropriate Training, Validating, and Testing chunks

Let's train on 2007-2014, validate on 2015, test on 2015 onwards
we do this based on the date to make sure that the temporal correlation between the forecasts has settled to --> 0

In [195]:
#Train
Dat_Training = AllDat.loc[dict(time=slice('2007-01-03', '2013-01-01'))]
df_Train=Dat_Training.dropna(dim='station')
df_Train=df_Train.to_dataframe()
df_Train=df_Train.droplevel(0)

#Test
Dat_Validate = AllDat.loc[dict(time=slice('2014-01-01', '2014-12-31'))]
df_Validate = Dat_Validate.dropna(dim='station')
df_Validate=df_Validate.to_dataframe()
df_Validate=df_Validate.droplevel(0)

#Validate
Dat_Test= AllDat.loc[dict(time=slice('2016-01-01', '2016-12-31'))]
df_Test = Dat_Test.dropna(dim='station')
df_Test=df_Test.to_dataframe()
df_Test=df_Test.droplevel(0)

## Build the EMOS network using Keras to Ingest Data and Post-Process Forecast:

In [127]:
model_keras = build_EMOS_network_keras(compile=True, optimizer='sgd', lr=0.1)
model_keras.summary()

Model: "model_8"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_15 (InputLayer)           (None, 1)            0                                            
__________________________________________________________________________________________________
input_16 (InputLayer)           (None, 1)            0                                            
__________________________________________________________________________________________________
dense_15 (Dense)                (None, 1)            2           input_15[0][0]                   
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 1)            2           input_16[0][0]                   
____________________________________________________________________________________________

In [201]:
#FIT PARAMETERS
#batch number and number of epochs to train: 
bn =1024 
epcs = 40

#train data
x1 = np.array(df_Train['t2m_fc_mean'])
x2 = np.array(df_Train['t2m_fc_std'])
y = np.array(df_Train['t2m_obs'])
#validate data
x1_v = np.array(df_Validate['t2m_fc_mean'])
x2_v = np.array(df_Validate['t2m_fc_std'])
y_v = np.array(df_Validate['t2m_obs'])
#test data 
x1_t = np.array(df_Test['t2m_fc_mean'])
x2_t = np.array(df_Test['t2m_fc_std'])
y_t = np.array(df_Test['t2m_obs'])

#### KERAS CALLBACKS TO ADD to Training######
filp = '/where/your/best/model/is/saved'
svbst = keras.callbacks.callbacks.ModelCheckpoint(filp, monitor='val_loss', 
                                                  verbose=1, save_best_only=True, save_weights_only=False)
#add this to the callbacks in fit function to save the best model on your personal machine. 

earlystop = keras.callbacks.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=10, 
                                                    verbose=1, mode='auto', restore_best_weights=True) 
rdclr = keras.callbacks.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, verbose=1, 
                                                    mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)

#### Fitting the Model ######
model_keras.fit([x1,x2],y,batch_size=bn,validation_data=[[x1_t,x2_t],y_t],epochs=40,callbacks=[earlystop,rdclr])

Train on 361515 samples, validate on 136884 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40

Epoch 00007: ReduceLROnPlateau reducing learning rate to 9.999999310821295e-05.
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40

Epoch 00012: ReduceLROnPlateau reducing learning rate to 9.999999019782991e-06.
Epoch 13/40
Restoring model weights from the end of the best epoch
Epoch 00013: early stopping


<keras.callbacks.callbacks.History at 0x639776710>

## Gather our Predictions and Evaluate the Method

In [113]:
# Now make our predictions:
preds = model_keras.predict([x1_t, x2_t])


In [200]:
#creating a dataframe to save and store the results of the Post-Processing method. 
#dictionary for Pandas
d = {'time_validity': df_Test.index, 'Station_ID':df_Test.station_id,'Obs': df_Test.t2m_obs,'Emos_mean': preds[:,0],'Emos_std': preds[:,1],
    'ECMWF_mean':df_Test.t2m_fc_mean,'ECMWF_std':df_Test.t2m_fc_std}
results_df = pd.DataFrame(d)

#Sorting DataFrame by time and Station ID
results_df = results_df.sort_values(by = ['time','Station_ID'], ascending = [True, True])
results_df

Unnamed: 0_level_0,time_validity,Station_ID,Obs,Emos_mean,Emos_std,ECMWF_mean,ECMWF_std
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-01-01,2016-01-01,44.0,4.3,4.500090,-1.653577,4.001777,0.604245
2016-01-01,2016-01-01,71.0,3.3,1.666450,-2.603714,0.933522,1.482697
2016-01-01,2016-01-01,78.0,3.2,4.455047,-1.715894,3.953004,0.661860
2016-01-01,2016-01-01,91.0,3.5,1.998814,-2.807781,1.293405,1.671367
2016-01-01,2016-01-01,102.0,7.1,6.460725,-1.351567,6.124743,0.325021
...,...,...,...,...,...,...,...
2016-12-31,2016-12-31,13713.0,-4.0,-2.071309,-1.281187,-3.113707,0.259951
2016-12-31,2016-12-31,13777.0,-4.1,0.198184,-1.382643,-0.656309,0.353752
2016-12-31,2016-12-31,15000.0,2.0,-1.332599,-1.479688,-2.313835,0.443476
2016-12-31,2016-12-31,15207.0,-4.2,-2.964258,-1.314240,-4.080589,0.290510


## Evaluate CRPS Pre and Post EMOS-post processing:

In [202]:
crps_preds = model_keras.evaluate([x1_t,x2_t],y_t)
#jump through hoops to get data in the right form for loss function:
ECMWFt2m_pred = np.transpose(np.array([df_Test.t2m_fc_mean, df_Test.t2m_fc_std]))
crps_ECMWF= keras.backend.eval(crps_cost_function(np.expand_dims(y_t,axis=1),ECMWFt2m_pred ))



In [203]:
print('Post-Processed with EMOS a Global CRPS:',np.round(crps_preds,2),'!!!!!!!!!!!')
print('Raw Ensemble a Global CRPS:',np.round(crps_ECMWF,2))

Post-Processed with EMOS a Global CRPS: 1.02 !!!!!!!!!!!
Raw Ensemble a Global CRPS: 1.17


### Again See Rasp and Lerch 2018 to continue to explore other methods for Post-Processing this data. 

github for relevant code: [GITHUB](https://github.com/slerch/ppnn)
    
    
Paper URL: [Rasp & Lerach 2018](https://arxiv.org/abs/1805.09091)

## Spread/ Skill Plot 