# __Preparation of train / validation split and training__

## __Splitting time series into train / test data__

First of all, note that we have split off our test set already at the beginning. Test and validation splits are often confused.

Have a closer look onto the table below. If we want to split our dataset into a train / test split, we have to think through a few points. Hypothetically, imagine we make a clean cut at any Timestamp > 2009-07-01 05:00:00. Then, almost all values of lag1 in the train set are present in lag12 of the test set. The same goes for all remaining lag columns in a lesser extent.

<img src="../images/leaking_example.png"> <br/>

In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame()

#train / validation split can be done with sklearn.model_selection.train_test_split
#the train_test_split can be done in few simple lines; here, 80 % of the dataset is assigned to become the training set

k = int(df.shape[0] * 0.8)

data_train = df.iloc[:k,:]
data_validation = df.iloc[k:,:]

### __Exercise 1__

Consider what was mentioned above about train / test leaking and write a function which takes our dataframe df as an input and returns our dataset split into X_train, y_train, X_test, y_test considering for the avoidance of leakage.

In [4]:
def train_validation_ts(df, relative_train, maximal_lag, horizon):
    
    #your code here
    
    return X_train, y_train, X_test, y_test

In [14]:
def train_test_ts(df, relative_train, maximal_lag, horizon):
    '''
    Time series (ts) split function creates a train/test set under consideration of potential overlap between the two due to lag processing
    X_train, y_train, X_test, y_test = ...
    df=must contain target column as "target"; all other columns must be used as features
    percentage_train=how much of the total dataset shall be used for training; must be added between 0 - 1
    maximal_lag=out of all lag feature engineering, enter the maximal lag number
    '''
    k = int(df.shape[0] * relative_train)
    data_train = df.iloc[:k,:]
    #to avoid overlapping of train and test data, a gap of the maximal lag - 1 must be included between the two sets
    data_test = df.iloc[k+maximal_lag:,:]
    
    assert data_train.index.max() < data_test.index.min()
    
    #returns in the sequence X_train, y_train, X_test, y_test
    return (data_train.drop(columns=[f"horizon{horizon}","t CO2-e / MWh"], axis=1), data_train[f"horizon{horizon}"],
            data_test.drop(columns=[f"horizon{horizon}","t CO2-e / MWh"], axis=1), data_test[f"horizon{horizon}"])


In [6]:
file_path = '../data/train_time_features.pkl'
df = pd.read_pickle(file_path)

In [7]:
print(df.index.min())
print(df.index.max())
df.head()

2009-07-01 04:00:00
2018-05-31 23:55:00


Unnamed: 0,t CO2-e / MWh,year,minute_sin,minute_cos,hour_sin,hour_cos,weekday_sin,month_sin,month_cos,lag1,...,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12,horizon0
2009-07-01 04:00:00,0.991217,2009,0.0,1.0,0.866025,0.5,0.0,-0.5,-0.866025,,...,,,,,,,,,,
2009-07-01 04:05:00,0.0,2009,0.5,0.8660254,0.866025,0.5,0.0,-0.5,-0.866025,0.991217,...,,,,,,,,,,
2009-07-01 04:10:00,0.0,2009,0.866025,0.5,0.866025,0.5,0.0,-0.5,-0.866025,0.0,...,,,,,,,,,,
2009-07-01 04:15:00,0.991217,2009,1.0,2.832769e-16,0.866025,0.5,0.0,-0.5,-0.866025,0.0,...,,,,,,,,,,
2009-07-01 04:20:00,1.025701,2009,0.866025,-0.5,0.866025,0.5,0.0,-0.5,-0.866025,0.991217,...,0.991217,,,,,,,,,


In [15]:
X_train, y_train, X_validation, y_validation = train_validation_ts(df, 0.8, 12, 0)

In [16]:
X_train

Unnamed: 0,year,minute_sin,minute_cos,hour_sin,hour_cos,weekday_sin,month_sin,month_cos,lag1,lag2,lag3,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12
2009-07-01 04:00:00,2009,0.000000,1.000000e+00,0.866025,0.500000,0.0,-0.500000,-0.866025,,,,,,,,,,,,
2009-07-01 04:05:00,2009,0.500000,8.660254e-01,0.866025,0.500000,0.0,-0.500000,-0.866025,0.991217,,,,,,,,,,,
2009-07-01 04:10:00,2009,0.866025,5.000000e-01,0.866025,0.500000,0.0,-0.500000,-0.866025,0.000000,0.991217,,,,,,,,,,
2009-07-01 04:15:00,2009,1.000000,2.832769e-16,0.866025,0.500000,0.0,-0.500000,-0.866025,0.000000,0.000000,0.991217,,,,,,,,,
2009-07-01 04:20:00,2009,0.866025,-5.000000e-01,0.866025,0.500000,0.0,-0.500000,-0.866025,0.991217,0.000000,0.000000,0.991217,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-08-18 14:45:00,2016,-1.000000,-1.836970e-16,-0.500000,-0.866025,0.0,-0.866025,-0.500000,0.562664,0.945590,1.033730,1.033730,1.033730,1.03373,1.03373,1.09319,1.09319,1.09319,1.09319,1.001666
2016-08-18 14:50:00,2016,-0.866025,5.000000e-01,-0.500000,-0.866025,0.0,-0.866025,-0.500000,0.882702,0.562664,0.945590,1.033730,1.033730,1.03373,1.03373,1.03373,1.09319,1.09319,1.09319,1.093190
2016-08-18 14:55:00,2016,-0.500000,8.660254e-01,-0.500000,-0.866025,0.0,-0.866025,-0.500000,1.050214,0.882702,0.562664,0.945590,1.033730,1.03373,1.03373,1.03373,1.03373,1.09319,1.09319,1.093190
2016-08-18 15:00:00,2016,0.000000,1.000000e+00,-0.707107,-0.707107,0.0,-0.866025,-0.500000,1.033730,1.050214,0.882702,0.562664,0.945590,1.03373,1.03373,1.03373,1.03373,1.03373,1.09319,1.093190


## __Error metrics for time series data__

### __Mean Absolute Error (MAE)__

$MAE = \frac{\sum\limits_{t=1}^n | F_t - A_t |}{n} $

$F_t$: forecast value <br/>
$A_t$: actual value <br/>
$n$: sample size

### __Mean Absolute Percentage Error (MAPE)__

$MAPE = \frac{100 \%}{n}\sum\limits_{t=1}^n \frac{A_t - F_t}{A_t}$

$F_t$: forecast value <br/>
$A_t$: actual value <br/>
$n$: sample size

### __Symmetric Mean Absolute Percentage Error (SMAPE)__

$SMAPE = \frac{\sum\limits_{t=1}^n | F_t - A_t |}{\sum\limits_{t=1}^n (A_t + F_t)}$

$F_t$: forecast value <br/>
$A_t$: actual value <br/>
$n$: sample size

### __Mean Absolute Scaled Error (MASE)__

$MASE = \frac{\frac{1}{J}\sum\limits_{j} | \epsilon_j | }{\frac{1}{T-1} \sum\limits_{t=2}^T | Y_t - Y_{t-1} | }$
<br/>
<br/>
$e_j$: forecast error of naive forecast for given period <br/>
$J$: number of forecasts <br/>