# Problem 1
We have to implement the pre-processing functions for this task.

a) **Splitting the dataset**: Implement a function which splits the data set into a training and validation
data set. The input of the function should be a dataframe containing the SMARD data, as well as
the first and last timesteps of both the training and validation data sets. These four values should be
pandas datetime objects. After the data is split, the columns containing datetime objects should be
removed. The output of the function should be two numpy arrays, one containing the training data
set and one containing the validation data set.

In [1]:
import pandas as pd

Copying helper function ...

In [2]:
def read_SMARD_data(path, remove_bad_columns=False):
    """ Read SMARD data .csv file to a pandas Dataframe

        Input:
        path: file path to SMARD data .csv file

        Returns:
        pandas Dataframe with SMARD data
    """
    # save data from .csv file to dataframe
    df = pd.read_csv(path, delimiter=';', thousands='.', decimal=',', parse_dates=[[0,1]], dayfirst="True")

    # rename columns
    df = df.rename(
        columns={
        'Datum_Anfang': "Date",
        'Gesamt (Netzlast) [MWh] Originalauflösungen': 'Total Load [MWh]',
        'Residuallast [MWh] Originalauflösungen': 'Residual Load [MWh]',
        'Pumpspeicher [MWh] Originalauflösungen' : 'Energy from Pumped Storage [MWh]'
        }
    )
    if remove_bad_columns==True:
        # remove columns from Dataframe
        df = df.drop(['Residual Load [MWh]', 'Energy from Pumped Storage [MWh]'], axis="columns")
        df.pop('Ende')
    return df

Reading the data from the csv ...

In [3]:
smard_df = read_SMARD_data('SMARD.csv')

In [4]:
print(smard_df)

                      Date   Ende  Total Load [MWh]  Residual Load [MWh]  \
0      2020-01-01 00:00:00  00:15          10964.25              9362.75   
1      2020-01-01 00:15:00  00:30          10908.50              9311.75   
2      2020-01-01 00:30:00  00:45          10833.00              9170.50   
3      2020-01-01 00:45:00  01:00          10788.25              9065.00   
4      2020-01-01 01:00:00  01:15          10756.00              9012.50   
...                    ...    ...               ...                  ...   
105211 2022-12-31 22:45:00  23:00          10457.75              1922.50   
105212 2022-12-31 23:00:00  23:15          10331.25              2098.50   
105213 2022-12-31 23:15:00  23:30          10339.50              2131.25   
105214 2022-12-31 23:30:00  23:45          10220.75              2076.75   
105215 2022-12-31 23:45:00  00:00          10104.25              2039.50   

        Energy from Pumped Storage [MWh]  
0                                  48.75  
1

Implementing a function that splits the data into training and validation datasets ...

In [None]:
def split_train_val(dataframe, first_timestp_train, last_timestp_train, first_timestp_val, last_timestp_val):
    """
    Splits data into training and validation datasets

    Args:
        dataframe (pandas.Dataframe): the dataframe containing the SMARD data.
        first_timestp_train (datetime.datetime): the first timestep of the training data.
        last_timestp_train (datetime.datetime): the last timestep of the training data.
        first_timestp_val (datetime.datetime): the first timestep of the validation data.
        last_timestp_val (datetime.datetime): the last timestep of the validation data.

    Returns:
        train_data (np.array): the training data,
        val_data (np.array): the validation data.
    """
    
    train_data = dataframe[dataframe['Date'] <= last_timestp_train]
    val_data = dataframe[dataframe['Date'] <= last_timestp_val]
    
    train_data = train_data[train_data['Date'] >= first_timestp_train]
    val_data = val_data[val_data['Date'] >= first_timestp_val]
    
    return train_data, val_data