## 2. Baseline model
In this notebook we define the Baseline model.
It predicts the wind farm dependent average electricity production and aggregates predictions over all wind farms. The windfarms will be called zones.

In [36]:
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
sys.path.append("..")

In [37]:
# loading the dataset
data = pd.read_csv('../data/GEFCom2014Data/Wind/raw_data_incl_features.csv', \
                    parse_dates= ['TIMESTAMP'],
                    index_col= 'TIMESTAMP' )
                    
# filling NaN values, by interpolating them                  
data.interpolate(method = 'linear', inplace= True)

# add dummy variables for WD100Card and WD10Card
data = pd.get_dummies(data, columns = ['WD100CARD','WD10CARD'], drop_first=True)

Defining train and test set. The dataset consists of weather data from 01.01.2012 till 31.12.2013. The dataset is split at the date 01.07.2013, so the train set is the first 75% of the dataset, and the test set is the last 25% of the dataset. 

In [38]:
train = data[:'2013-07-01 00:00:00']
test = data['2013-07-01 01:00:00':]


In [39]:
test.head()

Unnamed: 0_level_0,ZONEID,TARGETVAR,U10,V10,U100,V100,HOUR,MONTH,WEEKDAY,IS_HOLIDAY,...,WD10CARD_NNW,WD10CARD_NW,WD10CARD_S,WD10CARD_SE,WD10CARD_SSE,WD10CARD_SSW,WD10CARD_SW,WD10CARD_W,WD10CARD_WNW,WD10CARD_WSW
TIMESTAMP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-07-01 01:00:00,1,0.625035,5.896003,-1.520128,9.461001,-2.10653,1,7,0,0,...,0,0,0,0,0,0,0,0,1,0
2013-07-01 02:00:00,1,0.791185,5.886435,-0.900037,9.019789,-1.276092,2,7,0,0,...,0,0,0,0,0,0,0,1,0,0
2013-07-01 03:00:00,1,0.8674,5.899591,-0.69367,8.685795,-1.147814,3,7,0,0,...,0,0,0,0,0,0,0,1,0,0
2013-07-01 04:00:00,1,0.896814,5.807502,-0.680772,8.629487,-1.117739,4,7,0,0,...,0,0,0,0,0,0,0,1,0,0
2013-07-01 05:00:00,1,0.647214,4.936254,-0.752703,7.652959,-1.130014,5,7,0,0,...,0,0,0,0,0,0,0,1,0,0


RMSE of the baseline model, in every windfarm, and aggregated over all windfarms:

In [40]:
# zones(windfarms) 
zones = np.sort(train.ZONEID.unique()) 

# baseline predictions of all zones will be merged into one DataFrame to calculate the RMSE with respect to the observations of 
# all zones
df_results = pd.DataFrame(index = [f'ZONE{zone}' for zone in zones] + ['TOTAL'], 
                          columns = ['BEST_PARAMS','CV','MODEL','FC','TESTSCORE','TRAINSCORE'])
df_results.loc['TOTAL'].TRAINSCORE = 0
df_results.loc['TOTAL'].TESTSCORE = 0
df_results['MODEL'] = 'Baseline'

# loop over all zones
for zone in zones:

    # get train and test data of individual zones
    ytrain = train[train.ZONEID == zone].TARGETVAR
    ytest =  test[test.ZONEID == zone].TARGETVAR

    # baseline predictons for individual zone
    pred_train = np.ones(len(ytrain)) * np.mean(ytrain)
    pred_test = np.ones(len(ytest)) * np.mean(ytrain)
    
    df_results.loc['ZONE{}'.format(zone)].TRAINSCORE = mean_squared_error(ytrain, pred_train, squared=False)
    df_results.loc['ZONE{}'.format(zone)].TESTSCORE = mean_squared_error(ytest, pred_test, squared=False)

    df_results.loc['TOTAL'].TRAINSCORE += np.power(df_results.loc['ZONE{}'.format(zone)].TRAINSCORE,2) * len(ytrain)/len(train)
    df_results.loc['TOTAL'].TESTSCORE += np.power(df_results.loc['ZONE{}'.format(zone)].TESTSCORE,2) * len(ytest)/len(test)

df_results.loc['TOTAL'].TRAINSCORE = np.power(df_results.loc['TOTAL'].TRAINSCORE,.5)
df_results.loc['TOTAL'].TESTSCORE = np.power(df_results.loc['TOTAL'].TESTSCORE,.5)

df_results.index.set_names(['ZONE'], inplace=True)


In [41]:
df_results

Unnamed: 0_level_0,BEST_PARAMS,CV,MODEL,FC,TESTSCORE,TRAINSCORE
ZONE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZONE1,,,Baseline,,0.330221,0.274462
ZONE2,,,Baseline,,0.290033,0.256417
ZONE3,,,Baseline,,0.312627,0.296552
ZONE4,,,Baseline,,0.366647,0.31975
ZONE5,,,Baseline,,0.361371,0.326282
ZONE6,,,Baseline,,0.355799,0.331287
ZONE7,,,Baseline,,0.307106,0.251734
ZONE8,,,Baseline,,0.316515,0.262741
ZONE9,,,Baseline,,0.312423,0.276838
ZONE10,,,Baseline,,0.344026,0.339246
