### __Introduction of additional non-time features__

In this notebook, we will introduce further features which are likely to impact the marginal CO2 emissions. These include the electricity demand as well as the available generation.
Moreover, the energy flow from generators of outside South Australia (remember that we filtered for South Australia in notebook 3) is added as a feature.

In [37]:
import numpy as np
import pandas as pd

In [38]:
df = pd.read_pickle('../../big_data/train_time_features.pkl')
df.head()

Unnamed: 0,CO2E_EMISSIONS_FACTOR,weekday,year,minute_sin,minute_cos,hour_sin,hour_cos,month_sin,month_cos,lag1,...,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12,horizon0
2009-07-01 04:00:00,0.991217,0,2009,0.0,1.0,0.866025,0.5,-0.5,-0.866025,,...,,,,,,,,,,
2009-07-01 04:05:00,0.0,0,2009,0.5,0.8660254,0.866025,0.5,-0.5,-0.866025,0.991217,...,,,,,,,,,,
2009-07-01 04:10:00,0.0,0,2009,0.866025,0.5,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,,,
2009-07-01 04:15:00,0.991217,0,2009,1.0,2.832769e-16,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,,,
2009-07-01 04:20:00,1.025701,0,2009,0.866025,-0.5,0.866025,0.5,-0.5,-0.866025,0.991217,...,0.991217,,,,,,,,,


In [39]:
demand = pd.read_csv('../../big_data/demand.csv', index_col=-1, parse_dates=True)
demand.drop(columns=["SETTLEMENTDATE", "I", "INTERVENTION"], inplace=True)
demand = demand[(demand.index >= df.index.min()) & (demand.index <= df.index.max())]

assert demand.index.min() == df.index.min()
assert demand.index.max() == df.index.max()

demand.head()

Unnamed: 0_level_0,TOTALDEMAND,AVAILABLEGENERATION
start-of-interval,Unnamed: 1_level_1,Unnamed: 2_level_1
2009-07-01 04:00:00,1004.32,3043.771
2009-07-01 04:05:00,1007.58,3043.694
2009-07-01 04:10:00,1019.33,3045.307
2009-07-01 04:15:00,1025.24,3045.436
2009-07-01 04:20:00,1050.5,3043.964


In [40]:
print(demand.index.min())
print(demand.index.max())

print(df.index.min())
print(df.index.max())

2009-07-01 04:00:00
2018-05-31 23:55:00
2009-07-01 04:00:00
2018-05-31 23:55:00


### __Checking for regular time granularity (5 min) in the demand dataset__

In order to check if our demand dataset contains any gaps, we first generate a datarange reflecting the min and max values of the demand datetimeindex.

In [41]:
len(pd.date_range(start=demand.index.min(), end=demand.index.max(), freq="5min"))

937968

In [42]:
test = demand.groupby(demand.index).agg({"TOTALDEMAND": "nunique","AVAILABLEGENERATION":"nunique"})
print(test.TOTALDEMAND.value_counts()) 
print(test.AVAILABLEGENERATION.value_counts())

#all TOTALDEMANDS and AVAILABLEGENERATION are unique

1    937968
Name: TOTALDEMAND, dtype: int64
1    937968
Name: AVAILABLEGENERATION, dtype: int64


The number of rows corresponds the the length of our hypothetical daterange (cell above). Hence we can conclude that no gaps are present in the dataset.

### __Addition of non-time features__

Note that the feature of available generation is introduced in a feature-engineered fashion together with the total demand feature. In that way, it reflects how far the energy grid is used to capacity.

In [43]:
demand = demand.groupby(demand.index).mean()

In [44]:
df["demand"] = demand.TOTALDEMAND
df["demand_capacity"] = demand.TOTALDEMAND/demand.AVAILABLEGENERATION

In [45]:
print(df.shape)
df.head()

(937968, 24)


Unnamed: 0,CO2E_EMISSIONS_FACTOR,weekday,year,minute_sin,minute_cos,hour_sin,hour_cos,month_sin,month_cos,lag1,...,lag6,lag7,lag8,lag9,lag10,lag11,lag12,horizon0,demand,demand_capacity
2009-07-01 04:00:00,0.991217,0,2009,0.0,1.0,0.866025,0.5,-0.5,-0.866025,,...,,,,,,,,,1004.32,0.329959
2009-07-01 04:05:00,0.0,0,2009,0.5,0.8660254,0.866025,0.5,-0.5,-0.866025,0.991217,...,,,,,,,,,1007.58,0.331039
2009-07-01 04:10:00,0.0,0,2009,0.866025,0.5,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,,1019.33,0.334722
2009-07-01 04:15:00,0.991217,0,2009,1.0,2.832769e-16,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,,1025.24,0.336648
2009-07-01 04:20:00,1.025701,0,2009,0.866025,-0.5,0.866025,0.5,-0.5,-0.866025,0.991217,...,,,,,,,,,1050.5,0.345109


### __Addition of the interconnectors feature__

This feature reflects how much electricity is being imported into the region of South Australia.

In [46]:
interconnectors = pd.read_csv('../../big_data/interconnectors.csv', index_col=-1, parse_dates=True)
interconnectors.drop(columns=["SETTLEMENTDATE", "I", "INTERCONNECTORID"], inplace=True)
interconnectors = interconnectors[(interconnectors.index >= df.index.min()) & (interconnectors.index <= df.index.max())]

assert interconnectors.index.min() == df_mean2.index.min()
assert interconnectors.index.max() == df_mean2.index.max()

interconnectors.head()

NameError: name 'df_mean2' is not defined

When trying to load the file together with our assertions, we get an error. That is due to that the interconncectors dataframe datetimeindex starts later.

__Watch:__

In [47]:
print(interconnectors.shape)
print(interconnectors.index.min())
print(interconnectors.index.max())

print(df.index.min())

(942201, 1)
2009-09-01 00:00:00
2018-05-31 23:55:00
2009-07-01 04:00:00


We will take care of that later as we delete all NaNs from our dataframe below.

In [48]:
interconnectors.groupby(interconnectors.index).agg({"MWFLOW":"nunique"}).MWFLOW.value_counts()

1    900648
2     19512
Name: MWFLOW, dtype: int64

In [49]:
interconnectors = interconnectors.groupby(interconnectors.index).mean()
interconnectors.shape

(920160, 1)

In [50]:
df["interconnector"] = interconnectors.MWFLOW
df.head()

Unnamed: 0,CO2E_EMISSIONS_FACTOR,weekday,year,minute_sin,minute_cos,hour_sin,hour_cos,month_sin,month_cos,lag1,...,lag7,lag8,lag9,lag10,lag11,lag12,horizon0,demand,demand_capacity,interconnector
2009-07-01 04:00:00,0.991217,0,2009,0.0,1.0,0.866025,0.5,-0.5,-0.866025,,...,,,,,,,,1004.32,0.329959,
2009-07-01 04:05:00,0.0,0,2009,0.5,0.8660254,0.866025,0.5,-0.5,-0.866025,0.991217,...,,,,,,,,1007.58,0.331039,
2009-07-01 04:10:00,0.0,0,2009,0.866025,0.5,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,1019.33,0.334722,
2009-07-01 04:15:00,0.991217,0,2009,1.0,2.832769e-16,0.866025,0.5,-0.5,-0.866025,0.0,...,,,,,,,,1025.24,0.336648,
2009-07-01 04:20:00,1.025701,0,2009,0.866025,-0.5,0.866025,0.5,-0.5,-0.866025,0.991217,...,,,,,,,,1050.5,0.345109,


In [51]:
def correct_timedelta(df, time_diff):
    '''
    df.index must be DateTimeIndex
    Returns two lists
    df=table_of_interest
    col="column_of_interest"
    time_diff=time_diff in seconds as int
    '''
    lst = []
    lst_i = []
    count = 0
    for i in df.index:
        count += 1
        if count >= len(df):
            break
        delta = abs(df.index[count] - df.index[count-1])
        if int(delta.total_seconds()) != int(time_diff):
            lst.append(("from index {} on, it has been {} s or {} h.".format(count-1,int(delta.total_seconds()),(int(delta.total_seconds()/3600)))))
            lst_i.append((df.index[count-1],int(delta.total_seconds())))
    return lst, lst_i

In [52]:
lst1, _ = correct_timedelta(df, 300)

In [53]:
len(lst1)

0

In [54]:
print(df.shape)
df.isna().any()

(937968, 25)


CO2E_EMISSIONS_FACTOR    False
weekday                  False
year                     False
minute_sin               False
minute_cos               False
hour_sin                 False
hour_cos                 False
month_sin                False
month_cos                False
lag1                      True
lag2                      True
lag3                      True
lag4                      True
lag5                      True
lag6                      True
lag7                      True
lag8                      True
lag9                      True
lag10                     True
lag11                     True
lag12                     True
horizon0                  True
demand                   False
demand_capacity          False
interconnector            True
dtype: bool

In our current dataframe, we have a set of columns featuring NaN values. For the columns we created in the course of time feature engineering, we know that they sit either at the very beginning or very end of each respective column. Hence, considering our massive amount of data, it would be no harm to simply drop those values.

In [55]:
df.dropna(inplace=True)

In [57]:
print(df.shape)
df.isna().any()

(920160, 25)


CO2E_EMISSIONS_FACTOR    False
weekday                  False
year                     False
minute_sin               False
minute_cos               False
hour_sin                 False
hour_cos                 False
month_sin                False
month_cos                False
lag1                     False
lag2                     False
lag3                     False
lag4                     False
lag5                     False
lag6                     False
lag7                     False
lag8                     False
lag9                     False
lag10                    False
lag11                    False
lag12                    False
horizon0                 False
demand                   False
demand_capacity          False
interconnector           False
dtype: bool

In [58]:
l1, l2 = correct_timedelta(df, 300)

In [59]:
l1

[]

__After our dropna operation no DateTimeIndex gaps are present, suggesting that we indeed only dropped NaN values at the edges of the table (beginniung 

In [55]:
file_path = '../../big_data/df_clean_interconnectors.pkl'
pd.to_pickle(df_clean, file_path)