# Feature Engineering

## Libraries

In [1]:
import numpy as np
import pandas as pd
from cnr_methods import get_simplified_data 

# Feature Engineering Library for Time Series
from tsfresh import extract_relevant_features
from tsfresh.utilities.dataframe_functions import make_forecasting_frame
from tsfresh.utilities.dataframe_functions import impute

from sklearn.ensemble import RandomForestRegressor
# Feature Selection Library
from boruta import BorutaPy

## Read Data

For this pipeline, only Training Set will be used.

In [2]:
full_data = get_simplified_data()
full_data = full_data[full_data['Set']=='Train']
y_train = pd.read_csv('Y_train.csv')

As done in the other Notebooks, we will transform the Column 'Time' to Datetime format and set as the index of the dataset.

In [3]:
full_data['Time'] = pd.to_datetime(full_data['Time'],dayfirst=True)
full_data = full_data.set_index('Time')

In [4]:
full_data.head()

Unnamed: 0_level_0,ID,WF,U_100m,V_100m,U_10m,V_10m,T,CLCT,Set
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-05-01 01:00:00,1,WF1,-2.2485,-3.2578,1.254603,-0.289687,286.44,82.543144,Train
2018-05-01 02:00:00,2,WF1,-2.4345,-1.4461,2.490908,-0.41337,286.26,99.990844,Train
2018-05-01 03:00:00,3,WF1,-1.220571,-0.266871,0.997093,-1.415138,286.575,98.367235,Train
2018-05-01 04:00:00,4,WF1,-0.420144,-1.172552,0.689598,-0.961441,285.842832,94.860604,Train
2018-05-01 05:00:00,5,WF1,0.100398,-1.262762,0.290994,-0.294963,285.967452,95.905879,Train


To simplify the work, we will generate features for just one Wind Farm. When doing modelling, the features, as the models, will be generated for all Wind Farms separately.

In [5]:
WF = 'WF1'
data = full_data[full_data['WF']==WF]
y_train = y_train[y_train['ID'].isin(data['ID'])]

## Feature Creation

### Wind Speed Vector

In [6]:
feature_data = data[['ID','WF','U_100m','V_100m','U_10m','V_10m','T','CLCT','Set']]
feature_data['Wind Speed 100m'] = np.sqrt(feature_data['U_100m']**2 + feature_data['V_100m']**2)
feature_data['Wind Direction 100m'] = np.arctan(feature_data['V_100m']/feature_data['U_100m'])
feature_data['Wind Speed 10m'] = np.sqrt(feature_data['U_10m']**2 + feature_data['V_10m']**2)
feature_data['Wind Direction 10m'] = np.arctan(feature_data['V_10m']/feature_data['U_10m'])
feature_data = feature_data.drop(['U_100m','V_100m','U_10m','V_10m'],axis=1)

Changing Reference for Negative Angle:

In [7]:
feature_data[feature_data['Wind Direction 100m'] < 0]['Wind Direction 100m'] = 360 - feature_data[feature_data['Wind Direction 100m'] < 0]['Wind Direction 100m']
feature_data[feature_data['Wind Direction 10m'] < 0]['Wind Direction 10m'] = 360 - feature_data[feature_data['Wind Direction 10m'] < 0]['Wind Direction 10m']

### Time-Relative Variables

Variables Last Month, Last Week

### Wavelet Transformations (Check)

## Tsfresh

Now we use Tsfresh, a Python Library that automates Feature Engineering for Time Series Data. We generate new features for all the columns on the Simplified Data, as done below. This step is done after the Wind Speed Vector calculation to avoid Negative Values on the features generated, which would cause problems on the Log Transformations done before Feature Selection.

In [8]:
tsfresh_data = pd.DataFrame()
for variable in ['T', 'CLCT', 'Wind Speed 100m','Wind Direction 100m', 'Wind Speed 10m', 'Wind Direction 10m']:
    df_shift, y = make_forecasting_frame(feature_data[variable],kind=variable,max_timeshift=20,rolling_direction=1)
    X = extract_relevant_features(df_shift, column_id="id", column_sort="time", column_value="value", impute_function=impute,show_warnings=False,n_jobs=3)
    X['Feature'] = variable
    tsfresh_data = tsfresh_data.append(X)

Feature Extraction: 100%|██████████| 15/15 [01:55<00:00,  7.72s/it]
Feature Extraction: 100%|██████████| 15/15 [01:38<00:00,  6.59s/it]
Feature Extraction: 100%|██████████| 15/15 [02:01<00:00,  8.13s/it]
Feature Extraction: 100%|██████████| 15/15 [02:03<00:00,  8.22s/it]
Feature Extraction: 100%|██████████| 15/15 [01:55<00:00,  7.67s/it]
Feature Extraction: 100%|██████████| 15/15 [01:56<00:00,  7.79s/it]


Process tsfresh_data to pass column 'Features' to the other columns

In [9]:
tsfresh_data = tsfresh_data.pivot(columns='Feature')

In [10]:
tsfresh_data.columns = tsfresh_data.columns.map('{0[0]}|{0[1]}'.format)

In [11]:
tsfresh_data.head()

Unnamed: 0_level_0,value__abs_energy|CLCT,value__abs_energy|T,value__abs_energy|Wind Direction 100m,value__abs_energy|Wind Direction 10m,value__abs_energy|Wind Speed 100m,value__abs_energy|Wind Speed 10m,value__absolute_sum_of_changes|CLCT,value__absolute_sum_of_changes|T,value__absolute_sum_of_changes|Wind Direction 100m,value__absolute_sum_of_changes|Wind Direction 10m,...,value__variance|Wind Direction 100m,value__variance|Wind Direction 10m,value__variance|Wind Speed 100m,value__variance|Wind Speed 10m,value__variance_larger_than_standard_deviation|CLCT,value__variance_larger_than_standard_deviation|T,value__variance_larger_than_standard_deviation|Wind Direction 100m,value__variance_larger_than_standard_deviation|Wind Direction 10m,value__variance_larger_than_standard_deviation|Wind Speed 100m,value__variance_larger_than_standard_deviation|Wind Speed 10m
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-05-01 02:00:00,6813.370572,82047.8736,0.93448,0.051494,15.669013,1.657948,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-05-01 03:00:00,16811.539518,163992.6612,1.221774,0.078539,23.687009,8.033445,17.447701,0.18,0.430687,0.06247,...,0.046373,0.000976,0.317422,0.382766,1.0,0.0,0.0,0.0,0.0,0.0
2018-05-01 04:00:00,26487.652359,246117.891825,1.268109,0.99438,25.248021,11.030254,19.07131,0.495,0.751428,0.855013,...,0.094779,0.129448,1.23464,0.261995,1.0,0.0,0.0,0.0,1.0,0.0
2018-05-01 05:00:00,35486.186521,327824.016181,2.772981,1.894191,26.79942,12.430168,22.577941,1.227168,1.762903,0.863426,...,0.151302,0.143798,1.311684,0.279346,1.0,0.0,0.0,0.0,1.0,0.0
2018-05-01 06:00:00,44684.124127,409601.399782,4.997424,2.521726,28.404069,12.601849,23.623216,1.351789,4.481092,1.019839,...,0.915011,0.122637,1.227261,0.480477,1.0,0.0,0.0,0.0,1.0,0.0


## Feature Selection

Here we do the Feature Selection using Borutapy, a Python Implementation of the Famous R Method. For the method we use a Random Forest Regressor.

In [12]:
rf = RandomForestRegressor(n_jobs=3, max_depth=5)

In [13]:
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)

In [14]:
feat_selector.fit(tsfresh_data.values, y_train['Production'][1:].values)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	4536
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	4536
Rejected: 	0


KeyboardInterrupt: 

In [34]:
feat_selector.ranking_

array([1, 1, 1, 1, 1, 1, 1])

In [35]:
feat_selector.support_

array([ True,  True,  True,  True,  True,  True,  True])