# Feature Engineering

In this notebook I will conduct feature engineering on my dataset to include:
* creating dummy variables
* standardizing feature scales
* creating training and test data sets

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

### Creating Dummy variables

In [2]:
# Import outage_df
outage_df = pd.read_csv('outage_df.csv', index_col = 0)
outage_df.head()

Unnamed: 0,Datetime Event Began,State Affected,NERC Region,Alert Criteria,Event Type,Demand Loss (MW),Number of Customers Affected,State Avg Temp (F),State Avg Windspeed (mph),State Avg Precipitation (mm),Monthly Net Energy for Load (GWh),Monthly Peak Hour Demand (MW)
297,2022-03-15 18:53:00,California,WECC,Physical threat to its Facility excluding weat...,Vandalism,10.0,3958.0,52.185096,6.105909,5.153283,70160.970234,101781.876097
100,2020-01-17 05:28:00,California,WECC,Electrical System Separation (Islanding) where...,Severe Weather,87.0,67864.0,36.832113,5.184511,11.522378,75757.704587,128013.159932
15,2018-05-17 01:11:00,California,WECC,"Loss of electric service to more than 50,000 c...",Severe Weather,70.0,70000.0,56.205196,1.11845,1.093863,69924.486936,128693.92159
16,2018-05-17 01:11:00,California,WECC,"Loss of electric service to more than 50,000 c...",Transmission / Distribution Interruption,124.0,70000.0,56.205196,1.11845,1.093863,69924.486936,128693.92159
242,2021-06-19 18:54:00,California,WECC,"Loss of electric service to more than 50,000 c...",System Operations,93.0,51806.0,68.9,6.651238,0.010243,82787.291966,160779.833901


The following columns need to be converted into dummy variables:
* State Affected
* NERC Region
* Event Type

In [3]:
# Create dummy variables for the columns listed above:
outage_df_with_dummies = pd.get_dummies(data = outage_df, columns = ['State Affected', 'NERC Region', 'Event Type'], dtype = 'int')
outage_df_with_dummies.head()

Unnamed: 0,Datetime Event Began,Alert Criteria,Demand Loss (MW),Number of Customers Affected,State Avg Temp (F),State Avg Windspeed (mph),State Avg Precipitation (mm),Monthly Net Energy for Load (GWh),Monthly Peak Hour Demand (MW),State Affected_Alabama,...,NERC Region_SERC,NERC Region_SPP RE,NERC Region_TRE,NERC Region_WECC,Event Type_Cyber Event,Event Type_Other,Event Type_Severe Weather,Event Type_System Operations,Event Type_Transmission / Distribution Interruption,Event Type_Vandalism
297,2022-03-15 18:53:00,Physical threat to its Facility excluding weat...,10.0,3958.0,52.185096,6.105909,5.153283,70160.970234,101781.876097,0,...,0,0,0,1,0,0,0,0,0,1
100,2020-01-17 05:28:00,Electrical System Separation (Islanding) where...,87.0,67864.0,36.832113,5.184511,11.522378,75757.704587,128013.159932,0,...,0,0,0,1,0,0,1,0,0,0
15,2018-05-17 01:11:00,"Loss of electric service to more than 50,000 c...",70.0,70000.0,56.205196,1.11845,1.093863,69924.486936,128693.92159,0,...,0,0,0,1,0,0,1,0,0,0
16,2018-05-17 01:11:00,"Loss of electric service to more than 50,000 c...",124.0,70000.0,56.205196,1.11845,1.093863,69924.486936,128693.92159,0,...,0,0,0,1,0,0,0,0,1,0
242,2021-06-19 18:54:00,"Loss of electric service to more than 50,000 c...",93.0,51806.0,68.9,6.651238,0.010243,82787.291966,160779.833901,0,...,0,0,0,1,0,0,0,1,0,0


### Creating Training and Test Data Sets

For this exercise I will not split the data into training and test sets.  I will plan to perform unsupervised learning - clustering - on this data set, so splitting it into training and test sets, when there are no labels doesn't make much sense.


### Standardizing Feature Scales

In [4]:
# Use standard scaler to scale the numeric features
scaler = StandardScaler()
outage_df_scaled_dummies = scaler.fit_transform(outage_df_with_dummies[['Demand Loss (MW)','Number of Customers Affected', 'State Avg Temp (F)', 'State Avg Windspeed (mph)', 'State Avg Precipitation (mm)', 'Monthly Net Energy for Load (GWh)', 'Monthly Peak Hour Demand (MW)']])