# Power Outages
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the severity (number of customers, duration, or demand loss) of a major power outage.
    * Predict the cause of a major power outage.
    * Predict the number and/or severity of major power outages in the year 2020.
    * Predict the electricity consumption of an area.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### Introduction
In this project, we will look at the major power outage data in the continental U.S. from January 2000 to July 2016. After data cleaning, the dataset has 1534 observations (each occurence of outage), and 56 columns (location, cause, influential factors, time-stamp, aftermath of the outage etc.)

The classification problem is that we want to predict the **cause category** (target variable) for each major power outages given the year, the state, outage duration, population of the state, number of customers affected for each state (features). And we'll use **Recall** as our evaluation metric to examine our model, and make further improvements based on the value. The justification for this metric over others will be given in Baseline model section.

### Baseline Model
Our baseline model uses the Decision Tree Classifier, since the `CAUSE.CATEGORY` consists of 7 categories,and including 5 features: [`'YEAR'`, `'U.S._STATE'`, `'OUTAGE.DURATION'`, `'POPULATION'`, `'CUSTOMERS.AFFECTED'`]. Of which, `YEAR` column is ordinal, `U.S. STATE` is nominal, and remaining three columns are quantitative variables.

The reason why we choose year as one of the features is that as time progresses, we guessed that cause category might also change correspondingly due to technological improvement. We chose state since the geography, demography, consumption varies among different states, which might have an effect on power outage. We chose outage duration, since the length of each outage can depend on the severity, which also closely linked to the cause category. The population is feasible since the population is a straight forward representation of the scale of the state, difference in population might also affect the cause. At last, if cause category belongs to a macro scale influence, naturally many customers will be affected, making customers affected another viable feature.

In addition,choosing a good evaluation metric is quite challenging,since the `CAUSE.CATEGORY` has 7 causes for the outage, each with very diverse amount (using value_counts), it's relatively hard to calculate recall, specifity, precision, and etc., only if we repeat them for each category (it will be much easier to work with since they are for binarized-categorical variable like the tumor/no tumor, or terrorists/non-terrorists example). To tackle this problem, we first define a variable `categories` that includes all causes, and then assign it for the keyword argument `labels` for the confusion matrix. To better visualize True Positive, False Positive, False Negative and True Negative, we used multi-labeled confusion matrix that displays all seven categories respectively in a more visually straightforward 2x2 matrices. Parallel to our expectation, the multi-label confusion matrix has matching True Positives as values on the diagonal line of the ordinary confusion matrix. Since there are 7 categories, and they have unequal proportion, we value True Positives over all observations that are actually labeled positive the most (unlike tumor problem, True Negative doesn't give much information in this case, recall is most helpful),

$$
{\rm Recall} =\frac{TP}{P} =\frac{TP}{TP + FN}
$$

which follows that we will use `Recall` as our evaluation metric. More specifically, we will take the mean of all 7 categories' `recall` as the evaluation metric. However, our evaluation metric can be further improved since we consider the case where TP = 0 and TP + FN = 0 to get a recall of 0 (but if FN is non-zero, TP is 0, we also get 0, they don't mean the same thing since TP + FN = 0 suggests that there are no positive labels in the input data). 

After training our baseline model, it has a mean of 0.582 recall over all categories on testing dataset, which is quite decent.

### Final Model
To achieve our final model, we attempted several feature engineering that would theoretically improve our evaluation metric. First is Grid Search, it would find the best three parameters (in this case, max_features, min_samples_leaf, min_samples_split) for our pipeline. However, further validation gives result that it doesn't supercede our baseline model with mean recall of 0.509. Then, we ...

### Fairness Evaluation
TODO

# Code

We first import necessary packages.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import datetime
from IPython.display import display
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [67]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn import utils
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

This section of the code processes and cleans the dataset.

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
df = pd.read_excel('outage.xlsx', skiprows = 5)
df = df.drop(0).drop(['variables','OBS'], axis = 1).reset_index(drop=True)
df

Unnamed: 0,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,OUTAGE.START.DATE,OUTAGE.START.TIME,OUTAGE.RESTORATION.DATE,OUTAGE.RESTORATION.TIME,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
0,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,2011-07-01 00:00:00,17:00:00,2011-07-03 00:00:00,20:00:00,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.5491,32.225,32.2024,2308736.0,276286.0,10673.0,2595696.0,88.9448,10.644,0.411181,51268,47586,1.07738,1.6,4802,274182,1.75139,2.2,5348119.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
1,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2014-05-11 00:00:00,18:38:00,2014-05-11 00:00:00,18:39:00,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.0325,34.2104,35.7276,2345860.0,284978.0,9898.0,2640737.0,88.8335,10.7916,0.37482,53499,49091,1.08979,1.9,5226,291955,1.79,2.2,5457125.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
2,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,2010-10-26 00:00:00,20:00:00,2010-10-28 00:00:00,22:00:00,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.0977,34.501,37.366,2300291.0,276463.0,10150.0,2586905.0,88.9206,10.687,0.392361,50447,47287,1.06683,2.7,4571,267895,1.70627,2.1,5310903.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
3,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,2012-06-19 00:00:00,04:30:00,2012-06-20 00:00:00,23:00:00,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.9941,33.5433,34.4393,2317336.0,278466.0,11010.0,2606813.0,88.8954,10.6822,0.422355,51598,48156,1.07148,0.6,5364,277627,1.93209,2.2,5380443.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
4,2015.0,7.0,Minnesota,MN,MRO,East North Central,1.2,warm,2015-07-18 00:00:00,02:00:00,2015-07-19 00:00:00,07:00:00,severe weather,,,1740,250,250000.0,13.07,10.16,7.74,10.43,2028875,2161612,1777937,5970339,33.9826,36.2059,29.7795,2374674.0,289044.0,9812.0,2673531.0,88.8216,10.8113,0.367005,54431,49844,1.09203,1.7,4873,292023,1.6687,2.2,5489594.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1529,2011.0,12.0,North Dakota,ND,MRO,West North Central,-0.9,cold,2011-12-06 00:00:00,08:00:00,2011-12-06 00:00:00,20:00:00,public appeal,,,720,155,34500.0,8.41,7.8,6.2,7.56,488853,438133,386693,1313678,37.2125,33.3516,29.4359,330738.0,60017.0,3639.0,394394.0,83.8598,15.2175,0.922681,57012,47586,1.19808,9.8,934,39067,2.39076,0.5,685326.0,59.9,19.9,2192.2,1868.2,3.9,0.27,0.1,97.5996,2.40177,2.40177
1530,2006.0,,North Dakota,ND,MRO,West North Central,,,,,,,fuel supply emergency,Coal,,,1650,,,,,,,,,,,,,309997.0,53709.0,2331.0,366037.0,84.6901,14.6731,0.636821,42913,48909,0.877405,3.5,1019,27868,3.65652,0.7,649422.0,59.9,19.9,2192.2,1868.2,3.9,0.27,0.1,97.5996,2.40177,2.40177
1531,2009.0,8.0,South Dakota,SD,RFC,West North Central,0.5,warm,2009-08-29 00:00:00,22:54:00,2009-08-29 00:00:00,23:53:00,islanding,,,59,84,,9.25,7.47,5.53,7.67,337874,370771,215406,924051,36.5644,40.1245,23.3111,367206.0,65971.0,3052.0,436229.0,84.1773,15.123,0.699633,45230,46680,0.968937,0,606,36504,1.66009,0.3,807067.0,56.65,26.73,2038.3,1905.4,4.7,0.3,0.15,98.3077,1.69226,1.69226
1532,2009.0,8.0,South Dakota,SD,MRO,West North Central,0.5,warm,2009-08-29 00:00:00,11:00:00,2009-08-29 00:00:00,14:01:00,islanding,,,181,373,,9.25,7.47,5.53,7.67,337874,370771,215406,924051,36.5644,40.1245,23.3111,367206.0,65971.0,3052.0,436229.0,84.1773,15.123,0.699633,45230,46680,0.968937,0,606,36504,1.66009,0.3,807067.0,56.65,26.73,2038.3,1905.4,4.7,0.3,0.15,98.3077,1.69226,1.69226


In [5]:
start_time = pd.to_datetime(
    df['OUTAGE.START.DATE'].dropna().astype(str).str.split(' ').str[0] 
    + ' ' 
    + df['OUTAGE.START.TIME'].dropna().astype(str)
)

restoration_time = pd.to_datetime(
    df['OUTAGE.RESTORATION.DATE'].dropna().astype(str).str.split(' ').str[0] 
    + ' ' 
    + df['OUTAGE.RESTORATION.TIME'].dropna().astype(str)
)

In [6]:
df['OUTAGE.RESTORATION'] = restoration_time
df['OUTAGE.START'] = start_time
df = df.drop(['OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE', 'OUTAGE.RESTORATION.TIME'], axis = 1)
df

Unnamed: 0,YEAR,MONTH,U.S._STATE,POSTAL.CODE,NERC.REGION,CLIMATE.REGION,ANOMALY.LEVEL,CLIMATE.CATEGORY,CAUSE.CATEGORY,CAUSE.CATEGORY.DETAIL,HURRICANE.NAMES,OUTAGE.DURATION,DEMAND.LOSS.MW,CUSTOMERS.AFFECTED,RES.PRICE,COM.PRICE,IND.PRICE,TOTAL.PRICE,RES.SALES,COM.SALES,IND.SALES,TOTAL.SALES,RES.PERCEN,COM.PERCEN,IND.PERCEN,RES.CUSTOMERS,COM.CUSTOMERS,IND.CUSTOMERS,TOTAL.CUSTOMERS,RES.CUST.PCT,COM.CUST.PCT,IND.CUST.PCT,PC.REALGSP.STATE,PC.REALGSP.USA,PC.REALGSP.REL,PC.REALGSP.CHANGE,UTIL.REALGSP,TOTAL.REALGSP,UTIL.CONTRI,PI.UTIL.OFUSA,POPULATION,POPPCT_URBAN,POPPCT_UC,POPDEN_URBAN,POPDEN_UC,POPDEN_RURAL,AREAPCT_URBAN,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND,OUTAGE.RESTORATION,OUTAGE.START
0,2011.0,7.0,Minnesota,MN,MRO,East North Central,-0.3,normal,severe weather,,,3060,,70000.0,11.6,9.18,6.81,9.28,2332915,2114774,2113291,6562520,35.5491,32.225,32.2024,2308736.0,276286.0,10673.0,2595696.0,88.9448,10.644,0.411181,51268,47586,1.07738,1.6,4802,274182,1.75139,2.2,5348119.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2011-07-03 20:00:00,2011-07-01 17:00:00
1,2014.0,5.0,Minnesota,MN,MRO,East North Central,-0.1,normal,intentional attack,vandalism,,1,,,12.12,9.71,6.49,9.28,1586986,1807756,1887927,5284231,30.0325,34.2104,35.7276,2345860.0,284978.0,9898.0,2640737.0,88.8335,10.7916,0.37482,53499,49091,1.08979,1.9,5226,291955,1.79,2.2,5457125.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2014-05-11 18:39:00,2014-05-11 18:38:00
2,2010.0,10.0,Minnesota,MN,MRO,East North Central,-1.5,cold,severe weather,heavy wind,,3000,,70000.0,10.87,8.19,6.07,8.15,1467293,1801683,1951295,5222116,28.0977,34.501,37.366,2300291.0,276463.0,10150.0,2586905.0,88.9206,10.687,0.392361,50447,47287,1.06683,2.7,4571,267895,1.70627,2.1,5310903.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2010-10-28 22:00:00,2010-10-26 20:00:00
3,2012.0,6.0,Minnesota,MN,MRO,East North Central,-0.1,normal,severe weather,thunderstorm,,2550,,68200.0,11.79,9.25,6.71,9.19,1851519,1941174,1993026,5787064,31.9941,33.5433,34.4393,2317336.0,278466.0,11010.0,2606813.0,88.8954,10.6822,0.422355,51598,48156,1.07148,0.6,5364,277627,1.93209,2.2,5380443.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2012-06-20 23:00:00,2012-06-19 04:30:00
4,2015.0,7.0,Minnesota,MN,MRO,East North Central,1.2,warm,severe weather,,,1740,250,250000.0,13.07,10.16,7.74,10.43,2028875,2161612,1777937,5970339,33.9826,36.2059,29.7795,2374674.0,289044.0,9812.0,2673531.0,88.8216,10.8113,0.367005,54431,49844,1.09203,1.7,4873,292023,1.6687,2.2,5489594.0,73.27,15.28,2279,1700.5,18.2,2.14,0.6,91.5927,8.40733,5.47874,2015-07-19 07:00:00,2015-07-18 02:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1529,2011.0,12.0,North Dakota,ND,MRO,West North Central,-0.9,cold,public appeal,,,720,155,34500.0,8.41,7.8,6.2,7.56,488853,438133,386693,1313678,37.2125,33.3516,29.4359,330738.0,60017.0,3639.0,394394.0,83.8598,15.2175,0.922681,57012,47586,1.19808,9.8,934,39067,2.39076,0.5,685326.0,59.9,19.9,2192.2,1868.2,3.9,0.27,0.1,97.5996,2.40177,2.40177,2011-12-06 20:00:00,2011-12-06 08:00:00
1530,2006.0,,North Dakota,ND,MRO,West North Central,,,fuel supply emergency,Coal,,,1650,,,,,,,,,,,,,309997.0,53709.0,2331.0,366037.0,84.6901,14.6731,0.636821,42913,48909,0.877405,3.5,1019,27868,3.65652,0.7,649422.0,59.9,19.9,2192.2,1868.2,3.9,0.27,0.1,97.5996,2.40177,2.40177,NaT,NaT
1531,2009.0,8.0,South Dakota,SD,RFC,West North Central,0.5,warm,islanding,,,59,84,,9.25,7.47,5.53,7.67,337874,370771,215406,924051,36.5644,40.1245,23.3111,367206.0,65971.0,3052.0,436229.0,84.1773,15.123,0.699633,45230,46680,0.968937,0,606,36504,1.66009,0.3,807067.0,56.65,26.73,2038.3,1905.4,4.7,0.3,0.15,98.3077,1.69226,1.69226,2009-08-29 23:53:00,2009-08-29 22:54:00
1532,2009.0,8.0,South Dakota,SD,MRO,West North Central,0.5,warm,islanding,,,181,373,,9.25,7.47,5.53,7.67,337874,370771,215406,924051,36.5644,40.1245,23.3111,367206.0,65971.0,3052.0,436229.0,84.1773,15.123,0.699633,45230,46680,0.968937,0,606,36504,1.66009,0.3,807067.0,56.65,26.73,2038.3,1905.4,4.7,0.3,0.15,98.3077,1.69226,1.69226,2009-08-29 14:01:00,2009-08-29 11:00:00


This part displays the number of occrences for each of the seven unique cause categories for power outages.

In [7]:
df['CAUSE.CATEGORY'].value_counts()

severe weather                   763
intentional attack               418
system operability disruption    127
public appeal                     69
equipment failure                 60
fuel supply emergency             51
islanding                         46
Name: CAUSE.CATEGORY, dtype: int64

### Baseline Model

In this section, we will implement our baseline model using 5 features: [`'YEAR'`, `'U.S._STATE'`, `'OUTAGE.DURATION'`, `'POPULATION'`, `'CUSTOMERS.AFFECTED'`] to predict `CAUSE.CATEGORY`.

In [8]:
# TODO
# Y: CUSTOMERS.AFFECTED
# X: YEAR, U.S._STATE, CAUSE.CATEGORY, DURATION, POPULATION

In [9]:
from sklearn.preprocessing import Binarizer

Out of the 55 columns, we only select those that we'll use.

In [10]:
ndf = df[['YEAR', 'U.S._STATE', 'CAUSE.CATEGORY', 'OUTAGE.DURATION', 'POPULATION', 'CUSTOMERS.AFFECTED', 'TOTAL.CUSTOMERS']]
ndf = ndf.dropna()
ndf

Unnamed: 0,YEAR,U.S._STATE,CAUSE.CATEGORY,OUTAGE.DURATION,POPULATION,CUSTOMERS.AFFECTED,TOTAL.CUSTOMERS
0,2011.0,Minnesota,severe weather,3060,5348119.0,70000.0,2595696.0
2,2010.0,Minnesota,severe weather,3000,5310903.0,70000.0,2586905.0
3,2012.0,Minnesota,severe weather,2550,5380443.0,68200.0,2606813.0
4,2015.0,Minnesota,severe weather,1740,5489594.0,250000.0,2673531.0
5,2010.0,Minnesota,severe weather,1860,5310903.0,60000.0,2586905.0
...,...,...,...,...,...,...,...
1522,2004.0,Idaho,system operability disruption,95,1391802.0,35000.0,701140.0
1523,2011.0,Idaho,intentional attack,360,1584134.0,0.0,794925.0
1524,2003.0,Idaho,public appeal,1548,1363380.0,0.0,687334.0
1526,2016.0,Idaho,intentional attack,0,1680026.0,0.0,849763.0


We first divide them into features and target variable, then, we splitted the training and testing dataset.

In [11]:
X = ndf[['YEAR', 'U.S._STATE', 'CUSTOMERS.AFFECTED', 'OUTAGE.DURATION', 'POPULATION']]
y = ndf['CAUSE.CATEGORY']

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

This section transforms the catgeorical varaibles using OneHotEncoder and keeps the same values for numeric variables.

In [13]:
cat = ['YEAR', 'U.S._STATE']
num = ['OUTAGE.DURATION', 'CUSTOMERS.AFFECTED', 'POPULATION']

cat_func = OneHotEncoder(handle_unknown='ignore')
num_func = FunctionTransformer(lambda x:x)

ct = ColumnTransformer([('categorical', cat_func, cat), ('numerical', num_func, num)])

In [14]:
baseline_pl = Pipeline([('column', ct), ('dtc', (DecisionTreeClassifier(max_depth=15)))])

In [87]:
baseline_pl.fit(X_train, y_train)

Pipeline(steps=[('column',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['YEAR', 'U.S._STATE']),
                                                 ('numerical',
                                                  FunctionTransformer(func=<function <lambda> at 0x000001CF316FC5E0>),
                                                  ['OUTAGE.DURATION',
                                                   'CUSTOMERS.AFFECTED',
                                                   'POPULATION'])])),
                ('dtc', DecisionTreeClassifier(max_depth=15))])

Accuracy is not a good measure, but we will display it anyway for our thought process.

In [54]:
baseline_pl.score(X_train, y_train)

0.9976303317535545

In [55]:
baseline_pl.score(X_test, y_test)

0.8160377358490566

We observe that the diagonal entries of the confusion matrix indeed aligns with True Positives of the multi-label confusion matrix. Then, we defined a helper function `calculate_recall` to compute **recall** for each category, and finally calculate their mean as our evaluation metric.

In [18]:
# categories contains all causes
categories = ndf['CAUSE.CATEGORY'].unique()
categories

array(['severe weather', 'intentional attack', 'public appeal',
       'system operability disruption', 'islanding', 'equipment failure',
       'fuel supply emergency'], dtype=object)

In [88]:
base_pred = baseline_pl.predict(X_test)

In [20]:
metrics.confusion_matrix(y_test, base_pred,labels=categories)

array([[134,   1,   0,   5,   1,   3,   1],
       [  1,  28,   0,   0,   1,   1,   0],
       [  1,   1,   4,   0,   0,   0,   0],
       [  6,   1,   1,   5,   0,   3,   0],
       [  1,   0,   2,   2,   4,   0,   0],
       [  2,   0,   0,   2,   0,   0,   0],
       [  0,   0,   0,   1,   0,   0,   0]], dtype=int64)

In [21]:
multi_matrix = metrics.multilabel_confusion_matrix(y_test, base_pred,labels=categories)
multi_matrix
# Counts of TN / FP / FN / TP

array([[[ 56,  11],
        [ 11, 134]],

       [[178,   3],
        [  3,  28]],

       [[203,   3],
        [  2,   4]],

       [[186,  10],
        [ 11,   5]],

       [[201,   2],
        [  5,   4]],

       [[201,   7],
        [  4,   0]],

       [[210,   1],
        [  1,   0]]], dtype=int64)

In [27]:
def calculate_recall(matrix):
    # return 0 if there are no positive labels among input data
    if matrix[1][1] + matrix[1][0] == 0:
        return None
    return matrix[1][1]/(matrix[1][1]+matrix[1][0])

In [34]:
lst = []
for cur_matrix in multi_matrix:
    cur_recall = calculate_recall(cur_matrix)
    lst.append(cur_recall)
#lst = [i for i in lst if i]
avg_recall = np.mean(lst)
avg_recall

0.4644249783710296

We will further attempt feature engineering that could potentially improve our model and determine the performance comparing the average recall with that of our baseline model, and then determine our final model.

### Feature Engineering: Grid Search

In this section, we attempted grid search.

In [57]:
# on baseline
baseline_pl.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'column', 'dtc', 'column__n_jobs', 'column__remainder', 'column__sparse_threshold', 'column__transformer_weights', 'column__transformers', 'column__verbose', 'column__categorical', 'column__numerical', 'column__categorical__categories', 'column__categorical__drop', 'column__categorical__dtype', 'column__categorical__handle_unknown', 'column__categorical__sparse', 'column__numerical__accept_sparse', 'column__numerical__check_inverse', 'column__numerical__func', 'column__numerical__inv_kw_args', 'column__numerical__inverse_func', 'column__numerical__kw_args', 'column__numerical__validate', 'dtc__ccp_alpha', 'dtc__class_weight', 'dtc__criterion', 'dtc__max_depth', 'dtc__max_features', 'dtc__max_leaf_nodes', 'dtc__min_impurity_decrease', 'dtc__min_impurity_split', 'dtc__min_samples_leaf', 'dtc__min_samples_split', 'dtc__min_weight_fraction_leaf', 'dtc__random_state', 'dtc__splitter'])

In [29]:
metrics.recall_score(y_test, base_pred, average = None)

array([0.        , 0.        , 0.90322581, 0.44444444, 0.66666667,
       0.92413793, 0.3125    ])

In [89]:
metrics.recall_score(y_test, base_pred, average = 'macro')

0.5476873701011632

In [30]:
lst

[0.9241379310344827,
 0.9032258064516129,
 0.6666666666666666,
 0.3125,
 0.4444444444444444]

In [56]:
grid_params = {
    'dtc__max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 20],
    'dtc__min_samples_split': [9, 11, 13, 15, 17, 20], 
    'dtc__min_samples_leaf': [1, 2, 3, 4, 5]
}

In [57]:
search = GridSearchCV(baseline_pl, grid_params, cv = 3)

In [58]:
search.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('column',
                                        ColumnTransformer(transformers=[('categorical',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['YEAR',
                                                                          'U.S._STATE']),
                                                                        ('numerical',
                                                                         FunctionTransformer(func=<function <lambda> at 0x000001CF316FC5E0>),
                                                                         ['OUTAGE.DURATION',
                                                                          'CUSTOMERS.AFFECTED',
                                                                          'POPULATION'])])),
                                       ('dtc',
 

In [59]:
search.best_params_

{'dtc__max_depth': 4, 'dtc__min_samples_leaf': 5, 'dtc__min_samples_split': 15}

In [60]:
grid_pl = Pipeline([('column', ct), ('dtc', (DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, min_samples_split=13)))])

In [61]:
grid_pl.fit(X_train, y_train)

Pipeline(steps=[('column',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['YEAR', 'U.S._STATE']),
                                                 ('numerical',
                                                  FunctionTransformer(func=<function <lambda> at 0x000001CF316FC5E0>),
                                                  ['OUTAGE.DURATION',
                                                   'CUSTOMERS.AFFECTED',
                                                   'POPULATION'])])),
                ('dtc',
                 DecisionTreeClassifier(max_depth=4, min_samples_leaf=5,
                                        min_samples_split=13))])

In [62]:
grid_pl.score(X_test, y_test)

0.8443396226415094

Now, repeat the step we did in baseline model step, we will find the mean of recall over all categories and determine whether it has improved.

In [63]:
grid_pred = grid_pl.predict(X_test)

In [None]:
metrics.recall_score(y_test, grid_pred, average = 'macro')

In [64]:
metrics.recall_score(y_test, grid_pred, average = 'macro')

0.5350370876686666

In [70]:
multi_grid_matrix = metrics.multilabel_confusion_matrix(y_test, grid_pred,labels=categories)
multi_grid_matrix

array([[[ 68,  10],
        [  5, 129]],

       [[168,   5],
        [  3,  36]],

       [[205,   3],
        [  2,   2]],

       [[187,   9],
        [  6,  10]],

       [[201,   2],
        [  4,   5]],

       [[203,   1],
        [  8,   0]],

       [[210,   0],
        [  2,   0]]])

In [71]:
grid_lst = []
for cur_matrix in multi_grid_matrix:
    cur_recall = calculate_recall(cur_matrix)
    grid_lst.append(cur_recall)
avg_grid_recall = np.mean(grid_lst)
avg_grid_recall

0.5094741493995226

Turns out the recall decreased. We will engineer more features.

### Feature Engineering: Affected Proportion

In this section, we tried to change the number of customers affected feature to customers affected proportion. This would theoretically improve our model since each area has very different total population, follows that total number of customers will also be different. Comparing the total number of customers affected between states/regions with naturally very different size would be quite biased. Hence, our postulates is that by changing this feature to proportion would largely solve this problem.

In [155]:
ndf = df[['YEAR', 'U.S._STATE', 'CAUSE.CATEGORY', 'OUTAGE.DURATION', 'POPULATION', 'CUSTOMERS.AFFECTED', 'POPPCT_URBAN', 'POPDEN_URBAN', 'POPDEN_RURAL']]
ndf = ndf.dropna()
ndf

Unnamed: 0,YEAR,U.S._STATE,CAUSE.CATEGORY,OUTAGE.DURATION,POPULATION,CUSTOMERS.AFFECTED,POPPCT_URBAN,POPDEN_URBAN,POPDEN_RURAL
0,2011.0,Minnesota,severe weather,3060,5348119.0,70000.0,73.27,2279,18.2
2,2010.0,Minnesota,severe weather,3000,5310903.0,70000.0,73.27,2279,18.2
3,2012.0,Minnesota,severe weather,2550,5380443.0,68200.0,73.27,2279,18.2
4,2015.0,Minnesota,severe weather,1740,5489594.0,250000.0,73.27,2279,18.2
5,2010.0,Minnesota,severe weather,1860,5310903.0,60000.0,73.27,2279,18.2
...,...,...,...,...,...,...,...,...,...
1522,2004.0,Idaho,system operability disruption,95,1391802.0,35000.0,70.58,2216.8,5.6
1523,2011.0,Idaho,intentional attack,360,1584134.0,0.0,70.58,2216.8,5.6
1524,2003.0,Idaho,public appeal,1548,1363380.0,0.0,70.58,2216.8,5.6
1526,2016.0,Idaho,intentional attack,0,1680026.0,0.0,70.58,2216.8,5.6


In [172]:
X = ndf[['YEAR', 'U.S._STATE', 'CUSTOMERS.AFFECTED', 'OUTAGE.DURATION', 'POPULATION', 'POPPCT_URBAN', 'POPDEN_URBAN', 'POPDEN_RURAL']]
y = ndf['CAUSE.CATEGORY']

In [207]:
# 跑跟上面一样的数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We define a function that converts number of customers affected to proportion.

## 从这开始我没改了，他说PCA并不能算feature engineering，然后下面这一坨comment掉的有点多我先不改了免得看不明白改错了。反正就是fit出来以后用这个pipeline.predict(X_test), 然后重复我Grid search 最下面那几步的步骤算recall。算出来低就低吧，无所谓了，还好他妈的看了下lecture，用accuracy缺陷有点大。如果还没法improve只能说是操了。直接move on。Grid search那个也没improve也挺离谱的

In [175]:
def pop_den(df):
    # us area, unit: square kilometer
    den = (df['POPPCT_URBAN'] * df['POPDEN_URBAN'] + (1 - df['POPPCT_URBAN']) * df['POPDEN_RURAL']) / 100
    return den.to_frame()

In [176]:
cat = ['YEAR', 'U.S._STATE']
num = ['OUTAGE.DURATION', 'CUSTOMERS.AFFECTED']
popden = ['POPPCT_URBAN', 'POPDEN_URBAN', 'POPDEN_RURAL']

cat_func = OneHotEncoder(handle_unknown='ignore')
#cat_func = OneHotEncoder(handle_unknown='ignore')

num_func = StandardScaler()
#num_func = FunctionTransformer(lambda x: x * x)

pop_func = FunctionTransformer(pop_den)

#ct = ColumnTransformer([('categorical', cat_func, cat), ('numerical', num_func, num)])
ct = ColumnTransformer([('categorical', cat_func, cat), ('numerical', num_func, num), ('density', pop_func, popden)])

In [208]:
final_pl = Pipeline([('column', ct), ('dtc', (DecisionTreeClassifier(max_depth=15)))])

In [209]:
final_pl.fit(X_train, y_train)

Pipeline(steps=[('column',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['YEAR', 'U.S._STATE']),
                                                 ('numerical', StandardScaler(),
                                                  ['OUTAGE.DURATION',
                                                   'CUSTOMERS.AFFECTED']),
                                                 ('density',
                                                  FunctionTransformer(func=<function pop_den at 0x000001CF3437F3A0>),
                                                  ['POPPCT_URBAN',
                                                   'POPDEN_URBAN',
                                                   'POPDEN_RURAL'])])),
                ('dtc', DecisionTreeClassifier(max_depth=15))])

In [210]:
final_pl.score(X_train, y_train)

0.997610513739546

In [211]:
final_pl.score(X_test, y_test)

0.8666666666666667

In [212]:
final_pred = final_pl.predict(X_test)

In [213]:
metrics.recall_score(y_test, final_pred, average = 'macro')

0.5966088098441039

In [214]:
baseline_pl.fit(X_train, y_train)

Pipeline(steps=[('column',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['YEAR', 'U.S._STATE']),
                                                 ('numerical',
                                                  FunctionTransformer(func=<function <lambda> at 0x000001CF316FC5E0>),
                                                  ['OUTAGE.DURATION',
                                                   'CUSTOMERS.AFFECTED',
                                                   'POPULATION'])])),
                ('dtc', DecisionTreeClassifier(max_depth=15))])

In [215]:
base_pred = baseline_pl.predict(X_test)

In [216]:
baseline_pl.score(X_test, y_test)

0.8333333333333334

In [217]:
metrics.recall_score(y_test, base_pred, average = 'macro')

  _warn_prf(average, modifier, msg_start, len(result))


0.4722335919814911

### Final Model

### Fairness Evaluation

In [None]:
# TODO