# Cluster Analysis using KMeans with n_clusters = 4

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [8]:
outage_df = pd.read_csv('outage_df_kmean.csv', index_col=0)
outage_df.head()

Unnamed: 0,Datetime Event Began,State Affected,NERC Region,Alert Criteria,Event Type,Demand Loss (MW),Number of Customers Affected,State Avg Temp (F),State Avg Windspeed (mph),State Avg Precipitation (mm),Monthly Net Energy for Load (GWh),Monthly Peak Hour Demand (MW),State Avg Snowfall (mm),KMeans Labels (4 clusters)
0,2018-01-02 10:00:00,New York,NPCC,Fuel supply emergencies that could impact elec...,System Operations,675.0,0.0,11.261429,11.499064,0.811386,63641.527716,110050.9,0.0,3
1,2018-01-02 06:45:00,North Carolina,SERC,System-wide voltage reductions of 3 percent or...,Severe Weather,14998.0,0.0,18.694595,4.726667,0.000424,85958.0,179134.0,0.0,0
2,2018-01-12 13:08:00,Michigan,RF,Cyber event that causes interruptions of elect...,System Operations,41.0,23007.0,18.944706,15.835709,8.174586,76640.0,137465.0,0.0,3
3,2018-02-04 13:42:00,California,WECC,Physical attack that could potentially impact ...,Vandalism,9760.0,0.0,56.094913,1.11845,0.005626,66930.133379,127365.022436,0.0,0
4,2018-02-08 13:25:00,California,WECC,Electrical System Separation (Islanding) where...,System Operations,30.0,10900.0,56.660645,3.791818,0.00111,66930.133379,127365.022436,0.0,0


The below code will group the dataframe by the KMeans clusters and analyze the characteristics of each cluster.

In [13]:
grouped_df = outage_df.groupby('KMeans Labels (4 clusters)')
grouped_df.count()

Unnamed: 0_level_0,Datetime Event Began,State Affected,NERC Region,Alert Criteria,Event Type,Demand Loss (MW),Number of Customers Affected,State Avg Temp (F),State Avg Windspeed (mph),State Avg Precipitation (mm),Monthly Net Energy for Load (GWh),Monthly Peak Hour Demand (MW),State Avg Snowfall (mm)
KMeans Labels (4 clusters),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,248,248,248,248,248,248,248,248,248,248,248,248,248
1,75,75,75,75,75,75,75,75,75,75,75,75,75
2,15,15,15,15,15,15,15,15,15,15,15,15,15
3,81,81,81,81,81,81,81,81,81,81,81,81,81


>By far, the largest cluster is cluster 0 and the smallest is cluster 2.  Let's dig into the features of each.

In [21]:
# show cross tab of NERC regions in each cluster
pd.crosstab(outage_df['KMeans Labels (4 clusters)'], outage_df['NERC Region'])

NERC Region,FRCC,MRO,NPCC,RF,SERC,SPP RE,TRE,WECC
KMeans Labels (4 clusters),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,36,3,26,60,0,0,123
1,7,0,11,1,0,6,49,1
2,0,2,0,2,0,0,0,11
3,0,13,4,8,44,0,0,12


>Most of the western outages are clustered in 0, while all the Texas and Florida outages are clustered in 1. Cluster 1 also contains all SPP outages which covers geographic areas consisting of the following states: Arkansas, Iowa, Kansas, Louisiana, Minnesota, Missouri, Montana, Nebraska, New Mexico, North Dakota, Oklahoma, South Dakota, Texas, and Wyoming.

In [24]:
# show cross tab of event type in each cluster
pd.crosstab(outage_df['KMeans Labels (4 clusters)'], outage_df['Event Type'])

Event Type,Cyber Event,Other,Severe Weather,System Operations,Transmission / Distribution Interruption,Vandalism
KMeans Labels (4 clusters),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,4,85,49,50,59
1,0,0,40,9,20,6
2,0,0,11,2,0,2
3,0,0,60,11,4,6


>Event types seem to be spread evenly throughout the clusters.

### Descriptive statistics for each cluster will be shown below.


In [39]:
# Show the descriptive statistics for each column for each cluster
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:.2f}'.format
grouped_df.describe().T

Unnamed: 0,KMeans Labels (4 clusters),0,1,2,3
Demand Loss (MW),count,248.0,75.0,15.0,81.0
Demand Loss (MW),mean,737.78,669.55,540.07,2182.63
Demand Loss (MW),std,2791.72,1589.48,1531.06,14783.63
Demand Loss (MW),min,1.0,1.0,14.0,2.0
Demand Loss (MW),25%,20.0,25.5,64.5,48.0
Demand Loss (MW),50%,88.5,150.0,143.0,166.0
Demand Loss (MW),75%,380.25,512.0,210.0,436.0
Demand Loss (MW),max,33480.0,10000.0,6062.0,133200.0
Number of Customers Affected,count,248.0,75.0,15.0,81.0
Number of Customers Affected,mean,39352.86,134682.07,79451.07,56218.83


>**Demand Loss:** By far, the largest demand losses are concentrated in cluster 3, however this seems to be due to a major outlier with an extremely high value of demand loss.  The medians are much closer together.
>**Number of Customers Affected:** Again, outliers cause the highest mean of this metric to be in cluster 1.  However, the medians are much more closely distributed.
>**State Avg Temp and State Avg Snowfall:** Cluster 2 seems to contain colder temperatures and larger amounts of snowfall.  Perhaps this is the defining characteristic of cluster 2.
>**State Avg Windspeed:** Cluster 3 seems to contain the outages with the highest associated windspeed.
>**State Avg Precipitation:** Cluster 2 seems to contain the highest amount of associated precipitation.  This is in line with this cluster being associated with the largest amount of snowfall.  Cluster 3 also seems to have large amounts of precipitation, although not as much as cluster 2.
>**Net Energy for Load:** Cluster 1 has the lowest load.  The others are pretty similar.
>**Energy Demand:** Cluster 1 has the lowest demand as well.  Other clusters are pretty similar.