# Predicting Power Outages and Their Cause

**Name(s)**: Ava Jeong and Charlene Hsu

**Website Link**: https://github.com/charl3n3hsu/DSC80_Final_Proj

In [1]:
pip install tabulate

Note: you may need to restart the kernel to use updated packages.


In [2]:
from dsc80_utils import *

from scipy.stats import kruskal
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(42)

## Introduction

Power outages information and data is constantly updated as outages across the United States provide for thousands of data for various regions. This dataset on power outages provides detailed records of major power outages across the United States across several years. Information on regional, geographical data on the outages, climate factors, causes of the power outages, the durations of outages (including the start and restoration times of the outage), population affected, and economical effects of the outage are detailedly provided through this dataset.

Outages in power can have significant economic and safety effects, making this dataset highly useful for understanding patterns in outages and improving power infrastructures by understanding the factors that lead to an outage. This leads to the research question to be analyzed and answered through this project: What causes influence the duration of major power outages? And from that, using our findings, can we predict power outage durations based on these factors?

This analysis is important in allowing us to understand the major causes of power outages, allowing businesses, utility companies, policymakers, and emergency responders to better adapt to these factors as outages affect millions of people and businesses every year. Being able to understand this question and make predictions for the duration of power outages can accommodate better preparation for outages and mitigate the impact of them, which allows for improved policies, infrastructure, faster response times, and reduced economic losses.



### Dataset Overview

The number of rows (meaning number of outages recorded) in this data is 1534 rows.
The relevant columns to assist in answering the research questions consist of ‘CLIMATE.CATEGORY’, ‘CAUSE.CATEGORY’, ‘OUTAGE.DURATION’.
- ‘CLIMATE.CATEGORY’: a categorical variable that classifies the type of climate that the power outage occurred in
- ‘CAUSE.CATEGORY’: a categorical variable that classifies the cause of the power outage
- 'DEMAND.LOSS.MW': a numerical variable that measures the scale of the impact of the outage by reporting the amount of demand loss in megawatts (MW) due to the outage
- 'CUSTOMERS.AFFECTED': a numerical variable that reports the number of customers that were affected by the outage
- ‘OUTAGE.DURATION’: a numerical variable that indicates how long the power outage lasted

Through understanding these columns and features, it allows us to identify a commonality amongst the causes and its relationship with the duration of the outages and make predictions by building a predictive model to estimate how long future outages may last.


## Data Cleaning and Exploratory Data Analysis

In [3]:
outages = pd.read_csv('outages.csv')
outages = outages.iloc[1:] # Gets rid of row that is initially units for each column
outages

Unnamed: 0,variables,OBS,YEAR,MONTH,...,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
1,,1.0,2011.0,7.0,...,0.6,91.59266587,8.407334131,5.478742983
2,,2.0,2014.0,5.0,...,0.6,91.59266587,8.407334131,5.478742983
3,,3.0,2010.0,10.0,...,0.6,91.59266587,8.407334131,5.478742983
...,...,...,...,...,...,...,...,...,...
1532,,1532.0,2009.0,8.0,...,0.15,98.30774418,1.692255822,1.692255822
1533,,1533.0,2009.0,8.0,...,0.15,98.30774418,1.692255822,1.692255822
1534,,1534.0,2000.0,,...,0.02,85.76115446,14.23884554,2.901181874


In [4]:
outages.columns # All columns

Index(['variables', 'OBS', 'YEAR', 'MONTH', 'U.S._STATE', 'POSTAL.CODE',
       'NERC.REGION', 'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY',
       'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE',
       'OUTAGE.RESTORATION.TIME', 'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL',
       'HURRICANE.NAMES', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW',
       'CUSTOMERS.AFFECTED', 'RES.PRICE', 'COM.PRICE', 'IND.PRICE',
       'TOTAL.PRICE', 'RES.SALES', 'COM.SALES', 'IND.SALES', 'TOTAL.SALES',
       'RES.PERCEN', 'COM.PERCEN', 'IND.PERCEN', 'RES.CUSTOMERS',
       'COM.CUSTOMERS', 'IND.CUSTOMERS', 'TOTAL.CUSTOMERS', 'RES.CUST.PCT',
       'COM.CUST.PCT', 'IND.CUST.PCT', 'PC.REALGSP.STATE', 'PC.REALGSP.USA',
       'PC.REALGSP.REL', 'PC.REALGSP.CHANGE', 'UTIL.REALGSP', 'TOTAL.REALGSP',
       'UTIL.CONTRI', 'PI.UTIL.OFUSA', 'POPULATION', 'POPPCT_URBAN',
       'POPPCT_UC', 'POPDEN_URBAN', 'POPDEN_UC', 'POPDEN_RURAL',
       'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 'PCT_WAT

In [5]:
# Keeping relevant columns
outages = outages[['YEAR', 'MONTH', 'U.S._STATE',
       'CLIMATE.REGION', 'ANOMALY.LEVEL', 'CLIMATE.CATEGORY',
       'OUTAGE.START.DATE', 'OUTAGE.START.TIME', 'OUTAGE.RESTORATION.DATE',
       'OUTAGE.RESTORATION.TIME', 'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL',
       'HURRICANE.NAMES', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED', 'POPULATION', 'POPPCT_URBAN',
       'POPDEN_URBAN', 'POPDEN_UC', 'POPDEN_RURAL',
       'AREAPCT_URBAN', 'AREAPCT_UC', 'PCT_LAND', 'PCT_WATER_TOT',
       'PCT_WATER_INLAND']]

The original dataset consisted of 57 columns, many of which were not of importance to our prediction problems. We chose to only select columns that we believed were important in analyzing/predicting the duration of power outages.

Some of these columns include:
- Time: MONTH, YEAR
- Location: U.S._STATE, CLIMATE.REGION
- Cause-Related: CAUSE.CATEGORY, CAUSE.CATEGORY.DETAIL,
- Outage Duration: OUTAGE.DURATION (Our Target Variable)
- Other Potentially Useful Features: CUSTOMERS.AFFECTED, POPULATION, POPPCT_URBAN, AREAPCT_URBAN, AREAPCT_UC, PCT_LAND, PCT_WATER_TOT, PCT_WATER_INLAND

In [6]:
# Checking for any missing values for our target features
missing_values = outages[['MONTH', 'CLIMATE.REGION', 'CAUSE.CATEGORY', 'OUTAGE.DURATION', 'DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED']].isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 MONTH                   9
CLIMATE.REGION          6
CAUSE.CATEGORY          0
OUTAGE.DURATION        58
DEMAND.LOSS.MW        705
CUSTOMERS.AFFECTED    443
dtype: int64


After keeping the relevant columns, we checked the dataframe to see if there were any missing values using .isnull().sum(), which showed that the 'MONTH' column has 9 missing values, 'CLIMATE.REGION' has 6 missing values, 'OUTAGE.DURATION' has 58 missing values, 'DEMAND.LOSS.MW' has 705 missing values, and 'CUSTOMERS.AFFECTED' has 443 missing values.

To clean this data we will need to impute the entire dataset to account for all missing values for both numerical and categorical columns. 

In [7]:
# Imputation for the entire dataset (both numerical and categorical)
outages['OUTAGE.DURATION'] = pd.to_numeric(outages['OUTAGE.DURATION'], errors='coerce') # Response variable
outages['OUTAGE.DURATION'] = outages.groupby('CAUSE.CATEGORY')['OUTAGE.DURATION'].transform(lambda x: x.fillna(x.median()))

# Making sure the features we will use later on is the correct type
outages['DEMAND.LOSS.MW'] = outages['DEMAND.LOSS.MW'].astype(float)
outages['CUSTOMERS.AFFECTED'] = outages['CUSTOMERS.AFFECTED'].astype(float)

def impute_missing_values(df):
    df_imputed = df.copy()

    numerical_columns = df_imputed.select_dtypes(include=['number']).columns
    categorical_columns = df_imputed.select_dtypes(include=['object', 'category']).columns

    if len(numerical_columns) > 0:
        numerical_imputer = SimpleImputer(strategy='median')
        df_imputed[numerical_columns] = numerical_imputer.fit_transform(df_imputed[numerical_columns])

    if len(categorical_columns) > 0:
        categorical_imputer = SimpleImputer(strategy='most_frequent')
        df_imputed[categorical_columns] = categorical_imputer.fit_transform(df_imputed[categorical_columns])

    return df_imputed

outages = impute_missing_values(outages)

In [8]:
# Checking to see if there are null values left
print(outages.isnull().sum())

YEAR                0
MONTH               0
U.S._STATE          0
                   ..
PCT_LAND            0
PCT_WATER_TOT       0
PCT_WATER_INLAND    0
Length: 26, dtype: int64


In [9]:
# HEAD of DF

outages.head()
# print(outages.head().to_markdown(index=False))

Unnamed: 0,YEAR,MONTH,U.S._STATE,CLIMATE.REGION,...,AREAPCT_UC,PCT_LAND,PCT_WATER_TOT,PCT_WATER_INLAND
1,2011.0,7.0,Minnesota,East North Central,...,0.6,91.59266587,8.407334131,5.478742983
2,2014.0,5.0,Minnesota,East North Central,...,0.6,91.59266587,8.407334131,5.478742983
3,2010.0,10.0,Minnesota,East North Central,...,0.6,91.59266587,8.407334131,5.478742983
4,2012.0,6.0,Minnesota,East North Central,...,0.6,91.59266587,8.407334131,5.478742983
5,2015.0,7.0,Minnesota,East North Central,...,0.6,91.59266587,8.407334131,5.478742983


### Univariate Analysis

To gain a better understanding of key variables in our dataset, we performed a univariate analysis through visualizations. Below are two plots we generated, along with the description of trends we found in the data. 

In [10]:
# Frequency Distribution of Climate Region
univariate1 = px.bar(outages['CAUSE.CATEGORY'].value_counts(), 
              title='Frequency Distribution of Causes of Outages',
              labels={'index': 'Cause Category', 'value': 'Frequency'})
univariate1.show()
# univariate1.write_html("univariate1.html")

#### Frequency Distribution of Cause of Outages
The bar plot above describes the frequency of different causes of power outages.

From the plot we can see that:
- Severe weather is the most common cause of power outages, occurring significantly more frequently than any other cause.
- Intentional attacks and system operability disruptions also contribute to outages but at a much lower frequency.
- Fuel supply emergencies, public appeals, and islanding are among the least frequent causes of outages.

Through this analysis, we can see that severe weather is a leading cause of power outages, indicating that weather-related factors play a crucial role in outage occurrences.

In [11]:
# Boxplot of Outage Duration
univariate2 = px.box(outages, y='OUTAGE.DURATION', title='Box Plot of Outage Duration',
              labels={'outage_duration': 'Outage Duration (hours)'})
univariate2.show()
# univariate2.write_html("univariate2.html")

#### Box Plot of Outage Duration

The box plot visualizes the distribution of power outage durations across all recorded incidients. 

From the plot we can see that:
- A majority of the power outages are relatively short lived, as indicated by the concentration of data in the lower end.
- There are a lot of outlier within the dataset, some outages lasting tens of thousands of hours. These outages may be due to severe natural disasters and/or infrastructure failures.
- The dataset is highly skewed, seen by the median outage duration.

This analysis shows that the outage durations in our datset vary signficantly and include extreme cases.

### Bivariate Analysis

To determine the relationships between key variables, we performed a bivariate analysis through visualizations. Below are the plots we generated, along with the description of trends we found in the data.

In [12]:
# Bivariate Analysis: Outage Duration vs. Climate Region
bivariate1 = px.box(outages, x='CLIMATE.REGION', y='OUTAGE.DURATION', 
              title='Outage Duration by Climate Region Boxplot',
              labels={'climate_region': 'Climate Region', 'outage_duration': 'Outage Duration (hours)'})
bivariate1.show()
# bivariate1.write_html("bivariate1.html")

#### Outage Duration by Climate Region (Box Plot)

The box plot shows how outage duration varies across different climate regions. 

From the plot we can see that:
- The East North Central region has the longest median outage duration, along with mutliple outliers.
- The South and West regions also show variability in outage durations.
- While most outage durations last relatively short, some climate regions have more long-duration outages, possibly suggesting that climate factors could play a role in the duration of power outages.

This visualization shows how climate-based factors could play a role in predicting power outage durations.

In [13]:
# Bivariate Analysis: Outage Duration vs. Climate Region
bivariate2 = px.scatter(outages, x='CLIMATE.REGION', y='OUTAGE.DURATION', 
                title='Outage Duration by Climate Region Scatterplot',
                labels={'climate_region': 'Climate Region', 'outage_duration': 'Outage Duration (hours)'})
bivariate2.show()
# bivariate2.write_html("bivariate2.html")

#### Outage Duration by Climate Region (Scatter Plot)

To gain better visability on the distribution, we also used a scatter plot to examine overlapping points.

From the plot we can see that:
- A majority of outages are clustered around the lower durations, with few outliers.
- Some regions have consistently longer outages.

This visualization further supports that climate region could be an important factor in our model. 

In [14]:
# Bivariate Analysis: Outage Duration vs. Cause Category
bivariate3 = px.box(outages, x='CAUSE.CATEGORY', y='OUTAGE.DURATION', 
              title='Outage Duration by Cause Category Boxplot',
              labels={'cause_category': 'Cause Category', 'outage_duration': 'Outage Duration (hours)'})
bivariate3.show()
# bivariate3.write_html("bivariate3.html")

#### Outage Duration by Cause Category (Box Plot)

The box plot shows how different causes impact power outage durations. 

From the plot we can see that:
- Severe weather is the most common cause of power outages (has broad range of durations).
- Fuel supply emergencies and public appeals tend to result in longer outages.
- Intentional attacks and equipment failures usually result in shorter outages. 

This visualization suggests that cause category is a significant factor of outage duration, as we observe that different causes lead to different restoration times. 

In [15]:
# Bivariate Analysis: Outage Duration vs. Cause Category
bivariate4 = px.scatter(outages, x='CAUSE.CATEGORY', y='OUTAGE.DURATION', 
                title='Outage Duration by Cause Category Scatterplot',
                labels={'cause_category': 'Cause Category', 'outage_duration': 'Outage Duration (hours)'})
bivariate4.show()
# bivariate4.write_html("bivariate4.html")

#### Outage Duration by Cause Category (Scatter Plot)

To gain better visability on the distribution, we also used a scatter plot to examine overlapping points.

From the plot we can see that:
- A majority of the outages result in shorter durations.
- Most of the points near 0 shows that most outages are resolved fairly quickly.
- There is significant variation between different cause categories.

This visualization further supports that certain causes tend to result in more prolonged outages. 

In [16]:
# Pivot Tables by Mean
pivot_table = pd.pivot_table(
    outages,
    values='OUTAGE.DURATION',  
    index='CLIMATE.REGION',   
    columns='CAUSE.CATEGORY', 
    aggfunc='mean',            
    fill_value=0              
)
pivot_table
# print(pivot_table.to_markdown(index=False))

CAUSE.CATEGORY,equipment failure,fuel supply emergency,intentional attack,islanding,public appeal,severe weather,system operability disruption
CLIMATE.REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Central,293.14,10035.25,315.53,125.33,1410.00,3238.30,2469.73
East North Central,26435.33,27969.00,2376.05,1.00,733.00,4434.82,2610.00
Northeast,216.67,14629.57,191.84,881.00,2655.00,4342.68,705.06
...,...,...,...,...,...,...,...
Southwest,113.80,2018.00,255.84,2.00,2275.00,11572.90,329.22
West,524.81,5250.94,857.68,214.86,2028.11,2908.30,356.41
West North Central,61.00,3960.00,23.50,68.20,439.50,2442.50,0.00


#### Pivot Table: Mean Outage Duration by Climate Region and Cause Category

This pivot table shows the average outage duration grouped by climate region and cause category. The table helps to identify trends adn variations in outage durations based noo geographic location and the underlying cause of the outage. 

Key observations:
- East North Central region has the highest outage duration for equipment failures (26,435 hours).
- Fuel Supply Emergencies result in long outages across multiple regions.
- Severe Weather affects all regions, but its impacts on each region vary.

This table allows us to determine the relationship between cause and climate, which is helpful in improving our model's performance. 

## Assessment of Missingness

In [17]:
# Get version for unimputed outages DataFrame
unimputed_outages = pd.read_csv('outages.csv')
unimputed_outages = unimputed_outages.iloc[1:] # Gets rid of row that is initially units for each column

In [18]:
# Permutation Test for Missingness of OUTAGE.DURATION does depend on
def permutation_test(data, col_missing, col_test, num_permutations=1000):
    frequencies = data[col_test].value_counts(normalize=True).to_dict()

    data[f'{col_test}_ENCODED'] = data[col_test].map(frequencies)

    observed_diff = abs(data[data[col_missing].isnull()][f'{col_test}_ENCODED'].mean() -
                    data[~data[col_missing].isnull()][f'{col_test}_ENCODED'].mean())

    perm_diffs = []
    for n in range(num_permutations):
        shuffled = data[f'{col_test}_ENCODED'].sample(frac=1, replace=False).reset_index(drop=True)
        perm_diff = abs(data[data[col_missing].isnull()].index.to_series().map(shuffled).mean() -
                        data[~data[col_missing].isnull()].index.to_series().map(shuffled).mean())
        perm_diffs.append(perm_diff)

    p_value = np.mean(np.array(perm_diffs) >= observed_diff)

    return observed_diff, perm_diffs, p_value

# Performing Permutation Missingness Test on CAUSE.CATEGORY
observed_diff, perm_diffs, p_value = permutation_test(unimputed_outages, 'OUTAGE.DURATION', 'CAUSE.CATEGORY')

missingness1 = px.histogram(x=perm_diffs, nbins=30, title=f'Permutation Test for Missingness in OUTAGE.DURATION Depending on CAUSE.CATEGORY', labels={'value': 'Absolute Difference in Means'}, opacity=0.7)
missingness1.add_vline(x=observed_diff, line_dash='dash', line_color='red', annotation_text='Observed Absolute Difference in Means')
missingness1.show()
# missingness1.write_html("missingness1.html")

print(f"Observed Absolute Difference in Means: {observed_diff}, P-Value: {p_value}")

Observed Absolute Difference in Means: 0.08634898930475615, P-Value: 0.0


#### Assessment of Missingness: Permutation Test for Missingness in OUTAGE.DURATION Depending on CAUSE.CATEGORY

##### Plot Interpretation
The histogram above shows the empirical distribution of the absolute difference in means from 1,000 permutations when testing whether the missingness in our target variable (OUTAGE.DURATION) depends on CAUSE.CATEGORY. The x-axis shows the absolute difference in means, while the y-axis represents the frequency of permuted differences.  

The observed absolute difference in means has a value of 0.0863, while the empirical p-value from this test has a value of 0.0. This means that none of the permuted differences were at least as extreme as the observed difference. Since the p-value is 0.0, we reject the null hypothesis. This suggests that the missingness in OUTAGE.DURATION is not completely at random and is dependent on CAUSE.CATEGORY. This means that the likelihood of missing values in outage duration is influenced by the specific causes of the outage, exhibiting Missing At Random behavior. Certain causes may be more prone to missing outage durations in the dataset than other causes. 

In [19]:
# Permutation Test for Missingness of OUTAGE.DURATION does NOT depend on
observed_diff, perm_diffs, p_value = permutation_test(unimputed_outages, 'OUTAGE.DURATION', 'HURRICANE.NAMES')

missingness2 = px.histogram(perm_diffs, nbins=30, title=f'Permutation Test for Missingness in OUTAGE.DURATION Depending on HURRICANE.NAMES', labels={'value': 'Absolute Difference in Means'}, opacity=0.7)
missingness2.add_vline(x=observed_diff, line_dash='dash', line_color='red', annotation_text='Observed Absolute Difference in Means')
missingness2.show()
# missingness2.write_html("missingness2.html")

print(f"Observed Absolute Difference in Means: {observed_diff}, P-Value: {p_value}")

Observed Absolute Difference in Means: 0.08763693270735526, P-Value: 0.098


#### Assessment of Missingness: Permutation Test for Missingness in OUTAGE.DURATION Depending on HURRICANE.NAMES

##### Plot Interpretation
The histogram above shows the empirical distribution of the absolute difference in means from 1,000 permutations when testing whether the missingness in our target variable (OUTAGE.DURATION) depends on Hurricane names. The x-axis shows the absolute difference in means, while the y-axis represents the frequency of permuted differences. 

The observed absolute difference in means has a value of 0.0876, while the empirical p-value has a value of 0.098. This means that about 9.8%  of the permuted differences were at least as extreme as the observed differences. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This suggests that there is not enough evidence to conclude that missingness in OUTAGE.DURATION is dependent on whether the outage was caused by HURRICANE.NAMES. While there may be the possibility of some relationship between the two features, the results from this permutation test suggest that there is no strong statistical support for Missing At Random behavior based on this variable. 


## Hypothesis Testing

In [20]:
# Permutation Test
def permutation_absmeans(df, feature, response, n_permutations=1000):
    means = df.groupby(feature)[response].mean()
    observed_diff = np.abs(means.diff().dropna()).sum()

    abs_diffs = []
    for n in range(n_permutations):
        shuffled = df.assign(Shuffled_outage_duration=np.random.permutation(df[response]))
        shuffled_means = shuffled.groupby(feature)['Shuffled_outage_duration'].mean()
        shuffled_diff = np.abs(shuffled_means.diff().dropna()).sum()
        abs_diffs.append(shuffled_diff)

    p_value = (np.array(abs_diffs) >= observed_diff).mean()

    return observed_diff, abs_diffs, p_value

observed_diff, abs_diffs, p_value = permutation_absmeans(outages, 'CAUSE.CATEGORY', 'OUTAGE.DURATION')

In [21]:
# Empirical distribution of absolute differences for CAUSE.CATEGORY & OUTAGE.DURATION
hypothesis_test_1 = px.histogram(abs_diffs, nbins=30, labels={'value': 'Absolute Difference in Means'}, title='Empirical Distribution of Absolute Difference in Means for CAUSE.CATEGORY & OUTAGE.DURATION')
hypothesis_test_1.add_vline(x=observed_diff, line_dash="dash", line_color="red", annotation_text="Observed Difference", annotation_position="top right")
hypothesis_test_1.update_layout(xaxis_title="Absolute Difference in Means", yaxis_title="Frequency", showlegend=False)
hypothesis_test_1.show()
# hypothesis_test_1.write_html("hypothesis_test_1.html")

print(f"Observed Absolute Difference in Means: {observed_diff}")
print(f"P-value: {p_value}")

Observed Absolute Difference in Means: 27022.72477030569
P-value: 0.0


#### Hypothesis Test 1: Permutation Test for Cause Category and Outage Duration

##### Plot Interpretation:
The histogram that plots the empirical distribution of the absolute difference in means for CAUSE.CATEGORY and OUTAGE.DURATION is heavily right skewed. This shows that most of the permuted differences cluster mainly on the right, meaning that the test results in differences that are in majority small, with just a few differences that are larger occurring in the right tail. The observed difference is also in the far right of the histogram meaning that the observed difference is larger than most permuted differences that were computed. This diagram further shows that the observed difference is unlikely to be due to chance and that the CAUSE.CATEGORY truly does affect the OUTAGE.DURATION.

In [22]:
groups = [group['OUTAGE.DURATION'].values for name, group in outages.groupby('CAUSE.CATEGORY')]

statistic, p_value = kruskal(*groups)

print(f"Kruskal-Wallis Test Statistic: {statistic}, P-Value: {p_value}")

# Violin plot
hypothesis_test_2 = px.violin(outages, x="CAUSE.CATEGORY", y="OUTAGE.DURATION", 
                title="Distribution of Outage Durations by Cause Category",
                box=True, points="all", hover_data=outages.columns)

hypothesis_test_2.update_layout(xaxis_title="Cause Category", yaxis_title="Outage Duration (mins)")
hypothesis_test_2.show()
# hypothesis_test_2.write_html("hypothesis_test_2.html")

Kruskal-Wallis Test Statistic: 700.6054239816652, P-Value: 4.526908484457232e-148


#### Hypothesis Test 2: Kruskal-Wallis Test for Outage Duration Across Cause Categories 

##### Plot Interpretation:
The plot above shows the distribution of outage durations by cause category, where a Kruskal-Wallis test was used to assess whether the duration of outages significantly varies across different causese. With the test statistic value of 700.61 and a p-value of 4.53e-148, which are extremely small, this suggests strong evidence against the null hypothesis. This indicates that different causes likely result in different outage durations. 

As seen on the plot:
- Severe weather is the most frequent, meaning it is the most common cause  of outages and shows a wide spread of durations.
- Fuel supply emergencies and public appeals have high variability, and some outages lasting longer.
- Other categories such as intentional attacks and system operability disruptions, tend to have shorter outages.


## Framing a Prediction Problem

### Prediction Type: Regression
The goal of this prediction task is to predict the duration of a power outage (OUTAGE.DURATION) based on several factors that are available at the start of the outage, including cause and climate. Since our target variable is a continuous numerical value, we define this as a regression problem rather than a classification problem. 

### Response Variable: OUTAGE.DURATION
We chose OUTAGE.DURATION as our response variable understanding the length of outages can help the following:
- Businesses: Plan for potential financial losses
- Utility Companies: Improve infrastructure and optimize restoration efforts
- Emergency Responders: Better prepare for different situations and allocation of resources
- Policy-makers: Use data-driven insights to improve power grid reliability
Overall, we can help predict future outages and enable proactive measures to reduce economic loss and better prepare for such emergencies. 

### Evaluation Metrics
To assess the performance of our model, we used the following metrics:
- Mean Squared Error (MSE): MSE was chosen as our primary form of evaluation metric because it helps to measure how far predicted outage durations deviate from the actual durations. It also is suitable for larger errors, allowing for outliers in the outage duration to have more weight.  
- R2 (Coefficient of Determination): R2 was chosen as our second metric because it measures how well our model performs with variability in OUTAGE.DURATION. A higher value of R2 indicates a better fit, allowing us to understand how well our model is performing with outage duration.  
MSE is prioritized, as there are significant outliers in OUTAGE.DURATION.

### Justification of Features
The model only includes features that are known at the start of an outage. The selected features include the following:
- CAUSE.CATEGORY (Categorical): Used to identify the general cause of the outage, such as severe weather, equipment failure, etc. 
- CLIMATE.CATEGORY (Categorical): Describes the climate type in which the outage occurred, severe weathers may influence the duration
- MONTH (Numerical): Helps us to identify the seasonal trends that may affect the outages.

Some additional features that may be used, but are not known prior to the outages include the following: 
- DEMAND.LOSS.MW (Numerical): Measures the severity of the outage based on power demand loss. 
- CUSTOMERS.AFFECTED (Numerical): Larger outages may take longer to restore, more individuals being affected. 

### Challenges
Some challenges that we faced while developing a outage duration prediction model include:  
- Handling Categorical Data
    - Most of the dataset contains categorical variables, which required us to one-hot encode.
    - We applied one-hot encoding for categorical features used such as CAUSE.CATEGORY and CLIMATE.REGION to convert them into a numerical format.
- Missing Data
    Variables including OUTAGE.DURATION had missing values in the dataset. To mitigate this, we had to perform a missingness analysis and group-wise median imputation to fill in the missing values based on the median outage duration for each cause category.
- Skewed Data
    - Our target variable also has extreme outliers, meaning that we may need to apply log transformation to deal with the outliers.

## Baseline Model

In [23]:
# Linear Regression
X = outages[['CAUSE.CATEGORY', 'CLIMATE.REGION']] 
y = outages['OUTAGE.DURATION'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

categorical_features = ['CAUSE.CATEGORY', 'CLIMATE.REGION']

preprocessor = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)])

pipeline = Pipeline([
    ('preprocessor', preprocessor), 
    ('regressor', LinearRegression()) 
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

Mean Squared Error (MSE): 26061252.55965856
R² Score: 0.007262599857186136


#### Chosen Model: Linear Regression with a Pipeline
For our baseline model, we chose to implement a Linear Regression model using an sklearn Pipeline, to ensure that all preprocessing steps such as encoding categorical variables were integrated within our training steps. 

##### Features Used

Our model uses the following two categorical features to predict the duration of outages:
- CAUSE.CATEGORY (Nomial, Categorical)
    - Represents the general cause of the power outage, such as severe weather, equipment failure, intentional attack, etc. 
    - Encoding Method: One-hot encoding, since there is no order among these categories. 
- CLIMATE.REGION (Nominal, Categorical)
    - Represents the climate region where the outage occurred, such as the Northeast, West, South, etc. 	
    - Encoding Method: One-hot encoding, since there is no order among these categories. 


#### Encoding and Model Training
- One-hot encoding was applied to the categorical variables, as mentioned above using a ColumnTransformer. 
- We split the dataset into training (80%) and testing (20%) sets to evaluate model generalization.
- The final pipeline consists of
    - Preprocessing step: One-hot encoding of categorical features
    - Regressoin: Linear Regression to predict OUTAGE.DURATION 

#### Baseline Model Performance and Assessment
After training and evaluating our baseline model on the test set, we obtained the following results:
- Mean Squared Error (MSE): 26,061,252.55965856
- $R^2$ Score: 0.007262599857186136

The baseline model’s low $R^2$ score of 0.0077 shows that it does not do a good job of explaining the variance in outage durations. This suggests that additional features may be needed to improve our model’s performance. Some limitations of our baseline model include limited feature selection and the complex relationship between our chosen features and outage duration. The baseline model only considers CAUSE.CATEGORY and CLIMATE.REGION, which do not account for other potential predictive features such as DEMAND.LOSS.MW and CUSTOMERS.AFFECTED. The performance of our model suggests that outage duration is likely influenced by multiple interacting factors, which this baseline model currently fails to capture. In addition to the lack of necessary features, power outages and their durations may not follow a simple linear relationship.

## Final Model

In [24]:
# Feature Engineering
def create_demand_per_customer(X):
    return (X['DEMAND.LOSS.MW'] / X['CUSTOMERS.AFFECTED'].replace(0, np.nan)).values.reshape(-1, 1)

def create_log_demand_loss(X):
    return np.log(X['DEMAND.LOSS.MW'] + 1).values.reshape(-1, 1)

features = ['MONTH', 'DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED']
categoricalfeatures = [col for col in outages.columns if 'CAUSE.CATEGORY_' in col or 'CLIMATE.REGION_' in col]

# Training Model
preprocessor = ColumnTransformer(
    transformers=[
        ('demand_per_customer', FunctionTransformer(create_demand_per_customer), ['DEMAND.LOSS.MW', 'CUSTOMERS.AFFECTED']),
        ('log_demand_loss', FunctionTransformer(create_log_demand_loss), ['DEMAND.LOSS.MW']),
        ('categorical', 'passthrough', categoricalfeatures)  
    ],
    remainder='drop' 
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),  
    ('imputer', SimpleImputer(strategy='median')),  
    ('poly', PolynomialFeatures()), 
    ('linear', LinearRegression()) 
])

X = outages[features + categoricalfeatures].copy()
y = outages['OUTAGE.DURATION']
X.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'poly__degree': [1, 2, 3]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Best Degree: {grid_search.best_params_['poly__degree']}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

Best Degree: 1
Mean Squared Error (MSE): 25941331.59266683
R² Score: 0.01183068532128484


#### Improvements from Baseline Model
To enhance our baseline model, we incorporated feature engineering, explored non-linear relationships, and performed hyperparameter tuning to optimize the model’s performance. 

Notable Improvements include the following:
- Addition of New Features
    - DEMAND.LOSS.MW (Megawatts lost due to outage): A key indicator of outage severity. 
    - CUSTOMERS.AFFECTED (Number of customers affected by the outage): Higher value could indicate longer restoration times, as more people are affected.
    - DEMAND_PER_CUSTOMER (Ratio of demand loss per customer): Captures the outage impact per customer.
    - LOG_DEMAND_LOSS (Log-transformed demand loss): Put in place to make the data more suitable for regression analysis. 
        - Computed as log(DEMAND.LOSS.MW + 1)
    - MONTH (Month in which the outage occurred): Captures the seasonal patterns affecting the outage duration. 
- Handling Missing Values
    - We used median imputation to handle missing values in DEMAND.LOSS.MW and CUSTOMERS.AFFECTED to ensure consistency within our dataset.
    - We replaced infinite values with NaN and re-imputed them to avoid computational errors from occurring. 
- Modeling Approach
    - We applied Polynomial Regression to capture non-linear relationships in the data, rather than using a simple Linear Regression model. 
    - We used PolynomialFeatures() within an sklearn Pipeline to generate the polynomial terms.
- Hyperparameter Tuning
    - We applied GridSearchCV with 5-fold cross-validation and neg_mean_squared_error as our scoring metric.
        - poly_degree: Is the degree of polynomial features, we decided to test values of 1, 2, and 3 to determine the best fit for the data. 
    - The best-performing model had a polynomial degree of 1, which suggests that a linear relationship best fits our dataset.

#### Justification of Feature Engineering
##### Demand_Per_Customer:
This feature was created to capture the per-customer impact of an outage. Outages affecting fewer customers but with high demand loss may be prioritized differently by utility companies, resulting in shorter restoration times. By including this feature, the model would better account for the relationship between demand loss and outage duration. 

##### Log_Demand_Loss:
The log transformation was applied in order to account for the skewness in the DEMAND.LOSS.MW column. Larger outages are not as frequent in this dataset, but can still disproportionately affect the model’s performance. By applying this transformation, we can reduce the influence of these outliers making the data more suitable for regression analysis. 

##### MONTH and Categorical Features:
These features were included to account for seasonal patterns and regions differences in outage durations and causes. Outages caused by storms for instance, could take longer to restore than outages caused by equipment failure. 

#### Final Model Performance Assessment
After training and evaluating our final model on the test set, we obtained the following results:
- Mean Squared Error (MSE): 25,941,331.59266683
- $R^2$ Score: 0.01183

Although the $R^2$ score is still relatively low, we can see a noticeable improvement compared to the results from our baseline model (MSE: 26,061,252.56, R²: 0.0077). The decrease in our MSE value indicates that the model’s predictions are closer to the actual outage durations than before, meaning the model has better accuracy. The increase in our $R^2$ score suggests that the model captures more variance in outage duration, although there is still a significant amount of unexplained variance, indicating that additional features or a different modeling approach could further improve our performance. 

## Fairness Analysis

In [25]:
# Checking each value in CAUSE.CATEGORY
outages['CAUSE.CATEGORY'].value_counts()

CAUSE.CATEGORY
severe weather                   763
intentional attack               418
system operability disruption    127
public appeal                     69
equipment failure                 60
fuel supply emergency             51
islanding                         46
Name: count, dtype: int64

In [26]:
# Creates binary column that determines whether the cause is severe weather or not
outages['natural_cause'] = outages['CAUSE.CATEGORY'].isin(['severe weather']).astype(int)

In [27]:
results = X_test.copy()
results['OUTAGE.DURATION'] = y_test
y_pred = best_model.predict(X_test)
results['prediction'] = y_pred
results['OUTAGE.DURATION'] = results['OUTAGE.DURATION'].fillna(0)
results['prediction'] = results['prediction'].fillna(0)
results['natural_cause'] = outages.loc[results.index, 'natural_cause']

severe_weather_rmse = np.sqrt(mean_squared_error(
    results[results['natural_cause'] == 1]['OUTAGE.DURATION'],
    results[results['natural_cause'] == 1]['prediction']
))

not_severe_weather_rmse = np.sqrt(mean_squared_error(
    results[results['natural_cause'] == 0]['OUTAGE.DURATION'],
    results[results['natural_cause'] == 0]['prediction']
))

observed_rmse_diff = severe_weather_rmse - not_severe_weather_rmse
print(f"Observed Difference in RMSE (Severe Weather - Not Severe Weather Causes): {observed_rmse_diff}")

Observed Difference in RMSE (Severe Weather - Not Severe Weather Causes): 2815.1174346001917


In [28]:
# Function to compute differences in RMSE
def rmse_difference(results):
    severe_weather_rmse = np.sqrt(mean_squared_error(
        results[results['natural_cause'] == 1]['OUTAGE.DURATION'],
        results[results['natural_cause'] == 1]['prediction']
    ))
    not_severe_weather_rmse = np.sqrt(mean_squared_error(
        results[results['natural_cause'] == 0]['OUTAGE.DURATION'],
        results[results['natural_cause'] == 0]['prediction']
    ))
    return severe_weather_rmse - not_severe_weather_rmse

In [29]:
# Permutation Test for Fairness using RMSE
rmse_differences = []

for n in range(1000):
    shuffled = results.assign(Shuffled_natural_cause=np.random.permutation(results['natural_cause']))
    rmse_diff = rmse_difference(shuffled)
    rmse_differences.append(rmse_diff)

p_value = (np.abs(rmse_differences) >= np.abs(observed_rmse_diff)).mean()
print(f"P-value: {p_value}")

P-value: 1.0


In [30]:
# Histogram of differenes in RMSE
fig = px.histogram([float(x) for x in rmse_differences], nbins=20, labels={'value': 'Difference in RMSE'},
                   title='Difference in RMSE (Severe Weather - Not Severe Weather Causes)')
fig.add_vline(x=observed_diff, line_color='red', annotation_text=f'Observed Difference in RMSE: {observed_rmse_diff:.2f}')
fig.update_layout(xaxis_title='Difference in RMSE', yaxis_title='Frequency')
fig.show()

print(rmse_differences)

[np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.float64(2815.1174346001917), np.floa