# Predicting the duration of power outages

**Name(s)**: Zihan Liu

**Website Link**: https://anananan116.github.io/US_Power_Outage_Analysis_model/

## Code

In [19]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import plotly.express as px
pd.options.plotting.backend = 'plotly'

### Framing the Problem

## Prediction Problem

### Objective
- **Predict the Duration of Power Outages**

### Type
- **Regression**
  - The target variable, the duration of the power outage, is continuous.

### Response Variable (Target)
- `OUTAGE.DURATION`
  - This is chosen as it directly indicates the length of the power interruption. It could be useful to power service companies to get a estimate of the duration of the power outage for their customers before it is fixed.

### Features Considered
- `U.S._STATE`
  - Geographic location can be a significant factor due to different infrastructure and weather patterns.
- `ANOMALY.LEVEL`, `CLIMATE.CATEGORY`
  - These indicate climate conditions, crucial for predicting weather-related outages.
- `CAUSE.CATEGORY`, `CAUSE.CATEGORY.DETAIL`
  - Understanding the cause of the outage helps in estimating duration.
- `CUSTOMERS.AFFECTED`
  - The number of customers affected might correlate with the severity and duration of the outage.
- `RES.PRICE`
  - Residential price as an indirect indicator of infrastructure quality.
- `SEASON`
  - Different seasons have unique patterns impacting outage durations.

- All features are known at the time of the power outage, making them relevant for real-time predictions.
- Utility companies typically have access to this information (cause, affected customers, climate conditions) when an outage occurs.

### Metric for Model Evaluation
- **Primary Metric**: Mean Absolute Error (MAE)
  - Chosen for its clear representation of average prediction error magnitude.
  - RMSE or MSE could be considered, but they emphasize larger errors more which could be caused by the large amount of outliers in this power outage dataset. 

In [23]:
df = pd.read_excel('outage.xlsx').drop(['variables', 'OBS'], axis = 1)
variable_names = [
    "MONTH",
    "U.S._STATE",
    "ANOMALY.LEVEL",
    "CLIMATE.CATEGORY",
    "CAUSE.CATEGORY",
    "CAUSE.CATEGORY.DETAIL",
    "OUTAGE.DURATION",
    "CUSTOMERS.AFFECTED",
    "RES.PRICE"
]
df_cleaned = df[variable_names]
df_cleaned = df_cleaned[pd.notna(df_cleaned['MONTH'])]
def map_month_to_season(month):
    month = int(month)  
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df_cleaned['SEASON'] = df_cleaned['MONTH'].apply(map_month_to_season)
df_cleaned['SERIOUSNESS'] = ((df_cleaned['OUTAGE.DURATION'] > 500) & (df_cleaned['CUSTOMERS.AFFECTED'] > 1000))
df_cleaned = df_cleaned[df_cleaned['SERIOUSNESS']]
df_cleaned = df_cleaned.drop('SERIOUSNESS', axis = 1)
df_cleaned = df_cleaned.drop('MONTH', axis = 1)
print(df_cleaned.head().to_markdown())

|    | U.S._STATE   |   ANOMALY.LEVEL | CLIMATE.CATEGORY   | CAUSE.CATEGORY   | CAUSE.CATEGORY.DETAIL   |   OUTAGE.DURATION |   CUSTOMERS.AFFECTED |   RES.PRICE | SEASON   |
|---:|:-------------|----------------:|:-------------------|:-----------------|:------------------------|------------------:|---------------------:|------------:|:---------|
|  0 | Minnesota    |            -0.3 | normal             | severe weather   | nan                     |              3060 |                70000 |       11.6  | Summer   |
|  2 | Minnesota    |            -1.5 | cold               | severe weather   | heavy wind              |              3000 |                70000 |       10.87 | Fall     |
|  3 | Minnesota    |            -0.1 | normal             | severe weather   | thunderstorm            |              2550 |                68200 |       11.79 | Summer   |
|  4 | Minnesota    |             1.2 | warm               | severe weather   | nan                     |              1740 |     

### Baseline Model

### Baseline Linear Regression Model Description

We developed a baseline linear regression model focusing on predicting the duration of power outages. This model incorporated two quantitative features: `SEASON`(norminal) and `CUSTOMERS.AFFECTED`(quantitive). These features were selected based on their availability and presumed relevance to the prediction task. One-hot encoding was used to encode SEASON variable. And StandardScaler is used for the quatitive variable(CUSTOMERS.AFFECTED).

### Model Performance

The model's performance was evaluated using the Mean Absolute Error (MAE). It yielded an MAE of 2901 on the training set and 2427 on the test set. These high MAE values suggest a significant deviation between the model's predictions and the actual outage durations. The duration is recorded as number of minutes, thus the average error of this model is about 50 hours. Given the context of power outages, this level of error indicates that the model's predictive accuracy is relatively low.

### Evaluation and Conclusion

The current linear regression model appears inadequate for accurately predicting power outage durations. The use of only two features and the simplistic nature of a linear regression model limit its capability to capture the complex dynamics of power outage durations. These durations are likely influenced by various factors, including but not limited to environmental conditions and infrastructure, which are not adequately represented in the current model. To enhance predictive accuracy, it would be prudent to consider a broader range of features and explore more sophisticated modeling approaches that can capture non-linear relationships and interactions among variables.


In [26]:
X = df_cleaned[['SEASON', 'CUSTOMERS.AFFECTED']]
y = df_cleaned['OUTAGE.DURATION']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


pipeline = Pipeline([
    ('transformer', ColumnTransformer([
        ('one_hot_encoder', OneHotEncoder(), ['SEASON'])
    ], remainder='passthrough')),
    ('regressor', LinearRegression())
])

pipeline.fit(X_train, y_train)

y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred) if len(y_test) > 0 else None

mae_train, mae_test

(2901.4410621361826, 2427.9062386845453)

## Final Model Development and Evaluation

In developing my final model, I placed significant emphasis on the selection and engineering of features. I incorporated a mix of categorical variables - `U.S._STATE`, `CLIMATE.CATEGORY`, `CAUSE.CATEGORY`, `CAUSE.CATEGORY.DETAIL`, and `SEASON` - and numerical features like `ANOMALY.LEVEL`, `CUSTOMERS.AFFECTED`, and `RES.PRICE`. My rationale was rooted in the intrinsic relevance of these features to the phenomenon of power outages. For instance, the specific state reflects varying infrastructure and weather conditions, while the cause and climate category directly correlate with outage durations. These features were chosen to capture the broad spectrum of factors impacting power outages, aiming for a more accurate and nuanced predictive model.

- `U.S._STATE`
  - Geographic location can be a significant factor due to different infrastructure and weather patterns.
- `ANOMALY.LEVEL`, `CLIMATE.CATEGORY`
  - These indicate climate conditions, crucial for predicting weather-related outages.
- `CAUSE.CATEGORY`, `CAUSE.CATEGORY.DETAIL`
  - Understanding the cause of the outage helps in estimating duration.
- `CUSTOMERS.AFFECTED`
  - The number of customers affected might correlate with the severity and duration of the outage.
- `RES.PRICE`
  - Residential price as an indirect indicator of infrastructure quality.
- `SEASON`
  - Different seasons have unique patterns impacting outage durations.

The RandomForestRegressor was the chosen algorithm for its proficiency in handling complex, non-linear data interactions and its ensemble learning approach, which is effective against overfitting. The best performing hyperparameters in my model were `n_estimators=100` and `max_depth=20`, identified through a systematic GridSearchCV process. This process tested combinations of `n_estimators` (10, 50, 100) and `max_depth` (None, 10, 20, 30), crucial for defining the forest structure. The choice of a higher number of estimators and unrestricted tree depth was aimed at capturing detailed patterns in the data while maintaining computational efficiency.

The performance of my final model, as indicated by a lower Mean Absolute Error on both training and test sets, marked a notable improvement over the baseline model. This enhancement is attributed to the comprehensive feature set, which offered a deeper insight into the dynamics of power outages, and the optimized RandomForestRegressor, capable of effectively learning from this diverse dataset. The methodical hyperparameter tuning via GridSearchCV contributed significantly to ensuring not just the accuracy but also the generalizability of the model, making it a robust advancement over the simpler baseline approach.


In [30]:
df_cleaned = df_cleaned.dropna()
X_full = df_cleaned.drop('OUTAGE.DURATION', axis=1)
y_full = df_cleaned['OUTAGE.DURATION']
#Same random state and proportin as baseline model
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, 
                                                                        test_size=0.2, random_state=42)


all_feature_transform = ColumnTransformer([
    ('one_hot_encoder', OneHotEncoder(handle_unknown='ignore'), ['U.S._STATE', 'CLIMATE.CATEGORY', 
                                          'CAUSE.CATEGORY', 'CAUSE.CATEGORY.DETAIL', 'SEASON']),
    ('scaler', StandardScaler(), ['ANOMALY.LEVEL', 'CUSTOMERS.AFFECTED', 'RES.PRICE'])
], remainder='passthrough')

final_pipeline = Pipeline([
    ('transformer', all_feature_transform),
    ('regressor', RandomForestRegressor(random_state=42))
])

param_grid_all_features = {
    'regressor__n_estimators': [10, 50, 100],
    'regressor__max_depth': [None, 10, 20, 30]
}

grid_search = GridSearchCV(final_pipeline, 
                        param_grid_all_features, cv=5, 
                        scoring='neg_mean_absolute_error', verbose=1)

grid_search.fit(X_train_full, y_train_full)

best_params = grid_search.best_params_

final_pipeline.set_params(**best_params)
final_pipeline.fit(X_full, y_full)

train_score = mean_absolute_error(y_train_full, final_pipeline.predict(X_train_full))
test_score = mean_absolute_error(y_test_full, final_pipeline.predict(X_test_full))

best_params, train_score, test_score



Fitting 5 folds for each of 12 candidates, totalling 60 fits


({'regressor__max_depth': 20, 'regressor__n_estimators': 100},
 1068.7569739778512,
 1188.4303926826922)

## Fairness Analysis of the Final Model

### Groups Definition
- **Group X**: Power outages caused by severe weather.
- **Group Y**: Power outages due to other reasons.


### Hypotheses
- **Null Hypothesis**: The model is fair. The MAE for Group X (severe weather) and Group Y (other causes) is roughly the same, with any differences being due to random chance.
- **Alternative Hypothesis**: The model is unfair. The MAE for Group X differs significantly from that of Group Y.

### Test Statistic and Significance Level
- **Test Statistic**: Absolute difference in MAE between Group X and Group Y.
- **Significance Level**: 0.05


### Results
- **Original MAE Difference**: 246.34
- **Mean of Permutation MAE Differences**: 300.50
- **P-Value**: 0.505

### Conclusion
The p-value of 0.505 suggests that there is insufficient evidence to reject the null hypothesis. This indicates that the model's performance, as measured by MAE, does not significantly differ between Group X (severe weather-related outages) and Group Y (outages due to other reasons). 


In [36]:
group_X = df_cleaned[df_cleaned['CAUSE.CATEGORY'] == 'severe weather']
group_Y = df_cleaned[df_cleaned['CAUSE.CATEGORY'] != 'severe weather']

y_X = group_X['OUTAGE.DURATION']
X_X = group_X.drop('OUTAGE.DURATION', axis=1)

y_Y = group_Y['OUTAGE.DURATION']
X_Y = group_Y.drop('OUTAGE.DURATION', axis=1)


mae_X = np.abs(y_X-final_pipeline.predict(X_X))
mae_Y = np.abs(y_Y-final_pipeline.predict(X_Y))
original_mae_diff = abs(mae_X.mean() - mae_Y.mean())
combined_mae = np.concatenate((mae_X, mae_Y))
n_permutations = 1000
mae_diffs = []

for _ in range(n_permutations):
    np.random.shuffle(combined_mae)
    shuffled_y_X = combined_mae[:len(y_X)]
    shuffled_y_Y = combined_mae[len(y_X):]

    mae_diffs.append(abs(shuffled_y_X.mean() - shuffled_y_Y.mean()))

p_value = np.sum(mae_diffs >= original_mae_diff) / n_permutations
original_mae_diff,sum(mae_diffs)/len(mae_diffs), p_value

(246.3364860359859, 300.5000246110856, 0.505)