# Environmental Solutions Lab: Adaptive Infrastructure
## Tasks IV & V: Predicting blackout events using machine learning
Intelligent Earth Center for Doctoral Training <br>
Friday 13th December 2024 <br>

Instructors:
* <alison.peard@ouce.ox.ac.uk><br>
* <yu.mo@chch.ox.ac.uk>


### Overview
In this task, you will use machine learning models to predict blackout events based on storm characteristics, blackout data, and socio-economic factors.

When we finish the tutorial, play around with the model. Maybe a different model will be better than XGBoost, or we have missed some useful predictors? Are there any other datasets we can use?

#### Environment set up
To begin, let's set up our environment.

We will use the `scikit-learn` library for some generic machine learning methods like grid searching and train-test spits. We want to use an `XGboost` model, so we need to import this library seperately.

We define some arguments at the start of our code: `SEED`, to ensure reproducibility and `NORMALISE` to select whether to normalise the data before feeding it into the model. Normalisation is not needed for tree-based methods, like XGBoost, so we can leave this switched off.

In [None]:
import os
import shap
import numpy as np
import pandas as pd
import xgboost as xgb
import multiprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from pprint import pp as prettyprint

SEED = 42
NORMALISE = False
XGBModel = xgb.XGBClassifier #  xgb.XGBRegressor for regression

WD = os.path.join(os.getcwd(), "data")

### Load and process the data
We will use the data we prepared in Tasks I-III. This should consist of clean IBTrACs storm characteristics, blackout durations from the Blackout dataset, and extra socioeconomic data. There is a pre-prepared file `blackout_with_socio.csv` provided, so we can just load this in.

We want to see if we can predict electricity outages (derived from blackout data) based on storm characteristics and socioeconomic factors. We begin by loading the data and visualising it.

In [None]:
# df = pd.read_csv(os.path.join(WD, "outages.csv"))
df = pd.read_csv(os.path.join(WD, "blackout_with_socio.csv"))
df.head()

The income column is in categorical string format, which `XGBoost` cannot automatically handle. We need to encode this in a numeric format.

Two common techniques are _label encoding_ and _one-hot encoding_. Label encoding should only be used when there is a clear hierarchy between categories, e.g., high, medium, low. It assigns an ordinal value to each category corresponding to the hierarchy. One-hot encoding creates a binary column for each category.

Since country income has a clear hierarchy, low, lower-middle, upper-middle, and high, we can use label-encoding.

In [None]:
# categorical to ordinal
income_factors = {'L': 0, 'LM': 1, 'UM': 2, 'H': 3}
df['income'] = df['income'].map(income_factors)

The blackout duration variable takes integer values corresponding to the number of days of blackout experienced after an IBTrACs storm. This is a difficult prediction task. One way to simplify this task, while still gaining useful information, is to set a value of interest and focus on predicting whether the blackout duration exceeds this value. Here, we will use the median.

In [None]:
# convert duration to binary variable
median = df['duration'].median()
df['duration'] = (df['duration'] > median).astype(int)

Now that all the data processing has been done, we split the dataset into train and test sets and view it once more.

In [None]:
# train-test split
response = "duration"
regressors = ["windmax", "speed", "distToTrack", "income"]
df = df.loc[:, regressors + [response]].copy()

# 90:10 split
df = df[regressors + [response]]
train, test = train_test_split(df, train_size=0.9, shuffle=True, random_state=SEED)
train.head()

### XGboost
XGBoost ([eXtreme Gradient Boosting](https://xgboost.readthedocs.io/en/stable/)) is an ensemble algorithm based-on the predictions of hundreds of decision trees. There are two main types of ensemble algorithm:
* **bagging:** (or bootstrap aggregating) is based on drawing bootstrap samples from your training step. Random forests are a special case of bagging algorithm where the predictive features at each data split are also subset.
* **boosting:** is an iterative approach, where an ensemble consists of a sequence of weak learners, each tuned to compensate for the weaknesses in their predecessors. XGBoost is a gradient boosting algorithm where the weak learners are hundreds or thousands of decision trees.

XGBoost models have a large number of parameters relating to the base learners (decision trees) and the booster. These are described in detail in the [XGboost documentation](https://xgboost.readthedocs.io/en/stable/parameter.html). We will use `scikit-learn`'s `GridSearchCV` to do cross-validatory gridsearch through a subset of these.

In [None]:
# define the model
X, y = train[regressors], train[response]
model = XGBModel(
    n_jobs=multiprocessing.cpu_count() // 2,
    tree_method="hist"
)

In [None]:
# define the grid search
cv = GridSearchCV(
    model,
    {   "booster": ["gbtree"],            # default
        "learning_rate": [0.1, 1],        # alias eta
        "min_split_loss": [1],            # alias gamma
        "alpha": [1],
        "lambda": [1],
        "max_depth": [8],
        "min_child_weight": [1],
        "subsample": [0.8, 0.9],
        "colsample_bytree": [0.8, 0.9],
        "colsample_bylevel": [0.8, 0.9],
        "colsample_bynode": [0.8, 0.9]
        },
    verbose=1,
    n_jobs=multiprocessing.cpu_count() // 2,
)

In [None]:
# perform the cv
cv.fit(X, y)

prettyprint(cv.best_score_)
prettyprint(cv.best_params_)

In [None]:
# save parameters to a csv file 
best_params = pd.DataFrame.from_dict(cv.best_params_, orient="index")
best_params.columns = ["value"]
best_params.to_csv("best_params.csv")

### Train best model

In [None]:
best_params = pd.read_csv(os.path.join(WD, "best_params.csv"), index_col=0)
best_params = best_params.to_dict()["value"]

In [None]:
best_model = XGBModel(
    n_jobs=multiprocessing.cpu_count() // 2,
    tree_method="hist",
    **best_params
)
best_model.fit(X, y)

### Look at results

In [None]:
y_train = train[response]
y_test = test[response]
y_fit = best_model.predict(train[regressors])
y_pred = best_model.predict(test[regressors])

In [None]:
# classification metrics
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Observation'], colnames=['Prediction'])
a = confusion_matrix[1][1] # true positives
b = confusion_matrix[0][1] # false positives
c = confusion_matrix[1][0] # false negatives
d = confusion_matrix[0][0] # true negatives

accuracy = np.mean(y_test == y_pred)

bias = (a + b) / (a + c)
f1 = (a + d) / (a + b + c + d)
precision = a / (a + b)
recall = a / (a + c)
csi = a / (a + b + c) # critical success index (f2)


print('Results:\n--------')
print(f'Accuracy: {accuracy:.2f}')
print(f'Bias: {bias:.2f}')
print(f'F1: {f1:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'CSI: {csi:.2f}')

### Feature importance

XGBoost's built-in feature importance is based-on summing up the feature importances for all the decision trees in the ensemble.

In [None]:
# look at built-in feature importances (not recommended)
rank = best_model.feature_importances_.argsort()
plt.barh(df.columns[rank], best_model.feature_importances_[rank])
plt.xlabel('Default XGBoost feature importances')

Many practitioners prefer to use SHAP values to determine feature importance. SHAP values are model agnostic and generally considered to be more reliable. We use the python `shap` library.

**Note:** for Apple Silicon Macs, `shap` needs to be installed with pip, not conda.

In [None]:
shap.initjs()

explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(test[regressors])
shap.summary_plot(shap_values, test[regressors], plot_type="bar", class_names=list(range(31)));

#### References
1. Stephens (2013) Problems with binary pattern measures for flood model evaluation