<a href="https://colab.research.google.com/github/butler-julie/GDSVirtualTutorials/blob/main/060625_UncertaintyQuantification/uncertainty_quantification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Uncertainty Quantification Tutorial

Ashley S. Dale

This notebook introduces a simple method of estimating *aleatoric* and *epistemic* uncertainty for random forest regression (RFR) models based on the conservation of variance.

**Epistemic Uncertainty** can be reduced through further information available to the model.

**Aleatoric Uncertainty** cannot be reduced; it is part of the randomness associated with model hyperparameters.

**Total Uncertainty** = Epistemic Uncertainty + Aleatoric Uncertainty

We would like to distinguish between the two so that we can better understand how to select data samples for tasks.  However, estimating the aleatoric uncertainty is challenging.  It is more common to calculate the total uncertainty and epistemic uncertainty, then solve for the aleatoric uncertainty.  

$aleatoric\ uncertainty = (total\ uncertainty) - (epistemic\ uncertainty)$



# Procedure

1. Given a dataset, divide the data into a training and a test set
    - The train set consists of chemistries which do not contain Fe as an element
    - The test set consists of chemistries which *do* contain Fe as an element
2. Train a Random Forest Regression model to predict the energy of formation for these chemistries
3. Estimate the total uncertainty of the predictions
4. Estimate the epistemic uncertainty of the predictions
5. Calculate an estimate of the aleatoric uncertainty for the predictions
6. Identify samples which have low aleatoric and high epistemic uncertainty

In [None]:
%matplotlib inline

In [None]:
import copy
import numpy as np
import datetime
import pandas as pd
from sklearn.metrics import root_mean_squared_error as rmse
from sklearn.metrics import r2_score as r2
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
from tqdm import tqdm, trange

from data_utils import get_samples_w_element_X, get_target_label

from calibration import calculate_density, calculate_miscalibration_area, calculate_calibration

## Load Data

We will use a version of the Jarvis3D DFT dataset. The total file size is 211 MB.

In [None]:
#download the data pickle
!chmod 755 get_featurized_data.bash
!./get_featurized_data.bash

In [None]:
# Get Data
data = pd.read_pickle('data/jarvis22/dat_featurized_matminer.pkl')
print(len(data))

In [None]:
data

In [None]:
target = 'e_form'
n_samples = -1 # for all samples, pass -1
element_to_omit_from_training_data = 'Fe'

In [None]:
train_data, test_data = get_samples_w_element_X(data, 'formula', element_to_omit_from_training_data)

In [None]:
X_test, y_test = get_target_label(test_data, target)

In [None]:
X_train, y_train = get_target_label(train_data, target)

## Train the Random Forest Regression Model

In [None]:
num_trees_in_forest = 100
max_feat = 0.1
num_dataset_features = X_test.shape[1]

In [None]:
model = RandomForestRegressor(
    n_estimators=num_trees_in_forest, 
    max_features=max_feat,
    oob_score=True
)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred_train = model.predict(X_train)

In [None]:
y_pred_test= model.predict(X_test)

In [None]:
test_predictions = []
train_predictions = []

# Get the predictions from each tree in the forest
for tree_obj in tqdm(model.estimators_, total = model.n_estimators):
    test_predictions.append(tree_obj.tree_.predict(X_test.astype(np.float32)))
    train_predictions.append(tree_obj.tree_.predict(X_train.astype(np.float32)))

In [None]:
set_of_train_predictions = np.transpose(np.squeeze(np.array(train_predictions)))
mean_of_ea_train_pred = np.mean(set_of_train_predictions, axis=1)

In [None]:
set_of_test_predictions = np.transpose(np.squeeze(np.array(test_predictions)))
mean_of_ea_test_pred = np.mean(set_of_test_predictions, axis=1)

### Confirm Model Accuracy

To check the model performance, we will make a parity plot of the predictions vs ground truth.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax[0].scatter(y_train, y_pred_train, label=f'R2: {r2(y_train, y_pred_train): .2f}')
ax[0].set_title('Training Data Predictions')
ax[1].scatter(y_test, y_pred_test, label=f'R2: {r2(y_test, y_pred_test): .2f}')
ax[1].set_title('Test Data Predictions')

for ax_ in ax:
    ax_.set_xlabel('True Values')
    ax_.set_ylabel('Predicted Values')
    ax_.plot([ax_.get_xlim()[0], ax_.get_xlim()[1]], [ax_.get_xlim()[0], ax_.get_xlim()[1]], 'k--', lw=2)
    ax_.legend()
plt.tight_layout()
plt.show()

# Calculate Total Uncertainty for Each Sample

The total uncertainty for each prediction from a Random Forest Regression model is commonly estimated using the Mean Squared Error of the prediction. For each prediction $y_i$ from the $i^\text{th}$ tree out of $n$ trees in the Random Forest and the target value $y$:

$\sigma^2 = \frac{1}{n} \Sigma_n \left( y - y_i\right )^2$

In [None]:
def variance_estimate(y, y_hat, n=None):
    if n == None:
        n = len(y_hat)
    return np.sum(np.power((y - y_hat), 2))/n

def get_variance_estimate(true_values, predicted_values):
    var = []
    for sample_idx in trange(len(true_values)):
        var.append(variance_estimate(true_values[sample_idx],predicted_values[sample_idx, :] ))
    return np.array(var)

In [None]:
train_total_var = get_variance_estimate(y_train, set_of_train_predictions)
test_total_var = get_variance_estimate(y_test, set_of_test_predictions)

## Calculate Epistemic Uncertainty

The epistemic uncertainty for a Random Forest Regression model is approximated using the variance in predictions across all trees. Given a mean predicted value $\bar{y}$ from the set of predictions $y_i$ from the $i^\text{th}$ 

$\bar{y} = \frac{1}{n} \Sigma_n \left(y_i\right)^2$

The variance in predictions between all trees in the forest is

$\sigma_\text{epi}^2 = \frac{1}{n} \Sigma_n (y_i - \bar{y})^2$

In [None]:
train_explained_var = get_variance_estimate(mean_of_ea_train_pred, set_of_train_predictions)
test_explained_var = get_variance_estimate(mean_of_ea_test_pred, set_of_test_predictions)

# Calculate Aleatoric Uncertainty

To obtain an estimate of the aleatoric uncertainty $\sigma_\text{al}^2$, we obtain the difference between the total uncertainty and the prediction variance:

$\sigma_\text{al}^2 = \sigma^2 - \sigma_\text{epi}^2$

In [None]:
train_diff_var = train_total_var - train_explained_var
test_diff_var = test_total_var - test_explained_var

# Check Model Calibration

In [None]:
residuals_train = np.array(y_train) - np.mean(set_of_train_predictions)
stddev_train = np.std(set_of_train_predictions)

residuals_test = np.array(y_test) - np.mean(set_of_test_predictions)
stddev_test = np.std(set_of_test_predictions)

In [None]:
predicted_pi = np.linspace(0, 1, 100)
obsv_pi_train = calculate_calibration(residuals_train, stddev_train)
obsv_pi_test = calculate_calibration(residuals_test, stddev_test)

In [None]:
cal_err_id = ((predicted_pi - obsv_pi_train)**2).sum()
cal_area_id = calculate_miscalibration_area(predicted_pi, obsv_pi_train)

cal_err_ood = ((predicted_pi - obsv_pi_test)**2).sum()
cal_area_ood = calculate_miscalibration_area(predicted_pi, obsv_pi_test)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))
ax[0].fill_between(
    predicted_pi, predicted_pi, obsv_pi_train, 
    label=f'Train Data: Cal Err={cal_err_id:.3f}, AUC={cal_area_id:.3f}', alpha=0.51)
ax[0].plot(predicted_pi, predicted_pi, '--', color='k', alpha=0.5)
ax[0].set_xlabel('Expected Frequency')
ax[0].set_ylabel('Observed Frequency')
ax[0].set_title(f'{target} {element_to_omit_from_training_data}')
ax[0].legend()

ax[1].fill_between(
    predicted_pi, predicted_pi, obsv_pi_test, alpha=0.51, 
    label=f'Test Data: Cal Err={cal_err_ood:.3f}, AUC={cal_area_ood:.3f}', color='tab:orange')
ax[1].plot(predicted_pi, predicted_pi, '--', color='k', alpha=0.5)
ax[1].set_xlabel('Expected Frequency')
ax[1].set_ylabel('Observed Frequency')
ax[1].set_title(f'{target} {element_to_omit_from_training_data}')
ax[1].legend()
fig.tight_layout()

plt.show()

# Compare Aleatoric and Epistemic Uncertainties

Next, we visualize the total uncertainty estimate for each prediction by creating a parity plot with error bars.  To preserve units, the error bars are presented using the standard devation of the total uncertainty for each sample:

$\sigma = \sqrt{\frac{1}{n}\Sigma_n \left(y - y_i \right)^2}$

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(9, 4))
ax[0].errorbar(
    y_train, mean_of_ea_train_pred, yerr=np.sqrt(train_total_var), 
    fmt='.', linestyle=None, alpha=0.5, label='R2: ' +str(r2(y_train, mean_of_ea_train_pred))[:6]
    )
ax[0].plot(y_train, y_train, 'r')
ax[0].set_title('Training')
ax[0].set_xlabel('Target')
ax[0].set_ylabel('Mean Prediction')
ax[0].set_xlim(1.1*np.amin(y_train), 1.1*np.amax(y_train))
ax[0].set_ylim(1.1*np.amin(y_train), 1.1*np.amax(y_train))
ax[0].legend()

ax[1].errorbar(
    y_test, mean_of_ea_test_pred, yerr=np.sqrt(test_total_var), 
    fmt='.', linestyle=None, alpha=0.5, color='C1',  label='R2: ' +str(r2(y_test, mean_of_ea_test_pred))[:6]
    )

ax[1].plot(y_test, y_test, 'r')
ax[1].set_title('Testing')
ax[1].set_xlabel('Target')
ax[1].set_ylabel('Mean Prediction')
ax[1].set_xlim(1.1*np.amin(y_test), 1.1*np.amax(y_test))
ax[1].set_ylim(1.1*np.amin(y_test), 1.1*np.amax(y_test))
ax[1].legend()

fig.suptitle(f'Random Forest Predictions: Train omits {element_to_omit_from_training_data}')

We can also visualize the distributions of the standard deviation:

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
_ = ax.hist(np.sqrt(train_total_var), 100, density=True, alpha=0.7, label='Train')
_ = ax.hist(np.sqrt(test_total_var), 100, density=True, alpha=0.7, label='Test')
ax.set_yticks(())
ax.set_xlabel('Sample Uncertainties')
ax.set_ylabel('Frequency')
ax.set_title('Total Uncertainty')
ax.legend()
plt.show()

Finally, we can can check for correlation between epistemic and aleatoric uncertainty:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(8, 4))

ax[0].scatter(train_explained_var, train_diff_var, alpha=0.5, color='C0', label='Train', s=8)
ax[0].set_xlabel('Epistemic Uncertainty')
ax[0].set_ylabel('Aleatoric Uncertainty')
ax[0].set_title('Train')
ax[0].set_yscale('log')
ax[0].set_xscale('log')

ax[1].scatter(test_explained_var, test_diff_var, alpha=0.5, color='C1', label='Test', s=10)
ax[1].set_xlabel('Epistemic Uncertainty')
ax[1].set_ylabel('Aleatoric Uncertainty')
ax[1].set_title('Test')
ax[1].set_yscale('log')
ax[1].set_xscale('log')
fig.tight_layout()
plt.show()