# Uncertainty in DL: MC-Dropout

Click to run on colab (if you're not already there): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/MCdropout_pytorch.ipynb) 

This session aims at understanding and implementing Monte Carlo Dropout, as described in [Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
](https://arxiv.org/abs/1506.02142).

![](https://github.com/charlesollion/dlexperiments/raw/master/6-Bayesian-DL/BDLworkflow.png)

**Notes:**

If you're interested in the Keras/Tensorflow version, please consider this instead:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/BayesianDeepWine.ipynb) 

In [None]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## The dataset

In [None]:
# If on colab, you may download the dataset running this cell
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/X_data.npy?raw=true -O X_data.npy
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/y_data.npy?raw=true -O y_data.npy

We use the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality)
dataset.
We use the red wine subset, which contains 4,898 examples.
The dataset has 11 numerical physicochemical features of the wine, and the task is to predict the wine quality, which is a score between 0 and 10.

While the experts gave integer scores, we will first consider them as continuous values between 0 and 10 and treat this as a regression task, for simplicity and easy interpretation of confidence intervals. To make the data more continous, we add a small observation noise to `y`.

We could consider instead consider a classification task, or many different things here, but that's not the point of this session.

### Create training and evaluation datasets

Let's split the wine dataset into training and test sets, with 85% and 15% of
the examples, respectively.

In [None]:
batch_size = 256

# X has 11 continuous inputs
# y is treated as a continous rating between 0 and 9
FEATURE_NAMES = [
    "fixed acidity", "volatile acidity", "citric acid",
    "residual sugar", "chlorides", "free sulfur dioxide",
    "total sulfur dioxide", "density", "pH",
    "sulphates", "alcohol",
]

In [None]:
import torch
import numpy as np
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn import preprocessing


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# load data
all_X_data = np.load("./X_data.npy")
all_y_data = np.load("./y_data.npy")

X_train, X_test, y_train, y_test = train_test_split(all_X_data, all_y_data, test_size=0.15, random_state=42)

# add a bit of noise to y_train
y_train = y_train + np.random.normal(0,0.2, size=y_train.shape)

# mean = 0 ; standard deviation = 1.0
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# transform to torch tensor
tensor_X_train = torch.from_numpy(X_train).float().to(device) 
tensor_y_train = torch.from_numpy(y_train).float().to(device) 
tensor_X_test = torch.from_numpy(X_test).float().to(device) 
tensor_y_test = torch.from_numpy(y_test).float().to(device) 

# build dataset and dataloader torch objects
dataset_train = TensorDataset(tensor_X_train, tensor_y_train)
dataset_test = TensorDataset(tensor_X_test, tensor_y_test)
dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_test = DataLoader(dataset_test, batch_size=batch_size)

In [None]:
import matplotlib.pyplot as plt
plt.hist(y_train, bins=100);

## Baseline: Standard neural network

We create a standard deterministic neural network model as a baseline.

In [None]:
import torch.nn as nn

class BaselineMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(BaselineMLP, self).__init__()
        self.hidden_layer1 = nn.Linear(input_dim, hidden_dim)
        self.hidden_layer2 = nn.Linear(hidden_dim, hidden_dim)
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = torch.relu
        
    def forward(self, inputs):
        # First layer
        h = self.hidden_layer1(inputs)
        h = self.act(h)
        # Second layer
        h = self.hidden_layer2(h)
        h = self.act(h)
        output = self.out_layer(h)
        
        return output

In [None]:
import torch.optim as optim
from tqdm import tqdm

def train_model(model, num_epochs):
    optimizer = optim.RMSprop(model.parameters(), lr=0.003, weight_decay=1e-5)
    loss = torch.nn.MSELoss()
    losses = []
    model.train()
    for e in tqdm(range(num_epochs)):
        for x, y in dataloader_train:
            optimizer.zero_grad()
            loss_value = loss(model(x), y)
            loss_value.backward()
            losses.append(loss_value.detach().cpu().item())
            optimizer.step()
    return losses


def eval_model(model):
    model.eval()
    errors = []
    for x,y in dataloader_test:
        y_hat = model(x)
        errors.append(((torch.squeeze(y_hat) - torch.squeeze(y))**2).detach().cpu().numpy())
  
    rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
    return round(rmse, 3)

In [None]:
baseline_model = BaselineMLP(11, 32).to(device)
baseline_model

In [None]:
[p.numel() for p in baseline_model.parameters()]
# hidden layer W, hidden layer b, output layer W, output layer b

### Baseline evaluations

Let's see the untrained model RMSE, and then a very stupid constant model (just predicting the mean of `y_train`):

In [None]:
rmse = eval_model(baseline_model)
print(f"untrained RMSE: {rmse:.3f}")

In [None]:
class Constant():
    def eval(self):
        pass
    
    def __call__(self, x):
        # Always return 5.81...
        return torch.ones((x.shape[0], 1)).to(device) * np.mean(y_train)

rmse = eval_model(Constant())
print(f"constant model RMSE: {round(rmse, 3):.3f}")

### Training our model

In [None]:
losses = train_model(baseline_model, 50)

plt.plot(losses);

In [None]:
rmse = eval_model(baseline_model)
print(f"trained RMSE: {rmse:.3f}")

In [None]:
samples = 10
# first ten examples
examples_torch, _ = next(iter(dataloader_test))
predicted = baseline_model(examples_torch[:samples])
predicted = predicted.detach().cpu().numpy()
for idx in range(samples):
    print(f"Predicted: {round(float(predicted[idx][0]), 1)} - Actual: {y_test[idx].item()}")

### Simple Dropout Model

In the following, we start by implementing a linear layer consisting stochastic weights. The output layer is kept as before.

In [None]:
import torch.nn as nn

class DropoutMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, dropout_proba):
        super(DropoutMLP, self).__init__()
        self.hidden_layer1 = nn.Linear(input_dim, hidden_dim)
        self.dropout1 = nn.Dropout(dropout_proba)
        self.hidden_layer2 = nn.Linear(hidden_dim, hidden_dim)
        self.dropout2 = nn.Dropout(dropout_proba)
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = torch.relu
        
    def forward(self, inputs):
        # First layer
        h = self.hidden_layer1(inputs)
        h = self.act(h)
        h = self.dropout1(h)
        # Second layer
        h = self.hidden_layer2(h)
        h = self.act(h)
        h = self.dropout2(h)
        # Output layer
        output = self.out_layer(h)
        return output

In [None]:
dropout_model = DropoutMLP(11, 32, 0.1).to(device)
[p.numel() for p in dropout_model.parameters()]
#mu, rho, W_output, b_output, batch_norm mu, batch_norm sigma

In [None]:
losses = train_model(dropout_model, 200)
plt.plot(losses);

### Evaluating the model

As we now have a predictive distribution instead of a single estimate, we can sample several dropout masks and compute several predictions for each example in the test set. It will allow us to see the distribution mean, variance or display a histogram.

However, let us first evaluate the model in a deterministic way, as is traditionnally done with standard dropout (not MC-dropout):

In [None]:
# Non stochastic, the dropout has a different behavior at test time (weights are multiplied by 1-p)
rmse = eval_model(baseline_model)
print(f"trained deterministic network RMSE: {rmse:.3f}")

In [None]:
def enable_dropout(model):
    """ Function to enable the dropout layers during test-time """
    for m in model.modules():
        if m.__class__.__name__.startswith('Dropout'):
            m.train()

In [None]:
def compute_predictions(model, n_samples=20):
    """ Compute `n_samples` predictions for each example in the test set"""
    model.eval()
    enable_dropout(model) # activate dropout at test time
    n_test = len(dataset_test)
    dropout_predictions = np.zeros((n_samples, n_test))
    
    for i in range(n_samples):
        with torch.no_grad():
            batch_y = np.squeeze(model(tensor_X_test).cpu().numpy())
            dropout_predictions[i] = batch_y

    # Calculating mean across multiple MCD forward passes 
    mean = np.mean(dropout_predictions, axis=0) # shape (n_test)

    # Calculating variance across multiple MCD forward passes 
    variance = np.var(dropout_predictions, axis=0) # shape (n_test)

    return dropout_predictions, mean, variance

In [None]:
preds, mean, var = compute_predictions(dropout_model)

In [None]:
preds.shape

In [None]:
def display_predictions(predictions, targets, idxs):
    """ Display predictions for examples in the test set indexed by the list `idxs`"""
    prediction_mean = np.mean(predictions, axis=0).tolist()
    prediction_min = np.min(predictions, axis=0).tolist()
    prediction_max = np.max(predictions, axis=0).tolist()
    prediction_range = (np.max(predictions, axis=0) - np.min(predictions, axis=0)).tolist()

    for idx in idxs:
        print(
            f"Predictions mean: {round(prediction_mean[idx], 2)}, "
            f"min: {round(prediction_min[idx], 2)}, "
            f"max: {round(prediction_max[idx], 2)}, "
            f"range: {round(prediction_range[idx], 2)} - "
            f"Actual: {targets[idx].item()}"
        )
    plt.boxplot(predictions[:,idxs])
    plt.plot(range(1,11), y_test[idxs], 'r.', alpha=0.8);

In [None]:
# Chose 10 random indices in the test set
idxs=np.random.choice(range(preds.shape[1]), size=10, replace=False)
display_predictions(preds, y_test, idxs)

Compared to a single estimate, the empirical mean of the prediction has lower RMSE:

In [None]:
errors = []
for y_hat, y in zip(mean, y_test):
    errors.append((y_hat - y)**2)

rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
print(f"mean prediction RMSE: {round(rmse, 3):.3f}")

We can also verify that our model produces variances that depend on the test example rather than a constant value:

In [None]:
plt.hist(var, bins=40, alpha=0.4, color="b");

### Comparing different values of $p$

In the original paper, the author mention different values of $p$, chosen as a hyperparameter. They mention that the results should converge to similar results (i.e. variances) with various values like $p=0.1, p=0.2$.

However, they chose $p=0.5$ for the LSTM experiment. In the following, we test several values of $p$. Note that it could be a good idea to increase the number of epochs to ensure convergence (which is slower at higher $p$).

We compute here the empirical variance, even though in the original paper, they mention that the variance should be:
$$V_{q(y^\star|x^\star)} = V^{empirical} + τ^{-1}$$

where $τ = \frac{pl^2}{2N\lambda}$, with $p$ dropout probability, $l$ a hyperparameter depending on the weight initialization, $N$ the number of data points, and $\lambda$ the weight decay parameter. In practice, the empirical variance alone is used as the other parameters can be tuned (the weight decay can be arbitrarily small for instance, making $τ^{-1}$ neglectable.

In [None]:
variances =[]
for p in [0.0, 0.01, 0.1, 0.2, 0.4]:
    print(f"training model with p={p}")
    dropout_model = DropoutMLP(11, 32, p).to(device)
    losses = train_model(dropout_model, 200)
    preds, mean, var = compute_predictions(dropout_model)
    variances.append(var)
    errors = []
    for y_hat, y in zip(mean, y_test):
        errors.append((y_hat - y)**2)

    rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
    print(f"p={p} mean prediction RMSE: {round(rmse, 3):.3f}")
    print(f"p={p} average variance: {np.mean(var):.3f}")

In [None]:
np.vstack(variances[:]).T.shape

In [None]:
plt.hist(np.vstack(variances[1:]).T, histtype='step', bins=40, label=[f"p={p}" for p in [0.01, 0.1, 0.2, 0.4]], 
         fill=False, alpha=0.6, linewidth=2, hatch='..');
plt.legend();

In [None]:
# long run
p = 0.2
print(f"training model with p={p}")
dropout_model = DropoutMLP(11, 32, p).to(device)
losses = train_model(dropout_model, 1000)
preds, mean, var = compute_predictions(dropout_model)
errors = []
for y_hat, y in zip(mean, y_test):
    errors.append((y_hat - y)**2)

rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
print(f"p={p} mean prediction RMSE: {round(rmse, 3):.3f}")
print(f"p={p} average variance: {np.mean(var):.3f}")

In [None]:
# lowest and largest variances
idxs = (np.argsort(var)[0:5]).tolist() + (np.argsort(var)[-5:]).tolist()

In [None]:
display_predictions(preds, y_test, idxs)