# Introduction to Bayesian DL

Click to run on colab (if you're not already there): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/BayesByBackprop_pytorch.ipynb) 

This session aims at understanding and implementing basic Bayesian Deep Learning models, as described in [Bayes by Backprop](https://arxiv.org/abs/1505.05424).

![](https://github.com/charlesollion/dlexperiments/raw/master/6-Bayesian-DL/BDLworkflow.png)

**What this is about:**

- fitting the Pytorch framework
- understanding basic BBB building blocks

**Notes:**

The notebook itself is inspired from [Khalid Salama's Keras tutorial on Bayesian Deep Learning](https://keras.io/examples/keras_recipes/bayesian_neural_networks/), and takes several graphs from the excellent paper [Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users
](https://arxiv.org/abs/2007.06823). If you're interested in the Keras/Tensorflow version, please consider this instead:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/BayesianDeepWine.ipynb) 

## Why BayesByBackprop

Our goal is to use existing frameworks such as Pytorch, and stick to a standard way of training Neural Networks:

- We will use Stochastic Gradient Descent (SGD) or a variant
- We will then train on random mini-batches of data (data may be large)
- We want to take advantage of automatic differentiation and not compute gradients ourselves: everything must be **differentiable**
- We want to benefit from parallel computing

### Installation (outside of Colab)

You just need pytorch and the built-in `torch.distributions` will be used:

```python
pip install pytorch
```

In [None]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## The dataset

In [None]:
# If on colab, you may download the dataset running this cell
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/X_data.npy?raw=true -O X_data.npy
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/y_data.npy?raw=true -O y_data.npy

We use the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality)
dataset.
We use the red wine subset, which contains 4,898 examples.
The dataset has 11 numerical physicochemical features of the wine, and the task is to predict the wine quality, which is a score between 0 and 10.

While the experts gave integer scores, we will first consider them as continuous values between 0 and 10 and treat this as a regression task, for simplicity and easy interpretation of confidence intervals. To make the data more continous, we add a small observation noise to `y`.

We could consider instead consider a classification task, or many different things here, but that's not the point of this session.

### Create training and evaluation datasets

Let's split the wine dataset into training and test sets, with 85% and 15% of
the examples, respectively.

In [None]:
batch_size = 256

# X has 11 continuous inputs
# y is treated as a continous rating between 0 and 9
FEATURE_NAMES = [
    "fixed acidity", "volatile acidity", "citric acid",
    "residual sugar", "chlorides", "free sulfur dioxide",
    "total sulfur dioxide", "density", "pH",
    "sulphates", "alcohol",
]

In [None]:
import torch
import numpy as np
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn import preprocessing


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# load data
all_X_data = np.load("./X_data.npy")
all_y_data = np.load("./y_data.npy")

X_train, X_test, y_train, y_test = train_test_split(all_X_data, all_y_data, test_size=0.15, random_state=42)

# add a bit of noise to y_train
y_train = y_train + np.random.normal(0,0.2, size=y_train.shape)

# mean = 0 ; standard deviation = 1.0
scaler = preprocessing.StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# transform to torch tensor
tensor_X_train = torch.Tensor(X_train) 
tensor_y_train = torch.Tensor(y_train)
tensor_X_test = torch.Tensor(X_test) 
tensor_y_test = torch.Tensor(y_test)

# build dataset and dataloader torch objects
dataset_train = TensorDataset(tensor_X_train, tensor_y_train)
dataset_test = TensorDataset(tensor_X_test, tensor_y_test)
dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_test = DataLoader(dataset_test, batch_size=batch_size)

In [None]:
import matplotlib.pyplot as plt
plt.hist(y_train, bins=100);

## Baseline: Standard neural network

We create a standard deterministic neural network model as a baseline.

<img src="https://raw.githubusercontent.com/charlesollion/dlexperiments/master/6-Bayesian-DL/stochasticNN.png" width="400" height="200">

In [None]:
import torch.nn as nn

class BaselineTorchModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(BaselineTorchModel, self).__init__()
        self.hidden_layer = nn.Linear(input_dim, hidden_dim)
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = torch.relu
        
    def forward(self, inputs):
        h = self.hidden_layer(inputs)
        h = self.act(h)
        output = self.out_layer(h)
        
        # we add a dummy output, a placeholder for future experiments
        return output, None

In [None]:
import torch.optim as optim
from tqdm import tqdm

def train_model(model, loss, num_epochs):
    optimizer = optim.RMSprop(model.parameters(), lr=0.003)
    losses = []
    model.train()
    for e in tqdm(range(num_epochs)):
        for x, y in dataloader_train:
            optimizer.zero_grad()
            loss_value = loss(model(x), y)
            loss_value.backward()
            losses.append(loss_value.detach().cpu().item())
            optimizer.step()
    return losses


def eval_model(model):
    model.eval()
    errors = []
    for x,y in dataloader_test:
        y_hat, _ = model(x)
        errors.append(((torch.squeeze(y_hat) - torch.squeeze(y))**2).detach().cpu().numpy())
  
    rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
    return round(rmse, 3)

In [None]:
baseline_torch_model = BaselineTorchModel(11, 32).to(device)
baseline_torch_model

In [None]:
[p.numel() for p in baseline_torch_model.parameters()]
# hidden layer W, hidden layer b, output layer W, output layer b

### Baseline evaluations

Let's see the untrained model RMSE, and then a very stupid constant model (just predicting the mean of `y_train`):

In [None]:
rmse = eval_model(baseline_torch_model)
print(f"untrained RMSE: {rmse:.3f}")

In [None]:
class Constant():
    def eval(self):
        pass
    
    def __call__(self, x):
        # Always return 5.81...
        return torch.ones((x.shape[0], 1)) * np.mean(y_train), None

rmse = eval_model(Constant())
print(f"constant model RMSE: {round(rmse, 3):.3f}")

### Training our model

In [None]:
def simple_mse_loss(model_outputs, y_true):
    y_hat, _ = model_outputs
    y_hat = torch.squeeze(y_hat)
    y_true = torch.squeeze(y_true)
    return torch.nn.MSELoss()(y_hat, y_true) 

losses = train_model(baseline_torch_model, simple_mse_loss, 50)

plt.plot(losses);

In [None]:
rmse = eval_model(baseline_torch_model)
print(f"untrained RMSE: {rmse:.3f}")

In [None]:
samples = 10
examples_torch, targets_torch = next(iter(dataloader_test))
predicted, _ = baseline_torch_model(examples_torch[:samples])
predicted = predicted.detach().cpu().numpy()
for idx in range(samples):
    print(f"Predicted: {round(float(predicted[idx][0]), 1)} - Actual: {targets_torch[idx].item()}")

## A simple Bayesian Neural Network

Our objective is to build a single layer Bayesian Neural Network using Tensorflow or Pytorch. 

<img src="https://raw.githubusercontent.com/charlesollion/dlexperiments/master/6-Bayesian-DL/stochasticNN.png" width="400" height="200">

We define a unit Gaussian prior, and a diagonal covariance multivariate Gaussian posterior.

From the [Bayes by Backprop](https://arxiv.org/abs/1505.05424) paper, we have the following algorithm:

1. Sample $\epsilon ∼ N(0, I)$.
2. Let $w = µ + log(1 + exp(ρ)) ◦ \epsilon$.
3. Let $θ = (µ, ρ)$.
4. Let $f(w, θ) = log q(w|θ) − log P(w)P(D|w)$.
5. Compute gradients with respect to everything

Note that the $∂f(w,θ) ∂w$ term of the gradients are exactly the gradients found by the usual backpropagation algorithm on a neural network. Thus, remarkably, to learn both the mean and the standard deviation we must simply calculate the usual gradients found by backpropagation.

When working on $M$ mini-batches $D_i$ of data, the authors propose the (equivalent) following cost function to differentiate:

$$f(w, θ, D_i) = − log P(D_i|w) + \frac{1}{M} KL[q(w|θ) || P(w)] $$

This is convenient as the cost function is seperated in two terms, one which corresponds to the standard mse loss as before, and the other a regularization term: the KL divergence between the posterior q and the prior. Note that there is a weight ($\frac{1}{M}$) which can be chosen with different schemes.

### Simple Bayesian By Backprop Layer

In the following, we start by implementing a linear layer consisting stochastic weights. The output layer is kept as before.

In [None]:
class BnnTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim, activation=None):
        super(BnnTorch, self).__init__()
        n = input_dim * hidden_dim
        self.mu = nn.Parameter(torch.zeros((n), dtype=torch.float32))
        self.rho  = nn.Parameter(torch.log(torch.expm1(torch.ones((n), dtype=torch.float32))))
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = activation
        self.hidden_dim = hidden_dim
        self.prior = torch.distributions.Normal(loc=torch.zeros((n), device=device, dtype=torch.float32),
                                                scale=torch.ones((n), device=device, dtype=torch.float32))
        #self.posterior = torch.distributions.Normal(loc=self.mu, 
        #                                            scale=torch.log(1.+torch.exp(self.rho)))
        self.kl_func = torch.distributions.kl.kl_divergence
        self.batch_norm = torch.nn.BatchNorm1d(input_dim)

        
    def forward(self, inputs):
        inputs = self.batch_norm(inputs)
        q = torch.distributions.Normal(loc=self.mu, 
                                       scale=torch.log(1.+torch.exp(self.rho)))
        
        kl = torch.sum(self.kl_func(q, self.prior))
        # we use q.rsample() which uses the reparametrization trick instead of 
        # q.sample() which breaks the auto-differentation path
        w = q.rsample() 
        w = w.reshape((-1, self.hidden_dim))
        h = inputs @ w
        if self.act is not None:
            h = self.act(h)
        output = self.out_layer(h)
        return output, kl

In [None]:
bnn_torch = BnnTorch(11, 32, torch.nn.functional.relu).to(device)
[p.numel() for p in bnn_torch.parameters()]
#mu, rho, W_output, b_output, batch_norm mu, batch_norm sigma

In [None]:
kl_weight = 2. / batch_size

def mse_kl_loss(model_outputs, y_true):
    y_hat, kl = model_outputs
    y_hat = torch.squeeze(y_hat)
    y_true = torch.squeeze(y_true)
    mse = torch.nn.MSELoss()(y_hat, y_true)
    return mse + kl * kl_weight

losses = train_model(bnn_torch, mse_kl_loss, 200)
plt.plot(losses);

In [None]:
def compute_predictions(model, iterations=20):
    model.eval()
    # We sample different weights for each example
    # This is slow! We don't batch anything, and do 
    single_item_dataloader_test = DataLoader(dataset_test, batch_size=1)
    predicted = np.zeros((y_test.shape[0], iterations))
    
    for j, (x, y) in tqdm(enumerate(single_item_dataloader_test)):
        for i in range(iterations):
            y_hat, _ = model(x)
            predicted[j,i] = y_hat.detach().cpu()[0].item()
    return predicted

In [None]:
def display_predictions(predictions, targets, samples=10):
    prediction_mean = np.mean(predictions, axis=1).tolist()
    prediction_min = np.min(predictions, axis=1).tolist()
    prediction_max = np.max(predictions, axis=1).tolist()
    prediction_range = (np.max(predictions, axis=1) - np.min(predictions, axis=1)).tolist()

    for idx in range(samples):
        print(
            f"Predictions mean: {round(prediction_mean[idx], 2)}, "
            f"min: {round(prediction_min[idx], 2)}, "
            f"max: {round(prediction_max[idx], 2)}, "
            f"range: {round(prediction_range[idx], 2)} - "
            f"Actual: {targets[idx].item()}"
        )

In [None]:
predictions = compute_predictions(bnn_torch)
display_predictions(predictions, targets_torch)

In [None]:
errors = []
mean_prediction = np.mean(predictions,axis=-1)
for y_hat, y in zip(mean_prediction, y_test):
    errors.append((y_hat - y)**2)

rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
print(f"mean prediction RMSE: {round(rmse, 3):.3f}")

In [None]:
idx=np.random.choice(range(predictions.shape[0]), size=10, replace=False)
plt.boxplot(predictions[idx].T)
plt.plot(range(1,11), y_test[idx], 'r.', alpha=0.8);

## Full Bayesian Network

We will now implement the full baysian neural network, exactly as stated in the paper. This corresponds to adding stochastic biases and output weights.

Keep in mind that the posterior over each weight/bias is still an independant Gaussian.

In [None]:
class FullBnnTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim, activation=None):
        super(FullBnnTorch, self).__init__()
        # All parameters W, b + W_output + b_output
        n = input_dim * hidden_dim + hidden_dim + hidden_dim + 1
        self.mu = nn.Parameter(torch.zeros((n), dtype=torch.float32))
        self.rho  = nn.Parameter(torch.log(torch.expm1(torch.ones((n), dtype=torch.float32))))
        
        self.act = activation
        self.hidden_dim = hidden_dim
        self.input_dim = input_dim
        self.prior = torch.distributions.Normal(loc=torch.zeros((n), device=device, dtype=torch.float32),
                                                scale=torch.ones((n), device=device, dtype=torch.float32))
        #self.posterior = torch.distributions.Normal(loc=self.mu, 
        #                                            scale=torch.log(1.+torch.exp(self.rho)))
        self.kl_func = torch.distributions.kl.kl_divergence
        self.batch_norm = torch.nn.BatchNorm1d(input_dim)

        
    def forward(self, inputs):
        inputs = self.batch_norm(inputs)
        q = torch.distributions.Normal(loc=self.mu, 
                                       scale=torch.log(1.+torch.exp(self.rho)))
        
        kl = torch.sum(self.kl_func(q, self.prior))
        # we use q.rsample() which uses the reparametrization trick instead of 
        # q.sample() which breaks the auto-differentation path
        all_w = q.rsample() 
        
        # split all_w into the different weight and biases matrices
        W_hidden = all_w[0:self.input_dim * self.hidden_dim].reshape((self.input_dim, self.hidden_dim))
        cur = self.input_dim * self.hidden_dim
        b_hidden = all_w[cur:cur + self.hidden_dim]
        cur = cur + self.hidden_dim
        W_output = all_w[cur:cur + self.hidden_dim].reshape((self.hidden_dim, 1))
        b_output = all_w[-2:-1]
        h = inputs @ W_hidden + b_hidden
        if self.act is not None:
            h = self.act(h)
        output = h @ W_output + b_output
        return output, kl

In [None]:
full_bnn_torch = FullBnnTorch(11, 32, torch.nn.functional.relu).to(device)
[p.numel() for p in full_bnn_torch.parameters()]
# 417 = 11*32 + 32 + 32*1 + 1

In [None]:
kl_weight = 1. / batch_size

def mse_kl_loss(model_outputs, y_true):
    y_hat, kl = model_outputs
    y_hat = torch.squeeze(y_hat)
    y_true = torch.squeeze(y_true)
    mse = torch.nn.MSELoss()(y_hat, y_true)
    return mse + kl * kl_weight

losses = train_model(full_bnn_torch, mse_kl_loss, 200)
plt.plot(losses);

In [None]:
predictions = compute_predictions_torch(bnn_torch)
display_predictions(predictions, targets_torch)

In [None]:
errors = []
mean_prediction = np.mean(predictions,axis=-1)
for y_hat, y in zip(mean_prediction, y_test):
    errors.append((y_hat - y)**2)

rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
print(f"mean prediction RMSE: {round(rmse, 3):.3f}")

In [None]:
idx=np.random.choice(range(predictions.shape[0]), size=10, replace=False)
plt.boxplot(predictions[idx].T)
plt.plot(range(1,11), targets_torch[idx].cpu(), 'r.', alpha=0.8);

## Comparison with the parametrizing a distribution

So far, we have distributions over weights, but single point estimates for each input (once we sample these weights).

<img src="https://raw.githubusercontent.com/charlesollion/dlexperiments/master/6-Bayesian-DL/stochasticNN.png" width="400" height="200">

We can also model an output distribution, rather than a single point estimate. We will change the last layer to become a parametrization of a Gaussian as the output. This should be related to *aleatoric uncertainty*, irreducible noise in the data, capturing the stochastic nature of the process generating the data.

We then use the negative loglikelihood of this distribution as our loss function, instead of the MSE. We are not using any stochastic weights in that case (it would still be possible!)

In [None]:
import torch.nn as nn

class DistributionModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(DistributionModel, self).__init__()
        self.hidden_layer = nn.Linear(input_dim, hidden_dim)
        # We output two values, which will be interpreted as mu and log sigma of a distribution
        self.out_layer = nn.Linear(hidden_dim, 2)
        
        self.act = torch.relu
        
    def forward(self, inputs):
        h = self.hidden_layer(inputs)
        h = self.act(h)
        output = self.out_layer(h)
        mu, logscale = torch.split(output, [1,1], dim=-1)
        output = torch.distributions.Normal(loc=mu,
                                            scale=torch.exp(logscale))
        
        # we return a distribution instead of a single point estimate
        return output

In [None]:
distrib_model = DistributionModel(11, 32).to(device)
[p.numel() for p in distrib_model.parameters()]
# 11*32 + 32 + 32*2 + 2

In [None]:
def nll_loss(model_outputs, y_true):
    return - torch.sum(model_outputs.log_prob(y_true))

losses = train_model(distrib_model, nll_loss, 200)
plt.plot(losses);

In [None]:
distrib_model.eval()
errors = []
intervals = []
for x,y in dataloader_test:
    d = distrib_model(x)
    y_hat = d.mean
    intervals.append(d.scale.detach().cpu().numpy())
    errors.append(((torch.squeeze(y_hat) - torch.squeeze(y))**2).detach().cpu().numpy())
    

rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
inter = np.mean(np.concatenate(intervals, axis=None)) * 1.96 * 2
print(f"mean prediction RMSE: {round(rmse, 3):.3f}")
print(f"average 95% confidence interval size: {round(inter, 3):.3f}")

In [None]:
predicted_d = distrib_model(examples_torch[:samples])
predicted_means = predicted_d.mean.detach().cpu().numpy()
predicted_sigmas = predicted_d.scale.detach().cpu().numpy()

upper = (predicted_means + (1.96 * predicted_sigmas)).tolist()
lower = (predicted_means - (1.96 * predicted_sigmas)).tolist()

for idx in range(samples):
    print(
        f"Prediction mean: {round(predicted_means[idx][0], 2):.2f}, "
        f"stddev: {round(predicted_sigmas[idx][0], 2):.2f}, "
        f"95% CI: [{round(upper[idx][0], 2)} - {round(lower[idx][0], 2)}]"
        f" - Actual: {y_test[idx].item()}"
    )

In [None]:
idx=np.random.choice(range(samples), size=10, replace=False)
predictions = predicted_d.sample((20,))[:,:,0]
plt.boxplot(predictions[idx].T)
plt.plot(range(1,11), y_test[idx], 'r.', alpha=0.8);

## 6) More complex examples: Convolutional models

This bonus section showcases a larger bayesian neural network with convolutional layers. 

##### In Physics

A good example of a practical use of Deep Bayesian Learning in Physics Simulation is available here: [Thuerey Group Physics Deep Learning](https://www.physicsbaseddeeplearning.org/bayesian-code.html)

The goal is to predict the air flow around an airfoil, and BDL enables to produce several distinct plausible outputs:
<img src="https://www.physicsbaseddeeplearning.org/_images/bayesian-code_27_0.png" width="600">

##### A standard bayesian convnet 
A simpler convolutional example is extracted from a [github repository](https://github.com/kumar-shridhar/PyTorch-BayesianCNN). It is trained on CIFAR10.

All weights, including the convolutionnal kernels are sampled from a weight distribution:
![](https://github.com/kumar-shridhar/PyTorch-BayesianCNN/raw/master/experiments/figures/CNNwithdist_git.png)

In [None]:
!git clone https://github.com/kumar-shridhar/PyTorch-BayesianCNN.git

In [None]:
%cd PyTorch-BayesianCNN

In [None]:
import os
import argparse

import torch
import numpy as np
from torch.optim import Adam, lr_scheduler
from torch.nn import functional as F

import data
import utils
import metrics
import config_bayesian as cfg
from models.BayesianModels.Bayesian3Conv3FC import BBB3Conv3FC

# CUDA settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
trainset, testset, inputs, outputs = data.getDataset("CIFAR10")
train_loader, valid_loader, test_loader = data.getDataloader(
        trainset, testset, cfg.valid_size, 32, 2)

We can get a pretrained model (could be more trained!). Warning, it might only work on GPU

In [None]:
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/model_3conv3fc_lrt_softplus.pt?raw=true -O model_3conv3fc_lrt_softplus.pt

In [None]:
weights_path = "model_3conv3fc_lrt_softplus.pt"

In [None]:
state_dict = torch.load(weights_path, map_location=device)
net = BBB3Conv3FC(outputs, inputs, cfg.priors, cfg.layer_type, cfg.activation_type).to(device)
net.load_state_dict(state_dict)

In [None]:
from tqdm import tqdm 

def validate_model(net, validloader, num_ens=1):
    """Calculate ensemble accuracy and NLL Loss"""
    net.eval()
    accs = []

    for i, (inputs, labels) in tqdm(enumerate(validloader)):
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = torch.zeros(inputs.shape[0], net.num_classes, num_ens).to(device)
        for j in range(num_ens):
            net_out, _ = net(inputs)
            outputs[:, :, j] = F.log_softmax(net_out, dim=1).data

        log_outputs = utils.logmeanexp(outputs, dim=2)
        accs.append(metrics.acc(log_outputs, labels))

    return np.mean(accs)

validate_model(net, test_loader, num_ens=10)

In [None]:
from uncertainty_estimation import init_dataset, get_uncertainty_per_image

In [None]:
import matplotlib.pyplot as plt
bX = next(iter(test_loader))
img = bX[0][2].to(device) # Chose the one you want


def softmax(X): 
    exp = np.exp(X)
    return exp / np.sum(exp)

pred, epistemic, aleatoric = get_uncertainty_per_image(net, img)
cls = pred.argmax()
proba = softmax(pred)
print(f"{cls} : ({proba[cls]:.2})")
plt.plot(proba);
plt.plot(epistemic);
plt.plot(aleatoric);

In [None]:
plt.imshow(img.cpu().numpy().transpose(1,2,0));

If you want to train this model yourself, you may run the following command:

In [None]:
!python main_bayesian.py --net_type 3conv3fc --dataset CIFAR10