# Crash Course on Bayesian Deep Learning

Click to run on colab (if you're not already there): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/BayesianDeepWine.ipynb) 

This session aims at understanding and implementing basic Bayesian Deep Learning models, as described in [Bayes by Backprop](https://arxiv.org/abs/1505.05424), and a short comparison with [Monte Carlo Dropout](  https://arxiv.org/abs/1506.02142).

**What this is not about:**

- answering questions about a dataset
- Deep Bayesian Learning: *What*, *Why*

**What this is about:**

- Deep Bayesian Learning: *How*
- trying to stick to classic deep learning frameworks and practice
- understanding basic building blocks

The notebook itself is inspired from [Khalid Salama's Keras tutorial on Bayesian Deep Learning](https://keras.io/examples/keras_recipes/bayesian_neural_networks/).

## Deep Learning frameworks introduction

Our goal is to use existing frameworks such as Pytorch or Tensorflow. This usually implies:

- We will use Stochastic Gradient Descent (SGD) or a variant
- We will then train on random mini-batches of data (data may be large)
- We want to take advantage of automatic differentiation and not compute gradients ourselves: everything must be **differentiable**
- We want to benefit from parallel computing

## Installation

In this notebook, basic probabilistic Bayesian neural networks are built, with a focus on practical implementation. We consider both of the most populat deep learning frameworks: Tensorflow (and Keras) or Pytorch. Feel free to use your favorite.

For Tensorflow (2.3 or higher), [TensorFlow Probability](https://www.tensorflow.org/probability) library is used which is compatible with Keras API. If you're outside of Colab, you may install it with pip:

```python
pip install tensorflow-probability
```

For pytorch, the built-in `torch.distributions` will be used:

```python
pip install pytorch
```

Outside of Colab, you can install [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/wine_quality) using the following command:

```python
pip install tensorflow-datasets
```

## The dataset

We use the [Wine Quality](https://archive.ics.uci.edu/ml/datasets/wine+quality)
dataset, which is available in the .
We use the red wine subset, which contains 4,898 examples.
The dataset has 11numerical physicochemical features of the wine, and the task
is to predict the wine quality, which is a score between 0 and 10.

While the experts gave integer scores, we will first consider them as continuous values between 0 and 10 and treat this as a regression task, for simplicity and easy interpretation of confidence intervals. 

We could consider instead consider a classification task, add an observation noise, or many different things here, but it's not the point of this session.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
import tensorflow_probability as tfp

# Benefits of Keras/Tensorflow
# + no management of CPU/GPU
# + once the model is defined, very easy to train
# - quite verbose code to define models
# - a lot is hidden, difficult to track bugs and what's really going on

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Benefits of Pytorch
# + easier to debug and to control
# - manage GPU/CPU (but it's easy)

### Create training and evaluation datasets

In [None]:
FEATURE_NAMES = [
    "fixed acidity", "volatile acidity", "citric acid",
    "residual sugar", "chlorides", "free sulfur dioxide",
    "total sulfur dioxide", "density", "pH",
    "sulphates", "alcohol",
]

def process_data(x,y):
    x_transform = tf.stack([tf.cast(x[f], tf.float32) for f in FEATURE_NAMES], 0)
    return x_transform, tf.cast(y, tf.float32) # cast y (integer) as a float

def get_train_and_test_splits(train_size, batch_size=1):
    dataset = (
        tfds.load(name="wine_quality", as_supervised=True, split="train")
        .map(process_data)
        .cache()
    )
    # We shuffle with a buffer the same size as the dataset.
    train_dataset = dataset.take(train_size).shuffle(buffer_size=train_size).batch(batch_size)
    test_dataset = dataset.skip(train_size).batch(batch_size)

    return train_dataset, test_dataset

Let's split the wine dataset into training and test sets, with 85% and 15% of
the examples, respectively.

In [None]:
dataset_size = 4898
batch_size = 256
train_size = int(dataset_size * 0.85)
M = int(train_size / batch_size)
train_dataset, test_dataset = get_train_and_test_splits(train_size, batch_size)

### Pytorch conversion

We can transform the tensorflow dataset into a pytorch one. This is not a very good practice, but is doable as the dataset is small.

In [None]:
dataset = tfds.load(name="wine_quality", as_supervised=True, split="train").map(process_data)

def process_data_torch(x,y):
    x_transform = torch.tensor(x.numpy(), device=device)
    y_transform = torch.tensor(y.numpy(), device=device)
    return (x_transform, y_transform)

data_train = list(map(lambda x: process_data_torch(*x), dataset.take(train_size)))
data_test = list(map(lambda x: process_data_torch(*x), dataset.skip(train_size)))
torch_trainloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size)
torch_testloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size)

## 1) Standard neural network

We create a standard deterministic neural network model as a baseline.

### Tensorflow / Keras version

In [None]:
def create_baseline_model(input_dim, hidden_dim):
    inputs = layers.Input(shape=(input_dim,))

    # Hidden layer with deterministic weights using a Dense layer, 
    # and non linear activation function
    features = layers.Dense(hidden_dim, activation="sigmoid")(inputs)

    # The output is deterministic: a single point estimate.
    outputs = layers.Dense(units=1)(features)

    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
learning_rate = 0.001

def run_experiment(model, loss, epochs, train_dataset, test_dataset):
    model.compile(
        optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate),
        loss=loss,
        metrics=[keras.metrics.RootMeanSquaredError()],
    )

    model.fit(train_dataset, epochs=num_epochs, validation_data=test_dataset)
    _, rmse = model.evaluate(train_dataset, verbose=0)
    print(f"Train RMSE: {round(rmse, 3)}")

    _, rmse = model.evaluate(test_dataset, verbose=0)
    print(f"Test RMSE: {round(rmse, 3)}")

In [None]:
baseline_model = create_baseline_model(11, 32)
baseline_model.summary()

In [None]:
num_epochs = 50
mse_loss = keras.losses.MeanSquaredError()
run_experiment(baseline_model, mse_loss, num_epochs, train_dataset, test_dataset)

We take a sample from the test set use the model to obtain predictions for them.
Note that since the baseline model is deterministic, we get a single a
*point estimate* prediction for each test example, with no information about the
uncertainty of the model nor the prediction.

In [None]:
samples = 10
examples, targets = list(test_dataset.unbatch().shuffle(batch_size * 10).batch(samples))[
    0
]

predicted = baseline_model(examples).numpy()
for idx in range(samples):
    print(f"Predicted: {round(float(predicted[idx][0]), 1)} - Actual: {targets[idx]}")

### Pytorch version

In [None]:
class BaselineTorchModel(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(BaselineTorchModel, self).__init__()
        self.hidden_layer = nn.Linear(input_dim, hidden_dim)
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = torch.sigmoid
        #self.batch_norm = torch.nn.BatchNorm1d(input_dim)
        
    def forward(self, inputs):
        #inputs = self.batch_norm(inputs)
        h = self.hidden_layer(inputs)
        h = self.act(h)
        output = self.out_layer(h)
        
        # we add a dummy output, a placeholder for future experiments
        return output, None

In [None]:
from tqdm import tqdm

def run_experiment_torch(model, loss, num_epochs, train_dataloader, test_dataloader):
    optimizer = optim.RMSprop(model.parameters(), lr=0.001)
    for e in tqdm(range(num_epochs)):
        model.train()
        for x, y in train_dataloader:
            optimizer.zero_grad()
            loss_value = loss(model(x), y)
            loss_value.backward()
            optimizer.step()
        model.eval()
    errors = []
    for x,y in test_dataloader:
        yhat, _ = model(x)
        errors.append((torch.squeeze(yhat.detach()).cpu().numpy() - y.detach().cpu().numpy())**2)
  
    rmse = np.sqrt(np.mean(np.concatenate(errors, axis=None)))
    print(f"Test RMSE: {round(rmse, 3)}")

In [None]:
baseline_torch_model = BaselineTorchModel(11, 32).to(device)
baseline_torch_model

In [None]:
[p.numel() for p in baseline_torch_model.parameters()]

In [None]:
def simple_mse_loss(model_outputs, y_true):
    yhat, _ = model_outputs
    yhat = torch.squeeze(yhat)
    return torch.nn.MSELoss()(yhat, y_true) 

run_experiment_torch(baseline_torch_model, 
                     simple_mse_loss, 
                     50, torch_trainloader, torch_testloader)

In [None]:
samples = 10
examples_torch, targets_torch = next(iter(torch_testloader))
predicted, _ = baseline_torch_model(examples_torch[:samples])
predicted = predicted.detach().cpu().numpy()
for idx in range(samples):
    print(f"Predicted: {round(float(predicted[idx][0]), 1)} - Actual: {targets_torch[idx]}")

## 2) A simple Bayesian Neural Network

Our objective is to build a single layer Bayesian Neural Network using Tensorflow or Pytorch. 

nstead of learning specific weight *values* in the
neural network, we will instead learns weight *distributions*, from which we can sample to produce a weight value, and then the output given an input.

We need to define prior and the posterior distributions of these weights, we will start by having a unit Gaussian prior, and a diagonal covariance multivariate Gaussian posterior.

From the [Bayes by Backprop](https://arxiv.org/abs/1505.05424) paper, we have the following algorithm:

1. Sample $\epsilon ∼ N(0, I)$.
2. Let $w = µ + log(1 + exp(ρ)) ◦ \epsilon$.
3. Let $θ = (µ, ρ)$.
4. Let $f(w, θ) = log q(w|θ) − log P(w)P(D|w)$.
5. Compute gradients with respect to everything

Note that the $∂f(w,θ) ∂w$ term of the gradients for the mean and standard deviation are shared and are exactly the gradients found by the usual backpropagation algorithm on a neural network. Thus, remarkably, to learn both the mean and the standard deviation we must simply calculate the usual gradients found by backpropagation.

The cost function is seperated in two terms, one which corresponds to the standard mse loss as before, and the other to a KL divergence between the posterior q and the prior.

### Tensorflow / Keras version

We will first consider having a single layer that has distribution over the weights. Wa can derive the Keras Layer class to build it:

In [None]:
from tensorflow_probability.python.distributions import kullback_leibler
from tensorflow_probability import distributions as tfd

class LinearVariational(tf.keras.layers.Layer):
    def __init__(self,
               units,
               kl_weight=1.0,
               activation=None):
        super(LinearVariational, self).__init__()
        self.units = int(units)
        self.kl_weight = kl_weight
        self.activation = tf.keras.activations.get(activation)
    
    def _make_prior(self, n):
        return tfp.distributions.MultivariateNormalDiag(loc=tf.zeros(n), 
                                                 scale_diag=tf.ones(n))
        
    def _make_posterior(self, dim):
        # A bit of boilerplate to define Keras Layers with the parameters
        trainable_normal = tf.keras.models.Sequential([
            tfp.layers.VariableLayer(
              shape=[dim, 2], #first is for mu, second for rho
              dtype=tf.float64,
              initializer=tfp.layers.BlockwiseInitializer([
                  'zeros', # initialize mu to 0
                  tf.keras.initializers.Constant(np.log(np.expm1(1.))), # initialize rho s.t. softplus(rho)=1
              ], sizes=[1, 1])),
            tfp.layers.DistributionLambda(lambda t: tfd.MultivariateNormalDiag(loc=t[..., 0], 
                                                                             scale_diag=tf.math.softplus(t[..., 1])))
            ])

        return trainable_normal
      
      
    def build(self, input_shape):
        input_shape = tf.TensorShape(input_shape)
        last_dim = tf.compat.dimension_value(input_shape[-1])
        self.posterior = self._make_posterior(last_dim * self.units)
        self.prior = self._make_prior(last_dim * self.units)
        self.built = True

    def call(self, inputs):
        q = self.posterior(0.) # input not used, but a keras model takes inputs
        p = self.prior
        
        kl = kullback_leibler.kl_divergence(q, p)
        self.add_loss(tf.reduce_sum(kl) * self.kl_weight)
        
        w = q.sample() # differentiable because MultivariateNormalDiag 
                       # implicitely uses reparametrization trick

        # Otherwise, we could have coded the reparametrization trick
        # directly from mu and rho as follows:
        # epsilon = tf.random.normal(mu.shape, 0, 1, tf.float32)
        # w = mu + tf.math.softplus(rho) * epsilon

        w = tf.reshape(w, shape=[-1, self.units])
        outputs = tf.matmul(inputs, w)

        if self.activation is not None:
            outputs = self.activation(outputs)

        return outputs

In [None]:
def create_bnn_model(input_dim, hidden_dim, kl_weight=1.):
    inputs = layers.Input(shape=(input_dim,))
    features = layers.BatchNormalization()(inputs)
    
    # Create our stochastic layer which has a distribution over weights
    z = LinearVariational(hidden_dim, kl_weight=kl_weight)(features)

    # The output is deterministic: a single point estimate.
    output = layers.Dense(units=1)(z)
    model = keras.Model(inputs=inputs, outputs=output)
    return model


In [None]:
bnn_model_small = create_bnn_model(11, 32, kl_weight = 1. / train_size)
bnn_model_small.summary()

Note that there are now more trainable parameters, even though we don't have a bias for the hidden layer.

The KL weight, the authors of the original paper propose either $1/M$ with $M$ the number of minibatches in the dataset for each epoch, of $\frac{2^{M-i}}{2^M-1}$ where $i$ is the batch index. In practice here it was chosen as an hyperparameter.

In [None]:
num_epochs = 200
run_experiment(bnn_model_small, mse_loss, num_epochs, train_dataset, test_dataset)

In [None]:
def compute_predictions(model, iterations=100):
    predicted = []
    for _ in range(iterations):
        predicted.append(model(examples).numpy())
    predicted = np.concatenate(predicted, axis=1)
    return predicted

def display_predictions(predictions, targets):
    prediction_mean = np.mean(predictions, axis=1).tolist()
    prediction_min = np.min(predictions, axis=1).tolist()
    prediction_max = np.max(predictions, axis=1).tolist()
    prediction_range = (np.max(predictions, axis=1) - np.min(predictions, axis=1)).tolist()

    for idx in range(samples):
        print(
            f"Predictions mean: {round(prediction_mean[idx], 2)}, "
            f"min: {round(prediction_min[idx], 2)}, "
            f"max: {round(prediction_max[idx], 2)}, "
            f"range: {round(prediction_range[idx], 2)} - "
            f"Actual: {targets[idx]}"
        )

predictions = compute_predictions(bnn_model_small)
display_predictions(predictions, targets)

### Pytorch version

In [None]:
class BnnTorch(nn.Module):
    def __init__(self, input_dim, hidden_dim, activation=None):
        super(BnnTorch, self).__init__()
        n = input_dim * hidden_dim
        self.mu = nn.Parameter(torch.zeros((n), dtype=torch.float32))
        self.rho  = nn.Parameter(torch.log(torch.expm1(torch.ones((n), dtype=torch.float32))))
        self.out_layer = nn.Linear(hidden_dim, 1)
        self.act = activation
        self.hidden_dim = hidden_dim
        self.prior = torch.distributions.Normal(loc=torch.zeros((n), device=device, dtype=torch.float32),
                                                scale=torch.ones((n), device=device, dtype=torch.float32))
        self.kl_func = torch.distributions.kl.kl_divergence
        self.batch_norm = torch.nn.BatchNorm1d(input_dim)

        
    def forward(self, inputs):
        inputs = self.batch_norm(inputs)
        q = torch.distributions.Normal(loc=self.mu, 
                                       scale=torch.log(1.+torch.exp(self.rho)))
        
        kl = torch.sum(self.kl_func(q, self.prior))
        # we use q.rsample() which uses the reparametrization trick instead of 
        # q.sample() which breaks the auto-differentation path
        w = q.rsample() 
        w = w.reshape((-1, self.hidden_dim))
        h = inputs @ w
        if self.act is not None:
            h = self.act(h)
        output = self.out_layer(h)
        return output, kl

In [None]:
bnn_torch = BnnTorch(11, 32).to(device)
bnn_torch

In [None]:
kl_weight = 1. / train_size

def mse_kl_loss(model_outputs, y_true):
    yhat, kl = model_outputs
    yhat = torch.squeeze(yhat)
    mse = torch.nn.MSELoss()(yhat, y_true)
    return mse + kl * kl_weight

run_experiment_torch(bnn_torch, 
                     mse_kl_loss, 
                     200, torch_trainloader, torch_testloader)

In [None]:
def compute_predictions_torch(model, iterations=100):
    predicted = []
    model.eval()
    for _ in range(iterations):
        preds, _ = model(examples_torch)
        predicted.append(preds.detach().cpu().numpy())
    predicted = np.concatenate(predicted, axis=1)
    return predicted

In [None]:
predictions = compute_predictions_torch(bnn_torch)
display_predictions(predictions, targets_torch)

## 3) Full Bayesian neural network using existing modules

In order to build more complex networks, we can use the `tfp.layers.DenseVariational` layer instead of our custom layers. For this layer, the Tensorflow [help page](https://www.tensorflow.org/probability/api_docs/python/tfp/layers/DenseVariational) show us that we need to define two functions, `make_posterior_fn`, `make_prior_fn` which return `tfd.Distribution`, and parametrized as keras models.

We will also parametrize more complex posterior distributions, using `tfp.distributions.MultivariateNormalTriL` with more complex covariance matrix.

In [None]:
# Define the prior weight distribution as Normal of mean=0 and stddev=1.
# Note that, in this example, the we prior distribution is not trainable,
# as we fix its parameters.
def prior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    prior_model = keras.Sequential(
        [
            tfp.layers.DistributionLambda(
                lambda t: tfp.distributions.MultivariateNormalDiag(
                    loc=tf.zeros(n), scale_diag=tf.ones(n)
                )
            )
        ]
    )
    return prior_model


# Define variational posterior weight distribution as multivariate Gaussian.
# Note that the learnable parameters for this distribution are the means,
# variances, and covariances.
def posterior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    posterior_model = keras.Sequential(
        [
            tfp.layers.VariableLayer(
                tfp.layers.MultivariateNormalTriL.params_size(n), dtype=dtype
            ),
            tfp.layers.MultivariateNormalTriL(n),
        ]
    )
    return posterior_model


In [None]:
def create_full_bnn_model(input_dim, hidden_units, kl_weight):
    inputs = layers.Input(shape=(input_dim,))
    features = layers.BatchNormalization()(inputs)

    # Create hidden layers with weight uncertainty using the DenseVariational layer.
    for units in hidden_units:
        features = tfp.layers.DenseVariational(
            units=units,
            make_prior_fn=prior,
            make_posterior_fn=posterior,
            kl_weight=kl_weight,
            activation="sigmoid",
        )(features)

    outputs = tfp.layers.DenseVariational(
            units=1,
            make_prior_fn=prior,
            make_posterior_fn=posterior,
            kl_weight=kl_weight,
    )(features)

    model = keras.Model(inputs=inputs, outputs=outputs)
    return model



In practice, [flipout](https://www.tensorflow.org/probability/api_docs/python/tfp/layers/DenseFlipout) layers (`tfp.layers.DenseFlipout`) and [local reparametrization](https://www.tensorflow.org/probability/api_docs/python/tfp/layers/DenseLocalReparameterization) layers (`tfp.layers.DenseLocalReparameterization`) are also used. You may also find convolutional versions of these, or RNN versions.

In [None]:
num_epochs = 500
bnn_model_full = create_full_bnn_model(11, [8, 8], 1/train_size)
run_experiment(bnn_model_full, negative_loglikelihood, num_epochs, train_dataset, test_dataset)

In [None]:
predictions = compute_predictions(bnn_model_full)
display_predictions(predictions, targets)

## 4)Bayesian Neural Network parametrizing a distribution

So far, we have distributions over weights, but single point estimates for each input (once we sample these weights).

We can also model an output distribution, rather than a single point estimate. We will change the last layer to become a parametrization of a Gaussian as the output. This should be related to *aleatoric uncertainty*, irreducible noise in the data, capturing the stochastic nature of the process generating the data.

We then use the negative loglikelihood of this distribution as our loss function, instead of the MSE.

In [None]:
def create_probabilistic_bnn_model(input_dim, hidden_units, kl_weight):
    inputs = layers.Input(shape=(input_dim,))
    features = layers.BatchNormalization()(inputs)

    # Create hidden layers with weight uncertainty using the DenseVariational layer.
    for units in hidden_units:
        features = tfp.layers.DenseVariational(
            units=units,
            make_prior_fn=prior,
            make_posterior_fn=posterior,
            kl_weight=kl_weight,
            activation="sigmoid",
        )(features)

    distribution_params = layers.Dense(units=2)(features)
    outputs = tfp.layers.IndependentNormal(1)(distribution_params)

    model = keras.Model(inputs=inputs, outputs=outputs)
    return model


In [None]:
def negative_loglikelihood(targets, estimated_distribution):
    return -estimated_distribution.log_prob(targets)

num_epochs = 500
bnn_model_probabilistic = create_probabilistic_bnn_model(11, [8, 8], 1/train_size)
run_experiment(bnn_model_probabilistic, negative_loglikelihood, num_epochs, train_dataset, test_dataset)

In [None]:
prediction_distribution = bnn_model_probabilistic(examples)
prediction_mean = prediction_distribution.mean().numpy().tolist()
prediction_stdv = prediction_distribution.stddev().numpy()

# The 95% CI is computed as mean ± (1.96 * stdv)
upper = (prediction_mean + (1.96 * prediction_stdv)).tolist()
lower = (prediction_mean - (1.96 * prediction_stdv)).tolist()
prediction_stdv = prediction_stdv.tolist()

for idx in range(samples):
    print(
        f"Prediction mean: {round(prediction_mean[idx][0], 2)}, "
        f"stddev: {round(prediction_stdv[idx][0], 2)}, "
        f"95% CI: [{round(upper[idx][0], 2)} - {round(lower[idx][0], 2)}]"
        f" - Actual: {targets[idx]}"
    )

## 5) MC Dropout

We will now consider a simpler approach, as described in the [Monte Carlo Dropout](  https://arxiv.org/abs/1506.02142) paper. The implementation is extremely simple to people already familiar with dropout: we just need to keep dropout active at test time by forcing `training=True`.

In [None]:
def create_mcdropout_model(input_dim, hidden_dim, p):
    inputs = layers.Input(shape=(input_dim,))

    # Create a hidden layer with deterministic weights using the Dense layer.
    features = layers.Dense(hidden_dim, activation="sigmoid")(inputs)
    features = tf.keras.layers.Dropout(p)(features, training=True)
    # The output is deterministic: a single point estimate.
    outputs = layers.Dense(units=1)(features)

    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
num_epochs = 500
mse_loss = keras.losses.MeanSquaredError()
mc_model = create_mcdropout_model(11, 32, 0.5)
run_experiment(mc_model, mse_loss, num_epochs, train_dataset, test_dataset)

In [None]:
predictions = compute_predictions(mc_model)
display_predictions(predictions, targets)

## 6) Convolutional model

This bonus section showcases a larger bayesian neural network with convolutional layers. It is extracted from a [github repository](https://github.com/kumar-shridhar/PyTorch-BayesianCNN). It is trained on CIFAR10

All weights, including the convolutionnal kernels are sampled from a weight distribution:
![](https://github.com/kumar-shridhar/PyTorch-BayesianCNN/raw/master/experiments/figures/CNNwithdist_git.png)

In [None]:
!git clone https://github.com/kumar-shridhar/PyTorch-BayesianCNN.git

In [None]:
%cd PyTorch-BayesianCNN

In [None]:
import os
import argparse

import torch
import numpy as np
from torch.optim import Adam, lr_scheduler
from torch.nn import functional as F

import data
import utils
import metrics
import config_bayesian as cfg
from models.BayesianModels.Bayesian3Conv3FC import BBB3Conv3FC

# CUDA settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
trainset, testset, inputs, outputs = data.getDataset("CIFAR10")
train_loader, valid_loader, test_loader = data.getDataloader(
        trainset, testset, cfg.valid_size, 32, 2)

We can get a pretrained model (could be more trained!). Warning, it might only work on GPU

In [None]:
!wget https://github.com/charlesollion/dlexperiments/blob/master/6-Bayesian-DL/model_3conv3fc_lrt_softplus.pt?raw=true -O model_3conv3fc_lrt_softplus.pt

In [None]:
weights_path = "model_3conv3fc_lrt_softplus.pt"

In [None]:
state_dict = torch.load(weights_path, map_location=device)
net = BBB3Conv3FC(outputs, inputs, cfg.priors, cfg.layer_type, cfg.activation_type).to(device)
net.load_state_dict(state_dict)

In [None]:
from tqdm import tqdm 

def validate_model(net, validloader, num_ens=1):
    """Calculate ensemble accuracy and NLL Loss"""
    net.eval()
    accs = []

    for i, (inputs, labels) in tqdm(enumerate(validloader)):
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = torch.zeros(inputs.shape[0], net.num_classes, num_ens).to(device)
        for j in range(num_ens):
            net_out, _ = net(inputs)
            outputs[:, :, j] = F.log_softmax(net_out, dim=1).data

        log_outputs = utils.logmeanexp(outputs, dim=2)
        accs.append(metrics.acc(log_outputs, labels))

    return np.mean(accs)

validate_model(net, test_loader, num_ens=10)

In [None]:
from uncertainty_estimation import init_dataset, get_uncertainty_per_image

In [None]:
import matplotlib.pyplot as plt
bX = next(iter(test_loader))
img = bX[0][2].to(device) # Chose the one you want


def softmax(X): 
    exp = np.exp(X)
    return exp / np.sum(exp)

pred, epistemic, aleatoric = get_uncertainty_per_image(net, img)
cls = pred.argmax()
proba = softmax(pred)
print(f"{cls} : ({proba[cls]:.2})")
plt.plot(proba);
plt.plot(epistemic);
plt.plot(aleatoric);

In [None]:
plt.imshow(img.cpu().numpy().transpose(1,2,0));

If you want to train this model yourself, you may run the following command:

In [None]:
!python main_bayesian.py --net_type 3conv3fc --dataset CIFAR10