# Bayesian Deep Learning (BDL)
[From this blog](https://towardsdatascience.com/building-a-bayesian-deep-learning-classifier-ece1845bc09)

## What is Bayesian deep learning?
Bayesian statistics is a theory in the field of statistics in which the evidence about the true state of the world is expressed in terms of degrees of belief. **The combination of Bayesian statistics and deep learning in practice means including uncertainty in your deep learning model predictions.**

## What is uncertainty?
Uncertainty is the state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome. As it pertains to deep learning and classification, uncertainty also includes ambiguity; uncertainty about human definitions and concepts, not an objective fact of nature.

### Types of uncertainty
* **Epistemic uncertainty** captures our ignorance about which model generated our collected data. This uncertainty can be explained away given enough data, and is often referred to as **model uncertainty**. Epistemic uncertainty is really important to model for:
    * Safety-critical applications, because epistemic uncertainty is required to understand examples which are different from training data,
    * Small datasets where the training data is sparse.
    
* **Aleatoric uncertainty** captures our uncertainty with respect to information which our data cannot explain. For example, aleatoric uncertainty in images can be attributed to occlusions (because cameras can’t see through objects). It can be explained away with the ability to observe all explanatory variables with increasing precision. Aleatoric uncertainty is very important to model for:
    * Large data situations, where epistemic uncertainty is mostly explained away,
    * Real-time applications, because we can form aleatoric models as a deterministic function of the input data, without expensive Monte Carlo sampling.
    * We can actually divide aleatoric into two further sub-categories:
        * **Data-dependant or Heteroscedastic** uncertainty is aleatoric uncertainty which depends on the input data and is predicted as a model output.
        * **Task-dependant or Homoscedastic** uncertainty is aleatoric uncertainty which is not dependant on the input data. It is not a model output, rather it is a quantity which stays constant for all input data and varies between different tasks.
        
### Why is uncertainty important?
In machine learning, we are trying to create approximate representations of the real world. Popular deep learning models created today produce a point estimate but not an uncertainty value. Understanding if your model is under-confident or falsely over-confident can help you reason about your model and your dataset.

**Note: In a classification problem, the softmax output gives you a probability value for each class, but this is not the same as uncertainty. The softmax probability is the probability that an input is a given class relative to the other classes. Because the probability is relative to the other classes, it does not help explain the model’s overall confidence.**

### Why is Aleatoric uncertainty important?
**Aleatoric uncertainty is important in cases where parts of the observation space have higher noise levels than others.** 
* For example, aleatoric uncertainty played a role in the first fatality involving a self driving car. Tesla has said that during this incident, the car’s autopilot failed to recognize the white truck against a bright sky. An image segmentation classifier that is able to predict aleatoric uncertainty would recognize that this particular area of the image was difficult to interpret and predicted a high uncertainty. In the case of the Tesla incident, although the car’s radar could “see” the truck, the radar data was inconsistent with the image classifier data and the car’s path planner ultimately ignored the radar data (radar data is known to be noisy). If the image classifier had included a high uncertainty with its prediction, the path planner would have known to ignore the image classifier prediction and use the radar data instead.

### Why is Epistemic uncertainty important?
**Epistemic uncertainty is important because it identifies situations the model was never trained to understand because the situations were not in the training data.**
* Machine learning engineers hope our models generalize well to situations that are different from the training data; however, in safety critical applications of deep learning hope is not enough. High epistemic uncertainty is a red flag that a model is much more likely to make inaccurate predictions and when this occurs in safety critical applications, the model should not be trusted.

* Epistemic uncertainty is also helpful for exploring your dataset. In one case, researchers trained a neural network to recognize tanks hidden in trees versus trees without tanks. After training, the network performed incredibly well on the training set and the test set. The only problem was that all of the images of the tanks were taken on cloudy days and all of the images without tanks were taken on a sunny day. The classifier had actually learned to identify sunny versus cloudy days. Whoops.

### Calculating aleatoric uncertainty
**Aleatoric uncertainty is a function of the input data.** Therefore, a deep learning model can learn to predict aleatoric uncertainty by using a modified loss function. For a classification task, instead of only predicting the softmax values, the Bayesian deep learning model will have two outputs, the softmax values and the input variance. Teaching the model to predict aleatoric variance is an example of unsupervised learning because the model doesn’t have variance labels to learn from. Below is the standard categorical cross entropy loss function and a function to calculate the Bayesian categorical cross entropy loss.

In [1]:
import numpy as np
from keras import backend as K
from tensorflow.contrib import distributions

# standard categorical cross entropy
# N data points, C classes
# true - true values. Shape: (N, C)
# pred - predicted values. Shape: (N, C)
# returns - loss (N)
def categorical_cross_entropy(true, pred):
	return np.sum(true * np.log(pred), axis=1)

# Bayesian categorical cross entropy.
# N data points, C classes, T monte carlo simulations
# true - true values. Shape: (N, C)
# pred_var - predicted logit values and variance. Shape: (N, C + 1)
# returns - loss (N,)
def bayesian_categorical_crossentropy(T, num_classes):
  def bayesian_categorical_crossentropy_internal(true, pred_var):
    # shape: (N,)
    std = K.sqrt(pred_var[:, num_classes:])
    # shape: (N,)
    variance = pred_var[:, num_classes]
    variance_depressor = K.exp(variance) - K.ones_like(variance)
    # shape: (N, C)
    pred = pred_var[:, 0:num_classes]
    # shape: (N,)
    undistorted_loss = K.categorical_crossentropy(pred, true, from_logits=True)
    # shape: (T,)
    iterable = K.variable(np.ones(T))
    dist = distributions.Normal(loc=K.zeros_like(std), scale=std)
    monte_carlo_results = K.map_fn(gaussian_categorical_crossentropy(true, pred, dist, undistorted_loss, num_classes), iterable, name='monte_carlo_results')
    
    variance_loss = K.mean(monte_carlo_results, axis=0) * undistorted_loss
    
    return variance_loss + undistorted_loss + variance_depressor
  
  return bayesian_categorical_crossentropy_internal

# for a single monte carlo simulation, 
#   calculate categorical_crossentropy of 
#   predicted logit values plus gaussian 
#   noise vs true values.
# true - true values. Shape: (N, C)
# pred - predicted logit values. Shape: (N, C)
# dist - normal distribution to sample from. Shape: (N, C)
# undistorted_loss - the crossentropy loss without variance distortion. Shape: (N,)
# num_classes - the number of classes. C
# returns - total differences for all classes (N,)
def gaussian_categorical_crossentropy(true, pred, dist, undistorted_loss, num_classes):
  def map_fn(i):
    std_samples = K.transpose(dist.sample(num_classes))
    distorted_loss = K.categorical_crossentropy(pred + std_samples, true, from_logits=True)
    diff = undistorted_loss - distorted_loss
    return -K.elu(diff)
  return map_fn

Using TensorFlow backend.
  return f(*args, **kwds)


### Calculating epistemic uncertainty
One way of modeling epistemic uncertainty is using Monte Carlo dropout sampling (a type of variational inference) at test time. For a full explanation of why dropout can model uncertainty check out this blog and this white paper. In practice, Monte Carlo dropout sampling means including dropout in your model and running your model multiple times with dropout turned on at test time to create a distribution of outcomes. You can then calculate the predictive entropy (the average amount of information contained in the predictive distribution).

In [2]:
# model - the trained classifier(C classes) 
# where the last layer applies softmax
# X_data - a list of input data(size N)
# T - the number of monte carlo simulations to run
def montecarlo_prediction(model, X_data, T):
    #shape: (T, N, C)
    predictions = np.array([model.predict(X_data) for _ in range(T)])
    
    # shape: (N, C)
    prediction_probabilities = np.mean(predictions, axis=0)
    
    # shape: (N)
    prediction_variances = predictive_entropy(prediction_probabilities)
    return (prediction_probabilities, prediction_variances)

# prob - prediction probability for each class(C). Shape: (N, C)
# returns - Shape: (N)
def predictive_entropy(prob):
    return -1 * np.sum(np.log(prob) * prob, axis=1)

In [3]:
from keras.models import Model
from keras.layers import Input, RepeatVector
from keras.engine.topology import Layer
from keras.layers.wrappers import TimeDistributed

# Take a mean of the results of a TimeDistributed layer.
# Applying TimeDistributedMean()(TimeDistributed(T)(x)) to an
# input of shape (None, ...) returns output of same size.
class TimeDistributedMean(Layer):
    def build(self, input_shape):
        super(TimeDistributedMean, self).build(input_shape)
        
    # input shape (None, T, ...)
    # output shape (None, ...)
    def compute_output_shape(self, input_shape):
        return (input_shape[0],) + input_shape[2:]
    
    def call(self, x):
        return K.mean(x, axis=1)


# Apply the predictive entropy function for input with C classes. 
# Input of shape (None, C, ...) returns output with shape (None, ...)
# Input should be predictive means for the C classes.
# In the case of a single classification, output will be (None,).

class PredictiveEntropy(Layer):
    def build(self, input_shape):
        super(PredictiveEntropy, self).build(input_shape)
        
    # input shape (None, C, ...)
    # output shape (None, ...)
    def compute_output_shape(self, input_shape):
        return (input_shape[0],)
    
    # x - prediction probability for each class(C)
    def call(self, x):
        return -1 * K.sum(K.log(x) * x, axis=1)
    
def create_epistemic_uncertainty_model(checkpoint, epistemic_monte_carlo_simulations):
    model = load_saved_model(checkpoint)
    inpt = Input(shape=(model.input_shape[1:]))
    x = RepeatVector(epistemic_monte_carlo_simulations)(inpt)
    # Keras TimeDistributed can only handle a single output from a model :(
    # and we technically only need the softmax outputs.
    hacked_model = Model(inputs=model.inputs, outputs=model.outputs[1])
    x = TimeDistributed(hacked_model, name='epistemic_monte_carlo')(x)
    # predictive probabilities for each class
    softmax_mean = TimeDistributedMean(name='epistemic_softmax_mean')(x)
    variance = PredictiveEntropy(name='epistemic_variance')(softmax_mean)
    epistemic_model = Model(inputs=inpt, outputs=[variance, softmax_mean])
    
    return epistemic_model

# 1. Load the model
# 2. compile the model
# 3. Set learning phase to train
# 4. predict
def predict():
    model = create_epistemic_uncertainty_model('model.ckpt', 100)
    model.compile(...)
    
    # set learning phase to 1 so that Dropout is on. In keras master you can set this
    # on the TimeDistributed layer
    K.set_learning_phase(1)
    
    epistemic_predictions = model.predict(data)

***Note: Epistemic uncertainty is not used to train the model. It is only calculated at test time (but during a training phase) when evaluating test/real world examples. This is different than aleatoric uncertainty, which is predicted as part of the training process. Also, in my experience, it is easier to produce reasonable epistemic uncertainty predictions than aleatoric uncertainty predictions.***

## Training a Bayesian deep learning classifier
Besides the code above, training a Bayesian deep learning classifier to predict uncertainty doesn’t require much additional code beyond what is typically used to train a classifier.

In [4]:
def resnet50(input_shape):
    input_tensor = Input(shape=input_shape)
    base_model = ResNet50(include_top=False, input_tensor=input_tensor)
    # freeze encoder layers to prevent over fitting
    for layer in base_model.layers:
        layer.trainable = False
    
    output_tensor = Flatten()(base_model.output)
    return Model(inputs=input_tensor, outputs=output_tensor)

For this experiment, I used the frozen convolutional layers from Resnet50 with the weights for ImageNet to encode the images.

In [5]:
def create_bayesian_model(encoder, input_shape, output_classes):
    encoder_model = resnet50(input_shape)
    input_tensor = Input(shape=encoder_model.output_shape[1:])
    x = BatchNormalization(name='post_encoder')(input_tensor)
    x = Dropout(0.5)(x)
    x = Dense(500, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    x = Dense(100, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    
    logits = Dense(output_classes)(x)
    variance_pre = Dense(1)(x)
    variance = Activation('softplus', name='variance')(variance_pre)
    logits_variance = concatenate([logits, variance], name='logits_variance')
    softmax_output = Activation('softmax', name='softmax_output')(logits)
    model = Model(inputs=input_tensor, outputs=[logits_variance,softmax_output])
    
    return model

def encoder_min_input_size(encoder):
    if encoder == 'resnet50':
        return (197, 197)
    else:
        raise ValueError('Unexpected encoder model ' + encoder + ".")

The trainable part of my model is two sets of BatchNormalization, Dropout, Dense, and relu layers on top of the ResNet50 output. The logits and variance are calculated using separate Dense layers. Note that the variance layer applies a softplus activation function to ensure the model always predicts variance values greater than zero. The logit and variance layers are then recombined for the aleatoric loss function and the softmax is calculated using just the logit layer.

In [8]:
#!/bin/python 

import os
import sys

#project_path, x = os.path.split(os.path.dirname(os.path.realpath(__file__)))
#sys.path.append(project_path)

import tensorflow as tf
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, EarlyStopping, CSVLogger
from keras import metrics
import numpy as np

from bnn.model import create_bayesian_model, encoder_min_input_size
from bnn.loss_equations import bayesian_categorical_crossentropy
from bnn.util import isAWS, upload_s3, stop_instance, BayesianConfig
from bnn.data import test_train_batch_data

#flags = tf.app.flags
#FLAGS = flags.FLAGS

dataset = 'cifar10'
encoder = 'resnet50'
epochs = 1
monte_carlo_simulations = 100
batch_size = 32
debug = False
verbose = 0
stop = True
min_delta = 0.005
patience = 20


config = BayesianConfig(encoder, dataset, batch_size, epochs, monte_carlo_simulations)
config.info()
min_image_size = encoder_min_input_size(encoder)
    
((x_train, y_train), (x_test, y_test)) = test_train_batch_data(dataset, encoder, debug, augment_data=True)
   
min_image_size = list(min_image_size)
min_image_size.append(3)
num_classes = y_train.shape[-1]
    
model = create_bayesian_model(encoder, min_image_size, num_classes)
    
if debug:
    print(model.summary())
    callbacks = None
else:
    callbacks = [ModelCheckpoint(config.model_file(), verbose=verbose, save_best_only=True),
                 CSVLogger(config.csv_log_file()),
                 EarlyStopping(monitor='val_logits_variance_loss', min_delta=min_delta, patience=patience, verbose=1)]

encoder: resnet50
batch_size: 32
epochs: 1
dataset: cifar10
monte_carlo_simulations: 100
Unpickling file batch_data/resnet50_cifar10/augment-train.p


FileNotFoundError: [Errno 2] No such file or directory: '/Users/esousa/projects/batch_data/resnet50_cifar10/augment-train.p'

In [None]:
print("Starting model train process.")
model.fit(x_train, 
          {'logits_variance':y_train, 'softmax_output':y_train}, 
          callbacks=callbacks,
          verbose=FLAGS.verbose,
          epochs=FLAGS.epochs,
          batch_size=FLAGS.batch_size,
          validation_data=(x_test, {'logits_variance':y_test, 'softmax_output':y_test}))

print("Finished training model.")
if isAWS() and FLAGS.debug == False:
    upload_s3(config.model_file())
    upload_s3(config.csv_log_file())
    if isAWS() and FLAGS.stop:
        stop_instance()

if __name__ == '__main__':
    tf.app.run()