# Bayesian Transfer Learning for Deep Networks

In this project we are concerned with **Bayesian Deep Learning**. Specifically, we want to know whether having a deep Bayesian model will improve the transfer of learning. Our hypothesis is that that knowledge gained from training a model on tasks **A** and then using the learned weights as a basis for learning on tasks $B$ will perform better than training **B** from scratch - assuming the domains are similar.

![Transfer Learning](https://image.slidesharecdn.com/13aibigdata-160606103446/95/aibigdata-lab-2016-transfer-learning-7-638.jpg?cb=1465209397)

We use Bayes By Backprop introduced by [Blundell, 2015](https://arxiv.org/abs/1505.05424)). to learn a probability distribution over each of the weights in the network. These weight distributions are fitted using variational inference given some prior.

By inferring the posterior weight distribution in task **A** $p(w|D_A)$, a model is trained which is able to solve the second task **B** when exposed to new data $D_B$, while remembering task **A**. Variational Bayasian approximations of $p(w|D_A)$ are considered for this operation.

> The model constructed in this notebook tries to dynamically adapt its weights when confronted with new tasks. A method named **elastic weight consolidation (EWC)** ([Kirkpatrick, 2016](http://www.pnas.org/content/114/13/3521.full.pdf)) is implemented that considers data from two different tasks as independent.

In [None]:
# Imports and declarations
import torch
import sys
import numpy as np
sys.path.append("bayesian_transfer")

### Hyperparameters

In [None]:
cuda        = torch.cuda.is_available()
num_epochs  = 250
num_samples = 16

## Transfer learning

We quickly show how to get transfer learning up and running in this Bayesian setting.

In [None]:
from bayesian_transfer import BBBMLP
from experiment import get_data

def get_model(num_output, num_hidden=100, num_layers=2, num_flows=0, pretrained=None):
    model = BBBMLP(in_features=784, num_class=num_output, num_hidden=num_hidden, num_layers=num_layers, nflows=num_flows, p_logvar_init = -2)

    if pretrained:
        d = pretrained.state_dict()
        model.load_prior(d)

    return model

In [None]:
# Training Loop. Here we are forward declaring the loss function
from tqdm import tqdm
import math

def run_epoch(model, loader, epoch, is_training=False):
    # Number of mini.batches
    m = math.ceil(len(loader.dataset) / loader.batch_size)
    
    diagnostics = {"accuracy": [], "likelihood": [],
                   "KL": [], "loss": []}

    for i, (data, labels) in enumerate(tqdm(loader)):
        # Repeat samples
        x = data.view(-1, 784).repeat(num_samples, 1)
        y = labels.repeat(num_samples)

        if cuda:
            x = x.cuda()
            y = y.cuda()

        # Blundell Beta-scheme
        beta = 2 ** (m - (i + 1)) / (2 ** m - 1)

        # Calculate loss
        logits, kl = model.probforward(Variable(x))
        loss = loss_fn(logits, Variable(y), kl, beta)
        ll = -loss.data.mean() + beta*kl.data.mean()

        # Update gradients
        if is_training:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Compute accuracy
        _, predicted = logits.max(1)
        accuracy = (predicted.data == y).float().mean()

        diagnostics["accuracy"].append(accuracy/m)
        diagnostics["loss"].append(loss.data.mean()/m)
        diagnostics["KL"].append(beta*kl.data.mean()/m)
        diagnostics["likelihood"].append(ll/m)

    return diagnostics

### Train model on A

For this, we limit the dataset to only the first 5 digits of MNIST and use all of the data within this subset. We use a standard 2-layer Bayesian NN with 400 hidden units. We are using the following loss function

$$\theta^* = \text{arg}\min_{\theta} = KL(q(w|\theta)||p(w)) - \mathbb{E}_{q(w|\theta)}[\ln p(\mathcal{D}_A|w)]$$

where $ p(\mathcal{D_A}|w)$ is a suitable likelihood function, such as cross-entropy - for the data in the first domain $\mathcal{D}_A$.

In [None]:
from bayesian_transfer import GaussianVariationalInference
from torch.autograd import Variable

digits = [0, 1, 2, 3, 4]
loader_train, loader_val = get_data(digits, fraction=1.0)

model_a = get_model(5, num_hidden=400, num_layers=2, num_flows=0)

# Define the objective, in this case we want to minimize the negative free free energy.
loss_fn = GaussianVariationalInference(torch.nn.CrossEntropyLoss())
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model_a.parameters()), lr=1e-3)

if cuda: model_a.cuda()
    
for epoch in range(num_epochs):
    print("Epoch {}/{}".format(epoch, num_epochs))
    diagnostics_train = run_epoch(model_a, loader_train, epoch, is_training=True)
    diagnostics_val = run_epoch(model_a, loader_val, epoch)

    diagnostics_train = dict({"type": "train", "epoch": epoch}, **diagnostics_train)
    diagnostics_val = dict({"type": "validation", "epoch": epoch}, **diagnostics_val)

    # Save model and diagnostics
    print(diagnostics_train)
    print(diagnostics_val)

    gc.collect()

Now we transfer to the second domain, by using the same loss function, but instead, the likelihood is defined over the second domain $\mathcal{D}_B$. We also use the learned posterior from the first part as a prior for transfer. Consequently, the loss function is now defined as:

$$\theta^* = \text{arg}\min_{\theta} = KL(q(w|\theta)||q_A(w)) - \mathbb{E}_{q(w|\theta)}[\ln p(\mathcal{D}_B|w)]$$

where $q_A$ is learned from the first domain. Implementation-wise we only need to include the trained model as a prior.

In [None]:
digits = [5, 6, 7, 8, 9]
loader_train, loader_val = get_data(digits, fraction=1.0)


model_b = get_model(5, num_hidden=400, num_layers=2, num_flows=0, pretrained=model_a)

optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model_b.parameters()), lr=1e-3)

if cuda: model_b.cuda()
    
for epoch in range(num_epochs):
    print("Epoch {}/{}".format(epoch, num_epochs))
    diagnostics_train = run_epoch(model_b, loader_train, epoch, is_training=True)
    diagnostics_val = run_epoch(model_b, loader_val, epoch)

    diagnostics_train = dict({"type": "train", "epoch": epoch}, **diagnostics_train)
    diagnostics_val = dict({"type": "validation", "epoch": epoch}, **diagnostics_val)

    # Save model and diagnostics
    print(diagnostics_train)
    print(diagnostics_val)

    gc.collect()

### Results

Plotting the results, we see that transfer learning does not work particularly well for this domain.

![](figs/transfer_results.png)