# Bayesian Transfer Learning for Deep Networks

In this project we are concerned with **Bayesian Deep Learning**. Specifically, we want to know whether having a deep Bayesian model will improve the transfer of learning. Our hypothesis is that that knowledge gained from training a model on tasks **A** and then using the learned weights as a basis for learning on tasks $B$ will perform better than training **B** from scratch - assuming the domains are similar.

![Transfer Learning](https://image.slidesharecdn.com/13aibigdata-160606103446/95/aibigdata-lab-2016-transfer-learning-7-638.jpg?cb=1465209397)

We use Bayes By Backprop introduced by ([Blundell, 2015](https://arxiv.org/abs/1505.05424)). to learn a probability distribution over each of the weights in the network. These weight distributions are fitted using variational inference given some prior.

By inferring the posterior weight distribution in task **A** $p(w|D_A)$, a model is trained which is able to solve the second task **B** when exposed to new data $D_B$, while remembering task **A**. Variational Bayasian approximations of $p(w|D_A)$ are considered for this operation.

> The model constructed in this notebook tries to dynamically adapt its weights when confronted with new tasks. A method named **elastic weight consolidation (EWC)** ([Kirkpatrick, 2016](http://www.pnas.org/content/114/13/3521.full.pdf)) is implemented that considers data from two different tasks as independent.

In [6]:
# Imports and declarations
import torch
import gc
import sys
import numpy as np
sys.path.append("bayesian_transfer")

### Hyperparameters

In [7]:
cuda        = torch.cuda.is_available()
num_epochs  = 250
num_samples = 16

## Transfer learning

We quickly show how to get transfer learning up and running in this Bayesian setting.

In [8]:
from bayesian_transfer.models import BBBMLP
from experiment_forgetting import get_data

def get_model(num_output, num_hidden=100, num_layers=2, num_flows=0, pretrained=None):
    model = BBBMLP(in_features=784, num_class=num_output, num_hidden=num_hidden, num_layers=num_layers, nflows=num_flows, p_logvar_init = -2)

    if pretrained:
        d = pretrained.state_dict()
        model.load_prior(d)

    return model

In [9]:
# Training Loop. Here we are forward declaring the loss function
from tqdm import tqdm
import math

def run_epoch(model, loader, epoch, is_training=False):
    # Number of mini.batches
    m = math.ceil(len(loader.dataset) / loader.batch_size)
    
    diagnostics = {"accuracy": [], "likelihood": [],
                   "KL": [], "loss": []}

    for i, (data, labels) in enumerate(tqdm(loader)):
        # Repeat samples
        x = data.view(-1, 784).repeat(num_samples, 1)
        y = labels.repeat(num_samples)

        if cuda:
            x = x.cuda()
            y = y.cuda()

        # Blundell Beta-scheme
        beta = 2 ** (m - (i + 1)) / (2 ** m - 1)

        # Calculate loss
        logits, kl = model.probforward(Variable(x))
        loss = loss_fn(logits, Variable(y), kl, beta)
        ll = -loss.data.mean() + beta*kl.data.mean()

        # Update gradients
        if is_training:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Compute accuracy
        _, predicted = logits.max(1)
        accuracy = (predicted.data == y).float().mean()

        diagnostics["accuracy"].append(accuracy/m)
        diagnostics["loss"].append(loss.data.mean()/m)
        diagnostics["KL"].append(beta*kl.data.mean()/m)
        diagnostics["likelihood"].append(ll/m)

    return diagnostics

### Train model on A

For this, we limit the dataset to only the first 5 digits of MNIST and use all of the data within this subset. We use a standard 2-layer Bayesian NN with 400 hidden units and the following loss function

$$\theta^* = \text{arg}\min_{\theta} = KL(q(w|\theta)||p(w)) - \mathbb{E}_{q(w|\theta)}[\ln p(\mathcal{D}_A|w)]$$

where $ p(\mathcal{D_A}|w)$ is a suitable likelihood function, such as cross-entropy - for the data in the first domain $\mathcal{D}_A$.

In [None]:
from bayesian_transfer.layers import GaussianVariationalInference
from torch.autograd import Variable

digits = [0, 1, 2, 3, 4]
loader_train, loader_val = get_data(digits, fraction=1.0)

model_a = get_model(5, num_hidden=400, num_layers=2, num_flows=0)

# Define the objective, in this case we want to minimize the negative free free energy.
loss_fn = GaussianVariationalInference(torch.nn.CrossEntropyLoss())
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model_a.parameters()), lr=1e-3)

if cuda: model_a.cuda()
    
for epoch in range(num_epochs):
    print("Epoch {}/{}".format(epoch, num_epochs))
    diagnostics_train = run_epoch(model_a, loader_train, epoch, is_training=True)
    diagnostics_val = run_epoch(model_a, loader_val, epoch)

    diagnostics_train = dict({"type": "train", "epoch": epoch}, **diagnostics_train)
    diagnostics_val = dict({"type": "validation", "epoch": epoch}, **diagnostics_val)

    # Save model and diagnostics
    print(diagnostics_train)
    print(diagnostics_val)

    gc.collect()

  0%|          | 0/240 [00:00<?, ?it/s]

Epoch 0/250


100%|██████████| 240/240 [01:49<00:00,  2.20it/s]
100%|██████████| 41/41 [00:04<00:00,  9.97it/s]
  0%|          | 0/240 [00:00<?, ?it/s]

{'type': 'train', 'epoch': 0, 'accuracy': [0.00056636460554371, 0.0004664179104477612, 0.000599680170575693, 0.0004997334754797442, 0.0003331556503198294, 0.00056636460554371, 0.00031649786780383796, 0.0005330490405117271, 0.00048307569296375264, 0.0005497068230277185, 0.000599680170575693, 0.0003498134328358209, 0.0008662046908315565, 0.0006996268656716418, 0.0006663113006396588, 0.0008495469083155651, 0.0007329424307036248, 0.0007496002132196162, 0.0008495469083155651, 0.0007829157782515991, 0.0010494402985074627, 0.0008995202558635395, 0.0012659914712153518, 0.00113272921108742, 0.000882862473347548, 0.0008328891257995735, 0.00113272921108742, 0.0014325692963752666, 0.0015991471215351812, 0.001382595948827292, 0.0015658315565031983, 0.0013659381663113006, 0.0016158049040511727, 0.001515858208955224, 0.001382595948827292, 0.0012493336886993604, 0.0016324626865671641, 0.0018323560767590618, 0.001415911513859275, 0.0013659381663113006, 0.0016491204690831556, 0.0014658848614072495, 0.00

100%|██████████| 240/240 [01:56<00:00,  2.06it/s]
100%|██████████| 41/41 [00:04<00:00,  9.54it/s]
  0%|          | 0/240 [00:00<?, ?it/s]

{'type': 'train', 'epoch': 1, 'accuracy': [0.001948960554371002, 0.001799040511727079, 0.0018323560767590618, 0.0018323560767590618, 0.0015991471215351812, 0.0018989872068230277, 0.0018989872068230277, 0.0019323027718550106, 0.0018490138592750533, 0.0019656183368869937, 0.0016990938166311302, 0.0018989872068230277, 0.0019156449893390191, 0.0018156982942430704, 0.001982276119402985, 0.0019156449893390191, 0.001982276119402985, 0.0019156449893390191, 0.0018823294243070362, 0.0019323027718550106, 0.0018490138592750533, 0.0019989339019189766, 0.0019989339019189766, 0.0019156449893390191, 0.0016824360341151385, 0.0018323560767590618, 0.0019323027718550106, 0.001948960554371002, 0.0018490138592750533, 0.0019989339019189766, 0.0018323560767590618, 0.0019656183368869937, 0.0019156449893390191, 0.002015591684434968, 0.0018323560767590618, 0.0018490138592750533, 0.0019989339019189766, 0.0019156449893390191, 0.0019156449893390191, 0.0018823294243070362, 0.002015591684434968, 0.0018156982942430704

100%|██████████| 240/240 [02:06<00:00,  1.90it/s]
100%|██████████| 41/41 [00:04<00:00,  8.23it/s]
  0%|          | 0/240 [00:00<?, ?it/s]

{'type': 'train', 'epoch': 2, 'accuracy': [0.0018989872068230277, 0.0019323027718550106, 0.001948960554371002, 0.002015591684434968, 0.0020988805970149254, 0.0018156982942430704, 0.0020988805970149254, 0.0019656183368869937, 0.002015591684434968, 0.0018989872068230277, 0.0018989872068230277, 0.0019656183368869937, 0.0019656183368869937, 0.001982276119402985, 0.0019656183368869937, 0.002048907249466951, 0.001948960554371002, 0.001982276119402985, 0.002015591684434968, 0.0020988805970149254, 0.001982276119402985, 0.001982276119402985, 0.001982276119402985, 0.0019656183368869937, 0.001948960554371002, 0.002048907249466951, 0.002048907249466951, 0.001982276119402985, 0.0019156449893390191, 0.0018656716417910447, 0.002015591684434968, 0.0019989339019189766, 0.0019156449893390191, 0.002015591684434968, 0.001732409381663113, 0.0019323027718550106, 0.0019656183368869937, 0.001948960554371002, 0.0019156449893390191, 0.001982276119402985, 0.0018823294243070362, 0.0020322494669509595, 0.001982276

100%|██████████| 240/240 [02:17<00:00,  1.75it/s]
100%|██████████| 41/41 [00:04<00:00, 10.04it/s]
  0%|          | 0/240 [00:00<?, ?it/s]

{'type': 'train', 'epoch': 3, 'accuracy': [0.0019323027718550106, 0.002015591684434968, 0.0019989339019189766, 0.002048907249466951, 0.0020322494669509595, 0.002015591684434968, 0.0019323027718550106, 0.001982276119402985, 0.001982276119402985, 0.002015591684434968, 0.001982276119402985, 0.001982276119402985, 0.001982276119402985, 0.001948960554371002, 0.002015591684434968, 0.0019323027718550106, 0.002048907249466951, 0.0020655650319829424, 0.0018989872068230277, 0.001982276119402985, 0.002082222814498934, 0.002048907249466951, 0.0020655650319829424, 0.0019989339019189766, 0.0020322494669509595, 0.0020655650319829424, 0.0019656183368869937, 0.001948960554371002, 0.0019989339019189766, 0.001982276119402985, 0.0018323560767590618, 0.0018989872068230277, 0.0019323027718550106, 0.0018656716417910447, 0.002015591684434968, 0.002015591684434968, 0.0018323560767590618, 0.0019989339019189766, 0.0019156449893390191, 0.0017823827292110875, 0.0019323027718550106, 0.001982276119402985, 0.001982276

 62%|██████▏   | 148/240 [01:31<00:57,  1.61it/s]

Now we transfer to the second domain, by using the same loss function, but instead, the likelihood is defined over the second domain $\mathcal{D}_B$. We also use the learned posterior from the first part as a prior for transfer. Consequently, the loss function is now defined as:

$$\theta^* = \text{arg}\min_{\theta} = KL(q(w|\theta)||q_A(w)) - \mathbb{E}_{q(w|\theta)}[\ln p(\mathcal{D}_B|w)]$$

where $q_A$ is learned from the first domain. Implementation-wise we only need to include the trained model as a prior.

In [None]:
digits = [5, 6, 7, 8, 9]
loader_train, loader_val = get_data(digits, fraction=1.0)


model_b = get_model(5, num_hidden=400, num_layers=2, num_flows=0, pretrained=model_a)
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model_b.parameters()), lr=1e-3)

if cuda: model_b.cuda()
    
for epoch in range(num_epochs):
    print("Epoch {}/{}".format(epoch, num_epochs))
    diagnostics_train = run_epoch(model_b, loader_train, epoch, is_training=True)
    diagnostics_val = run_epoch(model_b, loader_val, epoch)

    diagnostics_train = dict({"type": "train", "epoch": epoch}, **diagnostics_train)
    diagnostics_val = dict({"type": "validation", "epoch": epoch}, **diagnostics_val)

    # Save model and diagnostics
    print(diagnostics_train)
    print(diagnostics_val)

    gc.collect()

### Can our network remember? *Validate model on A after trained on B*
To prove whether our network is capable of remembering, we simply validate the model with domain $\mathcal{D}_A$ after it has been trained on domain $\mathcal{D}_B$.

In [None]:
digits = [0, 1, 2, 3, 4]
loader_val = get_data(digits, fraction=1.0)

model_av = get_model(5, num_hidden=400, num_layers=2, num_flows=0, pretrained=model_b)

if cuda: model_av.cuda()
    
for epoch in range(num_epochs):
    print("Epoch {}/{}".format(epoch, num_epochs))
    diagnostics_val = run_epoch(model_av, loader_val, epoch)

    diagnostics_val = dict({"type": "validation", "epoch": epoch}, **diagnostics_val)

    # Save model and diagnostics
    print(diagnostics_val)

    gc.collect()

### Results

#### Does Bayesian transfer learning make learning a second similar task converging faster?
Plotting the results, we see that Bayesian transfer learning does *not* make learning of a second similar task converging faster.

![](figs/transfer_results.png)

#### Is our network capable of remembering domain $\mathcal{D}_A$?  

(plot here results of our experiment)