# Comparison of model changes

The [DAVEnet model](https://github.com/dharwath/DAVEnet-pytorch) (Harwath et al. 2018) had two precursor models:

[NIPS 2016 model](https://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf) and [ACL 2017 model](https://arxiv.org/pdf/1701.07481.pdf)

Code for these two models was not published, but they could be recreated using DAVEnet as a basis. This notebook documents the differences found between these three models to help the replicating process.

## Comparison table

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border:none;border-color:#ccc;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 9px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 9px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#f0f0f0;}
.tg .tg-waok{font-weight:bold;font-family:Tahoma, Geneva, sans-serif !important;;border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-td0d{font-family:"Lucida Sans Unicode", "Lucida Grande", sans-serif !important;;text-align:left;vertical-align:top}
.tg .tg-j6ou{background-color:#f9f9f9;font-family:"Lucida Sans Unicode", "Lucida Grande", sans-serif !important;;text-align:left;vertical-align:top}
@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;margin: auto 0px;}}</style>
<div class="tg-wrap"><table class="tg">
  <tr>
    <th class="tg-waok">Model part</th>
    <th class="tg-waok">NIPS2016 Model</th>
    <th class="tg-waok">ACL 2017 Model</th>
    <th class="tg-waok">DAVEnet ECCV 2018</th>
  </tr>
  <tr>
    <td class="tg-td0d">Image input</td>
      <td class="tg-j6ou">Subtract VGG mean pixel value (no mention of variance/std) and take a <i>center</i> 224x224 crop.</td>
    <td class="tg-td0d">Presumably the same as NIPS.</td>
    <td class="tg-j6ou">Resize smallest dimension to 256, take a <i>random</i> 224x224 crop and normalize with global mean and variance.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Image body</td>
    <td class="tg-j6ou">A VGG16 with softmax classification layer removed. VGG weights presumed to be fixed (not specified in the paper).</td>
    <td class="tg-td0d">The same as NIPS, weights are known to be fixed.</td>
    <td class="tg-j6ou">A VGG16 where the last maxpool and all layers after that are replaced with one convolution layer.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Image output</td>
    <td class="tg-j6ou">Linear transform of the 4096 inputs from the penultimate layer to 1024 dimensions.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">A 3 by 3 linear convolution which outputs a 1024 feature map.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio input</td>
    <td class="tg-j6ou">Log Mel filter bank spectrograms with 40 filters. Spectrogram normalization. Fixed to L frames (=1024/<b>2048</b>, latter is better) using truncation or zero padding.</td>
    <td class="tg-td0d">The same as NIPS, but only L=1024 considered.</td>
    <td class="tg-j6ou">Similar to NIPS, but with following changes. Samples are padded to the length of the longest caption in a minibatch. Manual spectrogram normalization is replaced by a BatchNorm layer.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio body</td>
    <td class="tg-j6ou">3 convolution layers with ReLU, 2 maxpools, 1 mean- or maxpool and a L2 normalization.</td>
    <td class="tg-td0d">5 convolution layers with ReLU, 3 maxpools, a meanpool and a L2 normalization.</td>
    <td class="tg-j6ou">Similar to ACL, but there is BatchNorm layer at the front. Also L2 normalization removed from the end.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio output</td>
    <td class="tg-j6ou">1024 dimensional activation vector.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">1024 dimensional feature map. The padding is removed at this stage, individually for each caption</td>
  </tr>
  <tr>
    <td class="tg-td0d">Parameters</td>
      <td class="tg-j6ou">SGD, 50 epochs, <i>constant</i> momentum 0.9, minibatch size 128. Learning rate: 1e-5, geometrical decay by a factor between 2 and 5 every 5 to 10 epochs.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">The same as NIPS except models converged in less than 150 epoch on average. Learning rate: 1e-3, decay by factor of 10 in every 70 epochs. </td>
  </tr>
  <tr>
    <td class="tg-td0d">Similarity function</td>
    <td class="tg-j6ou">Dot product</td>
    <td class="tg-td0d">Dot product</td>
      <td class="tg-j6ou">SISA, <b>MISA</b> (best), SIMA</td>
  </tr>
</table></div>

In [2]:
# For importing the models folder
import sys
sys.path.append('/m/home/home4/44/virkkua1/unix/PlacesAudio_project/DAVEnet')

import torch
import torchvision
import torchvision.models as tvmodels
from torchsummary import summary

import models

# The last line prints False, then there's something funny with Cuda toolkit installation. 
# For example the toolkit might have been updated without updating the driver which
# leads to mismatch and errors.
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.is_available())

In [3]:
#dave_vgg16 = models.VGG16()
#dave_vgg16.cuda()

standard_vgg16 = tvmodels.vgg16()
standard_vgg16.cuda()

audio_model = models.Davenet()
audio_model.cuda()

Davenet(
  (batchnorm1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv1): Conv2d(1, 128, kernel_size=(40, 1), stride=(1, 1))
  (conv2): Conv2d(128, 256, kernel_size=(1, 11), stride=(1, 1), padding=(0, 5))
  (conv3): Conv2d(256, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (conv4): Conv2d(512, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (conv5): Conv2d(512, 1024, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (pool): MaxPool2d(kernel_size=(1, 3), stride=(1, 2), padding=(0, 1), dilation=1, ceil_mode=False)
)

In [4]:
summary(audio_model, (40, 1024))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
       BatchNorm2d-1          [-1, 1, 40, 1024]               2
            Conv2d-2         [-1, 128, 1, 1024]           5,248
            Conv2d-3         [-1, 256, 1, 1024]         360,704
         MaxPool2d-4          [-1, 256, 1, 512]               0
            Conv2d-5          [-1, 512, 1, 512]       2,228,736
         MaxPool2d-6          [-1, 512, 1, 256]               0
            Conv2d-7          [-1, 512, 1, 256]       4,456,960
         MaxPool2d-8          [-1, 512, 1, 128]               0
            Conv2d-9         [-1, 1024, 1, 128]       8,913,920
        MaxPool2d-10          [-1, 1024, 1, 64]               0
Total params: 15,965,570
Trainable params: 15,965,570
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.16
Forward/backward pass size (MB): 10.31
Params size (MB): 60.90
Est

In [5]:
#print(dave_vgg16)

In [6]:
#summary(dave_vgg16, (3, 224, 224))

In [7]:
print(standard_vgg16)

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1

In [8]:
summary(standard_vgg16, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,

## NIPS model

Trying to replicate the NIPS model using the DAVEnet base and the table of changes.

In [9]:
import torch.nn as nn
import torch.nn.functional as F

class ConvX3AudioNet(nn.Module):
    def __init__(self, embedding_dim=1024):
        super(ConvX3AudioNet, self).__init__()
        self.embedding_dim = embedding_dim
        self.conv1 = nn.Conv2d(1, 64, kernel_size=(40,5), stride=(1,1), padding=(0,2))
        self.conv2 = nn.Conv2d(64, 512, kernel_size=(1,25), stride=(1,1), padding=(0,12))
        self.conv3 = nn.Conv2d(512, 1024, kernel_size=(1,25), stride=(1,1), padding=(0,12))
        self.pool = nn.MaxPool2d(kernel_size=(1,4), stride=(1,2), padding=(0,1))
        self.globalPool = nn.MaxPool2d(kernel_size=(1,256), stride=(1,1), padding=(0,0))
        self.fc = nn.Linear(1024, 205)
        
    def forward(self, x):
        if x.dim() == 3:
            x = x.unsqueeze(1)
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = F.normalize(self.globalPool(x), dim=1)
        x = x.squeeze()
        x = self.fc(x)
        return x

convX3 = ConvX3AudioNet()
convX3.cuda()
summary(convX3, (40, 1024), batch_size=128)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [128, 64, 1, 1024]          12,864
         MaxPool2d-2          [128, 64, 1, 512]               0
            Conv2d-3         [128, 512, 1, 512]         819,712
         MaxPool2d-4         [128, 512, 1, 256]               0
            Conv2d-5        [128, 1024, 1, 256]      13,108,224
         MaxPool2d-6          [128, 1024, 1, 1]               0
            Linear-7                 [128, 205]         210,125
Total params: 14,150,925
Trainable params: 14,150,925
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 20.00
Forward/backward pass size (MB): 737.20
Params size (MB): 53.98
Estimated Total Size (MB): 811.18
----------------------------------------------------------------


In [10]:
class VGG16withFC(nn.Module):
    def __init__(self, embedding_dim=1024):
        super(VGG16withFC, self).__init__()
        seed_model = tvmodels.__dict__['vgg16'](pretrained=True)
        # Remove last 1000-dim class transform
        seed_model.classifier = nn.Sequential(*list(seed_model.classifier.children())[:-1])
        # Freeze params
        for param in seed_model.parameters():
            param.requires_grad = False
        # Add a linear transform of the embedding dimension size
        last_layer_index = len(list(seed_model.classifier.children()))
        seed_model.classifier.add_module(str(last_layer_index),
                                         nn.Linear(4096, embedding_dim))
        self.image_model = seed_model
        
    def forward(self, x):
        x = self.image_model(x)
        return x

vgg16withFC = VGG16withFC()
vgg16withFC.cuda()
summary(vgg16withFC, (3, 224, 224), batch_size=128)

Downloading: "https://download.pytorch.org/models/vgg16-397923af.pth" to /var/cache/user/virkkua1/torch/checkpoints/vgg16-397923af.pth
100%|██████████| 528M/528M [00:14<00:00, 37.2MB/s]  


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1        [128, 64, 224, 224]           1,792
              ReLU-2        [128, 64, 224, 224]               0
            Conv2d-3        [128, 64, 224, 224]          36,928
              ReLU-4        [128, 64, 224, 224]               0
         MaxPool2d-5        [128, 64, 112, 112]               0
            Conv2d-6       [128, 128, 112, 112]          73,856
              ReLU-7       [128, 128, 112, 112]               0
            Conv2d-8       [128, 128, 112, 112]         147,584
              ReLU-9       [128, 128, 112, 112]               0
        MaxPool2d-10         [128, 128, 56, 56]               0
           Conv2d-11         [128, 256, 56, 56]         295,168
             ReLU-12         [128, 256, 56, 56]               0
           Conv2d-13         [128, 256, 56, 56]         590,080
             ReLU-14         [128, 256,