# Comparison of model changes

The [DAVEnet model](https://github.com/dharwath/DAVEnet-pytorch) (Harwath et al. 2018) had two precursor models:

[NIPS 2016 model](https://papers.nips.cc/paper/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf) and [ACL 2017 model](https://arxiv.org/pdf/1701.07481.pdf)

Code for these two models was not published, but they could be recreated using DAVEnet as a basis. This notebook documents the differences found between these three models to help the replicating process.

## Comparison table

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border:none;border-color:#ccc;margin:0px auto;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 9px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#fff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 9px;border-style:solid;border-width:0px;overflow:hidden;word-break:normal;border-color:#ccc;color:#333;background-color:#f0f0f0;}
.tg .tg-waok{font-weight:bold;font-family:Tahoma, Geneva, sans-serif !important;;border-color:inherit;text-align:center;vertical-align:top}
.tg .tg-td0d{font-family:"Lucida Sans Unicode", "Lucida Grande", sans-serif !important;;text-align:left;vertical-align:top}
.tg .tg-j6ou{background-color:#f9f9f9;font-family:"Lucida Sans Unicode", "Lucida Grande", sans-serif !important;;text-align:left;vertical-align:top}
@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;margin: auto 0px;}}</style>
<div class="tg-wrap"><table class="tg">
  <tr>
    <th class="tg-waok">Model part</th>
    <th class="tg-waok">NIPS2016 Model</th>
    <th class="tg-waok">ACL 2017 Model</th>
    <th class="tg-waok">DAVEnet ECCV 2018</th>
  </tr>
  <tr>
    <td class="tg-td0d">Image input</td>
      <td class="tg-j6ou">Subtract VGG mean pixel value (no mention of variance/std) and take a <i>center</i> 224x224 crop.</td>
    <td class="tg-td0d">Presumably the same as NIPS.</td>
    <td class="tg-j6ou">Resize smallest dimension to 256, take a <i>random</i> 224x224 crop and normalize with global mean and variance.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Image body</td>
    <td class="tg-j6ou">A VGG16 with softmax classification layer removed. VGG weights presumed to be fixed (not specified in the paper).</td>
    <td class="tg-td0d">The same as NIPS, weights are known to be fixed.</td>
    <td class="tg-j6ou">A VGG16 where the last maxpool and all layers after that are replaced with one convolution layer.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Image output</td>
    <td class="tg-j6ou">Linear transform of the 4096 inputs from the penultimate layer to 1024 dimensions.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">A 3 by 3 linear convolution which outputs a 1024 feature map.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio input</td>
    <td class="tg-j6ou">Log Mel filter bank spectrograms with 40 filters. Spectrogram normalization. Fixed to L frames (=1024/<b>2048</b>, latter is better) using truncation or zero padding.</td>
    <td class="tg-td0d">The same as NIPS, but only L=1024 considered.</td>
    <td class="tg-j6ou">Similar to NIPS, but with following changes. Samples are padded to the length of the longest caption in a minibatch. Manual spectrogram normalization is replaced by a BatchNorm layer.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio body</td>
    <td class="tg-j6ou">3 convolution layers with ReLU, 2 maxpools, 1 mean- or maxpool and a L2 normalization.</td>
    <td class="tg-td0d">5 convolution layers with ReLU, 3 maxpools, a meanpool and a L2 normalization.</td>
    <td class="tg-j6ou">Similar to ACL, but there is BatchNorm layer at the front. Also L2 normalization removed from the end.</td>
  </tr>
  <tr>
    <td class="tg-td0d">Audio output</td>
    <td class="tg-j6ou">1024 dimensional activation vector.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">1024 dimensional feature map. The padding is removed at this stage, individually for each caption</td>
  </tr>
  <tr>
    <td class="tg-td0d">Parameters</td>
      <td class="tg-j6ou">SGD, 50 epochs, <i>constant</i> momentum 0.9, minibatch size 128. Learning rate: 1e-5, geometrical decay by a factor between 2 and 5 every 5 to 10 epochs.</td>
    <td class="tg-td0d">The same as NIPS.</td>
    <td class="tg-j6ou">The same as NIPS except models converged in less than 150 epoch on average. Learning rate: 1e-3, decay by factor of 10 in every 70 epochs. </td>
  </tr>
  <tr>
    <td class="tg-td0d">Similarity function</td>
    <td class="tg-j6ou">Dot product</td>
    <td class="tg-td0d">Dot product</td>
      <td class="tg-j6ou">SISA, <b>MISA</b> (best), SIMA</td>
  </tr>
</table></div>

In [1]:
# For importing the models folder
import sys
sys.path.append('/m/home/home4/44/virkkua1/unix/PlacesAudio_project/DAVEnet')

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.models as tvmodels
from torchsummary import summary

import models

# The last line prints False, then there's something funny with Cuda toolkit installation. 
# For example the toolkit might have been updated without updating the driver which
# leads to mismatch and errors.
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.backends.cudnn.enabled)
print(torch.cuda.is_available())

1.2.0
9.2.148
7600
True
True


## NIPS model

Trying to replicate the NIPS model using the DAVEnet base and the table of changes.

In [None]:
class ConvX3AudioNet(nn.Module):
    def __init__(self, embedding_dim=1024):
        super(ConvX3AudioNet, self).__init__()
        self.embedding_dim = embedding_dim
        self.conv1 = nn.Conv2d(1, 64, kernel_size=(40,5), stride=(1,1), padding=(0,2))
        self.conv2 = nn.Conv2d(64, 512, kernel_size=(1,25), stride=(1,1), padding=(0,12))
        self.conv3 = nn.Conv2d(512, 1024, kernel_size=(1,25), stride=(1,1), padding=(0,12))
        self.pool = nn.MaxPool2d(kernel_size=(1,4), stride=(1,2), padding=(0,1))
        self.globalPool = nn.MaxPool2d(kernel_size=(1,256), stride=(1,1), padding=(0,0))
        self.fc = nn.Linear(1024, 205)
        
    def forward(self, x):
        if x.dim() == 3:
            x = x.unsqueeze(1)
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = F.normalize(self.globalPool(x), dim=1)
        x = x.squeeze()
        x = self.fc(x)
        return x

convX3 = ConvX3AudioNet()
convX3.cuda()
summary(convX3, (40, 1024), batch_size=128)

In [None]:
class VGG16withFC(nn.Module):
    def __init__(self, embedding_dim=1024):
        super(VGG16withFC, self).__init__()
        seed_model = tvmodels.__dict__['vgg16'](pretrained=True)
        # Remove last 1000-dim class transform
        seed_model.classifier = nn.Sequential(*list(seed_model.classifier.children())[:-1])
        # Freeze params
        for param in seed_model.parameters():
            param.requires_grad = False
        # Add a linear transform of the embedding dimension size
        last_layer_index = len(list(seed_model.classifier.children()))
        seed_model.classifier.add_module(str(last_layer_index),
                                         nn.Linear(4096, embedding_dim))
        self.image_model = seed_model
        
    def forward(self, x):
        x = self.image_model(x)
        return x

vgg16witbhFC = VGG16withFC()
vgg16withFC.cuda()
summary(vgg16withFC, (3, 224, 224), batch_size=128)

## Warmup for DAVEnet audio branch

Embedding models are difficult to train from scratch. Tuomas said that it is therefore general practice to warmup/pretrain the embedding model weights using a simpler classification task. In this section I write a classifier version of the DAVEnet audio branch. It tries to predict the Places image class of the audio captions. The image branch of the DAVEnet uses conventional image classifiers like VGG16 or ResNet50. For these pretrained weights already exist and are available in Torch. 

The image classes are extracted from the image paths of the Places400k dataset. For this, see the "create_class_labels" notebook.

In [None]:
class DavenetClassifier(nn.Module):
    def __init__(self, embedding_dim=1024):
        super(DavenetClassifier, self).__init__()
        self.embedding_dim = embedding_dim
        self.batchnorm1 = nn.BatchNorm2d(1)
        self.conv1 = nn.Conv2d(1, 128, kernel_size=(40,1), stride=(1,1), padding=(0,0))
        self.conv2 = nn.Conv2d(128, 256, kernel_size=(1,11), stride=(1,1), padding=(0,5))
        self.conv3 = nn.Conv2d(256, 512, kernel_size=(1,17), stride=(1,1), padding=(0,8))
        self.conv4 = nn.Conv2d(512, 512, kernel_size=(1,17), stride=(1,1), padding=(0,8))
        self.conv5 = nn.Conv2d(512, embedding_dim, kernel_size=(1,17), stride=(1,1), padding=(0,8))
        self.pool = nn.MaxPool2d(kernel_size=(1,3), stride=(1,2),padding=(0,1))
        self.fc = nn.Linear(64*embedding_dim, 205)

    def forward(self, x):
        print(x.shape)
        if x.dim() == 3:
            x = x.unsqueeze(1)
        print(x.shape)
        x = self.batchnorm1(x)
        print(x.shape)
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = F.relu(self.conv4(x))
        print(x.shape)
        x = self.pool(x)
        print(x.shape)
        x = F.relu(self.conv5(x))
        print(x.shape)
        x = self.pool(x)
        print(x.shape)
        x = x.view(x.size(0), -1, 1024 * (self.embedding_dim // 2**4))
        print(x.shape)
        x = x.squeeze()
        print(x.shape)
        x = self.fc(x)
        print(x.shape)
        return x


class_model = DavenetClassifier()
class_model.cuda()

In [None]:
summary(class_model, (40, 2048), batch_size=8)

## ResDAVEnet

DAVEnet got a new version with residual connections, described in the paper ["Transfer Learning from Audio-Visual Grounding to Speech Recognition"](https://arxiv.org/pdf/1907.04355.pdf).

Here I attempt to replicate the ResDAVEnet model using the DAVEnet base.

In [2]:
# Standard resnet34 for comparison
standard_resnet = tvmodels.resnet34()
standard_resnet.cuda()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [3]:
summary(standard_resnet, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]          36,864
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
       BasicBlock-11           [-1, 64, 56, 56]               0
           Conv2d-12           [-1, 64, 56, 56]          36,864
      BatchNorm2d-13           [-1, 64, 56, 56]             128
             ReLU-14           [-1, 64,

In [38]:
# ResDAVEnet has 1x9 convolutions instead of 3x3
def conv1x9(in_planes, out_planes, stride=1):
    """1x9 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=(1,9), stride=stride,
                     padding=(0, 4), bias=False)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

# This BasicBlock is an adjusted version of the one defined in torch example implementation of ResNet:
# https://pytorch.org/docs/stable/_modules/torchvision/models/resnet.html
class BasicBlock(nn.Module):
    expansion = 1
    __constants__ = ['downsample']

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        norm_layer = nn.BatchNorm2d
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x9(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv1x9(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        if self.downsample is not None:
            identity = self.downsample(x)
        
        out += identity
        out = self.relu(out)
        return out

class ResDaveNet(nn.Module):
    
    def __init__(self, embedding_dim=1024):
        super(ResDaveNet, self).__init__()
        
        self._norm_layer = nn.BatchNorm2d
        self.inplanes = 128
        
        self.embedding_dim = embedding_dim
        self.conv1 = nn.Conv2d(1, 128, kernel_size=(40, 1), stride=(1, 1), padding=(0, 0))
        self.relu = nn.ReLU(inplace=True)
        self.batchnorm1 = self._norm_layer(self.inplanes)

        self.stack1 = self._make_residual_block(BasicBlock, 128, 2, stride=2)
        self.stack2 = self._make_residual_block(BasicBlock, 256, 2, stride=2)
        self.stack3 = self._make_residual_block(BasicBlock, 512, 2, stride=2)
        self.stack4 = self._make_residual_block(BasicBlock, 1024, 2, stride=2)
        
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_residual_block(self, block, planes, blocks, stride=1):
        
        norm_layer = self._norm_layer
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = [block(self.inplanes, planes, stride, downsample)]
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def _forward(self, x):
        if x.dim() == 3:
            x = x.unsqueeze(1)
        x = self.conv1(x)
        x = self.relu(x)
        x = self.batchnorm1(x)
        
        x = self.stack1(x)
        x = self.stack2(x)
        x = self.stack3(x)
        x = self.stack4(x)
        
        x = x.squeeze(2)
        return x
    
    # Allow for accessing forward method in a inherited class
    forward = _forward
    

In [39]:
resDAVE = ResDavenet()
resDAVE.cuda()

ResDavenet(
  (conv1): Conv2d(1, 128, kernel_size=(40, 1), stride=(1, 1))
  (relu): ReLU(inplace=True)
  (batchnorm1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (stack1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(1, 9), stride=(2, 2), padding=(0, 4), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(128, 128, kernel_size=(1, 9), stride=(1, 1), padding=(0, 4), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (downsample): Sequential(
        (0): Conv2d(128, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): BasicBlock(
      (conv1): Conv2d(128, 128, kernel_size=(1, 9), stride=(1, 1), padding=(0, 4), bias=False)
      (bn1)

In [40]:
summary(resDAVE, (40, 1024))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 128, 1, 1024]           5,248
              ReLU-2         [-1, 128, 1, 1024]               0
       BatchNorm2d-3         [-1, 128, 1, 1024]             256
            Conv2d-4          [-1, 128, 1, 512]         147,456
       BatchNorm2d-5          [-1, 128, 1, 512]             256
              ReLU-6          [-1, 128, 1, 512]               0
            Conv2d-7          [-1, 128, 1, 512]         147,456
       BatchNorm2d-8          [-1, 128, 1, 512]             256
              ReLU-9          [-1, 128, 1, 512]               0
           Conv2d-10          [-1, 128, 1, 512]          16,384
      BatchNorm2d-11          [-1, 128, 1, 512]             256
             ReLU-12          [-1, 128, 1, 512]               0
       BasicBlock-13          [-1, 128, 1, 512]               0
           Conv2d-14          [-1, 128,