# Memory footprint estimation

Below are calculations on how much GPU memory is approximately required to run the [DAVEnet model](https://github.com/dharwath/DAVEnet-pytorch) (Harwath et al. 2018).

In [1]:
# For importing the models folder
import sys
sys.path.append('/m/home/home4/44/virkkua1/unix/PlacesAudio_project/DAVEnet')

import torch
import torchvision
from torchsummary import summary

import models

The model consists of two CNNs, an audio and an image branch. The audio branch is a 5-layer CNN that generates audio embeddings from spoken captions. The image branch is a standard VGG16 where the final maxpool is replaced with 2D convolutional layer.

In [2]:
audio_model = models.Davenet()
audio_model.cuda()

Davenet(
  (batchnorm1): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (conv1): Conv2d(1, 128, kernel_size=(40, 1), stride=(1, 1))
  (conv2): Conv2d(128, 256, kernel_size=(1, 11), stride=(1, 1), padding=(0, 5))
  (conv3): Conv2d(256, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (conv4): Conv2d(512, 512, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (conv5): Conv2d(512, 1024, kernel_size=(1, 17), stride=(1, 1), padding=(0, 8))
  (pool): MaxPool2d(kernel_size=(1, 3), stride=(1, 2), padding=(0, 1), dilation=1, ceil_mode=False)
)

In [3]:
image_model = models.VGG16(pretrained=True)
image_model.cuda()

VGG16(
  (image_model): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): C

Parameter counts are computed by the torchsummary module, which produces Keras-like network summaries.

In [4]:
# The second parameter is input size.
summary(image_model, (3, 224, 224))
summary(audio_model, (40, 1024))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,

To estimate the total memory footprint of a model, we need to add together the input size, forward *and* backward pass sizes (AKA feature + gradient maps or channels) and parameters' size (AKA weights and bias). We can see above that torchsummary can do this for us. But to understand where the numbers come from, let's do a sanity check below and compute the estimates for the audio network ourselves.

The input size is computed as a product of the input dimensions. For forward pass size, dimensions of each layer are multiplied and the resulting products are summed together. Backward pass size is the same as forward pass, so forward pass can be multiplied by two. The weights/parameters for each layer are a product of the kernel size and input and output depths (i.e. the number of feature maps/channels). The potential bias vector size is added on top of the product. Parameters are naturally computed only for layers which have them, in this case convolutional and batch norm layers. In the end, all parameters are summed together. Finally, to convert each result to megabytes, they need to be multiplied by 4 bytes (values are presumably stored as float32, so each value uses 32 bits i.e. 4 bytes) and then divided by the square of 1024 (torchsummary uses binary notation). 

$$\text{Input:} \ \ \frac{40*1024*4\text{B}}{1024^2} = 0.15625 \text{MB} \approx 0.16 \text{MB}$$

$$\text{Forward+backward pass:} \ \ \frac{40*1024+128*1024+256*1024+256*512+512^2+2*512*256+512*128+1024*128+1024*64}{1024^2}*4\text{B}*2 = 10.3125 \text{MB} \approx 10.31 \text{MB}$$

$$\text{Parameters:} \ \ \frac{2+1*128*40+128+128*256*11+256+256*512*17+512+512^2*17+512+512*1024*17+1024}{1024^2}*4\text{B} \approx 60.90 \text{MB}$$

However, these estimates do not take batch size into account. When computing on a GPU, one batch is loaded to memory at a time. So when the batch size is for example 10, input size needs to be multiplied by 10. The same applies to forward and backward pass sizes, because activation and gradient maps are different for each input item in a batch. Parameters stay the same across different inputs, so the parameter size can be left untouched.

In the DAVEnet paper, authors write that they train the model with batch size 128. Supplying the `batch_size` argument to the `summary()` gives us the estimate of

$$28273.64 \ \text{MB} = \frac{28273.64 \ \text{MB}}{1024} \approx 27.61 \ \text{GB} $$

for the image model and

$$1400.90 \ \text{MB} = \frac{1400.90 \ \text{MB}}{1024} \approx 1.37 \ \text{GB}$$

for the audio model. 

In [5]:
summary(image_model, (3, 224, 224), batch_size=128)
summary(audio_model, (40, 1024), batch_size=128)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1        [128, 64, 224, 224]           1,792
              ReLU-2        [128, 64, 224, 224]               0
            Conv2d-3        [128, 64, 224, 224]          36,928
              ReLU-4        [128, 64, 224, 224]               0
         MaxPool2d-5        [128, 64, 112, 112]               0
            Conv2d-6       [128, 128, 112, 112]          73,856
              ReLU-7       [128, 128, 112, 112]               0
            Conv2d-8       [128, 128, 112, 112]         147,584
              ReLU-9       [128, 128, 112, 112]               0
        MaxPool2d-10         [128, 128, 56, 56]               0
           Conv2d-11         [128, 256, 56, 56]         295,168
             ReLU-12         [128, 256, 56, 56]               0
           Conv2d-13         [128, 256, 56, 56]         590,080
             ReLU-14         [128, 256,