<a id='appendix'></a>
## Appendix
### Weights and Flops estimation
The following formulas give estimations of number of weights and operations for training fully-connected and convolutional neural networks. Training consists of the forward propagation, backpropagation of error and gradient computation. Lest consider fully-connected networks:
  - Weights: intputs times outputs for each layer plus bias: w = Sum(mn+n)
  - Training 
    - Forward: vector-matrix multiplication for each layer, bias addition and activation function, ~2mn=2w
    - Backward: vector-matrix and elementwise vector multiplications, ~2nm = 2w
    - Gradient computation: column-vector row-vector products, ~nm = w
    - Gradient update and total: 5w + 1w = 6w
    - Counting "multiply-add" operations: 3w

In [1]:
def weightsAndFlops(layers):
    w = 0
    for i in range(1, len(layers)):
        w = w + layers[i - 1] * layers[i] + layers[i]
    return (w, 3 * w)
assert(weightsAndFlops([784, 10])[0] == 7850)
# MSFT model [3]
assert(round(weightsAndFlops([39 * 11, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 9304])[0] / 1E5, 0) * 1E5 == 45.1E6)

Convolutional neural networks:
  - Weights: (feature maps size plus bias) times depth and number of maps
  - Training
     - Forward: dot product of input and weights, summation of the result and bias addition, for each feature map and output neuron
     - Backward: similar
     - Gradient computation: similar
     - Total in "multiply-add" operations: 3w
     
Bias adds a lot more weights to convolutional networks, so in recent papers it is not used given that data is normalized.

In [2]:
import numpy as np
def computeConvOutput(data, layer):
    n = layer['count']
    k = layer['size']
    b = layer['border']
    s = layer['stride']
    c = (data[0] - k + b) // s + 1   
    return [c, c, n]
def computePoolOutput(data, layer):
    n = data[2]
    k = layer['size']
    b = layer['border']
    s = layer['stride']
    c = (data[0] - k + b) // s + 1   
    return [c, c, n]
def cnnWeightsAndFlops(inputLayer, layers, useBias):
    w = 0.0
    f = 0.0
    inputData = inputLayer
    for i in range(0, len(layers)):
        layer = layers[i]
        if (layer['type'] == 'conv'):
            n = layer['count']
            k = layer['size']
            d = layer['depth']
            inputData = computeConvOutput(inputData, layer)
            c = inputData[0]
            bw = 0
            if (useBias): 
                bw = c * c
            w = w + n * (k * k * d + bw)
            f = f + n * (k * k * d * c * c)
        if (layer['type'] == 'pool'):
            n = inputData[2]
            inputData = computePoolOutput(inputData, layer)
        if (layer['type'] == 'full'):
            k = layer['size']
            inputFlat = np.product(inputData)
            bw = 0
            if (useBias):
                bw = k
            w = w + k * inputFlat + bw
            f = f + k * inputFlat
            inputData = [k]
    return (w, 3 * f)

Unit test

In [3]:
# LeCun et al. 1982 (1068 + 2592 + 5790 + 310 == 9760 weights)
assert cnnWeightsAndFlops([16, 16, 1], [{'type' : 'conv', 'count' : 12, 'size' : 5, 'depth' : 1, 'stride': 2, 'border' : 3}], True)[0] == 1068
assert cnnWeightsAndFlops([16, 16, 1], [{'type' : 'conv', 'count' : 12, 'size' : 5, 'depth' : 1, 'stride': 2, 'border' : 3},
                         {'type' : 'conv', 'count' : 12, 'size' : 5, 'depth' : 8, 'stride': 2, 'border' : 3},
                         {'type' : 'full', 'size' : 30},
                         {'type' : 'full', 'size' : 10}], True)[0] == 9760
# Krizhevsky et al. 2012, AlexNet, see also https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/deploy.prototxt
assert(computeConvOutput([227, 227, 3], {'type' : 'conv', 'count' : 96, 'size' : 11, 'depth' : 3, 'stride': 4, 'border' : 0}) == [55, 55, 96])
assert(computePoolOutput([55, 55, 96], {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0}) == [27, 27, 96])
assert(computeConvOutput([27, 27, 256], {'type' : 'conv', 'count' : 256, 'size' : 5, 'depth' : 48, 'stride': 1, 'border' : 4}) == [27, 27, 256])
assert(computePoolOutput([27, 27, 256], {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0}) == [13, 13, 256])
assert(computeConvOutput([13, 13, 256], {'type' : 'conv', 'count' : 384, 'size' : 3, 'depth' : 256, 'stride': 1, 'border' : 2}) == [13, 13, 384])
assert(computeConvOutput([13, 13, 384], {'type' : 'conv', 'count' : 384, 'size' : 3, 'depth' : 192, 'stride': 1, 'border' : 2}) == [13, 13, 384])
assert(computeConvOutput([13, 13, 256], {'type' : 'conv', 'count' : 256, 'size' : 3, 'depth' : 192, 'stride': 1, 'border' : 2}) == [13, 13, 256])
assert(computePoolOutput([13, 13, 256], {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0}) == [6, 6, 256])
#example AlexNet from http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf
alexWeights, alexFlops = cnnWeightsAndFlops([227, 227, 3], [
                           {'type' : 'conv', 'count' : 96, 'size' : 11, 'depth' : 3, 'stride': 4, 'border' : 0},
                           {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0},
                           {'type' : 'conv', 'count' : 256, 'size' : 5, 'depth' : 48, 'stride': 1, 'border' : 4},
                           {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0},
                           {'type' : 'conv', 'count' : 384, 'size' : 3, 'depth' : 256, 'stride': 1, 'border' : 2},
                           {'type' : 'conv', 'count' : 384, 'size' : 3, 'depth' : 384, 'stride': 1, 'border' : 2}, # depth is 192 in paper
                           {'type' : 'conv', 'count' : 256, 'size' : 3, 'depth' : 192, 'stride': 1, 'border' : 2},
                           {'type' : 'pool', 'size' : 3, 'stride': 2, 'border' : 0}, 
                           {'type' : 'full', 'size' : 4096},
                           {'type' : 'full', 'size' : 4096},
                           {'type' : 'full', 'size' : 1000}
                          ], False)
# weights ~ 60M, forward pass ~ 832M Flops
assert(round(alexWeights / 1E7) * 1E7 == 60E6)
assert(round(alexFlops / 3E8) * 1E8 == 8E8)


## References
  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. (slides: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)
  2. Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015).
  3. Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
  4. Iandola, Forrest N., et al. "FireCaffe: near-linear acceleration of deep neural network training on compute clusters." arXiv preprint arXiv:1511.00175 (2015).
  5. Chen, Jianmin, et al. "Revisiting Distributed Synchronous SGD." arXiv preprint arXiv:1604.00981 (2016).
  6. Chapelle, Olivier, Eren Manavoglu, and Romer Rosales. "Simple and scalable response prediction for display advertising." ACM Transactions on Intelligent Systems and Technology (TIST) 5.4 (2015): 61.
  