### This jupyter notebook compare performance between Naive convolution (using loop), Eisum with stride tricks of numpy and convolution of tensorflow.

In [1]:
import numpy as np
import tensorflow as tf

tf.enable_eager_execution()

##### 1. I start with small number of dimension of X corresponding to input of convolutional layer. X has shape (batch_size, input_height, input_width, input_channel)

In [55]:
X = np.random.normal(size=(2, 12, 10, 3))

#### 2. Kernel has shape = (kernel_height, kernel_width, input_channel, output_channel) and I assume that stride = 1, padding = "VALID" 

In [67]:
kernel_height = 3
kernel_width = 3
output_channel = 16
stride = 2
padding = "VALID"

In [48]:
kernel = np.random.normal(size=(kernel_height, kernel_width, 3, output_channel))

#### 3. Define function split X to do with einsum of numpy

What does this function do? It computes height, width of the output base on stride, padding and transform 4D tensor input shape = `(batch_size, input_height, input_width, input_channel)` to 6D tensor input that has shape = `(batch_size, output_height, output_width, filter_height, filter_width, input_channel)`. This function is important to do convolution with einsum. 

In [49]:
def _split_X(X, filter_size, stride):
    """
    Preprocess input X to avoid for-loop.
    """
    m, iW, iH, iC = X.shape
    fW, fH = filter_size
    oW = int((iW - fW)/stride + 1)
    oH = int((iH - fH)/stride + 1)
    batch_strides, width_strides, height_strides, channel_strides = X.strides
    view_shape = (m, oW, oH, fW, fH, iC)
    X = np.lib.stride_tricks.as_strided(X, shape=view_shape, strides=(batch_strides, stride*width_strides, 
                                                                      stride*height_strides, width_strides, 
                                                                      height_strides, channel_strides), writeable=False)
    return X

In [50]:
import torch

In [51]:
def _split_X_torch(X, filter_size, stride):
    """
    Preprocess input X to avoid for-loop.
    """
    m, iH, iW, iC = X.shape
    fH, fW = filter_size
    oH = int((iH - fH)/stride + 1)
    oW = int((iW - fW)/stride + 1)
    view_shape = (m, oH, oW, fH, fW, iC)
    X = torch.Tensor(X)
    X = torch.as_strided(X, size=view_shape, stride=(iH*iW*iC, iC*iH*stride, iC*stride, iC*iH, iC, 1))
    return X

In [52]:
X1 = _split_X_torch(X, (3, 3), 2)

RuntimeError: setStorage: sizes [2, 7, 5, 3, 3, 3], strides [576, 96, 6, 48, 3, 1], and storage offset 0 requiring a storage size of 1281 are out of bounds for storage with numel 1152

In [57]:
X2 = _split_X(X, (3, 3), 2)

In [44]:
np.allclose(X1.numpy(), X2)

True

#### 4. Intuitive approach by using for loop

In [9]:
def naive_conv(X, kernel):
    m, iW, iH, iC = X.shape
    fW, fH, iC, fC = kernel.shape
    oW = iW - fW + 1
    oH = iH - fH + 1
    out = np.zeros(shape=(m, oW, oH, fC))
    for f in range(fC):
        for i in range(m):
            for j in range(oW):
                for k in range(oH):
                        out[i, j, k, f] = np.sum(X[i, j:j+fW, k:k+fH, :]*kernel[:, :, :, f])
    return out

#### 5. Convolution with einsum of numpy

[np.einsum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html) is a powerful function that numpy has implemented for us

In [68]:
def einsum_conv(X, kernel):
    X = _split_X(X, (kernel_height, kernel_width), stride)
    return np.einsum("bwhijk,ijkl->bwhl", X, kernel)

In [69]:
def tf_conv(X, kernel):
    with tf.device("/cpu:0"):
        return tf.nn.conv2d(X, kernel, strides=stride, padding="VALID")

In [63]:
%timeit naive_conv(X, kernel)

NameError: name 'naive_conv' is not defined

In [64]:
%timeit einsum_conv(X, kernel)

141 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [65]:
%timeit tf_conv(X, kernel)

199 µs ± 57.1 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
np.allclose(naive_conv(X, kernel), tf_conv(X, kernel))

True

In [70]:
np.allclose(einsum_conv(X, kernel), tf_conv(X, kernel))

True

### Try for a larger X

In [17]:
X = np.random.normal(size=(32, 28, 28, 3))

In [18]:
%timeit naive_conv(X, kernel)

1.75 s ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%timeit einsum_conv(X, kernel)

18.1 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit tf_conv(X, kernel)

1.74 ms ± 47.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


##### We can see that einsum convolution is ~100 times faster than naive convolution by using pure python loop.