### Different Types of Convolutions

#### Dilated Convolutions

* Dilated convolutions introduce another parameter to convolutional layers called the **dilation rate**. 
* This defines a spacing between the values in a kernel. 
* A 3x3 kernel with a dilation rate of 2 will have the same field of view as a 5x5 kernel, while only using 9 parameters. 
* Imagine taking a 5x5 kernel and deleting every second column and row.
* This delivers a wider field of view at the same computational cost. 
* Dilated convolutions are particularly popular in the field of real-time segmentation. 
* Use them if you need a wide field of view and cannot afford multiple convolutions or larger kernels (due to computational costs).

#### Example: A dilated convolution with a dilation rate of 2 and no padding
* Blue maps are inputs and cyan maps are outputs
<img src = '../pics/dilatedconv3x3rate2nopad.gif'>

#### Transposed Convolution (sometimes called deconvolution or fractionally strided convolution)

* A transposed convolution produces the same spatial resolution a hypothetical deconvolutional layer would. 
* However, the actual mathematical operation that’s being performed on the values is different. 
* A transposed convolutional layer carries out a regular convolution but reverts its spatial transformation.

Example:
* An image of 5x5 is fed into a convolutional layer. 
* The stride is set to 2, the padding is deactivated and the kernel is 3x3. 
* This results in a 2x2 image.

#### Example: Transposed 2D convolution with no padding, stride of 2 and kernel of 3
* Blue maps are inputs and cyan maps are outputs
<img src='../pics/transposedconv3x3stride2nopad.gif'>

* If we wanted to reverse this process, we’d need the inverse mathematical operation so that 9 values are generated from each pixel we input. 
* Afterward, we would traverse the output image with a stride of 2. This would be a deconvolution.

* A transposed convolution does not do that. The only thing in common is it guarantees that the output will be a 5x5 image as well, while still performing a normal convolution operation. To achieve this, we need to perform some fancy padding on the input.

* As you can imagine now, this step will not reverse the process from above. We could get the same shape back, but not the same numeric values.
* Instead, this would just reconstruct the spatial resolution from before and performs a convolution. 
* This may not be the mathematical inverse, but for Encoder-Decoder architectures, it’s still very helpful. This way we can combine the upscaling of an image with a convolution, instead of doing two separate processes.

* Therefore, although transposed convolutions are sometimes referred to as "deconvolutions, they shouldn't be, because they aren't actually reverting the process of a convolution. They are not the mathematical inverse of a convolutional layer.

### Separable Convolutions

* In a separable convolution, we can split the kernel operation into multiple steps. 
* Let’s express a convolution as y = conv(x, k) where y is the output image, x is the input image, and k is the kernel. 
* Next, let’s assume k can be calculated by: k = k1.dot(k2). 
* This would make it a separable convolution because instead of doing a 2D convolution with k, we could get to the same result by doing 2 1D convolutions with k1 and k2.

* Take the Sobel kernel for example, which is often used in image processing. 
* You could get the same kernel by multiplying the vector `[1, 0, -1]` and `[1,2,1].T`. 
* This would require 6 instead of 9 parameters while doing the same operation. 
* The example above shows what’s called a "spatial separable convolution" which is rare in deep learning.
* In deep learning, one can create something very similar to a spatial separable convolution by stacking a 1xN and a Nx1 kernel layer. This was recently used in an architecture called EffNet showing promising results.

#### Sobel X and Y filters
<img src = '../pics/sobelxyfilters.png'>


* More commonly in neural networks, we use something called a "depthwise separable convolution". 
* This will perform a spatial convolution while keeping the channels separate and then follow this with a depthwise convolution. 
* For example, let’s say we have a 3x3 convolutional layer on 16 input channels and 32 output channels. 
* What happens in detail is that every of the 16 channels is traversed by 32 3x3 kernels resulting in 512 (16x32) feature maps. 
* Next, we merge 1 feature map out of every input channel by adding them up. Since we can do that 32 times, we get the 32 output channels we wanted.
* For a depthwise separable convolution on the same example, we traverse the 16 channels with 1 3x3 kernel each, giving us 16 feature maps. 
* Now, before merging anything, we traverse these 16 feature maps with 32 1x1 convolutions each and only then start to them add together. 
* This results in 656 (16x3x3 + 16x32x1x1) parameters opposed to the 4608 (16x32x3x3) parameters from above.
* The example is a specific implementation of a depthwise separable convolution where the so called depth multiplier is 1. This is by far the most common setup for such layers.
* We do this because of the hypothesis that spatial and depthwise information can be decoupled. Looking at the performance of the Xception model this theory seems to work. 
* Depthwise separable convolutions are also used for mobile devices because of their efficient use of parameters.

### Receptive Field and Feature Map Visualization
* The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by).
* For convolutional neural network, the number of output features in each dimension can be calculated by the following formula:
$$n_{out} = \Big[\frac{n_{in} + 2p - k}{s}\Big]+1$$
$$$$

* $n_{in}$: number of features
* $n_{out}$: number of output features
* $k$: convolution kernel size
* $p$: convolution padding size
* $s$: convolution stride size

For the moment, we'll assume that the number of (input/output) features equals the amount of the features along one axis (one dimension) of the input/output where an axis can be understood as the width, height or a channel of a color image.

### Visualizing Feature Maps

<img src='../pics/visualize_feature_map.png'>

**Left Column:** 
* The input image is a 5 x 5 matrix (blue grid). 
* Then zero-padding with size of p = 1 (transparent grid around the input image) is used to maintain the edge information during convolution. 
* After that, a 3 x 3 kernel with stride of s = 2 is used to convolve this image to obtain its feature map (green grid) with size of 3 x 3. 
* In this example, nine features are obtained and each feature has a receptive field of 3 x 3 (the area inside light blue lines). 
* We can use the same convolution on this green grid to gain a deeper feature map (orange grid) as shown in sub-figure at the left bottom. As for orange feature map, each feature has a 7 x 7 receptive field.
* But if we only look at the feature map (green or orange grid), we cannot directly know which pixels a feature is looking at and how big that region is. 
$$$$
<img src='../pics/visualize_feature_map.png'>
**Right Column:**
* In the right column, we have encoded the stride into the feature map.
* Thus, the size of each feature map is fixed and equals the size of the input.
* Also, each feature is located at the center of its receptive field. 
* So in this situation, its easier to see the receptive field.

### Receptive Field Arithmetic

The receptive field for a given kernel in a particular layer can be calculated as follows
$$$$
* First, we calculate the number of output features in each dimension as above:

$$ n_{out} = \Big[\frac{n_{in}+2p-k}{s}\Big]+1$$


* Then, we calculate the _jump_ $j$ in the output feature map. The _jump_ is the distance between two adjacent features. For the original input image, _jump_ is equal to 1.
$$ j_{out} = j_{in}*s$$


* Now we calculate the _size of the receptive field_ $r$ of one output feature.

$$ r_{out} = r_{in}+(k-1)*j_{in}$$


* Finally, we calculate the _center position_ of the receptive field of the first output feature.
* Here, _start_ is the center coordinate of one pixel.

$$ start_{out} = start_{in}+\Big(\frac{k-1}{s}-p\Big)*j_{in}$$

<img src='../pics/receptive_field_computation.png'>

### Some code to generate the receptive field of any given neuron
* Assumed layer dimensions: `[filter size, stride, padding]`
* Assume the two filter dimensions are the same
* Each kernel requires the following parameters:
    - k_i: kernel size
    - s_i: stride
    - p_i: padding (if padding is uneven, right padding will higher than left padding; "SAME" option in tensorflow)

* Each layer i requires the following parameters to be fully represented: 
    - n_i: number of feature (data layer has n_1 = imagesize )
    - j_i: distance (projected to image pixel distance) between center of two adjacent features
    - r_i: receptive field of a feature in layer i
    - start_i: position of the first feature's receptive field in layer i (idx start from 0, negative means the center fall into padding)

In [3]:
import math
convnet =   [[11,4,0],[3,2,0],[5,1,2],[3,2,0],[3,1,1],[3,1,1],[3,1,1],[3,2,0],[6,1,0], [1, 1, 0]]
layer_names = ['conv1','pool1','conv2','pool2','conv3','conv4','conv5','pool5','fc6-conv', 'fc7-conv']
imsize = 227

def outFromIn(conv, layerIn):
  n_in = layerIn[0]
  j_in = layerIn[1]
  r_in = layerIn[2]
  start_in = layerIn[3]
  k = conv[0]
  s = conv[1]
  p = conv[2]
  
  n_out = math.floor((n_in - k + 2*p)/s) + 1
  actualP = (n_out-1)*s - n_in + k 
  pR = math.ceil(actualP/2)
  pL = math.floor(actualP/2)
  
  j_out = j_in * s
  r_out = r_in + (k - 1)*j_in
  start_out = start_in + ((k-1)/2 - pL)*j_in
  return n_out, j_out, r_out, start_out
  
def printLayer(layer, layer_name):
  print(layer_name + ":")
  print("\t n features: %s \n \t jump: %s \n \t receptive size: %s \t start: %s " % (layer[0], layer[1], layer[2], layer[3]))
 
layerInfos = []
if __name__ == '__main__':
#first layer is the data layer (image) with n_0 = image size; j_0 = 1; r_0 = 1; and start_0 = 0.5
  print ("-------Net summary------")
  currentLayer = [imsize, 1, 1, 0.5]
  printLayer(currentLayer, "input image")
  for i in range(len(convnet)):
    currentLayer = outFromIn(convnet[i], currentLayer)
    layerInfos.append(currentLayer)
    printLayer(currentLayer, layer_names[i])
  print ("------------------------")
  layer_name = input ("Layer name where the feature in: ")
  layer_idx = layer_names.index(layer_name)
  idx_x = int(input ("index of the feature in x dimension (from 0)"))
  idx_y = int(input ("index of the feature in y dimension (from 0)"))
  
  n = layerInfos[layer_idx][0]
  j = layerInfos[layer_idx][1]
  r = layerInfos[layer_idx][2]
  start = layerInfos[layer_idx][3]
  assert(idx_x < n)
  assert(idx_y < n)
  
  print ("receptive field: (%s, %s)" % (r, r))
  print ("center: (%s, %s)" % (start+idx_x*j, start+idx_y*j))

-------Net summary------
input image:
	 n features: 227 
 	 jump: 1 
 	 receptive size: 1 	 start: 0.5 
conv1:
	 n features: 55 
 	 jump: 4 
 	 receptive size: 11 	 start: 5.5 
pool1:
	 n features: 27 
 	 jump: 8 
 	 receptive size: 19 	 start: 9.5 
conv2:
	 n features: 27 
 	 jump: 8 
 	 receptive size: 51 	 start: 9.5 
pool2:
	 n features: 13 
 	 jump: 16 
 	 receptive size: 67 	 start: 17.5 
conv3:
	 n features: 13 
 	 jump: 16 
 	 receptive size: 99 	 start: 17.5 
conv4:
	 n features: 13 
 	 jump: 16 
 	 receptive size: 131 	 start: 17.5 
conv5:
	 n features: 13 
 	 jump: 16 
 	 receptive size: 163 	 start: 17.5 
pool5:
	 n features: 6 
 	 jump: 32 
 	 receptive size: 195 	 start: 33.5 
fc6-conv:
	 n features: 1 
 	 jump: 32 
 	 receptive size: 355 	 start: 113.5 
fc7-conv:
	 n features: 1 
 	 jump: 32 
 	 receptive size: 355 	 start: 113.5 
------------------------
Layer name where the feature in: pool5
index of the feature in x dimension (from 0)3
index of the feature in y dimens