# Understanding Deep Neural Network Layers

## 2D Convolutional Layers
A **convolutional layer** convolves a **filter** (a coolection of **kernels**) with the input to produce an **activation or feature map**. Sometimes, a **bias** term is added to the output of the filter. Mathematically speaking, a convolution is the sum of element-wise multiplications of the filter over the input  in a sliding window fashion. In a convolutional neural network, the filter weights are learned. 

<img src="media/convolution.png" alt="convolution example" width="512"/>


In [17]:
import torch


# Our randomly initialized input tensor
# We can interpret it  as a batch of 1 16x16 RGB image (3 channels)
x = torch.rand((1, 3, 16, 16))
N, C, H, W = x.shape

# KERNEL SIZE
# This is the "field of view" of the convolution. This is also referred to as
# the receptive field of the layer.
y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=C,
    kernel_size=5)(x)
print(y.shape) # torch.Size([1,3,12,12])

# PADDING
# Without padding, we can see that the ouput is smaller than the input. (If you 
# think about the sliding window, the edges of the input matrix are "clipped" 
# during convolution.) We can add padding to the matrix to get an output matrix 
# of the same size as the input (allow for centering of edge pixels to be in the 
# center of the filter)
y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=C,
    kernel_size=5,
    padding=2)(x)
print(y.shape) # torch.Size([1,3,16,16])

# STRIDE
# We can also play around with striding to reduce the size of our feature map. 
# The stride specifies the "step size" of the filter as we slide through the 
# image. A stride of 2 means that we slide through the image 2 pixels at a time 
# (i.e., skipping a pixel.)
y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=C,
    kernel_size=5,
    padding=2,
    stride=2)(x)
print(y.shape) # torch.Size([1,3,8,8])

# NUMBER OF KERNELS
# The number of kernels in a filter will determine the number of channels in 
# your output feature map.
y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=5,
    kernel_size=5,
    padding=2)(x) 
print(y.shape) # torch.Size([1,5,16,16])

# DILATION
# Dilation essentially influates the kernel by inserting spaces between the
# kernel elements. A dilation factor of l=2 means that we insert l-1 spaces 
# between the kernel elements. Dilation allows us to observe a larger receptive
# field without adding additional costs (i.e. number of parameters).
y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=C,
    kernel_size=5,
    dilation=2)(x)
print(y.shape) # torch.Size([1, 3, 8, 8])

y = torch.nn.Conv2d(
    in_channels=C, 
    out_channels=C,
    kernel_size=9,
    dilation=1)(x)
print(y.shape) # torch.Size([1, 3, 8, 8])

torch.Size([1, 3, 12, 12])
torch.Size([1, 3, 16, 16])
torch.Size([1, 3, 8, 8])
torch.Size([1, 5, 16, 16])
torch.Size([1, 3, 8, 8])
torch.Size([1, 3, 8, 8])


### Transposed Convolution

This layer is also called **deconvolution** or **fractionally strided convolution**. Conceptually it's purpose is the opposite of a typical convolution layer, i.e. it is "upsampling" information instead of condensing information. This type of layer is used often in high-resolution image generation, or mapping low-dimensional feature space to higher-dimensional feature space as in autoencoders. 

Empirically, people have observed that using transposed convolution can lead to **checkerboard artifacts**. This is a result of uneven overlap when the filter size is not divisible by the stride. 

A normal convolution:

<img src="media/convolution_computer.jpeg" alt="convolution example" width="512"/>

A transposed convolution:

<img src="media/transpose_convolution.png" alt="convolution example" width="512"/>

In [3]:
y = torch.nn.ConvTranspose2d(
    in_channels=C, 
    out_channels=C, 
    kernel_size=5)(x)
print(y.shape) #torch.Size([1, 3, 20, 20])

torch.Size([1, 3, 20, 20])


## Pooling Layers

A pooling layer reduces the spatial size of the input, thereby reducing the number of parameters and controlling overfitting. It is independently applied to every channel of the input, resizing it spatially using maximum or average operations.

<img src="media/pooling.jpeg" alt="convolution example" width="512"/>

In [5]:
y = torch.nn.MaxPool2d(
    kernel_size=2, 
    stride=2)(x)
print(y.shape) #torch.Size([1, 3, 8, 8])

y = torch.nn.AvgPool2d(
    kernel_size=2, 
    stride=2)(x)
print(y.shape) torch.Size([1, 3, 8, 8])

torch.Size([1, 3, 8, 8])
torch.Size([1, 3, 8, 8])


## Normalization Layers

A normalization layer normalizes activations during training and is typically placed between a convolutional layer and an activation layer. It learns two trainable parameters, (1) gamma, a standard deviation parameter and (2) beta, a mean parameter. It scales the input activations by these two parameters to force activations to be unit standard deviation and zero mean.

**Batch normalization** normalizes over the minibatch, **layer normalization** normalizes across the features, and **instance normalization** normalizes across each channel.

In [7]:
y = torch.nn.BatchNorm2d(num_features=3)(x)
print(y.shape) # torch.Size([1, 3, 16, 16])

y = torch.nn.InstanceNorm2d(num_features=3)(x)
print(y.shape) # torch.Size([1, 3, 16, 16])

y = torch.nn.LayerNorm(x.size()[1:])(x)
print(y.shape) #torch.Size([1, 3, 16, 16])

torch.Size([1, 3, 16, 16])
torch.Size([1, 3, 16, 16])
torch.Size([1, 3, 16, 16])


## Fully Connected Layers

A fully connected layer has connections to all activations in the previous layer.

In [13]:
x_flat = x.view(48,16) # torch.Size([48, 16])

y = torch.nn.Linear(
    in_features=16,
    out_features=4)(x_flat)
print(y.shape) # torch.Size([48, 4])

torch.Size([48, 4])


## Nonlinear Layers

Also known as **activation layers**.

<img src="media/relu.png" alt="convolution example" width="512"/>
<img src="media/sigmoid.png" alt="convolution example" width="512"/>
<img src="media/tanh.png" alt="convolution example" width="512"/>

In [16]:
y = torch.nn.ReLU()(x)
y = torch.nn.Sigmoid()(x)
y = torch.nn.Tanh()(x)
print(y.shape) # torch.Size([1, 3, 16, 16])

torch.Size([1, 3, 16, 16])


## Recurrent Layers

Recurrent layers are good for processing sequential information (e.g. speech, language) since they maintain a **hidden state** representing features from previous time steps. 

A **basic RNN cell** combines the input and previous hidden state to form a vector. That vector  goes through a nonlinear activation to output the new hidden state.

<img src="media/rnn_cell.gif" alt="convolution example" width="512"/>

A downside of the basic RNN cell is short-term memory loss due to vanishing gradients-- RNNS have a hard time carrying information from earlier time steps to later ones. 

**Long Short-Term Memory (LSTM) cells** use three different gate types in order to control the flow of information to and from the **cell state**, which helps hold relevant information throughout the processing of the sequence. 
- The **forget gate** controls what part of the previous cell state will be kept.
- The **input gate** controls what part of the new computed information will be added to the cell state.
- The **out gate** controls what part of the cell state will exposed as the hidden state.

**Gated Recurrent Units (GRUs)** are simpler than LSTM cells. They do not have a cell state and only have two gates, the **reset gate** and the **update gate**. 
- The update gate decides what information to keep from the hidden state.
- The reset gate decides what part of the hidden state we use to compute the new proposal.
<img src="media/lstm_gru.png" alt="convolution example" width="512"/>

In [23]:
# Input sequence of length 5, batch size 3, input size 16
x = torch.rand((5, 3, 16))
N, BS, L = x.shape

# Hidden state of batch size 3, hidden size 10
L_hidden = 10
hx = torch.rand((BS, L_hidden))

# Cell state of batch size 3, hidden size 10
cx = torch.rand((BS, L_hidden))

# Make the RNN cells
rnn_cell = torch.nn.RNNCell(
    input_size=L, 
    hidden_size=L_hidden)
lstm_cell = torch.nn.LSTMCell(
    input_size=L, 
    hidden_size=L_hidden)
gru_cell = torch.nn.GRUCell(
    input_size=L, 
    hidden_size=L_hidden)

# Apply recurrent cells to input
for i in range(N):
    hx = rnn_cell(x[i], hx)
for i in range(N):
    hx, cx = lstm_cell(x[i], (hx, cx))
for i in range(N):
    hx = gru_cell(x[i], hx)

## References
- https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215
- http://cs231n.github.io/convolutional-networks/
- https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8
- https://jhui.github.io/2017/03/15/RNN-LSTM-GRU/
- https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21