<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Design" data-toc-modified-id="Design-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Design</a></span><ul class="toc-item"><li><span><a href="#Keep-your-goals-Simple" data-toc-modified-id="Keep-your-goals-Simple-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Keep your goals Simple</a></span></li><li><span><a href="#CNN-Design-Overview" data-toc-modified-id="CNN-Design-Overview-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>CNN Design Overview</a></span><ul class="toc-item"><li><span><a href="#CNN-Architecture-(Image-Classification)" data-toc-modified-id="CNN-Architecture-(Image-Classification)-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>CNN Architecture (Image Classification)</a></span><ul class="toc-item"><li><span><a href="#Head" data-toc-modified-id="Head-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>Head</a></span></li><li><span><a href="#Tail-and-Body" data-toc-modified-id="Tail-and-Body-1.2.1.2"><span class="toc-item-num">1.2.1.2&nbsp;&nbsp;</span>Tail and Body</a></span></li><li><span><a href="#Receptive-field-size-at-the-head" data-toc-modified-id="Receptive-field-size-at-the-head-1.2.1.3"><span class="toc-item-num">1.2.1.3&nbsp;&nbsp;</span>Receptive field size at the head</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#CNN-Architectures" data-toc-modified-id="CNN-Architectures-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>CNN Architectures</a></span><ul class="toc-item"><li><span><a href="#Serial-(LeNet,-AlexNet,-VGG,-MobileNet-V1)" data-toc-modified-id="Serial-(LeNet,-AlexNet,-VGG,-MobileNet-V1)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Serial (LeNet, AlexNet, VGG, MobileNet V1)</a></span></li><li><span><a href="#Parallel-(GoogLeNet,-Inceptions,-SqueezeNet)" data-toc-modified-id="Parallel-(GoogLeNet,-Inceptions,-SqueezeNet)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Parallel (GoogLeNet, Inceptions, SqueezeNet)</a></span></li><li><span><a href="#Dense-(DenseNet)" data-toc-modified-id="Dense-(DenseNet)-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Dense (DenseNet)</a></span></li><li><span><a href="#Residual" data-toc-modified-id="Residual-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Residual</a></span></li><li><span><a href="#Neural-Architecture-Search" data-toc-modified-id="Neural-Architecture-Search-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Neural Architecture Search</a></span><ul class="toc-item"><li><span><a href="#Notes-for-Vision-Optimized-CNNs" data-toc-modified-id="Notes-for-Vision-Optimized-CNNs-2.5.1"><span class="toc-item-num">2.5.1&nbsp;&nbsp;</span>Notes for Vision Optimized CNNs</a></span></li></ul></li></ul></li><li><span><a href="#RNNs" data-toc-modified-id="RNNs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>RNNs</a></span><ul class="toc-item"><li><span><a href="#RNN" data-toc-modified-id="RNN-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>RNN</a></span></li><li><span><a href="#GRU" data-toc-modified-id="GRU-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>GRU</a></span></li><li><span><a href="#LSTM" data-toc-modified-id="LSTM-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>LSTM</a></span></li></ul></li><li><span><a href="#Attention" data-toc-modified-id="Attention-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Attention</a></span></li></ul></div>

In [1]:
import tensorflow as tf
from tensorflow.keras.layers import *
print(tf.__version__)

2.1.0


# Design

[Slides](https://github.com/arthurredfern/UT-Dallas-CS-6301-CNNs/blob/master/Lectures/xNNs_050_Design.pdf)

[On disk](file:///F:/Data/xNNs_050_Design.pdf)

## Keep your goals Simple
- The simpler the task, the more training data there is. The more data, the better a NN will perform
- Handle complex problems with a shallow combination of simple problems.

## CNN Design Overview
- CNN exploits spatial structure in data. correlations between loaclized regions of pixels.
- __Encoder__ (Tail & Body for feature extraction) + __decoder__ (head for prediction)
    - Commonly: Serial, Parallel, Dense, or Residual
    
- __Tail__- few specialized layers with weaker features and better localization.
- __Body__ - layers that define the network. strong features with worse localization.
- __Head__ - task specific layers that produce the output of the network


### CNN Architecture (Image Classification)

Classification network inputs an image and ouputs a vector, whose largest elements are the predicted class.

$$TAIL\rightarrow BODY \rightarrow HEAD$$

$$ENCODER\rightarrow DECODER$$

#### Head

The __head__ estimates the dominant object in a specific region by using the feature vector in the corresponding region of the feature maps at the body ouput. 
- __Global average pooling__ will extract information from each feature map (presence of feature across all areas) and the head will use a linear combination of these global features to predict the dominant object of the image. Global average Pooling or vectorization can be used to convert the output of the body to a 1xN vector for input into the Linear layers of the head.

GAP is more popular since it allows arbitrary sized input images and also reduces the computational complexity of the decoding phase.


Different head designs can be created to accomplish different goals, and more than one head can be attached to the end of a tail and body.

#### Tail and Body

Tail and body transform entire regions of the input image to feature vectures across the ouput feature maps. The __receptive field__ is the area of pixels from the input that can be mapped to a feature vector output.

Each activation in the output of the body relates the confidence of the presence of a class in a certain area (size receptive field). Averaging over every region leads to the final prediction in the head of the dominant class in the image.


##### Tail
- High compute, high feature map memory low parameter memory, lots of spatial redundancy.
- Aggressive down sampling and aggressive increase in number of channels.

Common architectures:
- conv (7,7) stride 2, 64 channels -> max pool (3x3) stride 2
- conv(3,3) stride 2, 32 channels -> conv (3,3) s=1, c=64 -> conv(3,3) s=2, c=64


##### Body
Blocks followed by down-sampling. Gradual reduction in memory required for feature maps. Leads to a loss in spatial resolution (pooling throws out pixels) Increase in number of channels

Common design practices:
- reduce cols and rows by 1/2 and double the number of channels
    - data volume shrinks by a factor of 2 (reduces compute by factor of 2)
  
#### Receptive field size at the head
Shows the area of information that the classifier has to work with at inference. May affect images with higher resolution.

__Calculation of Receptive Field__
1. Start at the output of the body and set r.f.=1
2. Move backwards in network.
3. Everytime you reach a filter, increase the receptive field size by F-1. (F is length of filter eg 3x3)
4. Everytime you reach a down sample multiply the receptive field by S, then subtract by S -1 (S the down sampling factor eg 2)


# CNN Architectures

## Serial (LeNet, AlexNet, VGG, MobileNet V1)

__VGGNet__ introduced stacked 3x3 conv filters. Used vectorization at the head
   - Stacking 3x3 convolutions reduces compute while expanding receptive field.
    

__[MobileNet V1](https://arxiv.org/pdf/1704.04861.pdf)__ uses cascade of 3x3 spatial and 1x1 channel convs. Utilized global avg pool at head.
   - __Depthwise-seperable convolutions__. Depthwise convolution has a nxnx1 kernel for each channel of the input. Pointwise convolution has 1x1xc kernels for c channels in input. [depthwise conv explained](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)

In [None]:
def MobileNetV1(inputs, num_filters, strides=(1,1)):
    """
    Conv Block from the MobileNetV1 architecture.
    
    inputs: Tensor- input to the first layer
    num_filters: int - desired number of channels in output
    strides: tuple-strides for the first convolution
    """
    x = DepthwiseConv2D((3,3), strides=strides)(inputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    
    x = Conv2D(num_filters, (1, 1))(x) #pointwise conv
    x = BatchNormalization()(x) 
    x = ReLU()(x)
    return x

## Parallel (GoogLeNet, Inceptions, SqueezeNet)
Input split $\rightarrow$ parallel ops $\rightarrow$ output combine. 

[Inception overview](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)

__[Inception V1](https://arxiv.org/pdf/1409.4842v1.pdf)__ - Size of the object may differ across images So build parallel convolution pathways with different receptive fields.

Authors limited the compute by adding 1x1 pontwise convolutions (reducing input channels) before each of the conv branches. References the *Network in Network* paper as the source of pointwise conv.

Authors also include a Dropout layer after GAP with dropout prob of 0.4.

Authors use axillary classifiers, heads inside of the network that predict the output. This attempts to allow clean gradient flow to earler parts of the network.

<img src="../img/lenet.PNG">

In [None]:
def conv_block(inputs, filters, kernel_size=(3,3), strides=(1,1)):
    """Generic Conv -> BN -> ReLU abstraction"""
    x = Conv2D(filters, kernel_size, strides=strides)(inputs)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    return x  

def InceptionV1(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V1 Conv Block (GoogLeNet).
    """
    x1 = conv_block(inputs, out_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(5,5))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, out_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

__[Inception V2 and V3](https://arxiv.org/pdf/1512.00567.pdf)__

__Inception V2__ focused on: 
- avoiding representational bottlenecks with extreme compression. 
- converting the 5x5 convolution to two stacked 3x3 convolutions. 
- Also focuses on reducing compute by factorizing an nxn convolution to an nx1 and 1xn convolutions.

In [None]:
def InceptionV2_fig5(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V2 Figure 5
    """
    x1 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(3,3))
    x3 = conv_block(x3, out_dims, kernel_size=(3,3))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

def InceptionV2_fig6(inputs, squeeze_dims, out_dims, strides=(1,1)):
    """
    Inception V2 figure 6
    """
    x1 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    x2 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x2 = conv_block(x2, out_dims, kernel_size=(3,1))
    x2 = conv_block(x2, out_dims, kernel_size=(1,3))
    
    x3 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    x3 = conv_block(x3, out_dims, kernel_size=(3,1))
    x3 = conv_block(x3, out_dims, kernel_size=(1,3))
    x3 = conv_block(x3, out_dims, kernel_size=(3,1))
    x3 = conv_block(x3, out_dims, kernel_size=(1,3))
    
    x4 = MaxPool2D(pool_size=(3,3), strides=(1,1))(inputs)
    x4 = conv_block(inputs, squeeze_dims, kernel_size=(1,1))
    
    concat = tf.concat([x1,x2,x3,x4],3)
    return concat

__Inception V3__ 
- adds RMSProp optimizer
- factorized 7x7 convolutions
- Batch Norm in auxillary classifiers
- label smoothing (regularization)

__[Inception V4](https://arxiv.org/pdf/1602.07261.pdf)__

Changes the tail design and introduces 5 new inception blocks. Also introduces an Inception-Resnet V1 and V2 architecture that make use of skip connections.

__[SqueezeNet](https://arxiv.org/abs/1602.07360)__

Parallel block design. A  1x1 convolutions that squeezes the channel dimension. Followed by parallel 1x1 and 3x3 convolutions that expand the channel dim.


## Dense (DenseNet)
Input split into a trasformation path and an identity path. Results are concatenated (Adds width to the netwprk w/ features at different depths) 

## Residual
Instead of concat as in DenseNet perform an add on the two splits.

__[ResNetV1](https://arxiv.org/pdf/1512.03385.pdf)__ Add the input of a convolutional block to the results of the block's convolutions. These "skip connections" allow networks to train faster by allowing the clean flow of gradients through the network.

Do the ReLU after the Add so the feature maps are compatable (both have positive and negative values)

<img src="../img/resnet.png">

[This paper](https://arxiv.org/pdf/1603.05027.pdf) describes the importance of "pre-activation" for the clean flow of gradients through the network.
<img src="../img/preactivation.png" height=200>

In [None]:
def ResNetV1(inputs, channel_dims, strides=(1,1)):
    """ResNet block w preactivation"""
    #RESIDUAL PATH
    resid = BatchNormalization()(inputs)
    resid = ReLU()(resid)
    resid = conv_block(inputs, channel_dims, kernel_size=(3,3), strides=strides)
    resid = Conv2D(channel_dims, (3,3), strides=(1,1))(resid)
    #IDENTITY PATH
    if strides==(2,2):
        inputs = Conv2D(channel_dims, (1,1), strides=strides)(inputs)
    #COMBINE
    combined = Add()([inputs, resid])
    return combined

__[ResNeXT](https://arxiv.org/abs/1611.05431)__

Replace 3x3 convolutions in the
residual bottleneck with highly
grouped 3x3 convolutions

<img src="../img/resnext.png">

__[MobileNet V2](https://arxiv.org/abs/1801.04381)__

- Uses an expansion 1x1 conv
- Utilizes depthwise seperable convolutions
- Utilizes ReLU6

The inverted residual ends the block with a compression 1x1 conv that shrinks the number of channel dims.

In [17]:
def inverted_residual(inputs, expand_channels, squeeze_channels, strides=(1,1)):
    """
    Inverted residual a la MobileNet V2 note the channel dimension will
     be expanded by pointwise conv, processed with depthwise conv, then 
     compressed by a linear bottleneck
    """
    x = Conv2D(expand_channels, (1, 1), strides=strides, padding='same')(inputs)
    x = BatchNormalization(**bn_params)(x)
    x = ReLU(max_value=6)(x)  # the paper uses a thresholded ReLU (3-bit output)
    
    x = DepthwiseConv2D((3,3), strides=(1,1), **conv_params)(x)
    x = BatchNormalization(**bn_params)(x)
    x = ReLU(max_value=6)(x)
    
    x = Conv2D(squeeze_channels, (1, 1), strides=(1, 1), padding='same')(x)
    x = BatchNormalization(**bn_params)(x) #No activation here (Linear BottleNeck)
    
    if strides==(2,2): # maintain dimensions during downsampling
        inputs = Conv2D(squeeze_channels, (1, 1), strides=strides, padding='same')(inputs)
    
    return Add()([x, inputs])

__[ShuffleNet](https://arxiv.org/pdf/1707.01083.pdf)__

__[ShiftNet](https://arxiv.org/abs/1711.08141)__

__[Residual Squeeze and Excite](https://arxiv.org/abs/1709.01507)__

__[Residual Selective Kernel Methods](https://arxiv.org/abs/1903.06586)__

__[ShuffleNet V2](https://arxiv.org/abs/1807.11164)__

## Neural Architecture Search

[1](https://www.fast.ai/2018/07/23/auto-ml-3)

[survey](https://arxiv.org/abs/1808.05377)

[survey](https://www.ml4aad.org/automl/literature-on-neural-architecture-search/)

__[MnasNet](https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html)__

__[EfficientNet](https://arxiv.org/abs/1905.11946)__

### Notes for Vision Optimized CNNs
- Target a feature size of 6-8x6-8 before GAP
- Early body blocks will have a high compute cost, while late blocks will require more memory.

# RNNs

Exploit sequential structure in data

## RNN
$$y_t = f(x_t^TH+y_{t-1}^TG+v^T)$$

## GRU
Update Gate:
$$u_t^T=\sigma(x_t^TH_U+y_{t-1}^TG_U+v_U^T)$$

Reset Gate:
$$r_t^T=\sigma(x_t^TH_r+y_{t-1}^TG_r+v_r^T)$$

Output:
$$y_t^T=[u_t\circ y_{t-1}]+[(1-u_t)\circ \sigma(x_t^TH_y+(r_t\circ y_{t-1})G_y+v_y)$$

$\circ$ is Hadamard product (pointwse multiplication)

## LSTM

[Understanding LSTMS](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

<img src="../img/lstm.png">

The __cell state__ is persistent memory that can be updated or modfied at each time step.

The __forget gate__ is a sigmoid operation wrapped arounf a dense unit. The output will be hadamard multiplied with the cell state and "remove information"
$$f_t=\sigma(W_f\cdot concat(h_{t-1},x_t)+b_f)$$

__Input Gate__ decides which values of the cell state wil be updated. 
$$i_t=\sigma(W_i\cdot concat(h_{t-1},x_t)+b_i)$$

This is Hadamard multiplied with __candidate values__ created by a tanh dense operation (outputs bounded (-1,1))
$$\tilde{C_t} = tanh(W_c\cdot concat(h_{t-1},x_t)+b_c)$$

The new cell state is then modified
$$C_t = f_t*C_{t-1}+i_t*\tilde{C_t}$$


Lastly the output will be a filtered version of the cell state. The __output gate__ is another sigmoid-dense operation that will be multiplied with the cell state to produce the __hidden state__ $h_t$

$$o_t=\sigma(W_o\cdot concat(h_{t-1},x_t)+b_o)$$

$$h_t=o_t*tanh(C_t)$$



In [None]:
#TensorFlow APIs
tf.keras.layers.RNN()
tf.keras.layers.GRU()
tf.keras.layers.LSTM()

# Attention

[Attention](https://distill.pub/2016/augmented-rnns/)
[The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
[Annotated](http://nlp.seas.harvard.edu/2018/04/03/attention.html)