![Banner](img/AI_Special_Program_Banner.jpg)

# Convolutional Neural Networks (CNN) - Convolution Artithmetic and Visualizations
---

Sources:
* [A guide to convolution arithmetic for deep learning](https://arxiv.org/abs/1603.07285) (and associated [GitHub page](https://github.com/vdumoulin/conv_arithmetic) + additional picture conversions via [EZGif](https://ezgif.com/))
* [Deep Learning Book, chapter 9](http://www.deeplearningbook.org/contents/convnets.html)

The material presented here has been compiled from these sources.

---

## Overview
  - [Motivation](#Motivation)
    - [Example in 2D](#Example-in-2D)
    - [Sparse Connectivity](#Sparse-Connectivity)
    - [Weight Sharing](#Weight-Sharing)
  - [Parameters of convolutions](#Parameters-of-convolutions)
  - [Example](#Example)
  - [Pooling](#Pooling)
  - [Convolution Arithmetic](#Convolution-Arithmetic)
    - [General Formulae for Unit Strides I](#General-Formulae-for-Unit-Strides-I)
    - [General Formulae for Unit Strides II](#General-Formulae-for-Unit-Strides-II)
    - [General Formulae for Unit Strides III](#General-Formulae-for-Unit-Strides-III)
    - [General Formulae for Non-Unit Strides I](#General-Formulae-for-Non-Unit-Strides-I)
    - [General Formulae for Non-Unit Strides II](#General-Formulae-for-Non-Unit-Strides-II)
  - [Final Observations](#Final-Observations)
    - [Example of usefulness: edge detection](#Example-of-usefulness:-edge-detection)

### Motivation
---
In general ANNs, input **flattended** (e.g. MNIST: $28\times 28 \to 784$) $\rightarrow$ loss of structure

Structure in images, sound clips etc:
* stored as multi-dimensional arrays
* one or more **ordered** axes
  - images: width and height
  - sound: time
* shared channel **axis**
  - images: RGB channels
  - sound: left / right for stereo
  
Convolutions preserve structure!

Furthermore
* massive reduction of parameters via
  - sparse connectivity
  - weight sharing

#### Example in 2D
$3\times3$ *kernel* operating on $5\times5$ *input feature map* yielding $3\times3$ *output feature map* via *sliding window operation*
<div>
    <table>
        <tr>
            <td valign="center"> <img src="img/kernel.png" width=80></td>
            <td>&nbsp;<div style="font-size: 30pt">&rarr;</div>&nbsp;</td>
            <td><img src="img/numerical_no_padding_no_strides.gif"></td>
        </tr>
        <tr style="font-size: 11pt">
            <td align="center">kernel</td><td>&nbsp;&nbsp;</td>
            <td align="center">convolution on input&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&rarr;&nbsp;&nbsp;output</td>
        </tr>
    </table>
</div>

**Notes** 
* Output still has 2-dimensional structure!
* example shows one channel; more channels by *stacking* input *feature maps*
* generalization to $N$ dimensions: e.g. in 3-D, kernel is *cuboid* sliding over height, width, depth
* theoretically: *kernel flipping* needed (see [Goodfellow et al.](http://www.deeplearningbook.org/contents/convnets.html))

### Parameters of convolutions
---
discrete convolutions consist of kernels parameterized by
\begin{equation*}
\begin{split}
    n &\equiv \text{number of output feature maps},\\
    m &\equiv \text{number of input feature maps},\\
    k_j &\equiv \text{kernel size along axis $j\>,\ j=1,\dots,N$}
\end{split}
\end{equation*}
e.g. $n=3\,\ m=2\>,\ k_1=k_2=3$:
<div><img src="img/AML_CNN_008.png" width=250></div>

output size $o_j$ of convolutional layer along axis $j$ affected by:
* $i_j$: input size along axis $j$,
* $k_j$: kernel size along axis $j$,
* $s_j$: *stride* (distance between two consecutive positions of the kernel) along axis $j$,
* $p_j$: *zero padding* (number of zeros concatenated at the beginning and at the end of an axis) along axis $j$.

### Example
 $N = 2$, $i_1 = i_2 = 5$, $k_1 = k_2 = 3$, $s_1 = s_2 = 2$, and $p_1 = p_2 = 1\rightarrow o_1 = o_2 = 3$
<div>
    <table>
        <tr>
            <td valign="center"> <img src="img/kernel.png" width=80></td>
            <td>&nbsp;<div style="font-size: 30pt">&rarr;</div>&nbsp;</td>
            <td><img src="img/numerical_padding_strides.gif"></td>
        </tr>
        <tr style="font-size: 11pt">
            <td align="center">kernel</td><td>&nbsp;&nbsp;</td>
            <td align="center">padded input + convolution&nbsp;&nbsp;&rarr;&nbsp;output</td>
        </tr>
    </table>
</div>

### Pooling
---
+ *practically* works like discrete convolutions, but goal is *summarization of subregions*
+ output size $o_j$ of a pooling layer along axis $j$ affected by:
  - $i_j$: input size along axis $j$
  - $k_j$: pooling window size along axis $j$
  - $s_j$: stride along axis $j$

**Example**: $i_1=i_2=5\>,\ k_1=k_2=3\>,\ s_1=s_2=1\rightarrow o_1=o_2=3$
<div>
    <table>
        <tr>
            <td><img src="img/numerical_max_pooling.gif"></td><td>&nbsp;&nbsp;</td><td><img src="img/numerical_avg_pooling.gif"></td>
        </tr>
        <tr style="font-size: 11pt">
            <td align="center">Max-Pooling</td><td>&nbsp;&nbsp;</td><td align="center">Average Pooling</td>
        </tr>
    </table>
</div>

### Convolution Arithmetic
---
Derivation of formulas for output size $o=o_1=o_2$ for special case (ease of visualization):
* 2-D discrete convolutions ($N = 2$),
* square inputs ($i_1 = i_2 = i$),
* square kernel size ($k_1 = k_2 = k$),
* same strides along both axes ($s_1 = s_2 = s$),
* same zero padding along both axes ($p_1 = p_2 = p$)

**but**: no interaction of parameters across axes, therefore easily generalizable!

$\rightarrow$ Output size equals *number of possible placements of kernel on input*

**Example**: $i=4\>,\ k=3\>,\ s=1\>,\ p=0\rightarrow o=2$
<div><img src="img/no_padding_no_strides.gif" width=200></div>

#### General Formulae for Unit Strides I
* **no zero padding, unit strides** (see above for visualization)<br>
For any $i$ and $k$, and for $s = 1$ and $p = 0$,
\begin{equation*}
    o = (i - k) + 1.
\end{equation*}
* **zero padding, unit strides** <br>
For any $i$,  $k$ and $p$, and for $s = 1$,
\begin{equation*}
    o = (i - k) + 2p + 1.
\end{equation*}
  *Proof*: effective input size changed to $i+2p\quad\square$<br>
 **Example**: $i = 5$, $k = 4$, $p = 2\rightarrow o=6$. 
  <div><img src="img/arbitrary_padding_no_strides.gif" width=250></div>

#### General Formulae for Unit Strides II
two special cases for padding widely used
+ **half (same) padding**: goal is $o = i$<br>
For any $i$ and for $k$ odd ($k = 2l + 1\>,\ l \in \mathbb{N}$), $s = 1$ and
$p = \lfloor k / 2 \rfloor = l$,<br>

\begin{equation*}
\begin{split}
    o &= i + 2 \lfloor k / 2 \rfloor - (k - 1) \\
      &= i + 2l - 2l = i
\end{split}
\end{equation*}
  **Example**: $i = 5$, $k = 3 \Rightarrow p = 1\rightarrow o=i=5$:
  <div><img src="img/same_padding_no_strides.gif" width=300></div>

#### General Formulae for Unit Strides III
two special cases for padding widely used
+ **full padding**: goal is $o > i$<br>
For any $i$ and $k$, and for $p = k - 1$ and $s = 1$,

\begin{equation*}
\begin{split}
    o &= i + 2(k - 1) - (k - 1) \\
      &= i + (k - 1).
\end{split}
\end{equation*}
  **Example**: $i = 5$, $k = 3\Rightarrow p = 2\rightarrow o=7$.
  <div><img src="img/full_padding_no_strides.gif" width=300></div>

#### General Formulae for Non-Unit Strides I
* **no zero padding, non-unit strides**<br>
For any $i$ and $k$, and for $s > 1$ and $p = 0$,

\begin{equation*}
    o = \left\lfloor \frac{i - k}{s} \right\rfloor + 1.
\end{equation*}
  **Example**: $i = 5$, $k = 3$, $s = 2$, and $p=0\rightarrow o=2$:
  <div><img src="img/no_padding_strides.gif" width=350></div>

#### General Formulae for Non-Unit Strides II
* **zero padding, non-unit strides**<br>
For any $i$, $k$, $p$ and $s$,
\begin{equation*}
    o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1.
\end{equation*}

  **Example**: $k=3\>,\ s=2\>,\ p=1$ with $i=5$ (left) and $i=6$ (right) $\rightarrow o=3$ (both cases)
<div>
    <table>
        <tr>
            <td><img src="img/padding_strides.gif" width=80%></td>
            <td>&nbsp;&nbsp;</td>
            <td><img src="img/padding_strides_odd.gif" width=80%></td>
        </tr>
        <tr style="font-size: 11pt">
            <td align="center">5x5 input</td><td>&nbsp;&nbsp;</td><td align="center">6x6 input</td>
        </tr>
    </table>
</div>

### Final Observations
---
#### Efficacy of CNN
<div><img src="img/AML_CNN_004.png" width=500></div>

**Observations**: 
* receptive field widens in deeper layers
* effect compounded by pooling or strided convolutions
* depth of network and choice of architecture lead to *indirect connections* of output neuros to input neurons

#### Example of usefulness: edge detection
kernel: $(1,-1)$; padding: $p_1=1\>,\ p_2=0$; stride: $s_1=s_2=1\rightarrow$ image size preserved

<div><img src="img/AML_CNN_006.png" width=500></div>

(image from [Goodfellow et al.](http://www.deeplearningbook.org/contents/convnets.html))

For a more sophisticated edge detector see, e.g., [Sobel Operator](https://en.wikipedia.org/wiki/Sobel_operator)

**However**: Weights ($\equiv$ kernel parameters) are *learned*, not prescribed