# Foundations of CNN

* used mostly in computer-vision problems such as image classification, object detection, style transfer, others
* images may lead to large inputs (and consequently very large networks)

## Convolution operation

**Convolution**  
* image (6 x 6) * filter (3 x 3) = result (4 x 4)
    * (n x n) * (f x f) = (n-f+1 x n-f+1)
        * f usually odd
        * not full detection, image shrinks -> padding (adding pixels around)
        * if p denotes padding size then the result size is (n+2p-f+1 x n+2p-f+1)
        * valid convolution (no padding)
            * (n x n) * (f x f) = (n-f+1 x n-f+1)
        * same convolution (output size is same as original input size)
            * (n+p x n+p) * (f x f) = (n+2p-f+1 x n+2p-f+1)
            * p = (f-1)/2  
* in some textbooks, filters are rotated (mirrored) before doing the transformation (to allow for associations), by convention this is not used in deep learning literature
* filters can be learned through weights (!)        
    
Algorithm 
* sliding filter window into the image, multiplying filter with the image values, adding them together and putting them into the result matrix
* sliding through the whole image from left to right (by 1 pixel), from top to down (by 1 pixel)
* result matrix is filled, serves as filter detection (could be ie edge filter)
* python: `conv-forward` tensorflow: `tf.nn.conv2d` keras: `tf.kearas.Conv2D`

**Strides**
* strides (movement by pixels) different from 1 can be used for the filter window
* (n+p x n+p) * (f x f) = ([(n+2p-f)/s+1] x [(n+2p-f)/s+1])
    * s is stride, if the fraction is not integer we round down

**Volumes**
* image (6 x 6 x 3) * filter (3 x 3 x 3) = result (4 x 4)  
* detection across color channels (can look at one or all)
* multiple filters can be applied at the same time resulting in stacking the 2d outputs (the last dimension of the result is driven by number of convolution filters)

Algorithm
* put filter into the starting position, do element-wise multiplication for     corresponding elements
* sum the result and add it to the result matrix
* move the filter by stride and repeat until the result matrix is filled

## One layer of CNN

* number of params -> (filter size + bias) * number of filters
* notation for layer $l$
    * $m$ -> training examples
    * $f^{[l]}$ -> filter size
    * $p^{[l]}$ -> padding
    * $d^{[l]}$ -> stride
    * $n_c^{[l]}$ -> number of filters
    * $f^{[l]}$ x $f^{[l]}$ x $n_c^{[l-1]}$ -> size of a filter, last dimension same as the number of channels in previous layer
    * $f^{[l]}$ x $f^{[l]}$ x $n_c^{[l]}$ -> activations
    * $m$ x $f^{[l]}$ x $f^{[l]}$ x $n_c^{[l]}$-> vectorized activations
    * $f^{[l]}$ x $f^{[l]}$ x $n_c^{[l-1]}$ x $n_c^{[l]}$ -> weights
    * $n_c^{[l]}$ -> bias

* input -> $n_h^{[l-1]}$ x $n_w^{[l-1]}$ x $n_c^{[l-1]}$  
* output -> $n_h^{[l]}$ x $n_w^{[l]}$ x $n_c^{[l]}$, where $n_h^{[k]}= [\frac{n_w^{[l-1]}+2p^{[l]}-f^{[l]}}{s^{[l]}}+1]$, see less formally written version above

* forward step (one filter)
    * volume calculation (sliding, element-wise multiplication and sum)
    * adding bias
    * feeding the resulting matrix into an activation function
    * stacking filters together (output of the convolution layer)
    * ([(n+2p-f)/s+1] x [(n+2p-f)/s+1] x number of filters)

* backward step ?


## Deep CNN

* for the parameters of convolution filters, formulas above apply
* common architecture -> convolution filter shrink and number of channels increases
* in the last step, filters are unrolled to one long vector which is fed into output layer

**Pooling layer**  
* max pooling -> input matrix divided into regions, returns max for each region
    * filter size & stride as pooling hyper params (defines regions)
    * no parameters to learn (fixed function)
    * intuition -> in case filter feature is detected, keep large number (the feature was detected)
    * done independently on each of the channel
    * ([(n+2p-f)/s+1] x [(n+2p-f)/s+1] x number of channels)
    
* average pooling -> input matrix divided into regions, returns avg for each region
    * not used that often (unless in very deep NNs for reducing the input)