In [3]:
import glob
glob.glob("*.png")

['residual_block.png',
 'shallow and deeper_networks.png',
 'resnet.png',
 'resnet_ensemple_networks.png',
 'resnet_architectures.png',
 'imagenet_resnet_residual.png',
 'Identity_mapping in residual blocks.png',
 'resnet_2_3_layers.png']

<h1>ResNet

![ResNet](resnet.png)

***The authors of the ResNet paper argue, that the stacking layers shouldn't degrade the network performance, because we could simply stack identity mappings (layers that don't do anything) upon the current network, and the resulting architecture would perform the same.***

 

*This indicates that the deeper model should not produce a training error higher than its shallower counterparts.* 

![imagenet_resnet_residual](imagenet_resnet_residual.png)

    If you think about it, ResNet can be considered as an ensemble of Smaller networks!

![Resnet_Ensemble](resnet_ensemple_networks.png)

<h3>WHERE IS THE RESIDUE?  
 

    But why call it residual? Where is the residue? It’s time we let the mathematicians within us to come to the surface. Let us consider a neural network block, whose input is x and we would like to learn the true distribution H(x). Let us denote the difference (or the residual) between this as

![ResNet_math](resnet_math.png)

![ResNet_core_idea](resnet_core_idea.png)

<h2> ResNet Back Propagation

![Resnet_Back_propagation](ResNet_back_propagation.jpg)

- `If λ ~ 0 (because of multiple positive ReLUs or other activation functions) the left term will be exponentially large, and gradient exploding problem occurs. As we should remember, when the gradient exploded, the loss cannot be converged.`

- `If λ~0 (because of multiple ReLUs or other activation functions), the left term will be exponentially small, and the gradient vanishing problem occurs. We cannot update the gradient with a large value, the loss stays at the plateau and ends up converged with a large loss.`

        Thus, that’s why we need to keep clean for the shortcut connection path from input to output without any Conv layers, BN and ReLU.

<h1>ResNet V2

![ResNet_V2](resnet_v2.png)

    Here's why!
   
![ResNet_math](resnet_m1.jpg)

    the input signal xl is still kept alive! And

![ResNet_math2](resnet_m2.png)

    the gradients can (almost) never be zero!

    The identity matrix summation speeds up the training process and improves gradient flow since the skip connections are taken from previous conv operations. Thus the backpropagation can effectively transfer error corrections to earlier layers much easier. This addresses the vanishing gradient problem.

`For ResNet, there are two kinds of residual connections:`

- `1.The identity shortcuts (x) can be directly added with the input and output are of the same dimensions`

- `2.when the dimensions change (input is larger than residual output, but we need to add them). The default way of solving this is to use a 1x1 Conv with a stride of 2. Yes, half of the pixels will be ignored.`


    They do this because it's what you should be doing. Residual connections with different shapes should be handled via a learned linear transformation between the two tensors, e.g. a 1x1 convolution with appropriate strides and border_mode, or for Dense layers, just matrix multiplication.

`We'll go through ResNet34.`

    It has 1 convolution layer of 7x7 sized kernel (64), with a stride of 2
    
    It is followed by MaxPooling. In fact, ResNet has only 1 MaxPooling operation!
    
    It is followed by 4 ResNet blocks (config: 3, 4, 6, 3)
    
    The channels are constant in each block (64, 128, 256, 512 respectively). Each block has only 3x3 kernels.

    The channel size is constant in each block
    
    Except for the first block, each block starts with a 3x3 kernel of stride 2 (this handles MaxPooling)

    The dotted lines are 1x1 convs with stride 2

![ResNet 34](ResNet34.png)

***SideNote:***

    The accuracy of convolutional networks evaluated on ImageNet is vastly underestimated. We find that when the mistakes of the model as assessed by human subjects and considered correct when four out of five humans agree with the model's prediction, the top-1 error of a ResNet-101trained on ImageNet... decreases from 22.69% to 9.47%. Similarly, the top-5 error decreases from 6.44% to 1.94%. (This is true for other models as well). 

![ResNet_Bottleneck](ResNet_bottleneck.png)

- `Deeper non-bottleneck ResNets also gain accuracy from increasing depth, but are not as economical as the bottleneck ResNets. So the usage of bottleneck designs is mainly due to practical considerations.` 

                R18    - Regular        - 2, 2, 2, 2
                
                R34    - Regular        - 3, 4, 6, 3
                
                R50    - Bottleneck     - 3, 4, 6, 3
                
                R101   - Bottleneck     - 3, 4, 23, 3
                
                R152   - Bottleneck     - 3, 8, 36, 3

<h1>ResNet V3 or ResNeXt  

*Aggregated Transformations*

![ResNet V3_ResNext](ResNetV3_ResNext.png)