### Going Deeper with Convolutions

**Authors:** C. Szegedy, W. Liu, Y. Jia, *et al.*  
**Link:** https://arxiv.org/pdf/1409.4842.pdf  

---


- Proposed a deep CNN architecture (Inception-"Network in network") with increased depth and width of the network while improving utilization of computing resources (Models designed to keep a computational budget of 1.5 billion multiply-adds at inference time)
- `Hebbian principle` - Neurons that fire together, wire together: A method of determining how to alter the weights between model neurons. The weight between two neurons increases if the two neurons activate simultaneously, and reduces if they activate separately. Nodes that tend to be either both positive or both negative at the same time have strong positive weights, while those that tend to be opposite have strong negative weights.
- GoogLeNet uses 12 times fewer parameters than AlexNet and is 22 layer deep. 
- Max pooling layers sometimes result in loss of accurate spatial information.
    - Advantages of Max pooling
        - No parameters
        - Often accurate
    - Disadvantages of Max pooling
        - More computationally expensive
        - More hyper-parameters (pooling size and stride)
- [1 x 1 Explanation](https://www.youtube.com/watch?v=qVP574skyuM): 1 x 1 convolutional layers are used as dimension reduction modules to remove computational bottlenecks. This allows increase in depth as well as width (number of units at each level) of network. For example, a feature map with size 100 x 100 x C channels on convolution with $k$ 1 x 1 filters would result in a feature map of size 100 x 100 x $k$. 
- Deep network has drawbacks:
    - Large number of parameters make deep network more prone to overfitting
    - Training a deep network requires a lot of computational resources.
    - Solution: Efficient distribution of computational resources and introduce sparsity and replace fully connected layers by sparse layers
- Architecture
    - Filter size 1 x 1, 3 x 3, and 5 x 5 are used to avoid patch alignment issues
    - [Inception modules](https://www.youtube.com/watch?v=VxhSouuSZDY): Used 9 inception modules with over 100 layers in total
        - Naive version:
            - Merging of outputs of the pooling layer with outputs of the convolutional layer would increase the number of outputs from stage to stage and this will lead to a computational blow up within a few stages
	- Dimensionality Reduction Inception module (idea based on embeddings): Using 1 x 1 filter size to reduce dimension as well as to increase non-linearity
    ![](images/gnet0.png)
    - All convolutions use ReLU non-linearity for activations
    ![](images/gnet1.png)
    - "#3 x3 reduce" and "#5 x 5 reduce" stands for the number of 1 x 1 filters in the reduction layer used before the 3 x 3 and 5 x 5 convolutions.
    ![](images/gnet2.png)
    ![](images/gnet3.png)
    - Auxiliary classifiers were added to intermediate layers to combat vanishing gradient problem while providing regularization. During training auxiliary classifier loss (with discount weight 0.3) gets added to total loss of the network. 
    - Used Average pooling layer
- Training:
    - GoogLeNet networks were trained using the **DistBelief** (Large Scale Distributed Deep Networks) distributed machine learning system
    - Asynchronous SGD with momentum = 0.9, learning rate decreased by 4% every 8 epochs. 
    - ImageNet 2012 dataset (1.3 Million images)
- Trained 7 versions of same GoogLeNet model and performed ensemble prediction and obtained a top-5 error of 6.67%

---

**Object Localization**
- Approach similar to R-CNN used but augmented with inception model as the region classifier

---
![](images/GoogLeNet.gif)