### Inception 

### Going Deeper with Convolutions (Szegedy C. et al., 2014/2015)

*The main hallmark of this architecture is the __improved utilization
of the computing resources inside the network__. This was achieved by a carefully
crafted design that allows for __increasing the depth and width of the network__ while
keeping the computational budget constant. To optimize quality, the architectural
decisions were based on the __Hebbian principle and the intuition of multi-scale
processing__.*

Hebbian principle in a nutshell: "Cells that fire together wire together."

[Paper](https://arxiv.org/abs/1409.4842)


*...Inception, which derives its name from the Network in network paper by Lin et al
in conjunction with the famous “we need to go deeper” internet meme*

<img src="../assets/13_inception.png" width="500">

In [1]:
import os
import numpy as np
import netron
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import pretrainedmodels

assert torch.cuda.is_available() is True
%load_ext watermark

In [2]:
%watermark -p torch,ignite,numpy,netron

torch : 1.10.2
ignite: 0.4.8
numpy : 1.22.1
netron: 5.7.8



Goals:

* Implement multi-scale features perception
* Make the model more computationally efficient and deep

<img src="../assets/3_inception_noses.png" width="500">

#### Inception blocks
<img src="../assets/1_inception_naive.png" width="500">

In [3]:
class NaiveBlock(nn.Module):
    def __init__(self, inp_ch: int = 256, out_ch: tuple = (128, 192, 96)) -> None:
        super().__init__()
        self.cov1x1 = nn.Conv2d(in_channels=inp_ch, out_channels=out_ch[0], kernel_size=1)
        self.cov3x3 = nn.Conv2d(in_channels=inp_ch, out_channels=out_ch[1], kernel_size=3, padding=1)
        self.cov5x5 = nn.Conv2d(in_channels=inp_ch, out_channels=out_ch[2], kernel_size=5, padding=2)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=(1, 1), padding=1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        branch1 = F.relu(self.cov1x1(x))
        branch2 = F.relu(self.cov3x3(x))
        branch3 = F.relu(self.cov5x5(x))
        branch4 = self.maxpool(x)
        out = torch.cat((branch1, branch2, branch3, branch4), 1)
        print(f'Total feature maps: {out.shape[1]} of size: {out.shape[2:]}')
        return out

In [4]:
naive_b = NaiveBlock().eval()
x = torch.Tensor(np.random.normal(size=(256, 28, 28)))
model_path = os.path.join('onnx_graphs', 'naive_inception_block.onnx')
torch.onnx.export(naive_b, torch.unsqueeze(x, 0), model_path,
                  input_names=['input'], output_names=['output'], opset_version=10)

Total feature maps: 672 of size: torch.Size([28, 28])


In [None]:
netron.start(model_path, 30000)

Number of operations for each branch of NaiveBlock:

In [5]:
def count_conv_ops(kernel, output_channel, input_shape):
    return np.prod([*kernel, output_channel, *input_shape])

In [6]:
ops_cnt = np.array((count_conv_ops((1, 1), 128, (256, 28, 28)),
                    count_conv_ops((3, 3), 192, (256, 28, 28)),
                    count_conv_ops((5, 5), 96, (256, 28, 28))))
print("%s\n%s" % (ops_cnt, ops_cnt/np.sum(ops_cnt)))
print('Total operations: %.3f M' % (np.sum(ops_cnt)/1e6))

[ 25690112 346816512 481689600]
[0.03007519 0.40601504 0.56390977]
Total operations: 854.196 M


*...in our setting, __1 × 1 convolutions have dual purpose: most critically, they
are used mainly as dimension reduction modules__ to remove computational bottlenecks, that would
otherwise limit the size of our networks.*

<img src="../assets/8_inception.png" width="300">

<img src="../assets/1_inception_reduction.png" width="500">

In [7]:
class BasicConv2d(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, **kwargs) -> None:
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        return F.relu(x, inplace=True)


class InceptionBlock(nn.Module):
    def __init__(self, inp_ch: int = 256, out_ch: tuple = (128, 64, 192, 64, 96, 64)):
        super().__init__()

        self.branch1 = BasicConv2d(
            in_channels=inp_ch, out_channels=out_ch[0], kernel_size=1
        )
        self.branch2 = nn.Sequential(
            BasicConv2d(inp_ch, out_ch[1], kernel_size=1),
            BasicConv2d(out_ch[1], out_ch[2], kernel_size=3, padding=1)
        )
        self.branch3 = nn.Sequential(
            BasicConv2d(inp_ch, out_ch[3], kernel_size=1),
            BasicConv2d(out_ch[3], out_ch[4], kernel_size=5, padding=2)
        )
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=(1, 1)),
            BasicConv2d(inp_ch, out_ch[5], kernel_size=1, padding=1)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)
        out = torch.cat((branch1, branch2, branch3, branch4), 1)
        print(f'Total feature maps: {out.shape[1]} of size: {out.shape[2:]}')
        return out

In [8]:
inc_block = InceptionBlock().eval()
x = torch.Tensor(np.random.normal(size=(256, 28, 28)))
model_path = os.path.join('onnx_graphs', 'inc_block.onnx')
torch.onnx.export(inc_block, torch.unsqueeze(x, 0), model_path, 
                  input_names=['input'], output_names=['output'], opset_version=10)

Total feature maps: 480 of size: torch.Size([28, 28])


In [None]:
netron.start(model_path, 30000)

In [9]:
ops_cnt = np.array((count_conv_ops((1, 1), 128, (256, 28, 28)),
                    
                    count_conv_ops((1, 1), 64, (256, 28, 28)),
                    count_conv_ops((3, 3), 192, (64, 28, 28)),
                    
                    count_conv_ops((1, 1), 64, (256, 28, 28)),
                    count_conv_ops((5, 5), 96, (64, 28, 28)), 
                    
                    count_conv_ops((1, 1), 64, (256, 28, 28)), 
                    
                  ))

print("Branch1: %s %.3f"  %  (ops_cnt[0], (ops_cnt/np.sum(ops_cnt))[0]))
print("Branch2: %s %.3f" % (ops_cnt[1:3], sum((ops_cnt/np.sum(ops_cnt))[1:3])))
print("Branch3: %s %.3f" % (ops_cnt[3:5], sum((ops_cnt/np.sum(ops_cnt))[3:5])))
print("Branch4: %s %.3f" % (ops_cnt[5:], sum((ops_cnt/np.sum(ops_cnt))[5:])))
print('Total operations: %.3f M' % (np.sum(ops_cnt)/1e6))

Branch1: 25690112 0.095
Branch2: [12845056 86704128] 0.367
Branch3: [ 12845056 120422400] 0.491
Branch4: [12845056] 0.047
Total operations: 271.352 M


#### Inception V1:

* 9 stacked inception modules, 22 layers (27 with pooling)
* No fully connected layers. Average pooling +  Linear layer (Improved the top-1 accuracy by about 0.6%)
* InceptionV1 loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2


<img src="../assets/7_inception.png" width="800">

<img src="../assets/2_inception.jpg" width="500">

<img src="../assets/4_inception.png" width="900">

#### Torch [implementation](https://github.com/pytorch/vision/blob/435eddf7a8200cc26338036a0a5f7db067ac7b0c/torchvision/models/googlenet.py#L28)


```python
class GoogLeNet(nn.Module):
    __constants__ = ["aux_logits", "transform_input"]

    def __init__(
        self,
        num_classes: int = 1000,
        aux_logits: bool = True,
        transform_input: bool = False,
        init_weights: Optional[bool] = None,
        blocks: Optional[List[Callable[..., nn.Module]]] = None,
        dropout: float = 0.2,
        dropout_aux: float = 0.7,
    ) -> None:
        super().__init__()
        _log_api_usage_once(self)
        if blocks is None:
            blocks = [BasicConv2d, Inception, InceptionAux]
        if init_weights is None:
            warnings.warn(
                "The default weight initialization of GoogleNet will be changed in future releases of "
                "torchvision. If you wish to keep the old behavior (which leads to long initialization times"
                " due to scipy/scipy#11299), please set init_weights=True.",
                FutureWarning,
            )
            init_weights = True
        assert len(blocks) == 3
        conv_block = blocks[0]
        inception_block = blocks[1]
        inception_aux_block = blocks[2]

        self.aux_logits = aux_logits
        self.transform_input = transform_input

        self.conv1 = conv_block(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)
        self.conv2 = conv_block(64, 64, kernel_size=1)
        self.conv3 = conv_block(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = inception_block(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = inception_block(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = inception_block(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = inception_block(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = inception_block(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = inception_block(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = inception_block(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(2, stride=2, ceil_mode=True)

        self.inception5a = inception_block(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = inception_block(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = inception_aux_block(512, num_classes, dropout=dropout_aux)
            self.aux2 = inception_aux_block(528, num_classes, dropout=dropout_aux)
        else:
            self.aux1 = None  # type: ignore[assignment]
            self.aux2 = None  # type: ignore[assignment]

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(p=dropout)
        self.fc = nn.Linear(1024, num_classes)

```

* Inception block in torchvision does not contain 5x5 convolutions:


```python

class Inception(nn.Module):
    def __init__(
        self,
        in_channels: int,
        ch1x1: int,
        ch3x3red: int,
        ch3x3: int,
        ch5x5red: int,
        ch5x5: int,
        pool_proj: int,
        conv_block: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        
        ...
        
        self.branch3 = nn.Sequential(
            conv_block(in_channels, ch5x5red, kernel_size=1),
            # Here, kernel_size=3 instead of kernel_size=5 is a known bug.
            # Please see https://github.com/pytorch/vision/issues/906 for details.
            conv_block(ch5x5red, ch5x5, kernel_size=3, padding=1),
        )

```

#### Further work

#### Inception V2: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe, S., Szegedy, C.2015)
[Paper](https://arxiv.org/pdf/1502.03167.pdf)



* *The main difference to the network described in (Szegedy et al., 2014) is that the 5 × 5 convolutional layers are replaced by two consecutive layers of 3 × 3 convolutions with up to 128 filters.*

* Small arhitecture changes: more inception modules, avg pool and max pool mix, strides.

In [10]:
class BasicConv2dBN(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, **kwargs) -> None:
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.bn(self.conv(x))
        return F.relu(x, inplace=True)


class InceptionBlockV2(nn.Module):
    def __init__(self, inp_ch: int = 256, out_ch: tuple = (128, 64, 192, 64, 96, 64)):
        super().__init__()

        self.branch1 = BasicConv2dBN(
            in_channels=inp_ch, out_channels=out_ch[0], kernel_size=1
        )
        self.branch2 = nn.Sequential(
            BasicConv2dBN(inp_ch, out_ch[1], kernel_size=1),
            BasicConv2dBN(out_ch[1], out_ch[2], kernel_size=3, padding=1)
        )
        self.branch3 = nn.Sequential(
            BasicConv2dBN(inp_ch, out_ch[3], kernel_size=1),
            BasicConv2dBN(out_ch[3], out_ch[4], kernel_size=3, padding=1),
            BasicConv2dBN(out_ch[4], out_ch[4], kernel_size=3, padding=1)
        )
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=(1, 1)),
            BasicConv2dBN(inp_ch, out_ch[5], kernel_size=1, padding=1)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)
        out = torch.cat((branch1, branch2, branch3, branch4), 1)
        print(f'Total feature maps: {out.shape[1]} of size: {out.shape[2:]}')
        return out

In [11]:
inc2_block = InceptionBlockV2().eval()
x = torch.Tensor(np.random.normal(size=(256, 28, 28)))
model_path = os.path.join('onnx_graphs', 'inc2_block.onnx')
torch.onnx.export(inc2_block, torch.unsqueeze(x, 0), model_path, 
                  input_names=['input'], output_names=['output'], opset_version=10)

Total feature maps: 480 of size: torch.Size([28, 28])


In [None]:
netron.start(model_path, 30000)

In [12]:
ops_cnt = np.array((count_conv_ops((1, 1), 128, (256, 28, 28)),

                    count_conv_ops((1, 1), 64, (256, 28, 28)),
                    count_conv_ops((3, 3), 192, (64, 28, 28)),

                    count_conv_ops((1, 1), 64, (256, 28, 28)),
                    count_conv_ops((3, 3), 96, (64, 28, 28)),
                    count_conv_ops((3, 3), 96, (96, 28, 28)),

                    count_conv_ops((1, 1), 64, (256, 28, 28)),

                    ))

print("Branch1: %s %.3f"  %  (ops_cnt[0], (ops_cnt/np.sum(ops_cnt))[0]))
print("Branch2: %s %.3f" % (ops_cnt[1:3], sum((ops_cnt/np.sum(ops_cnt))[1:3])))
print("Branch3: %s %.3f" % (ops_cnt[3:6], sum((ops_cnt/np.sum(ops_cnt))[3:6])))
print("Branch4: %s %.3f" % (ops_cnt[6:], sum((ops_cnt/np.sum(ops_cnt))[6:])))
print('Total operations: %.3f M' % (np.sum(ops_cnt)/1e6))

Branch1: 25690112 0.099
Branch2: [12845056 86704128] 0.384
Branch3: [12845056 43352064 65028096] 0.467
Branch4: [12845056] 0.050
Total operations: 259.310 M


#### Inception V3: Rethinking the Inception Architecture for Computer Vision (Szegedy C. et al., 2016)
[Paper](https://arxiv.org/pdf/1512.00567.pdf)


Principles and optimization ideas that that proved to be useful for scaling up convolution networks:
   
   * Avoid representational bottlenecks, especially early in the network
   * Increasing the activations per tile in a convolutional network allows for more disentangled features
   * Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.  Convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers.
   * Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network
  
Use them judiciously in ambiguous situations only.

* Replacing n × n convolutions by a 1 × n convolution followed by a n × 1 gives very good results on medium grid-sizes (On m × m feature maps, where m ranges between 12 and 20)

* Auxiliary classifiers did not result in improved convergence early in the training. In general they act as regularizers.

* Label smoothing regularization (LSR) - noising one-hot encoded vectors.

<img src="../assets/9_inception.png" width="400">

<img src="../assets/10_inception.png" width="450">


#### Inception V4: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (Szegedy C. et al., 2017)

[Paper](https://arxiv.org/abs/1602.07261)

<img src="../assets/12_inception.png" width="220">

* Residual connections leads to dramatically improved training speed for the Inception architecture
* Inception-v4: a pure Inception variant without residual connections with roughly the same recognition performance as Inception-ResNet-v2.

<img src="../assets/11_inception.png" width="440">

In [13]:
tuple(arch for arch in dir(torchvision.models) if 'inception' in arch or 'google' in arch)

('googlenet', 'inception', 'inception_v3')

In [14]:
tuple(x for x in dir(pretrainedmodels) if 'incept' in x)

('bninception', 'inceptionresnetv2', 'inceptionv3', 'inceptionv4')

#### Your training code here

In [None]:
# Define data transformation pipeline.


# Initialize dataset and dataloaders.


# Initialize pretrained network, replace Linear layer with a new one for your dataset.


# Initialize optimizer, loss function and training procedure with handlers/callbacks.

#### References

* http://cs231n.stanford.edu/slides/2021/lecture_9.pdf
* https://cs231n.github.io/convolutional-networks/
* https://www.cs.colostate.edu/~dwhite54/InceptionNetworkOverview.pdf
* https://onnx.ai/
* https://leimao.github.io/blog/Label-Smoothing/