### Xception 

### Xception: Deep Learning with Depthwise Separable Convolutions (Chollet F., 2016)

*We present an interpretation of Inception modules in convolutional neural networks
as being an intermediate step in __between regular convolution and the depthwise separable
convolution operation__ (a depthwise convolution followed by a pointwise convolution)...*


[Paper](https://arxiv.org/abs/1610.02357v2?source=post_page---------------------------)

In [None]:
import os
import numpy as np
import torch
import torch.nn as nn
from typing import Union, Tuple
import pretrainedmodels

assert torch.cuda.is_available() is True
%load_ext watermark

In [None]:
%watermark -p torch,ignite,numpy,netron,pretrainedmodels

#### Depthwise separable convolution

(Laurent Sifre at Google Brain in 2013, reported in V. Vanhoucke. Learning visual representations at scale. ICLR, 2014, PhD thesis Rigid-motion scattering for image classification, 2014)

The standard convolution extracts:
* Spatial correlation across pixels within an image channel (spatial correlations)
* Pixels correlation across channels (cross-channel correlations)

This is actually what Inception block does! 

<img src="../assets/2_xception.png" width="450">

This simplified Inception block actually states that spatial and cross-channel correlations can be factorized.
And the experiments showed it was true.

Extreme case:

<img src="../assets/3_xception.png" width="450">

The ideas is known as __depthwise separable convolution__. The only difference is the order of operations:

<img src="../assets/4_xception.jpeg" width="490">

<img src="../assets/1_xception.png" width="490">

#### Depthwise separable convolution from computational perspective:

Let $F_{K \times K \times C} $ is the filter or kernel with dims $K \times K \times C$, $H_{G \times G \times N}$ is the resulting feature map with dims $G \times G \times N$ produced by a convolution with $N$ filters $F_{K \times K \times C}$ over $W \times H \times C$ input.

Then:

1) for the standard convolution the number of multiplications per kernel is:

$$Mults_{1} = K^2 \times C \times G^2$$

For N kernels:

$$Mults_{N} = K^2 \times C \times G^2 \times N$$

2) for DWS convolution:

* depthwise part:

$$DW Mults = C \times K^2 \times G^2$$

* pointwise part:

$$PC Mults_{N} = N \times G^2 \times C$$

$$DWSC_{Total} = C \times K^2 \times G^2 + N \times G^2 \times C = C \times G^2 [K^2 + N]$$



3) The reduction ratio:

$$r = \frac{C \times G^2 [K^2 + N]}{K^2 \times C \times G^2 \times N} = \frac{1}{N} + \frac{1}{K^2}$$



In [None]:
mults_reduction = lambda n, k: sum((1/n, 1/k**2))

kernel_size = 3
for n in range(32, 32+8*10, 8):
    print('[n=%d]\tconv has %.2f times more mults than DWS conv' % (n, 1/mults_reduction(n, kernel_size)))

To make convolution work with each channel separately, __groups__ parameter is used: 


__Groups__ is a positive integer specifying the number of groups in which the input is split along the channel axis. Each group is convolved separately with `filters / groups` filters. 

The output is the concatenation of all the groups results along the channel axis. Input channels and filters must both be divisible by groups. 

In [None]:
# Groups example:
x = torch.Tensor(np.random.normal(size=(1, 25, 28, 28)))
for g in (1, 5, 25):
    conv = nn.Conv2d(in_channels=25, out_channels=50, kernel_size=3, padding=1, groups=g)
    print(f'Group: {g} Weights: {conv.weight.shape} Output: {conv(x).shape}')

In [None]:
class DWSConv2d(nn.Module):
    """
    Depthwise separable convolution
    """

    def __init__(self,
                 in_channels: int,
                 out_channels: int,
                 kernel_size: Union[int, Tuple[int, int]],
                 kernels_per_layer: int,
                 stride: Union[int, Tuple[int, int]] = 1,
                 padding: Union[str, int, Tuple[int, int]] = 0,
                 dilation: Union[int, Tuple[int, int]] = 1,
                 groups: int = 1,
                 *kwargs):
        super(DWSConv2d, self).__init__(*kwargs)

        self.dw_conv2d = nn.Conv2d(in_channels, in_channels * kernels_per_layer,
                                   kernel_size=kernel_size, padding=padding,
                                   groups=in_channels)
        self.pw_conv2d = nn.Conv2d(in_channels * kernels_per_layer, out_channels, kernel_size=1)

    def forward(self, x) -> torch.Tensor:
        return self.pw_conv2d(self.dw_conv2d(x))

Xception (Extreme Inception) components:
   * InceptionV3 blocks -> DWSConv blocks
   * Residual connections
    

<img src="../assets/5_xception.png" width="800">

<img src="../assets/6_xception.png" width="530">

<img src="../assets/7_xception.png" width="700">

#### Torch [implementation](https://github.com/Cadene/pretrained-models.pytorch/blob/8aae3d8f1135b6b13fed79c1d431e3449fdbf6e0/pretrainedmodels/models/xception.py#L114)


In [None]:
xception = pretrainedmodels.xception(pretrained=False)
xception

#### Your training code here

In [None]:
# Define data transformation pipeline.


# Initialize dataset and dataloaders.


# Initialize pretrained network, replace Linear layer with a new one for your dataset.


# Initialize optimizer, loss function and training procedure with handlers/callbacks.

#### References

* https://github.com/Cadene/pretrained-models.pytorch#xception
* https://onnx.ai/
* https://pytorch.org/docs/stable/index.html
* https://pytorch.org/docs/0.3.1/nn.html#torch.nn.Conv2d
* https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D
* https://www.researchgate.net/publication/343943234_Real-Time_Food_Intake_Monitoring_Using_Wearable_Egocnetric_Camera