### MobileNets 

### MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications  (Howard, A. G., Google Inc., 2017)

*MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks...
This paper proposes a class of network architectures that allows a model developer to specifically choose a small network that matches the resource restrictions (latency, size) for their application...*

[Paper](https://arxiv.org/abs/1704.04861)

Most common applications:

* object detection
* finegrain classification
* face attributes
* large scale geo-localization

In [1]:
import os
import numpy as np
import netron
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn import datasets as sk_datasets
from typing import Callable
import torchvision

import pandas as pd
import plotly.offline as offline
import plotly.graph_objs as go
import plotly.express as px
offline.init_notebook_mode(connected=True)

assert torch.cuda.is_available() is True
%load_ext watermark

In [2]:
%watermark -p torch,ignite,numpy,netron,sklearn,pandas,plotly

torch  : 1.10.2
ignite : 0.4.8
numpy  : 1.22.1
netron : 5.5.5
sklearn: 0.24.2
pandas : 1.4.1
plotly : 5.6.0



#### MobileNetV1

Main points:

* Massive usage of depthwise separable convolution with $3 \times 3$ kernels.

<img src="../assets/4_xception.jpeg" width="490">

<img src="../assets/1_xception.png" width="490">

* 1x1 convolution:
    * 75% of MobileNet parameters
    * 95% of computation time
    * Can be implemented directly with [GEMM](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3) which is one of the most optimized numerical linear algebra algorithms.


* New hyperparams: 
    * $\alpha$ - the width multiplier is a scaler of input and output channels: $\alpha C, \alpha N$.  Can be considered as a rough features selection. 
    * $\rho$ -  the resolution multiplier.
    * Baseline: $\alpha = \rho$ =  1

$$DWSC_{Total} = \alpha C \times K^2 \times \rho^2 G^2 + \alpha N \times \rho^2 G^2 \times \alpha C =  \alpha \rho^2 C \times G^2 [K^2 + \alpha N]$$

Total reduction ratio:

$$r = \frac{\alpha \rho^2  C \times G^2 [K^2 + \alpha N]}{K^2 \times C \times G^2 \times N} = \frac{a \rho^2}{N} + \left[\frac{a \rho }{K}\right]^2$$




In [3]:
mults_reduction = lambda a, rho, n, k: sum(( (a*rho**2) / n, ( (a * rho) / k )**2 ))

# Fixed rho = 1
kernel_size = 3
points = 100
a = np.linspace(0.5, 1, points)
rho = 1
n = np.linspace(32, 256, points)
A, N = np.meshgrid(a, n)
reduction = 1/mults_reduction(A, rho, N, kernel_size)

fig = go.Figure(data=[go.Surface(z=reduction, x=A, y=N)])
fig.update_layout(title='Reducing computational cost with width multiplier', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90),
                  scene = {'xaxis': {'title': "Width multiplier, a"},
                           'yaxis': {'title': "Number of filters, N"}})

fig.show()

In [4]:
# Fixed alpha = 1
kernel_size = 3
points = 100
a = 1
rho = np.linspace(0.5, 1, points)
n = np.linspace(32, 256, points)
R, N = np.meshgrid(rho, n)
reduction = 1/mults_reduction(a, R, N, kernel_size)

fig = go.Figure(data=[go.Surface(z=reduction, x=A, y=N)])
fig.update_layout(title='Reducing computational cost with resolution multiplier', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90),
                  scene = {'xaxis': {'title': "Resolution multiplier, rho"},
                           'yaxis': {'title': "Number of filters, N"}})

fig.show()

In [5]:
# Fixed n = 128
kernel_size = 3
points = 100
n = 128
multipliers = np.linspace(0.5, 1, points)
A, R = np.meshgrid(multipliers, multipliers)
reduction = 1/mults_reduction(A, R, n, kernel_size)

fig = go.Figure(data=[go.Surface(z=reduction, x=A, y=R)])
fig.update_layout(title=f'Reducing MobileNet computational cost. n={n}', autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90),
                  scene = {'xaxis': {'title': "Width multiplier, a"},
                           'yaxis': {'title': "Resolution multiplier, rho"}})

fig.show()

<img src="../assets/1_mobilenet.png" width="490">


<img src="../assets/2_mobilenet.png" width="490">


<img src="../assets/3_mobilenet.png" width="490">

#### Torch [implementation](https://github.com/jmjeon94/MobileNet-Pytorch/blob/20972586be740d5bc1e92bfb23928359be30e731/MobileNetV1.py#L4)

#### MobileNetV2

#### MobileNetV2: Inverted Residuals and Linear Bottlenecks (Sandler M. et al., Google Inc., 2018)

[Paper](https://arxiv.org/abs/1801.04381)

Main points:

   * The set of layer activations (for any layer $L_i$) forms a “manifold of interest”.
    
   * It has been long assumed that manifolds of interest in neural networks could be embedded in low-dimensional subspaces. So, features encoded in a channel of the hidden layer output could lie in some manifold which may be embedable in a low-dimensional subspace.
   
   Proving experiments:
       * MobileNetV1: width multiplier $\alpha$
   
   
  * But when ReLU collapses the channel, it inevitably loses information in that channel.

In [6]:
seed = 100
# noisy_moons, _ = sk_datasets.make_moons(n_samples=800, noise=0.05, random_state=seed)
noisy_circles = np.concatenate(
    [sk_datasets.make_circles(n_samples=400, noise=0.02, factor=f, random_state=seed)[0] for f in (0.8, 0.5, 0.2)]
)

fig = px.scatter(x=noisy_circles[:, 0], y=noisy_circles[:, 1], width=400, height=400, opacity=0.8,
                 title='Original data')
fig.show()

In [7]:
def embedding_manifolds(x: np.ndarray, 
                        input_channels: int, 
                        output_channels:int, 
                        activation: Callable = lambda x: x,
                        seed: int = 42
                       ) -> np.ndarray:
    rng = np.random.default_rng(seed)
    
    W = rng.normal(size=(input_channels, output_channels))
    W_inv = np.linalg.pinv(W)
    xW = np.dot(x, W) # and x@W
    z = activation(xW)
    z_inv = np.dot(z, W_inv)
    
    return z_inv

In [8]:
input_ch = 1
output_chs = (2, 3, 5, 15, 20)
relu = np.vectorize(lambda x: np.maximum(0, x))
noisy_circles_reshaped = noisy_circles.reshape(*noisy_circles.shape, input_ch)

In [9]:
df = pd.DataFrame({'x': noisy_circles[:, 0], 'y': noisy_circles[:, 1], 'data': 'original'})

for output_ch in output_chs:
    res = embedding_manifolds(noisy_circles_reshaped, input_ch, output_ch, relu)
    df = pd.concat([df, pd.DataFrame({'x': res[:, 0, 0], 'y': res[:, 1, 0], 'data': f'out_ch: {output_ch}'})])

fig = px.scatter(df, x='x', y=r'y', opacity=0.8, facet_col='data', width=1000, height=250)
fig.update_traces(marker = dict(size=2))
fig.show()

<img src="../assets/4_mobilenet.png" width="490">

Question: What the result will be if we change activation function?

*If we have lots of channels, and there is a a structure in the activation manifold that information might still be
preserved in the other channels.*


1. If the manifold of interest remains non-zero volume after ReLU transformation, it corresponds to a linear transformation.

2. ReLU is capable of preserving complete information about the input manifold, but only if the input manifold lies in a low-dimensional subspace of the input space


Assuming the manifold of interest is low-dimensional we can capture this by inserting __linear bottleneck__
layers with expantion factor $t$ into the convolutional blocks.

In other words, based on the assumtion linear bottleneck can preserve features as much as possible.

* __In a nutshell__: to preserve information loss as much as possible because of relu we can first increase the number of channels with some expantion factor $t$, then perform relu, and, finally,  turn backwards to the original number of channels.


* __Inverted residual bottlenecks:__ *inspired by the intuition that the bottlenecks actually contain all the necessary information...we use shortcuts directly between the bottlenecks...*


* __ReLU6__: robust with low-precision computations.

<img src="../assets/10_mobilenet.png" width="400">

<img src="../assets/6_mobilenet.png" width="700">

<img src="../assets/7_mobilenet.png" width="490">

<img src="../assets/8_mobilenet.png" width="490">

<img src="../assets/9_mobilenet.png" width="450">

#### Torch [implementation](https://github.com/pytorch/vision/blob/f40c8df02c197d1a9e194210e40dee0e6a6cb1c3/torchvision/models/mobilenetv2.py#L88)

In [10]:
mbnet2 = torchvision.models.mobilenet_v2()
x = torch.Tensor(np.random.normal(size=(1, 3, 224, 224)))
model_path = os.path.join('onnx_graphs', 'mbnet2.onnx')
torch.onnx.export(mbnet2, x, model_path,
                  input_names=['input'], output_names=['output'], opset_version=10)
netron.start(model_path, 30000)

Serving 'onnx_graphs/mbnet2.onnx' at http://localhost:30000


('localhost', 30000)

#### MobileNetV3

#### Searching for MobileNetV3 (Howard, A. G., Google AI, Google Brain., 2019)

[Paper](https://arxiv.org/abs/1905.02244)

* Automated architectures search:

    1. Neural architecture search (NAS) NAS to search for the global network structures by optimizing each network block. [Paper](https://arxiv.org/abs/1611.01578)
    
    2. NetAdapt algorithm to search per layer for the number of filters. [Paper](https://arxiv.org/abs/1804.03230).
       On each step optimization rule was based on the ratio: $\frac{\Delta Accuracy}{\Delta Latency}$.
    
    3. Retraining new architecture from scratch.

NAS:
<img src="../assets/12_mobilenet.png" width="450">
<img src="../assets/18_mobilenet.png" width="470">

Net Adapt:
<img src="../assets/17_mobilenet.png" width="500">

* h-swish non-linearity (replacing sigmoid):
    $$HSwish = x \times \frac{ReLU6(x+3)}{6}$$


* MobileNetV3-Large, MobileNetV3-Small for high and low resource use cases.

In [11]:
relu6 = np.vectorize(lambda x: np.minimum(np.maximum(0, x), 6)  )
swish = np.vectorize(lambda x: x*(1/(1+np.exp(-x))))
hswish = np.vectorize(lambda x: x*relu6(x+3)/6)
hsigm = lambda x: nn.functional.hardsigmoid(torch.Tensor(x)).cpu().numpy()

n = 500
xs = np.linspace(-8, 8, n)
df = pd.concat([pd.DataFrame({'x':xs, 'y':swish(xs), 'activation': ('swish',)*n}),
                pd.DataFrame({'x':xs, 'y':relu6(xs), 'activation': ('relu6',)*n}),
                pd.DataFrame({'x':xs, 'y':hswish(xs), 'activation': ('hswish',)*n}),
                pd.DataFrame({'x':xs, 'y':hsigm(xs), 'activation': ('hsigm',)*n}),
               ])

fig = px.line(df, x='x', y='y', color='activation', width=600, height=400)
fig.show()



<img src="../assets/11_mobilenet.png" width="550">

<img src="../assets/13_mobilenet.png" width="550">

<img src="../assets/14_mobilenet.png" width="550">

<img src="../assets/15_mobilenet.png" width="550">

<img src="../assets/16_mobilenet.png" width="550">

#### Torch [implementation](https://github.com/pytorch/vision/blob/f40c8df02c197d1a9e194210e40dee0e6a6cb1c3/torchvision/models/mobilenetv3.py#L131)

In [12]:
mbnet3small = torchvision.models.mobilenet_v3_small()
x = torch.Tensor(np.random.normal(size=(1, 3, 224, 224)))
# Make sure that you select appropriate onnx opset version
# HardSwish operator is supported since opset 14
# https://github.com/onnx/onnx/blob/main/docs/Operators.md
model_path = os.path.join('onnx_graphs', 'mbnet3small.onnx')
torch.onnx.export(mbnet3small, x, model_path,
                  input_names=['input'], output_names=['output'], opset_version=14)
netron.start(model_path, 30000)

Stopping http://localhost:30000
Serving 'onnx_graphs/mbnet3small.onnx' at http://localhost:30000


('localhost', 30000)

#### Your training code here

In [None]:
# Define data transformation pipeline.


# Initialize dataset and dataloaders.


# Initialize pretrained network, replace Linear layer with a new one for your dataset.


# Initialize optimizer, loss function and training procedure with handlers/callbacks.

#### References

* https://onnx.ai/
* https://pytorch.org/docs/stable/index.html
* https://lilianweng.github.io/posts/2020-08-06-nas/