# MobileNet
<hr>

MobileNets are efficient convolutional neural networks designed for mobile and embedded vision applications.

- **Optimized for Low-Power Devices:** MobileNets are particularly suited for environments with limited computational resources, such as mobile phones, due to their efficiency.


- **Depthwise Separable Convolutions:** The core of MobileNet architecture is `depthwise separable convolution`, which is a lightweight alternative to standard convolutions. This operation consists of two layers:

    1. **Depthwise Convolution:** Filters each input channel separately with a single filter per channel.
    2. **Pointwise Convolution:** Applies a 1x1 convolution to combine the output of the depthwise convolution across channels.


- **Computational Efficiency:** Depthwise separable convolutions dramatically reduce the computational cost without significantly sacrificing accuracy.

[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications]

<hr>

<div style="text-align:center">
    <img src="media/normal_conv.png" width=700>
    <caption><font color="red"><u>Normal Convolution</u></font></caption>
</div>

In the normal convolution above, we have:

- **Input Volume:** $6 \times 6 \times 3$
- **Filters:** 5 filters of size $3 \times 3 \times 3$
- **Output Volume:** $4 \times 4 \times 5$

<i>For every position in the $4 \times 4 \times 5$ output volume, we performed $3 \times 3 \times 3$ multiplications.</i> Therefore, the total cost is:

$$(3 \times 3 \times 3) \times (4 \times 4 \times 5) = 2160$$

<hr>

In a `depthwise separable convolution`, we perform two steps:
- The depthwise convolution
- The pointwise convolution

<div style="text-align:center">
    <img src="media/depthwise_separable.png" width=700>
    <caption><font color="red"><u>Normal vs. Depthwise Separable Convolution</u></font></caption>
</div>

<hr>

**Step 1: Depthwise Convolution**

- The number of filters in the convolution are the same as the number of chanels in the input volume. Each filter is applied separately to an individual chanel such that a position in the output $4 \times 4 \times 3$ volume is the result of $3 \times 3$ multiplications rather than $3 \times 3 \times 3$.


<table style="width: 100%; border: 1px solid #ddd;">
  <tr>
    <td style="text-align:center; width: 50%; border: 1px solid #ddd;">
      <img src="media/depth_conv1.png" style="width: auto; max-width: 100%;">
      <br>
      <span style="color: red;"><u> Step 1: applying 1st filter to 1st channel</u></span>
    </td>
    <td style="text-align:center; width: 50%; border: 1px solid #ddd;">
      <img src="media/depth_conv2.png" style="width: auto; max-width: 100%;">
      <br>
      <span style="color: red;"><u>Step 2: applying 2nd filter to 2nd channel</u></span>
    </td>
  </tr>
  <tr>
    <td style="text-align:center; width: 50%; border: 1px solid #ddd;" colspan="2">
      <img src="media/depth_conv3.png" style="width: auto; max-width: 50%;">
      <br>
      <span style="color: red;"><u>Step 3: applying 3rd filter to 3rd channel</u></span>
    </td>
  </tr>
</table>

In the depthwise convolution above, we have:

- **Input Volume:** $6 \times 6 \times 3$
- **Filters:** 3 filters of size $3 \times 3$
- **Output Volume:** $4 \times 4 \times 3$

<i>For every position in the $4 \times 4 \times 3$ output volume, we performed $3 \times 3$ multiplications.</i> Therefore, the total cost is:

$$(3 \times 3) \times (4 \times 4 \times 3) = 432$$


**Step 2: Pointwise Convolution**

- A $1 \times 1$ convolution is applied to the intermediate result produced by the "depthwise convolution" to get the desired output volume.

<div style="text-align:center">
    <img src="media/pointwise.png" width=700>
    <caption><font color="red"><u>Pointwise Convolution</u></font></caption>
</div>

In the pointwise convolution above, we have:

- **Input Volume:** $4 \times 4 \times 3$
- **Filters:** 5 filters of size $1 \times 1 \times 3$
- **Output Volume:** $4 \times 4 \times 5$

<i>For every position in the $4 \times 4 \times 5$ output volume, we performed $1 \times 1 \times 3$ multiplications.</i> Therefore, the total cost is:

$$(1 \times 1 \times 3) \times (4 \times 4 \times 5) = 240$$

Therefore, in total, the depthwise separable convolution performed (432 + 240 = 672) operations while the normal convolution performed 2160.

$$\frac{672}{2160} = 0.31$$

<hr>

The authors of the MobileNet paper showed that, in general, the ratio of the cost of depthwise separable convolution to a normal convolution is:

<br>
<font color='red'>
$$\frac{1}{n'_c} + \frac{1}{f^2}$$
</font>


Where $n'_c$ is the number of output chanels (in our case $n'_c=5$) and $f$ is the filter size (in our case $f=3$).

$$\frac{1}{5} + \frac{1}{3^2} = 0.31$$

<hr>

# MobileNetV1

<br>

<div style="text-align:center">
    <img src="media/mobilenet_v1.png" width=800>
    <caption><font color="red"><u>MobileNet v1</u></font></caption>
</div>

- It employs a stack of depthwise separable convolutions, using this block **<font color="red">13</font>** times in sequence.
- The network concludes with the typical arrangement of a pooling layer, a fully connected layer, and a softmax layer for classification.

[Howard et al. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications]

<hr>

# MobileNetV2

<br>

<div style="text-align:center">
    <img src="media/mobilenet_v2.png" width=800>
    <caption><font color="red"><u>MobileNet v2</u></font></caption>
</div>

- Introduces two main improvements: residual connections and an expansion layer.
- The residual connection allows for more efficient gradient propagation, similar to ResNets.
- The expansion layer is applied before the depthwise convolution, enlarging the feature map before compressing it back down, known as a bottleneck block.

<br>

**Bottleneck Blocks**

- These are key to MobileNet V2, allowing the network to compute more complex functions while managing memory efficiently.
- The expansion layer increases dimensionality for richer computation, and the projection layer (pointwise convolution) reduces it again for memory efficiency.
- MobileNet V2 repeats its modified bottleneck block **<font color="red">17</font>** times, followed by the standard ending layers for classification.

<br>

<div style="text-align:center">
    <img src="media/mobilenet_v2_block.png" width=800>
    <caption><font color="red"><u>MobileNet v2 Bottleneck</u></font></caption>
</div>

<br>

[Sandler et al. 2019, MobileNetV2: Inverted Residuals and Linear Bottlenecks]