**MobileNet** is a family of lightweight convolutional neural networks designed for **mobile and embedded vision applications**, where **computational efficiency** and **model size** are crucial.

Its key innovation is the **depthwise separable convolution**, which drastically reduces computation and parameters compared to standard convolutions.

---

## 1. Motivation

In a standard convolutional layer, for an input feature map of size
$$
H \times W \times M
$$
(where ( M ) is the number of input channels), and ( N ) output channels, a kernel of size ( K \times K ) requires:

$$
\text{Cost}_{\text{standard}} = H \times W \times M \times N \times K^2
$$

This is computationally expensive, especially for large ( M, N, K ).

MobileNet replaces this with **depthwise separable convolution**, which breaks the convolution into two simpler steps:

1. **Depthwise Convolution** — apply a single ( K \times K ) filter per input channel (no mixing between channels).
2. **Pointwise Convolution** — a ( 1 \times 1 ) convolution to combine the output of depthwise convolution across channels.

This decomposition reduces computation by roughly:

$$
\frac{1}{N} + \frac{1}{K^2}
$$

For typical values (e.g., ( K = 3, N \gg 1 )), this is about **8–9× less computation**.

---

## 2. MobileNet Architecture Overview (V1)

| Layer Type            | Input Size | Filters/Stride | Output Size | Notes                |
| --------------------- | ---------- | -------------- | ----------- | -------------------- |
| Conv 3×3              | 224×224×3  | 32 / s2        | 112×112×32  | Standard conv        |
| Depthwise Conv 3×3    | 112×112×32 | s1             | 112×112×32  | Per-channel conv     |
| Pointwise Conv 1×1    | 112×112×32 | 64 / s1        | 112×112×64  | Combines channels    |
| Depthwise + Pointwise | 112×112×64 | 128 / s2       | 56×56×128   | Repeat with stride 2 |
| ...                   | ...        | ...            | ...         | Repeated pattern     |
| Avg Pool              | 7×7×1024   | —              | 1×1×1024    | Global avg pooling   |
| FC (Softmax)          | —          | —              | 1000        | Classification       |

### Building Block

Each **MobileNet block** (except the first layer) follows:
$$
\text{Conv}*{3\times3}^{dw} \rightarrow \text{BN} \rightarrow \text{ReLU6} \rightarrow \text{Conv}*{1\times1}^{pw} \rightarrow \text{BN} \rightarrow \text{ReLU6}
$$

Here:

* **ReLU6** is used instead of ReLU to improve robustness on low-precision devices.
* **BN** stands for batch normalization.

---

## 3. Width and Resolution Multipliers

MobileNet introduces two hyperparameters to trade off accuracy vs. efficiency:

1. **Width Multiplier (α)** — scales the number of channels:
   $$
   M' = \alpha M, \quad N' = \alpha N
   $$
   Smaller α → fewer parameters and computations.

2. **Resolution Multiplier (ρ)** — scales the input image size:
   $$
   H' = \rho H, \quad W' = \rho W
   $$
   Lower resolution → faster inference.

Example:
MobileNet-V1 with α=0.75 and ρ=0.5 runs much faster than the full model but with slightly lower accuracy.

---

## 4. MobileNetV2 – Inverted Residuals and Linear Bottlenecks

**MobileNetV2 (2018)** improves over V1 using two new ideas:

### (a) Bottleneck Residual Block

Instead of a simple depthwise separable block, V2 uses:
$$
\text{1×1 expansion} \rightarrow \text{3×3 depthwise} \rightarrow \text{1×1 projection}
$$

This expands channels by a factor ( t ) (typically 6), applies depthwise convolution, then projects back to a low-dimensional space.

### (b) Linear Bottleneck

After projection, **no ReLU** is applied at the end — this preserves information that would otherwise be lost due to non-linearity in a narrow bottleneck space.

### Block Structure

| Step | Type                            | Purpose           |
| ---- | ------------------------------- | ----------------- |
| 1    | 1×1 Conv (expand) + BN + ReLU6  | Increase channels |
| 2    | 3×3 Depthwise Conv + BN + ReLU6 | Spatial filtering |
| 3    | 1×1 Conv (project) + BN         | Compress channels |
| 4    | Skip connection (if same shape) | Residual learning |

---

## 5. MobileNetV3 – Efficient Search and Squeeze-Excitation

**MobileNetV3 (2019)** uses **Neural Architecture Search (NAS)** and **Squeeze-and-Excitation (SE)** blocks for improved accuracy/efficiency.

Key components:

* **SE Block**: channel attention mechanism to reweight features.
* **Hard-Swish (h-swish)** activation: a computationally efficient approximation of Swish:
  $$
  \text{h-swish}(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}
  $$
* Mix of **bottleneck residuals (from V2)** and **efficient NAS-designed layers**.

---

## 6. Comparison Summary

| Version     | Key Innovation                        | Approx. Params | Typical Use                     |
| ----------- | ------------------------------------- | -------------- | ------------------------------- |
| MobileNetV1 | Depthwise separable conv              | ~4.2M          | Simple, fast models             |
| MobileNetV2 | Inverted residual + linear bottleneck | ~3.4M          | Balance accuracy & speed        |
| MobileNetV3 | NAS + SE + h-swish                    | ~5.4M          | Most accurate, mobile-optimized |

---

## 7. When to Use MobileNet

* **Real-time inference on edge devices** (phones, drones, embedded systems).
* **Feature extractor in lightweight pipelines** (e.g., object detection with SSD or segmentation with DeepLab).
* **Transfer learning** for small datasets where training from scratch is impractical.

---


## 1. MobileNetV1 — Full Architecture

MobileNetV1 (2017) is a simple **stack of depthwise-separable convolutions** with gradually increasing channel width and downsampling at certain stages.

### Structure

|     # | Type                       | Kernel / Stride | Output Channels | Input Size (for 224×224 input) |
| ----: | -------------------------- | --------------- | --------------: | ------------------------------ |
|     1 | Conv2D                     | 3×3 / 2         |              32 | 112×112×32                     |
|     2 | Depthwise Conv             | 3×3 / 1         |              32 | 112×112×32                     |
|     3 | Pointwise Conv             | 1×1 / 1         |              64 | 112×112×64                     |
|     4 | Depthwise Conv             | 3×3 / 2         |              64 | 56×56×64                       |
|     5 | Pointwise Conv             | 1×1 / 1         |             128 | 56×56×128                      |
|     6 | Depthwise Conv             | 3×3 / 1         |             128 | 56×56×128                      |
|     7 | Pointwise Conv             | 1×1 / 1         |             128 | 56×56×128                      |
|     8 | Depthwise Conv             | 3×3 / 2         |             128 | 28×28×128                      |
|     9 | Pointwise Conv             | 1×1 / 1         |             256 | 28×28×256                      |
|    10 | Depthwise Conv             | 3×3 / 1         |             256 | 28×28×256                      |
|    11 | Pointwise Conv             | 1×1 / 1         |             256 | 28×28×256                      |
|    12 | Depthwise Conv             | 3×3 / 2         |             256 | 14×14×256                      |
|    13 | Pointwise Conv             | 1×1 / 1         |             512 | 14×14×512                      |
| 14–18 | [Depthwise + Pointwise] ×5 | 3×3 / 1         |             512 | 14×14×512                      |
|    19 | Depthwise Conv             | 3×3 / 2         |             512 | 7×7×512                        |
|    20 | Pointwise Conv             | 1×1 / 1         |            1024 | 7×7×1024                       |
|    21 | Depthwise Conv             | 3×3 / 1         |            1024 | 7×7×1024                       |
|    22 | Pointwise Conv             | 1×1 / 1         |            1024 | 7×7×1024                       |
|    23 | AvgPool                    | 7×7             |            1024 | 1×1×1024                       |
|    24 | Fully Connected            | —               |            1000 | 1000 classes                   |

**Total parameters:** ~4.2 million
**FLOPs:** ~569 million

So: it’s essentially **a deep stack of depthwise-separable convs**, with downsampling every few layers and a final global average pooling before classification.

---

## 2. MobileNetV2 — Full Architecture

MobileNetV2 (2018) introduced **inverted residual bottlenecks** with expansion and projection.

Each block has parameters:

* **t**: expansion factor
* **c**: output channels
* **n**: number of repeats
* **s**: stride for the first block

### Structure

| Stage | Input      | Operator       |  t  |   c  |  n  |  s  |
| :---- | :--------- | :------------- | :-: | :--: | :-: | :-: |
| 0     | 224×224×3  | Conv2D 3×3     |  —  |  32  |  1  |  2  |
| 1     | 112×112×32 | Bottleneck     |  1  |  16  |  1  |  1  |
| 2     | 112×112×16 | Bottleneck     |  6  |  24  |  2  |  2  |
| 3     | 56×56×24   | Bottleneck     |  6  |  32  |  3  |  2  |
| 4     | 28×28×32   | Bottleneck     |  6  |  64  |  4  |  2  |
| 5     | 14×14×64   | Bottleneck     |  6  |  96  |  3  |  1  |
| 6     | 14×14×96   | Bottleneck     |  6  |  160 |  3  |  2  |
| 7     | 7×7×160    | Bottleneck     |  6  |  320 |  1  |  1  |
| 8     | 7×7×320    | Conv2D 1×1     |  —  | 1280 |  1  |  1  |
| 9     | 7×7×1280   | Global AvgPool |  —  | 1280 |  —  |  —  |
| 10    | 1×1×1280   | FC + Softmax   |  —  | 1000 |  —  |  —  |

Each *Bottleneck* consists of:
$$
1\times1\ \text{Conv (expand)} \rightarrow 3\times3\ \text{Depthwise} \rightarrow 1\times1\ \text{Conv (project)}
$$
and uses a **residual connection** if stride = 1 and input/output channels match.

**Total parameters:** ~3.4 million
**FLOPs:** ~300 million

---

## 3. MobileNetV3 — Full Architecture (Small & Large Variants)

MobileNetV3 (2019) is the result of **Neural Architecture Search** plus **SE blocks** and **h-swish activation**. It has two main variants: **Large** and **Small**.

Below is **MobileNetV3-Large** (for 224×224 input):

| Stage | Operator         |  k  | exp |   c  |  SE |    NL   |  s  |
| :---- | :--------------- | :-: | :-: | :--: | :-: | :-----: | :-: |
| 0     | Conv2D           | 3×3 |  —  |  16  |  —  | h-swish |  2  |
| 1     | Bottleneck       | 3×3 |  16 |  16  |  No |   ReLU  |  1  |
| 2     | Bottleneck       | 3×3 |  64 |  24  |  No |   ReLU  |  2  |
| 3     | Bottleneck       | 3×3 |  72 |  24  |  No |   ReLU  |  1  |
| 4     | Bottleneck       | 5×5 |  72 |  40  | Yes |   ReLU  |  2  |
| 5     | Bottleneck       | 5×5 | 120 |  40  | Yes |   ReLU  |  1  |
| 6     | Bottleneck       | 5×5 | 120 |  40  | Yes |   ReLU  |  1  |
| 7     | Bottleneck       | 3×3 | 240 |  80  |  No | h-swish |  2  |
| 8     | Bottleneck       | 3×3 | 200 |  80  |  No | h-swish |  1  |
| 9     | Bottleneck       | 3×3 | 184 |  80  |  No | h-swish |  1  |
| 10    | Bottleneck       | 3×3 | 184 |  80  |  No | h-swish |  1  |
| 11    | Bottleneck       | 3×3 | 480 |  112 | Yes | h-swish |  1  |
| 12    | Bottleneck       | 3×3 | 672 |  160 | Yes | h-swish |  2  |
| 13    | Bottleneck       | 3×3 | 960 |  160 | Yes | h-swish |  1  |
| 14    | Conv2D           | 1×1 |  —  |  960 |  —  | h-swish |  1  |
| 15    | Pool + SE + Conv |  —  |  —  | 1280 |  —  | h-swish |  —  |
| 16    | FC + Softmax     |  —  |  —  | 1000 |  —  |    —    |  —  |

**MobileNetV3-Small** is similar but optimized for lower latency and smaller memory footprint.

---

## 4. Visualization Summary

### MobileNetV1

```
Input → Conv → [DWConv + PWConv]*13 → AvgPool → FC
```

### MobileNetV2

```
Input → Conv → Bottleneck(t=1, c=16)
      → Bottleneck(t=6, c=24)*2
      → Bottleneck(t=6, c=32)*3
      → Bottleneck(t=6, c=64)*4
      → Bottleneck(t=6, c=96)*3
      → Bottleneck(t=6, c=160)*3
      → Conv → Pool → FC
```

### MobileNetV3

```
Input → Conv → SE Bottlenecks (ReLU / h-swish mix)
      → Conv1x1 → Pool + SE → FC
```

---


