# **MobileNet**
**MobileNet** is a family of lightweight convolutional neural networks designed for **mobile and embedded vision applications**, where **computational efficiency** and **model size** are crucial.

Its key innovation is the **depthwise separable convolution**, which drastically reduces computation and parameters compared to standard convolutions.

---

## 1. Motivation

In a standard convolutional layer, for an input feature map of size
$$
H \times W \times M
$$
(where $ M $ is the number of input channels), and $ N $ output channels, a kernel of size $ K \times K $ requires:

$$
\text{Cost}_{\text{standard}} = H \times W \times M \times N \times K^2
$$

This is computationally expensive, especially for large ( M, N, K ).

MobileNet replaces this with **depthwise separable convolution**, which breaks the convolution into two simpler steps:

1. **Depthwise Convolution** — apply a single ( K \times K ) filter per input channel (no mixing between channels).
2. **Pointwise Convolution** — a ( 1 \times 1 ) convolution to combine the output of depthwise convolution across channels.

This decomposition reduces computation by roughly:

$$
\frac{1}{N} + \frac{1}{K^2}
$$

For typical values (e.g., ( K = 3, N \gg 1 )), this is about **8–9× less computation**.

---



## 2. MobileNetV1 — Full Architecture

MobileNetV1 (2017) is a simple **stack of depthwise-separable convolutions** with gradually increasing channel width and downsampling at certain stages.

#### Structure

|     # | Type                       | Kernel / Stride | Output Channels | Input Size (for 224×224 input) |
| ----: | -------------------------- | --------------- | --------------: | ------------------------------ |
|     1 | Conv2D                     | 3×3 / 2         |              32 | 112×112×32                     |
|     2 | Depthwise Conv             | 3×3 / 1         |              32 | 112×112×32                     |
|     3 | Pointwise Conv             | 1×1 / 1         |              64 | 112×112×64                     |
|     4 | Depthwise Conv             | 3×3 / 2         |              64 | 56×56×64                       |
|     5 | Pointwise Conv             | 1×1 / 1         |             128 | 56×56×128                      |
|     6 | Depthwise Conv             | 3×3 / 1         |             128 | 56×56×128                      |
|     7 | Pointwise Conv             | 1×1 / 1         |             128 | 56×56×128                      |
|     8 | Depthwise Conv             | 3×3 / 2         |             128 | 28×28×128                      |
|     9 | Pointwise Conv             | 1×1 / 1         |             256 | 28×28×256                      |
|    10 | Depthwise Conv             | 3×3 / 1         |             256 | 28×28×256                      |
|    11 | Pointwise Conv             | 1×1 / 1         |             256 | 28×28×256                      |
|    12 | Depthwise Conv             | 3×3 / 2         |             256 | 14×14×256                      |
|    13 | Pointwise Conv             | 1×1 / 1         |             512 | 14×14×512                      |
| 14–18 | [Depthwise + Pointwise] ×5 | 3×3 / 1         |             512 | 14×14×512                      |
|    19 | Depthwise Conv             | 3×3 / 2         |             512 | 7×7×512                        |
|    20 | Pointwise Conv             | 1×1 / 1         |            1024 | 7×7×1024                       |
|    21 | Depthwise Conv             | 3×3 / 1         |            1024 | 7×7×1024                       |
|    22 | Pointwise Conv             | 1×1 / 1         |            1024 | 7×7×1024                       |
|    23 | AvgPool                    | 7×7             |            1024 | 1×1×1024                       |
|    24 | Fully Connected            | —               |            1000 | 1000 classes                   |

**Total parameters:** ~4.2 million
**FLOPs:** ~569 million

So: it’s essentially **a deep stack of depthwise-separable convs**, with downsampling every few layers and a final global average pooling before classification.





#### Building Block

Each **MobileNet block** (except the first layer) follows:
$$
\text{Conv}*{3\times3}^{dw} \rightarrow \text{BN} \rightarrow \text{ReLU6} \rightarrow \text{Conv}*{1\times1}^{pw} \rightarrow \text{BN} \rightarrow \text{ReLU6}
$$

Here:

* **ReLU6** is used instead of ReLU to improve robustness on low-precision devices.
* **BN** stands for batch normalization.

---

## 3. Width and Resolution Multipliers

MobileNet introduces two hyperparameters to trade off accuracy vs. efficiency:

1. **Width Multiplier (α)** — scales the number of channels:
   $$
   M' = \alpha M, \quad N' = \alpha N
   $$
   Smaller α → fewer parameters and computations.

2. **Resolution Multiplier (ρ)** — scales the input image size:
   $$
   H' = \rho H, \quad W' = \rho W
   $$
   Lower resolution → faster inference.

Example:
MobileNet-V1 with α=0.75 and ρ=0.5 runs much faster than the full model but with slightly lower accuracy.

---

## 4.  MobileNetV2 —  Inverted Residuals and Linear Bottlenecks

**MobileNetV2 (2018)** improves over V1 using two new ideas:

#### (a) Bottleneck Residual Block

Instead of a simple depthwise separable block, V2 uses:
$$
\text{1×1 expansion} \rightarrow \text{3×3 depthwise} \rightarrow \text{1×1 projection}
$$

This expands channels by a factor ( t ) (typically 6), applies depthwise convolution, then projects back to a low-dimensional space.

#### (b) Linear Bottleneck

After projection, **no ReLU** is applied at the end — this preserves information that would otherwise be lost due to non-linearity in a narrow bottleneck space.

---

MobileNetV2 (2018) introduced **inverted residual bottlenecks** with expansion and projection.

Each block has parameters:

* **t**: expansion factor
* **c**: output channels
* **n**: number of repeats
* **s**: stride for the first block

### Structure

| Stage | Input      | Operator       |  t  |   c  |  n  |  s  |
| :---- | :--------- | :------------- | :-: | :--: | :-: | :-: |
| 0     | 224×224×3  | Conv2D 3×3     |  —  |  32  |  1  |  2  |
| 1     | 112×112×32 | Bottleneck     |  1  |  16  |  1  |  1  |
| 2     | 112×112×16 | Bottleneck     |  6  |  24  |  2  |  2  |
| 3     | 56×56×24   | Bottleneck     |  6  |  32  |  3  |  2  |
| 4     | 28×28×32   | Bottleneck     |  6  |  64  |  4  |  2  |
| 5     | 14×14×64   | Bottleneck     |  6  |  96  |  3  |  1  |
| 6     | 14×14×96   | Bottleneck     |  6  |  160 |  3  |  2  |
| 7     | 7×7×160    | Bottleneck     |  6  |  320 |  1  |  1  |
| 8     | 7×7×320    | Conv2D 1×1     |  —  | 1280 |  1  |  1  |
| 9     | 7×7×1280   | Global AvgPool |  —  | 1280 |  —  |  —  |
| 10    | 1×1×1280   | FC + Softmax   |  —  | 1000 |  —  |  —  |

Each *Bottleneck* consists of:
$$
1\times1\ \text{Conv (expand)} \rightarrow 3\times3\ \text{Depthwise} \rightarrow 1\times1\ \text{Conv (project)}
$$
and uses a **residual connection** if stride = 1 and input/output channels match.

**Total parameters:** ~3.4 million
**FLOPs:** ~300 million


---

## 5. MobileNetV3 – Efficient Search and Squeeze-Excitation

**MobileNetV3 (2019)** uses **Neural Architecture Search (NAS)** and **Squeeze-and-Excitation (SE)** blocks for improved accuracy/efficiency.

Key components:

* **SE Block**: channel attention mechanism to reweight features.
* **Hard-Swish (h-swish)** activation: a computationally efficient approximation of Swish:
  $$
  \text{h-swish}(x) = x \cdot \frac{\text{ReLU6}(x + 3)}{6}
  $$
* Mix of **bottleneck residuals (from V2)** and **efficient NAS-designed layers**.

---


MobileNetV3 (2019) is the result of **Neural Architecture Search** plus **SE blocks** and **h-swish activation**. It has two main variants: **Large** and **Small**.

Below is **MobileNetV3-Large** (for 224×224 input):

| Stage | Operator         |  k  | exp |   c  |  SE |    NL   |  s  |
| :---- | :--------------- | :-: | :-: | :--: | :-: | :-----: | :-: |
| 0     | Conv2D           | 3×3 |  —  |  16  |  —  | h-swish |  2  |
| 1     | Bottleneck       | 3×3 |  16 |  16  |  No |   ReLU  |  1  |
| 2     | Bottleneck       | 3×3 |  64 |  24  |  No |   ReLU  |  2  |
| 3     | Bottleneck       | 3×3 |  72 |  24  |  No |   ReLU  |  1  |
| 4     | Bottleneck       | 5×5 |  72 |  40  | Yes |   ReLU  |  2  |
| 5     | Bottleneck       | 5×5 | 120 |  40  | Yes |   ReLU  |  1  |
| 6     | Bottleneck       | 5×5 | 120 |  40  | Yes |   ReLU  |  1  |
| 7     | Bottleneck       | 3×3 | 240 |  80  |  No | h-swish |  2  |
| 8     | Bottleneck       | 3×3 | 200 |  80  |  No | h-swish |  1  |
| 9     | Bottleneck       | 3×3 | 184 |  80  |  No | h-swish |  1  |
| 10    | Bottleneck       | 3×3 | 184 |  80  |  No | h-swish |  1  |
| 11    | Bottleneck       | 3×3 | 480 |  112 | Yes | h-swish |  1  |
| 12    | Bottleneck       | 3×3 | 672 |  160 | Yes | h-swish |  2  |
| 13    | Bottleneck       | 3×3 | 960 |  160 | Yes | h-swish |  1  |
| 14    | Conv2D           | 1×1 |  —  |  960 |  —  | h-swish |  1  |
| 15    | Pool + SE + Conv |  —  |  —  | 1280 |  —  | h-swish |  —  |
| 16    | FC + Softmax     |  —  |  —  | 1000 |  —  |    —    |  —  |

**MobileNetV3-Small** is similar but optimized for lower latency and smaller memory footprint.

---






---
## 5. Comparison Summary

| Version     | Key Innovation                        | Approx. Params | Typical Use                     |
| ----------- | ------------------------------------- | -------------- | ------------------------------- |
| MobileNetV1 | Depthwise separable conv              | ~4.2M          | Simple, fast models             |
| MobileNetV2 | Inverted residual + linear bottleneck | ~3.4M          | Balance accuracy & speed        |
| MobileNetV3 | NAS + SE + h-swish                    | ~5.4M          | Most accurate, mobile-optimized |

---

## 6. When to Use MobileNet

* **Real-time inference on edge devices** (phones, drones, embedded systems).
* **Feature extractor in lightweight pipelines** (e.g., object detection with SSD or segmentation with DeepLab).
* **Transfer learning** for small datasets where training from scratch is impractical.

---

## **7. MobileNetV3 Variants From timm**

These are all **MobileNetV3** variants from the [timm](https://rwightman.github.io/pytorch-image-models/) library, but each suffix encodes how and on what dataset the model was trained or fine-tuned.
Let’s decode each part systematically.


In [4]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!
import warnings
import os

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm
# fmt: on

all_mobilenetV3 = timm.list_models("*mobilenetv3*", pretrained=True)
for m in all_mobilenetV3:
    print(m)

mobilenetv3_large_100.miil_in21k
mobilenetv3_large_100.miil_in21k_ft_in1k
mobilenetv3_large_100.ra4_e3600_r224_in1k
mobilenetv3_large_100.ra_in1k
mobilenetv3_large_150d.ra4_e3600_r256_in1k
mobilenetv3_rw.rmsp_in1k
mobilenetv3_small_050.lamb_in1k
mobilenetv3_small_075.lamb_in1k
mobilenetv3_small_100.lamb_in1k
tf_mobilenetv3_large_075.in1k
tf_mobilenetv3_large_100.in1k
tf_mobilenetv3_large_minimal_100.in1k
tf_mobilenetv3_small_075.in1k
tf_mobilenetv3_small_100.in1k
tf_mobilenetv3_small_minimal_100.in1k


---
## Models and Name Explanation
#### **8.1. Base model name**

Example: `mobilenetv3_large_100`

* **mobilenetv3** → Architecture version (MobileNetV3, introduced by Howard et al. 2019).
* **large / small** → Two design variants:

  * *large*: higher accuracy, heavier.
  * *small*: lighter, for mobile CPUs.
* **100 / 075 / 050 / 150d** → Width multiplier:

  * `100` = 1.0× (default width)
  * `075` = 0.75× narrower
  * `050` = 0.5× narrower
  * `150d` = 1.5× wider (from “d” = **dilated/expanded** version in timm).

---

#### **8.2. Prefixes**

Example: `tf_mobilenetv3_large_075`

* **tf_** → TensorFlow-converted weights
  These models come from the **TensorFlow** (TF-Slim / TF-Hub) implementation of MobileNetV3.
  They match TF preprocessing (e.g., normalization to [0, 1], not ImageNet mean/std).

* **mobilenetv3_rw** → “rw” = Ross Wightman, the timm author’s own re-implementation
  Usually trained from scratch or fine-tuned using timm’s standard recipe.

---

#### **8.3. Training recipe / dataset tags**

| Suffix                                  | Meaning                                                                                                                                              |
| :-------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------- |
| **in1k**                                | Trained on ImageNet-1k (1.28 M images / 1000 classes).                                                                                               |
| **in21k**                               | Pretrained on ImageNet-21k (14 M images / 21 000 classes).                                                                                           |
| **ft_in1k**                             | “Fine-tuned on ImageNet-1k” after pretraining on a larger dataset (e.g., IN-21k).                                                                    |
| **miil**                                | Trained by **MIIL** (Mean Images Lab, from the paper “Training ImageNet-21k Models for ImageNet-1k Classification”), known for high-quality recipes. |
| **ra_in1k**                             | “RA” = **RandAugment** recipe from timm (`train_config=’ra’`).                                                                                       |
| **ra4_e3600_r224_in1k**                 | Detailed training recipe:                                                                                                                            |
| – **ra4** → RandAugment magnitude = 4.  |                                                                                                                                                      |
| – **e3600** → Trained for 3600 epochs.  |                                                                                                                                                      |
| – **r224** → Input resolution = 224 px. |                                                                                                                                                      |
| – **in1k** → Dataset = ImageNet-1k.     |                                                                                                                                                      |
| **rmsp**                                | Optimizer = **RMSProp** (original TensorFlow MobileNet recipe).                                                                                      |
| **lamb**                                | Optimizer = **LAMB** (Layer-wise Adaptive Moments optimizer) used for small-model training stability.                                                |

---

#### **8.4. Minimal variants**

Example: `tf_mobilenetv3_large_minimal_100.in1k`

* **minimal** → Uses **“minimalistic” MobileNetV3** (no squeeze-and-excitation, no h-swish).
  This reduces complexity and latency, sacrificing some accuracy.

---

#### **8.5. Putting it all together**

| Model                                          | Meaning                                                                       |
| :--------------------------------------------- | :---------------------------------------------------------------------------- |
| **mobilenetv3_large_100.miil_in21k**           | MobileNetV3-Large 1.0× pretrained on ImageNet-21k by MIIL team.               |
| **mobilenetv3_large_100.miil_in21k_ft_in1k**   | Same as above, then fine-tuned on ImageNet-1k.                                |
| **mobilenetv3_large_100.ra4_e3600_r224_in1k**  | Large 1.0× trained on IN-1k using RandAugment m4, 3600 epochs, 224 px inputs. |
| **mobilenetv3_large_100.ra_in1k**              | Large 1.0× trained on IN-1k with standard RandAugment recipe.                 |
| **mobilenetv3_large_150d.ra4_e3600_r256_in1k** | 1.5× wider variant, trained 3600 epochs on 256 px inputs.                     |
| **mobilenetv3_rw.rmsp_in1k**                   | Ross Wightman’s re-implementation trained with RMSProp on IN-1k.              |
| **mobilenetv3_small_050.lamb_in1k**            | Small 0.5× trained with LAMB optimizer on IN-1k.                              |
| **mobilenetv3_small_075.lamb_in1k**            | Small 0.75× trained with LAMB optimizer.                                      |
| **mobilenetv3_small_100.lamb_in1k**            | Small 1.0× trained with LAMB optimizer.                                       |
| **tf_mobilenetv3_large_075.in1k**              | TF-converted Large 0.75× trained on IN-1k.                                    |
| **tf_mobilenetv3_large_100.in1k**              | TF-converted Large 1.0× on IN-1k.                                             |
| **tf_mobilenetv3_large_minimal_100.in1k**      | TF-converted minimalistic Large 1.0× on IN-1k.                                |
| **tf_mobilenetv3_small_075.in1k**              | TF-converted Small 0.75× on IN-1k.                                            |
| **tf_mobilenetv3_small_100.in1k**              | TF-converted Small 1.0× on IN-1k.                                             |
| **tf_mobilenetv3_small_minimal_100.in1k**      | TF-converted minimalistic Small 1.0× on IN-1k.                                |

---

#### **8.6. Which to choose**

| Goal                                      | Recommended variant                        |
| :---------------------------------------- | :----------------------------------------- |
| Best accuracy overall                     | `mobilenetv3_large_100.miil_in21k_ft_in1k` |
| Lightweight for mobile apps               | `tf_mobilenetv3_small_075.in1k`            |
| Balanced speed / accuracy                 | `mobilenetv3_large_100.ra_in1k`            |
| Training from scratch                     | `mobilenetv3_rw.rmsp_in1k`                 |
| Low-latency inference (no SE, no h-swish) | `tf_mobilenetv3_large_minimal_100.in1k`    |
| fine-tune on your own dataset | `miil_in21k_ft_in1k (better pretrained features)`    |



---



## **9. Comparison Table of accuracy (top-1) vs Model size (MB)** 
Here’s a comparison table with approximate parameters & accuracy figures for several of the timm MobileNetV3 variants you listed. Note that for some variants full metrics (e.g., size in MB, or top-1 accuracy) are not publicly listed, so the table uses available metadata and best­-known numbers.

| Model                                        | Params (M)                                         | Top-1 accuracy (ImageNet-1k)                           | Notes                                                       |
| -------------------------------------------- | -------------------------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------- |
| `mobilenetv3_large_100.ra_in1k`              | ~5.5 M ([huggingface.co][1])                       | ~75.8% ([huggingface.co][2])                           | Large 1.0×, “RA” recipe                                     |
| `mobilenetv3_large_100.ra4_e3600_r224_in1k`  | ~5.5 M ([huggingface.co][3])                       | ~76.31% @224px, ~77.16% @256px ([GitHub][4])           | “RA4”, 3600 epochs, improved recipe                         |
| `mobilenetv3_large_100.miil_in21k_ft_in1k`   | ~5.5 M ([huggingface.co][5])                       | (Not clearly published)                                | Pretrained on IN-21k by MIIL, then fine-tuned on IN-1k      |
| `mobilenetv3_large_100.miil_in21k`           | ~5.5 M ([Dataloop][6])                             | (Not clearly published)                                | Pretrained on IN-21k (no fine-tune details)                 |
| `mobilenetv3_large_150d.ra4_e3600_r256_in1k` | (Params not clearly listed)                        | ~80.94% @256px, ~81.81% @320px ([GitHub][4])           | Wider 1.5× variant, stronger accuracy                       |
| `mobilenetv3_rw.rmsp_in1k`                   | (Params unclear)                                   | —                                                      | “rw” = Ross Wightman’s re-implementation, RMSProp optimizer |
| `mobilenetv3_small_100.lamb_in1k`            | (Params not clearly listed)                        | —                                                      | Small 1.0× variant trained with LAMB optimizer              |
| `tf_mobilenetv3_large_100.in1k`              | ~5.5 M (from original paper) ([huggingface.co][2]) | ~75.5% (original Google weights) ([huggingface.co][2]) | TensorFlow-style weights, baseline for MNV3 Large           |

### Some observations

* Many of the models cluster around **5.5 M parameters** for the “Large” 1.0× versions, regardless of training recipe.
* Improved training recipes (e.g., RA4, more epochs, better augmentation) yield noticeable gains (≈1% top-1 improvement in some cases).
* The wider 1.5× variant (150d) shows a much stronger jump in accuracy (≈80%+) compared to ~76–77% for 1.0× versions.
* Models pretrained on larger datasets (IN-21k) then fine-tuned tend to have the strongest potential, though exact public metrics can be harder to find.

### Caveats

* “Size in MB” is not consistently reported in the model cards; I haven’t listed MB values because they weren’t reliably published.
* Some of the “small”, “minimal”, or “tf_” variants you listed don’t have full public metric details available in the sources I found.
* Accuracy values may depend on evaluation resolution (224 vs 256 px) and specific test cropping/resizing protocols.

---

If you like, I can **scrape the full timm model listing CSV** to extract all available accuracy, param & size details for *all* your listed variants (and any missing ones) into a more complete table.

[1]: https://huggingface.co/timm/mobilenetv3_large_100.ra_in1k?utm_source=chatgpt.com "timm/mobilenetv3_large_100.ra_in1k - Hugging Face"
[2]: https://huggingface.co/blog/rwightman/mobilenet-baselines?utm_source=chatgpt.com "MobileNet Baselines - Hugging Face"
[3]: https://huggingface.co/timm/mobilenetv3_large_100.ra4_e3600_r224_in1k?utm_source=chatgpt.com "timm/mobilenetv3_large_100.ra4_e3600_r224_in1k - Hugging Face"
[4]: https://github.com/huggingface/pytorch-image-models?utm_source=chatgpt.com "huggingface/pytorch-image-models - GitHub"
[5]: https://huggingface.co/timm/mobilenetv3_large_100.miil_in21k_ft_in1k?utm_source=chatgpt.com "timm/mobilenetv3_large_100.miil_in21k_ft_in1k"
[6]: https://dataloop.ai/library/model/timm_mobilenetv3_large_100miil_in21k_ft_in1k/?utm_source=chatgpt.com "Mobilenetv3 large 100.miil in21k ft in1k · Models"
