# **MobileNet**

* Standard CNNs like VGG or ResNet are accurate but computationally heavy (many parameters, FLOPs).
* MobileNet was designed (by Google, 2017) to run efficiently on **mobile and embedded devices** with limited compute/memory while keeping good accuracy.

---

##  **1. Key Idea – Depthwise Separable Convolution**

MobileNet replaces standard convolutions with a cheaper two-step operation:

####  **1.1. Standard convolution**

For an input of size $H \times W \times M$ (height × width × input channels) and $N$ filters of size $k \times k$:

* **Cost (MACs)**: $H \cdot W \cdot M \cdot N \cdot k^2$

####  **1.2. Depthwise Separable convolution**

Split into two layers:

1. **Depthwise convolution**:

   * Each input channel has its own $k \times k$ filter (no cross-channel mixing).
   * Cost: $H \cdot W \cdot M \cdot k^2$



2. **Pointwise convolution** (a $1 \times 1$ conv):

   * Combines the outputs of the depthwise step across channels.
   * Cost: $H \cdot W \cdot M \cdot N$





**Total cost:**
$
H W (M k^2 + M N)
$
which is much less than the standard convolution cost for typical (k=3).

#### **1.3. Efficiency gain**

Reduction factor ≈
$
\frac{k^2 M N}{k^2 M + M N} \quad \text{(usually ~8–9× fewer computations)}
$

---



#### **1.4. Standard Convolution Weight Tensor shape**

If the input feature map has shape **(H, W, M)** and you want **N** output channels with a kernel of size **k×k**, then:

* Each **output channel** has its own filter of shape **(M, k, k)** (one k×k kernel per input channel).
* Collectively, the weight tensor is:

$$ W_\text{shape} = (N,M,k,k) $$

So yes — you need **N** such filters, each spanning all **M** input channels.
That’s why the cost is:

$$H ,W, M , N , k^2$$

operations (MACs).



<img src='../conv/images/06_03.png' height="50%" width="50%"/>
<img src='../conv/images/06_09.png' height="50%" width="50%" />
<img src='../conv/images/3_channel_conv.gif' height="50%" width="50%" />


---

#### **1.5. Depthwise separable convolution weight shapes**

MobileNet breaks that heavy convolution into two:

**(a) Depthwise convolution**

* One filter per **input channel**, *not* per output channel.
* Weight shape:

$$ W_\text{depthwise} = (M,1,k,k)$$

so only **M** filters total, each applied to a single channel.

**(b) Pointwise (1×1) convolution**

* After the depthwise stage, you have still **M** channels.
* You then mix them to get **N** channels using a 1×1 conv.
* Weight shape:

$$ W_\text{pointwise} = (N,M,1,1)$$

This is much cheaper than having $N,M,k,k$ directly.


**Depthwise-separable (8 Groups followed by pointwise)**
![](../conv/images/depthwise-separable-convolution-animation-3x3-kernel.gif)


---

#### **1.6. Visual comparison**

| Stage              | Input shape | Weight shape | Output channels |
| ------------------ | ----------- | ------------ | --------------- |
| **Standard Conv**  | (H,W,M)     | (N,M,k,k)    | N               |
| **Depthwise Conv** | (H,W,M)     | (M,1,k,k)    | M               |
| **Pointwise Conv** | (H,W,M)     | (N,M,1,1)    | N               |

---

#### **1.7. Why cost drops**

Standard conv cost:
$H W M N k^2$

Depthwise + pointwise cost:
$H W (M k^2 + M N)$

For typical numbers (k=3, M=N), the second expression is about **1/k^2 + 1/N** of the original → ~8–9× smaller.

---



* **Standard conv:** N filters of shape M×k×k.
* **MobileNet (depthwise separable):** M filters of shape 1×k×k **plus** N filters of shape M×1×1.




## **Numerical Example**

Let’s gets at the heart of **EfficientNet** and **MobileNetV2/V3** architectures.

#### 1. What is MBConv?

**MBConv (Mobile Inverted Bottleneck Convolution)** was introduced in **MobileNetV2** and reused in **EfficientNet**.

It’s called *inverted* because:

* A normal bottleneck first **reduces** channels, then applies convolution.
* MBConv first **expands** channels, does computation, and then **projects back** to fewer channels.

---

#### MBConv block structure

1. **Expansion (1×1 convolution)**
   Expands from input channels $C_{in}$ to $t \times C_{in}$, where $t$ is the **expansion factor** (usually 6).

   $$
   X_{expand} = \text{ReLU6}(\text{BN}(\text{Conv}_{1\times1}(X_{in})))
   $$

2. **Depthwise convolution (3×3 convolution per channel)**
   Applies spatial convolution **independently** for each channel.

   $$
   X_{depth} = \text{ReLU6}(\text{BN}(\text{ConvDepthwise}_{3\times3}(X_{expand})))
   $$

3. **Projection (1×1 convolution)**
   Reduces back to $C_{out}$ channels.

   $$
   X_{out} = \text{BN}(\text{Conv}*{1\times1}(X*{depth}))
   $$

4. **Skip connection (optional)**
   If stride = 1 and $C_{in} = C_{out}$:
   $$
   Y = X_{in} + X_{out}
   $$

---

#### 2. Is MBConv the same as Depthwise-Separable Conv?

❌ **Not exactly.**
✅ It *includes* a depthwise convolution, but **adds expansion and projection** around it.

**Depthwise Separable Conv** (used in **MobileNetV1**) only has:

1. Depthwise 3×3 conv
2. Pointwise (1×1) conv

**MBConv = Expansion (1×1) → Depthwise (3×3) → Projection (1×1)**
**Depthwise Separable = Depthwise (3×3) → Pointwise (1×1)**

So MBConv is a **generalized and more expressive** version of Depthwise Separable Conv.

---

#### 3. Numerical Example

Let’s take a **tiny example** to see the shapes.

| Parameter        | Symbol       | Value |
| ---------------- | ------------ | ----- |
| Input size       | $H \times W$ | 8 × 8 |
| Input channels   | $C_{in}$     | 4     |
| Output channels  | $C_{out}$    | 4     |
| Expansion factor | $t$          | 6     |
| Kernel size      | $k$          | 3     |
| Stride           | 1            |       |

---

#### Step 1: Expansion (1×1 conv)

$$
C_{expand} = t \times C_{in} = 6 \times 4 = 24
$$

Output tensor shape:
$$
[8, 8, 24]
$$

---

#### Step 2: Depthwise 3×3 conv

Each of the 24 channels gets its own 3×3 filter → no channel mixing.

Output tensor shape:
$$
[8, 8, 24]
$$

(assuming padding = 1, stride = 1)

---

#### Step 3: Projection (1×1 conv)

Reduces back to 4 channels:

$$
[8, 8, 24] \xrightarrow{\text{Conv1×1}} [8, 8, 4]
$$

---

#### Step 4: Skip connection

Since stride = 1 and input/output channels are the same (4), we add:

$$
Y = X_{in} + X_{out}
$$

Final output shape:
$$
[8, 8, 4]
$$

---

#### 4. Parameter Comparison

Let’s roughly compare MBConv vs Depthwise-Separable Conv for the same example.

| Layer type                         | Parameters                       |
| ---------------------------------- | -------------------------------- |
| Expansion 1×1 conv                 | $1×1×4×24 = 96$                  |
| Depthwise 3×3 conv                 | $3×3×24 = 216$                   |
| Projection 1×1 conv                | $1×1×24×4 = 96$                  |
| **Total MBConv**                   | **408**                          |
| Depthwise-separable (no expansion) | $3×3×4 + 1×1×4×4 = 36 + 16 = 52$ |

So MBConv has **more parameters**, but also **more expressive capacity** and nonlinearities (two ReLUs, one linear projection), enabling better accuracy at similar compute.

---

#### 5. Key Differences Summary

| Feature         | Depthwise-Separable | MBConv                                                 |
| --------------- | ------------------- | ------------------------------------------------------ |
| Expansion       | No                  | Yes (1×1 expand to $t \times C_{in}$)                  |
| Nonlinearity    | After both convs    | ReLU6 after expansion and depthwise, linear projection |
| Skip connection | Optional            | Yes (if stride=1 and $C_{in}=C_{out}$)                 |
| Used in         | MobileNetV1         | MobileNetV2, EfficientNet                              |
| Expressiveness  | Moderate            | High                                                   |

---

Would you like me to show this same example in **PyTorch code**, so you can visualize the tensor shapes and operations for MBConv?


##  **2. MobileNet Architecture**

* Built mainly from **(Depthwise conv + Pointwise conv) + BatchNorm + ReLU** blocks.
* Ends with a fully-connected layer for classification.
* Lightweight, fewer parameters, and faster inference.

---
##  **3. Hyperparameters for Trade-offs**

MobileNet introduces two knobs to scale size vs. accuracy:

1. **Width Multiplier $ \alpha $**

   * Scales the number of channels in each layer:
     $M' = \alpha M$, $N' = \alpha N$.
   * $\alpha \in (0,1]$ makes the network thinner.

2. **Resolution Multiplier $ \rho $**

   * Scales the input image resolution:
     $H' = \rho H$, $W' = \rho W$.
   * Reduces compute quadratically.

By adjusting $\alpha$ and $\rho$, you can deploy MobileNet variants for different devices.

---

##  **4. Evolution of MobileNet Versions**

| Version         | Year | Key Improvements                                                                               |
| --------------- | ---- | ---------------------------------------------------------------------------------------------- |
| **MobileNetV1** | 2017 | Depthwise separable conv + width/resolution multipliers                                        |
| **MobileNetV2** | 2018 | Introduced **Inverted Residuals** + **Linear Bottlenecks** (similar to ResNet but lightweight) |
| **MobileNetV3** | 2019 | Added **SE blocks** (Squeeze-and-Excitation) + **hard-swish** activation + NAS-based search    |

---

##  **5. When to Use**

* On-device classification or detection (phones, IoT devices).
* As a backbone for mobile-friendly models (SSD-MobileNet, DeepLabV3-MobileNet).
* Great for applications where latency, power, or memory are limited.

---

##  **6. Quick Visual of a Block**

```
Input
  │
Depthwise Conv (3x3, per-channel)
  │
BatchNorm + ReLU
  │
Pointwise Conv (1x1, across channels)
  │
BatchNorm + ReLU
  │
Output
```



## **7. PyTorch grouped and depthwise Convolution**
PyTorch’s `groups` argument in `nn.Conv2d` is exactly what lets you move from a **full convolution** to **grouped** or even **depthwise** convolutions (like MobileNet uses).

---

####  **7.1. Default (groups = 1)**

* **Shape:** weights = `(out_channels, in_channels, kH, kW)`
* Every output channel sees **all** input channels.
* This is the standard convolution we described first.

---

####  **7.2 Grouped convolution (1 < groups < in_channels))**

* Split the **input channels** and **output channels** into `groups` parts.
* Each group of output channels only looks at a subset of the input channels.
* The weight shape becomes:

$$
\text{weight shape} = \left( \text{out\_channels}, \frac{\text{in\_channels}}{\text{groups}}, kH, kW \right)
$$

* Cost drops roughly by a factor of `groups` compared to a full conv.

Example:
`nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, groups=4)`
→ Input split into 4 groups of 16 channels, output split into 4 groups of 32 channels. Each group processes only its slice.

---

####  **7.3 Depthwise convolution (groups = in_channels)**

* Special case of grouped convolution.
* You set:

```python
nn.Conv2d(in_channels=M, out_channels=M, kernel_size=3, groups=M)
```

* Each input channel has its own filter and produces its own output channel (one-to-one).
* This is exactly the **depthwise** step in MobileNet.

Weight shape:

$$
(M,1,kH,kW)
$$

— which matches what we wrote earlier.

---

####  **7.4 Depthwise separable convolution in PyTorch**

You implement it as two layers:

```python
# depthwise
self.depthwise = nn.Conv2d(in_channels=M,
                           out_channels=M,
                           kernel_size=3,
                           groups=M,
                           padding=1)

# pointwise
self.pointwise = nn.Conv2d(in_channels=M,
                           out_channels=N,
                           kernel_size=1)
```

This is literally a MobileNet block.

---



| groups value    | Name           | Effect                                           |
| --------------- | -------------- | ------------------------------------------------ |
| 1               | Standard conv  | Each output channel sees all input channels      |
| g (1<g<in_ch)   | Grouped conv   | Each group sees a fraction of input channels     |
| in_ch (=out_ch) | Depthwise conv | Each channel has its own filter (MobileNet step) |

---

So : `groups` in `nn.Conv2d` is PyTorch’s general mechanism for this; **depthwise convolution = groups = in_channels**.


#### **7.5 Comparing the output shapes for `groups=1`, `groups=2`, and `groups=in_channels`**


In [1]:
import torch
import torch.nn as nn

x = torch.randn(1, 4, 5, 5)  # batch=1, 4 input channels

conv1 = nn.Conv2d(4, 8, kernel_size=3, groups=1, padding=1)
conv2 = nn.Conv2d(4, 8, kernel_size=3, groups=2, padding=1)

y1 = conv1(x)
y2 = conv2(x)

print(y1.shape, y2.shape)  # both (1,8,5,5)


torch.Size([1, 8, 5, 5]) torch.Size([1, 8, 5, 5])


`groups=2` will **not** give the same result as `groups=1` (unless you very specially choose the weights so it behaves identically).

Here’s why:

---

####  **7.6 What happens when `groups=1`**

* All `in_channels` are connected to all `out_channels`.
* Each filter can combine information from **every input channel**.

####  **7.7 What happens when `groups=2`**

* PyTorch splits the input channels into 2 equal groups.

  * Group 1: channels `0..in_channels/2-1`
  * Group 2: channels `in_channels/2..end`
* The output channels are also split into 2 groups.
* Each group’s filters only “see” its **own half** of the input channels; there is **no cross-talk** between groups.

So effectively you’re doing two separate convolutions in parallel and then concatenating their outputs along the channel dimension.



## **8.What “SE” Stands For**

**SE block** = **Squeeze-and-Excitation block**.
It’s a lightweight attention mechanism for CNNs introduced in the paper:

> *“Squeeze-and-Excitation Networks” (Hu et al., CVPR 2018)*

It improves a network’s ability to model the **importance of each channel** in feature maps.

---

#### **8.2. Why We Need It**

Convolutions learn spatial filters but treat all channels equally.
Some channels might carry more relevant information for the current task.
An SE block lets the network **recalibrate channel-wise feature responses** adaptively.

---

#### **8.3. How an SE Block Works**

Let’s say the input feature map is $X\in \mathbb{R}^{H\times W\times C}$ (height, width, channels).

#### Step A: **Squeeze** (Global information)

* Perform **global average pooling** over spatial dimensions $H \times W$ to get one value per channel.
* This yields a vector $z\in \mathbb{R}^{C}$ summarizing each channel’s global response.

Mathematically:
$$
z_c = \frac{1}{H , W}\sum_{i=1}^{H}\sum_{j=1}^{W} X_{i,j,c}
$$

#### Step B: **Excitation** (Learn channel attention)

* Pass (z) through a small **two-layer MLP**:

  * First layer reduces dimension to $C/r$ (bottleneck; $r$ is reduction ratio like 16)
  * Apply ReLU.
  * Second layer expands back to $C$.
  * Apply sigmoid to get weights $s\in[0,1]^C$.

$$
s = \sigma(W_2 , \delta(W_1 z))
$$

where $\delta$ is ReLU, $\sigma$ is sigmoid.

#### Step C: **Scale** (Recalibration)

* Multiply the original feature map channels by the learned weights:

$$
\tilde{X}{i,j,c} = s_c \cdot X{i,j,c}
$$

So channels the block deems “important” get boosted; less important channels get suppressed.

---

#### **8.4. Visual Diagram**

```
Input Feature Map (H×W×C)
       │
 Global Average Pooling (Squeeze)
       ↓
 Channel Descriptor (C)
       │
 Fully Connected (reduce C→C/r), ReLU
       │
 Fully Connected (C/r→C), Sigmoid (Excitation)
       ↓
 Channel Weights (C)
       │
 Scale original feature map channel-wise
       ↓
Output Feature Map (H×W×C)
```

---

#### **8.5. Where It’s Used**

* Originally introduced in **SENet** (ImageNet winner 2017/2018).
* Incorporated into **MobileNetV3**, **EfficientNet** (in MBConv blocks), ResNeXt, etc.
* Adds only a tiny computational overhead but often improves accuracy significantly.

---

#### **8.6. PyTorch Implementation (simple)**

```python
import torch
import torch.nn as nn

class SEBlock(nn.Module):
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channels, channels // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channels // reduction, channels, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)        # Squeeze
        y = self.fc(y).view(b, c, 1, 1)        # Excitation
        return x * y                          # Scale
```

---

#### **8.7. Key Takeaways**

* **SE block** = channel-wise attention module.
* Steps: **Squeeze → Excite → Scale**.
* Helps the network emphasize informative features dynamically.
* Tiny overhead, noticeable accuracy gain.

---

Would you like me to also explain **how SE differs from spatial attention** (e.g., CBAM)?


## **9. Compute cost (MACs)** for a standard conv vs a **depthwise-separable** (depthwise + pointwise) conv.

* **Standard conv** (in_channels = $M$, out_channels = $N$, kernel $k\times k$, output size $H\times W)$
  $$
  \text{MACs}_{\text{std}} = H,W,M,N,k^2
  $$

* **Depthwise-separable conv**
  Depthwise: $H,W,M,k^2$
  Pointwise $1×1$: $H,W,M,N$
  $$
  \text{MACs}_{\text{dw+pw}} = H,W,(M k^2 + M N) = H,W,M,(k^2 + N)
  $$

* **Ratio (how much of the standard cost remains)**
  $$
  \frac{\text{MACs}*{\text{dw+pw}}}{\text{MACs}*{\text{std}}}
  = \frac{M(k^2+N)}{MNk^2}
  = \frac{k^2+N}{N k^2}
  = \frac{1}{k^2} + \frac{1}{N}
  $$
  So the **speed-up** (std / dw+pw) is:
  $$
  \text{Speed-up} = \frac{N k^2}{k^2 + N}
  $$

Your RGB→5-channel example (M=3, N=5, k=3)

* **Per-pixel MACs**

  * Standard: $M N k^2 = 3 \cdot 5 \cdot 9 = 135$
  * Depthwise+Pointwise: $M k^2 + M N = 3\cdot 9 + 3 \cdot 5 = 27 + 15 = 42$

  **Ratio:** $42/135 = 14/45 \approx 0.311$ → ~**69% fewer MACs**
  **Speed-up:** $135/42 = 45/14 \approx 3.21\times$

* **If output is 224×224** (same spatial size):

  * Standard: $50{,}176 \times 135 = 6{,}773{,}760$ MACs
  * Depthwise+Pointwise: $50{,}176 \times 42 = 2{,}107{,}392$ MACs

**Intuition**

* Standard conv mixes **spatial** and **cross-channel** interactions in one heavy op.
* Depthwise handles **spatial** per-channel cheaply; pointwise (1×1) then mixes **channels**.
* With typical (k=3) and larger $N$, the ratio $\frac{1}{k^2}+\frac{1}{N}$ gets close to $1/9$, i.e., ~**9×** less compute. For small $N$ (like 5), the saving is smaller $~3.2×$, as the 1×1 mixing dominates.


In [3]:
import torch
import torch.nn as nn

# Suppose we want to reduce 64 channels -> 16 channels
reduce_channels = nn.Conv2d(64, 16, kernel_size=1)

x = torch.randn(1, 64, 32, 32)
y = reduce_channels(x)

print("Reduced shape:", y.shape)  # (1, 16, 32, 32)


Reduced shape: torch.Size([1, 16, 32, 32])
