## 1. What is MBConv?

**MBConv (Mobile Inverted Bottleneck Convolution)** was introduced in **MobileNetV2** and reused in **EfficientNet**.

It’s called *inverted* because:

* A normal bottleneck first **reduces** channels, then applies convolution.
* MBConv first **expands** channels, does computation, and then **projects back** to fewer channels.

---

### MBConv block structure

1. **Expansion (1×1 convolution)**
   Expands from input channels $C_{in}$ to $t \times C_{in}$, where $t$ is the **expansion factor** (usually 6).

   $$
   X_{expand} = \text{ReLU6}(\text{BN}(\text{Conv}_{1\times1}(X_{in})))
   $$

2. **Depthwise convolution (3×3 convolution per channel)**
   Applies spatial convolution **independently** for each channel.

   $$
   X_{depth} = \text{ReLU6}(\text{BN}(\text{ConvDepthwise}_{3\times3}(X_{expand})))
   $$

3. **Projection (1×1 convolution)**
   Reduces back to $C_{out}$ channels.

   $$
   X_{out} = \text{BN}(\text{Conv}_{1\times1}(X_{depth}))
   $$

4. **Skip connection (optional)**
   If stride = 1 and $C_{in} = C_{out}$:
   $$
   Y = X_{in} + X_{out}
   $$

---

**MBConv = Expansion (1×1) → Depthwise (3×3) → Projection (1×1)**


**Depthwise Separable = Depthwise (3×3) → Pointwise (1×1)**

So MBConv is a **generalized and more expressive** version of Depthwise Separable Conv.

---

## 2. Numerical Example

Let’s take a **tiny example** to see the shapes.

| Parameter        | Symbol       | Value |
| ---------------- | ------------ | ----- |
| Input size       | $H \times W$ | 8 × 8 |
| Input channels   | $C_{in}$     | 4     |
| Output channels  | $C_{out}$    | 4     |
| Expansion factor | $t$          | 6     |
| Kernel size      | $k$          | 3     |
| Stride           | 1            |       |

---

#### Step 1: Expansion (1×1 conv)

$$
C_{expand} = t \times C_{in} = 6 \times 4 = 24
$$

Output tensor shape:
$$
[8, 8, 24]
$$

---

#### Step 2: Depthwise 3×3 conv

Each of the 24 channels gets its own 3×3 filter → no channel mixing.

Output tensor shape:
$$
[8, 8, 24]
$$

(assuming padding = 1, stride = 1)

---

#### Step 3: Projection (1×1 conv)

Reduces back to 4 channels:

$$
[8, 8, 24] \xrightarrow{\text{Conv1×1}} [8, 8, 4]
$$

---

#### Step 4: Skip connection

Since stride = 1 and input/output channels are the same (4), we add:

$$
Y = X_{in} + X_{out}
$$

Final output shape:
$$
[8, 8, 4]
$$

<img src="images/mbconv.png"  height="30%" width="30%" />


#### Python Code

```python
import torch

input = torch.randn(1, 4, 8, 8)  # [B, Cin, H, W]
expansion_factor = 6
```

#### 1. Expansion (1×1 convolution)

```python
conv1x1 = torch.nn.Conv2d(in_channels=4,
                          out_channels=expansion_factor * 4,  # 24
                          kernel_size=1,
                          stride=1)
output_expanded = conv1x1(input)
print(output_expanded.shape)  # [1, 24, 8, 8]
```

This expands the number of channels:
$$
C_{out} = t \times C_{in} = 6 \times 4 = 24
$$

---

#### 2. Depthwise convolution (3×3)

```python
conv_depthwise = torch.nn.Conv2d(in_channels=expansion_factor * 4,
                                 out_channels=expansion_factor * 4,
                                 kernel_size=3,
                                 stride=1,
                                 padding=1,          # keep size same
                                 groups=expansion_factor * 4)  # depthwise
output_depthwise = conv_depthwise(output_expanded)
print(output_depthwise.shape)  # [1, 24, 8, 8]
```

Each channel is convolved **independently** (since `groups=in_channels`).

---

#### 3. Projection (1×1 convolution)

```python
conv_projection = torch.nn.Conv2d(in_channels=expansion_factor * 4,
                                  out_channels=4,
                                  kernel_size=1,
                                  stride=1)
output_projected = conv_projection(output_depthwise)
print(output_projected.shape)  # [1, 4, 8, 8]
```

This projects back down to the original number of channels.

---

#### 4. Optional skip connection

If stride = 1 and `C_in == C_out`, you can add:

```python
output = input + output_projected
print(output.shape)  # [1, 4, 8, 8]
```

---

#### ✅ Summary of shapes

| Stage      | Operation | Input shape   | Output shape  | Parameters     |
| ---------- | --------- | ------------- | ------------- | -------------- |
| Expansion  | 1×1 conv  | [1, 4, 8, 8]  | [1, 24, 8, 8] | 4×24×1×1 = 96  |
| Depthwise  | 3×3 conv  | [1, 24, 8, 8] | [1, 24, 8, 8] | 24×3×3 = 216   |
| Projection | 1×1 conv  | [1, 24, 8, 8] | [1, 4, 8, 8]  | 24×4×1×1 = 96  |
| **Total**  |           |               |               | **408 params** |

---




## 4. Parameter Comparison

Let’s roughly compare MBConv vs Depthwise-Separable Conv for the same example.

| Layer type                         | Parameters                       |
| ---------------------------------- | -------------------------------- |
| Expansion 1×1 conv                 | $1×1×4×24 = 96$                  |
| Depthwise 3×3 conv                 | $3×3×24 = 216$                   |
| Projection 1×1 conv                | $1×1×24×4 = 96$                  |
| **Total MBConv**                   | **408**                          |
| Depthwise-separable (no expansion) | $3×3×4 + 1×1×4×4 = 36 + 16 = 52$ |



## **Inverted Residual Structure vs Encoder–Decoder**


#### 1. Usual Intuition (VAE, U-Net)

In most architectures like VAEs, U-Nets, or traditional CNNs:

* We **shrink** (downsample or reduce channels) to form a **compact representation** — a *latent code*.
* Then we **expand** back (via upsampling or transposed conv) to reconstruct or segment.

The idea is to **compress information**, forcing the network to **learn meaningful global features**.

That’s useful when your goal is *generation* or *reconstruction*, i.e., turning an input into something larger or richer (like an image output).

---

#### 2. In MobileNetV2 — the Goal is Different

MobileNetV2 is **not a generative model**, but a **feature extractor for classification or detection**.

Its purpose is not to compress the entire image into a low-dimensional latent space,
but rather to **transform and refine features efficiently** while keeping information flow stable.

So the expansion–shrink operation is **not an encoder–decoder**, but a **local feature transformation trick** that balances *expressiveness* and *efficiency*.

---

#### 3. The Core Idea — “Inverted Residual”

Let’s recall a **residual block** from ResNet:

$$
y = F(x) + x
$$

Here, $ F(x) $ is usually **wide** — lots of channels — so adding ( x ) directly would be expensive.
ResNet typically **compresses (1×1 conv)** → **processes (3×3 conv)** → **expands (1×1 conv)**.

**MobileNetV2 does the opposite:**

$$
\text{Expand (1×1)} \rightarrow \text{Depthwise (3×3)} \rightarrow \text{Project (1×1)}
$$

and keeps the *skip connection* between the **narrow (bottleneck)** tensors.

That’s why it’s called **“inverted residual”** — the *shortcut* connects narrow layers, not wide ones.

---

#### 4. Why Expand First?

Because **depthwise convolutions** (used for efficiency) operate *independently* on each channel —
they can’t mix information across channels.

If the input has few channels, the depthwise conv has *very limited capacity*.

#### So:

1. **Expand (1×1 conv)** — project to a higher-dimensional space (e.g., ×6 wider).
   This allows richer combinations of features.
2. **Depthwise Conv (3×3)** — spatial filtering per channel (cheap, local spatial context).
3. **Project back (1×1 conv)** — compress to fewer channels again to save memory and computation.

Formally, expansion provides **a wider feature space** for non-linear transformations.

---

#### 5. Why Shrink Again?

The projection (shrink) step serves two purposes:

* **Efficiency**: reduces the number of channels to keep the next layer lightweight.
* **Linear bottleneck**: after non-linear transformations (ReLU6), projection back to a smaller space *without non-linearity* preserves information that would otherwise be destroyed by clipping in ReLU6.

Hence the name **Linear Bottleneck** — linear projection back to compact form.

---

#### 6. Intuitive Analogy

Think of it like working in **a higher-dimensional workspace to manipulate data more flexibly**, then compressing it back.

* In low dimension (few channels): limited capacity to separate or combine patterns.
* In high dimension (expanded): easier to apply non-linear transforms (like ReLU) without losing structure.
* After processing: you bring it back down to a compact form to continue efficiently.

It’s conceptually similar to what happens in transformers or kernels in SVMs:

> Project into a higher-dimensional feature space → perform simple operations → project back.

---



#### 7. Why It Works So Well

* Keeps **parameter count low** (thanks to depthwise + projection).
* Maintains **gradient flow** through the narrow residual path.
* Enables **non-linear expressiveness** in expanded space.
* Prevents **information loss** with the linear bottleneck.

---



## **Numerical Example**
Let’s go through a **tiny tensor** step by step to see exactly how **MobileNetV2’s inverted residual block** works.

We’ll simulate the shapes and operations — no need for an actual image file — so you can *visualize* what happens numerically.

---

#### 1. Setup

Let’s say we have an **input tensor** from some previous layer:

$$
X \in \mathbb{R}^{4 \times 4 \times 8}
$$

That is:

* Height = 4
* Width = 4
* Channels = 8 (a “narrow” representation)

We’ll use:

* **Expansion factor** $ t = 6 $
* **Output channels** $ c_{\text{out}} = 8 $
* **Stride = 1**

---

#### 2. Step 1 — Expansion (1×1 convolution)

We apply a **1×1 convolution** to increase the number of channels by the factor $ t = 6 $:

$$
\text{Conv}_{1×1}^{expand}: 8 \rightarrow 48
$$

So the tensor becomes:

$$
X_{expand} \in \mathbb{R}^{4 \times 4 \times 48}
$$

Interpretation:

* Each pixel location (4×4 = 16 total) now has **48 features** instead of 8.
* This gives the model more “room” to apply non-linear transformations (ReLU6).

---

#### 3. Step 2 — Depthwise Convolution (3×3)

Next, we apply a **depthwise 3×3 convolution**, one filter per channel:

$$
\text{Conv}_{3×3}^{dw}: 48 \text{ filters, each on 1 channel}
$$

Output stays the same size (since stride=1, padding=1):

$$
X_{dw} \in \mathbb{R}^{4 \times 4 \times 48}
$$

Key point:

* Each channel is filtered *independently*, capturing spatial context (edges, corners, etc.)
* No mixing between channels yet.

---

#### 4. Step 3 — Projection (1×1 convolution)

Now we use another **1×1 convolution** to compress channels back:

$$
\text{Conv}_{1×1}^{project}: 48 \rightarrow 8
$$

So the tensor becomes:

$$
X_{proj} \in \mathbb{R}^{4 \times 4 \times 8}
$$

This projection is **linear** (no activation).
It re-combines the 48 feature maps into a compact 8-channel representation.

---

#### 5. Step 4 — Residual Connection

Since stride=1 and the input/output channel counts are the same (8),
we can add the residual connection:

$$
Y = X_{proj} + X
$$

The output has the same shape as the input:

$$
Y \in \mathbb{R}^{4 \times 4 \times 8}
$$

---

#### 6. What Happened Intuitively

| Step         | Operation | Channels | Spatial | Comment                |
| ------------ | --------- | -------- | ------- | ---------------------- |
| Input        | —         | 8        | 4×4     | narrow, compressed     |
| Expand       | 1×1 Conv  | 48       | 4×4     | rich, high-dimensional |
| Depthwise    | 3×3 Conv  | 48       | 4×4     | local spatial mixing   |
| Project      | 1×1 Conv  | 8        | 4×4     | compact again          |
| Residual Add | +         | 8        | 4×4     | information shortcut   |

So the block **transforms information in a high-dimensional space** (48 channels),
but **stores and connects through a low-dimensional bottleneck** (8 channels).

---

#### 7. Why It’s “Inverted Residual”

In **ResNet**, you’d have:

```
Input (wide)
 → 1×1 conv (reduce)
 → 3×3 conv
 → 1×1 conv (expand)
 → Add
```

In **MobileNetV2**, you invert that:

```
Input (narrow)
 → 1×1 conv (expand)
 → 3×3 depthwise conv
 → 1×1 conv (project)
 → Add
```

Hence: **“Inverted residual”**.

---

#### 8. Optional: Visual Intuition (Grid)

```
Before expansion:
  4×4 pixels × 8 channels  → compact info

After expansion:
  4×4 pixels × 48 channels → high-dimensional workspace

After projection:
  4×4 pixels × 8 channels  → compact again, ready for next block
```

You can think of it as a **temporary workspace explosion** —
you blow up the feature space, do useful operations in it,
and then compress it back efficiently.

---

