## **Fused Mobile Inverted Bottleneck Convolution**

**Fused-MBConv** (short for **Fused Mobile Inverted Bottleneck Convolution**) is a variant of the **MBConv** block used in **EfficientNetV2**, designed to make training faster and inference more efficient on modern accelerators (like GPUs and TPUs).

Let’s unpack it step by step.

---

## 1. Background: MBConv (from EfficientNet / MobileNetV2)

The original **MBConv** (Mobile Inverted Bottleneck Convolution) block follows this sequence:

1. **Expand (1×1 conv)** – increases the number of channels
   $$C_{in} \rightarrow C_{in} \times t$$
   where (t) is the expansion factor.

2. **Depthwise convolution (k×k)** – applies one convolution per channel
   (preserves the number of channels).

3. **Squeeze-and-Excitation (optional)** – channel attention mechanism.

4. **Project (1×1 conv)** – reduces channels back to output dimension
   $$C_{in} \times t \rightarrow C_{out}$$

---

## 2. The Problem

While MBConv is *efficient* on mobile CPUs (because depthwise convs are cheap there), it performs *poorly on GPUs* due to **memory access overhead** and **non-fused operations** between the 1×1 and depthwise convs.

So, **EfficientNetV2** introduces **Fused-MBConv** to speed up training and make it more GPU-friendly.

---

## 3. Fused-MBConv: The Idea

In **Fused-MBConv**, the **expansion 1×1 conv** and the **depthwise k×k conv** are **merged (fused)** into a single **k×k convolution**.

This single convolution both **expands** the channels and **extracts spatial features** in one step.

### Block structure:

| Stage | MBConv                             | Fused-MBConv                           |
| ----- | ---------------------------------- | -------------------------------------- |
| 1     | 1×1 conv (expand)                  | 3×3 conv (expand + spatial conv fused) |
| 2     | Depthwise 3×3 conv                 | — (already fused)                      |
| 3     | Squeeze-and-Excitation             | optional                               |
| 4     | 1×1 conv (project)                 | 1×1 conv (project)                     |
| 5     | Skip connection (if shape matches) | Skip connection (if shape matches)     |

---

## 4. Block Diagram

```
Input
 │
 ├── Conv3x3 + BN + SiLU   ← fused expand + depthwise
 │
 ├── (Optional SE)
 │
 ├── Conv1x1 + BN
 │
 └── Skip connection (if same shape)
Output
```

By comparison, the classic MBConv had *two separate convolutions* before projection.

---

## 5. Mathematical View

In MBConv:

$$
Y = P(\text{DW}(E(X)))
$$

where:

* (E) = 1×1 expansion conv
* (\text{DW}) = depthwise conv
* (P) = 1×1 projection conv

In Fused-MBConv, we replace (E) and (\text{DW}) with a single fused conv (F):

$$
Y = P(F(X))
$$

where (F) is a k×k convolution that expands the channel dimension and captures spatial context.

---

## 6. When Each Is Used

* **Fused-MBConv** is used in **early layers** of EfficientNetV2
  → low-resolution, small input → faster and GPU-optimized.

* **MBConv** is used in **later layers**
  → higher channels, smaller spatial size → depthwise conv becomes efficient again.

---

## 7. Performance Trade-off

| Property         | MBConv          | Fused-MBConv                |
| ---------------- | --------------- | --------------------------- |
| Number of layers | More            | Fewer                       |
| GPU speed        | Slower          | Faster                      |
| Memory access    | High overhead   | Reduced                     |
| FLOPs            | Slightly higher | Slightly higher, but faster |
| Accuracy         | Similar         | Similar or slightly better  |

---

## 8. PyTorch Example

```python
import torch
import torch.nn as nn

class FusedMBConv(nn.Module):
    def __init__(self, in_ch, out_ch, expansion=4, kernel_size=3, stride=1, se_ratio=0.25):
        super().__init__()
        hidden_dim = in_ch * expansion
        self.use_res_connect = (stride == 1 and in_ch == out_ch)

        layers = [
            nn.Conv2d(in_ch, hidden_dim, kernel_size, stride, padding=kernel_size//2, bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.SiLU(inplace=True)
        ]

        # optional Squeeze-Excitation (not always used)
        if se_ratio is not None and se_ratio > 0:
            layers += [
                nn.AdaptiveAvgPool2d(1),
                nn.Conv2d(hidden_dim, int(hidden_dim * se_ratio), 1),
                nn.SiLU(inplace=True),
                nn.Conv2d(int(hidden_dim * se_ratio), hidden_dim, 1),
                nn.Sigmoid()
            ]

        layers += [
            nn.Conv2d(hidden_dim, out_ch, 1, bias=False),
            nn.BatchNorm2d(out_ch)
        ]

        self.block = nn.Sequential(*layers)

    def forward(self, x):
        out = self.block(x)
        if self.use_res_connect:
            out = out + x
        return out

x = torch.randn(1, 24, 56, 56)
block = FusedMBConv(24, 24, expansion=4)
print(block(x).shape)
```

Output:

```
torch.Size([1, 24, 56, 56])
```

---

## 9. Summary

| Concept           | Description                                                     |
| ----------------- | --------------------------------------------------------------- |
| **Origin**        | Introduced in EfficientNetV2                                    |
| **Motivation**    | Improve GPU training speed by fusing expansion + depthwise conv |
| **Fusion**        | Replace 1×1 + 3×3 with a single 3×3 convolution                 |
| **Used in**       | Early stages (large feature maps)                               |
| **Advantages**    | Faster, more memory-efficient on modern accelerators            |
| **Disadvantages** | Slightly higher FLOPs, less efficient on CPU/mobile             |

---

Would you like me to show a **visual comparison diagram** between MBConv and Fused-MBConv with arrows showing feature maps and channel changes?
