## **1. What EfficientNet Is**

EfficientNet is a family of convolutional neural networks (CNNs) introduced by Google (2019) designed to achieve **high accuracy with much fewer parameters and FLOPs** compared to previous models like ResNet, Inception, or DenseNet.
It’s basically a *scalable* architecture that balances **depth**, **width**, and **resolution**.



#### **1.1. Depth, Width, and Resolution: Core Definitions**

| Term           | Meaning                                                 | In Neural Networks                                                                             |
| -------------- | ------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| **Depth**      | Number of layers (how *deep* the network is)            | How many convolutional or block layers are stacked sequentially (e.g. MBConv repeats).         |
| **Width**      | Number of channels per layer (how *wide* each layer is) | The number of feature maps (filters) in each convolution — controls representational capacity. |
| **Resolution** | Spatial size of input/output feature maps               | Height × Width of the image or intermediate feature maps.                                      |

So yes:

* **Depth →** number of layers.
* **Width →** number of channels (feature maps).
* **Resolution →** spatial size (H × W).


---

## **2. Why EfficientNet Was Created**

Traditional model-scaling methods (just making the network deeper or wider or feeding larger images) improve accuracy but quickly lead to inefficiency.
EfficientNet uses a **principled scaling method** to get the most accuracy per computation.

---

## **3. Two Key Ideas**

#### A. **EfficientNet-B0 (the baseline)**

* They searched for a small but powerful baseline network using **neural architecture search (NAS)**.
* This gave a mobile-friendly architecture with compound building blocks (MBConv, similar to MobileNetV2).

#### B. **Compound Scaling**

* Instead of arbitrarily scaling depth, width, or input resolution, EfficientNet scales all three together using fixed **scaling coefficients**. EfficientNet’s **compound scaling** says: grow **depth**, **width**, and **input resolution** together by fixed multipliers so compute grows predictably.

Formally, if you want to scale up EfficientNet:

* Depth → $d = \alpha^\phi$
* Width → $w = \beta^\phi$
* Resolution → $r = \gamma^\phi$

Subject to:

$$
\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2
$$

(where $\phi$ is a user-chosen scaling factor indicating how much more compute you want to spend).
This yields EfficientNet-B1 … B7 (each larger/more accurate than the last).

---


## **4. The Rule**

Choose constants $\alpha,\beta,\gamma>1$ and a user knob $\phi \in {0,1,2,\dots}$.

* Depth: $d=\alpha^{\phi}$
* Width $channels$: $w=\beta^{\phi}$
* Resolution $image size$: $r=\gamma^{\phi}$

Conv cost scales roughly as $ \text{FLOPs} \propto d \cdot w^2 \cdot r^2$.
So if we enforce
$$
\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2,
$$

then each time you increase $\phi$ by 1, **FLOPs ≈ double**.

A commonly cited set (close to the original paper):
$\alpha=1.2,; \beta=1.1,; \gamma=1.15$ → $\alpha\beta^2\gamma^2 \approx 1.92 \approx 2$.

Below are concrete numbers using these.

---

#### **4.2 Numerical examples (starting from a baseline B0)**

Assume baseline depth/width/resolution are all “1×” (e.g., 224×224 input).

| $\phi$ | $ d=\alpha^\phi$ | $w=\beta^\phi$ | $r=\gamma^\phi$ |  New input (≈ $224\cdot r$) | FLOPs scale $d,w^2,r^2$ |
| -----: | --------------: | -------------: | --------------: | --------------------------: | ----------------------: |
|      0 |           1.000 |          1.000 |           1.000 |                         224 |                   1.00× |
|      1 |           1.200 |          1.100 |           1.150 |     **≈258** (round to 256) |              **≈1.92×** |
|      2 |           1.440 |          1.210 |           1.322 | **≈296** (round to 296/288) |              **≈3.69×** |
|      3 |           1.728 |          1.331 |           1.521 |     **≈341** (round to 336) |              **≈7.08×** |

Interpretation:

* At $\phi=1$: make the net ~20% deeper, ~10% wider, feed ~15% larger images → ~1.9× compute.
* At $\phi=2$: apply those multipliers again → ~3.7× compute vs. baseline.
* At $\phi=3$: ~7.1× compute vs. baseline.

*(In practice you round image sizes to multiples of 8/16 and channels to hardware-friendly sizes.)*

---

#### Mini “what-if” examples

1. **Channels and layers**
   Baseline has 32 channels and 10 layers. With $\phi=2$:

* Width: $32 \cdot \beta^2 \approx 32 \cdot 1.21 \approx 39$ → round to 40.
* Depth: $10 \cdot \alpha^2 = 10 \cdot 1.44 = 14.4$ → ~14–15 layers.
* Resolution: $224 \cdot \gamma^2 \approx 224 \cdot 1.322 \approx 296$ → round to 288/296.

2. **Compute sanity check**
   Jumping from $\phi=1$ to $\phi=3$ multiplies FLOPs by $\approx 7.08/1.92 \approx 3.7$.
   That’s consistent because each +1 in $\phi$ nearly doubles compute.

---

### Takeaways

* The constraint $\alpha \beta^2 \gamma^2 \approx 2$ ensures **predictable ~2× compute per step**.
* Scaling **all three** (depth, width, resolution) is more *accuracy-efficient* than scaling any single dimension alone.
* Real models round/tune sizes, but the compound law is the guiding principle.


## **5. EfficientNet Architecture**
Let’s go through the **EfficientNet** family (B0–B7) step by step, including:

1. Core architectural **principles**
2. **Building blocks** (MBConv, SE, etc.)
3. **Differences across B0–B7**
4. Discussion of **CBAM** (and where it can be integrated)

---


### **5.1. Building Blocks**

#### **(a) MBConv (Mobile Inverted Bottleneck Convolution)**

The **MBConv** block comes from **MobileNetV2**.
It uses a **depthwise separable convolution** + **inverted residuals**:

```
Input → 1x1 Expansion Conv → 3x3 Depthwise Conv → SE → 1x1 Projection Conv → Output
```

* **Expansion ratio (t):** how much to expand channels before depthwise conv
* **Kernel size:** 3×3 or 5×5 depending on stage
* **Skip connection:** only if stride=1 and input/output have same channels

#### **(b) SE (Squeeze-and-Excitation)**

Integrated inside each MBConv.
It recalibrates channel importance:

1. Global average pool (squeeze)
2. Two FC layers (reduce → expand)
3. Multiply sigmoid output with feature map

Formally:
$$
\mathbf{z}_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} X_{c,i,j}
$$
$$
\mathbf{s} = \sigma(W_2 \cdot \delta(W_1 \cdot \mathbf{z}))
$$
$$
\mathbf{y}_c = \mathbf{s}_c \cdot X_c
$$

The SE ratio is typically **0.25** (i.e., reduction by 4×).

#### **(c) CBAM (Convolutional Block Attention Module)**

CBAM is **not** part of original EfficientNet, but sometimes researchers add it to enhance attention.
It has:

1. **Channel Attention (CA)** → similar to SE
2. **Spatial Attention (SA)** → a 7×7 convolution over channel-aggregated map

So CBAM = Channel Attention → Spatial Attention.
If inserted, it’s usually placed after the SE block (or replacing SE).

---

## 5.2. EfficientNet-B0 Architecture

EfficientNet-B0 is the base model; others (B1–B7) scale it up.
Below is the canonical configuration:

| Stage | Operator | Resolution | Channels | Layers | Expansion | Kernel | SE | Stride |
|:------|:----------|:------------|:----------|:--------|:-----------|:--------|:------|:-------|
| Stem | Conv3x3 | 224x224 | 32 | 1 | – | 3x3 | – | 2 |
| 1 | MBConv1 | 112x112 | 16 | 1 | 1 | 3x3 | ✓ | 1 |
| 2 | MBConv6 | 112x112 | 24 | 2 | 6 | 3x3 | ✓ | 2 |
| 3 | MBConv6 | 56x56 | 40 | 2 | 6 | 5x5 | ✓ | 2 |
| 4 | MBConv6 | 28x28 | 80 | 3 | 6 | 3x3 | ✓ | 2 |
| 5 | MBConv6 | 14x14 | 112 | 3 | 6 | 5x5 | ✓ | 1 |
| 6 | MBConv6 | 14x14 | 192 | 4 | 6 | 5x5 | ✓ | 2 |
| 7 | MBConv6 | 7x7 | 320 | 1 | 6 | 3x3 | ✓ | 1 |
| Head | Conv1x1 + Pool + FC | 7x7 | 1280 | 1 | – | 1x1 | – | – |

Total: **8 MBConv stages** (some repeated) + 1 stem + 1 head.
All MBConv blocks have **SE** inside.

---

## 5.3. Scaling to B1–B7

Scaling increases:

* **Resolution** (input size)
* **Depth** (more MBConv repeats)
* **Width** (more channels)

| Model | Input (px) | Depth Mult | Width Mult | #Params (M) | #MBConv Blocks |
| :---- | :--------- | :--------- | :--------- | :---------- | :------------- |
| B0    | 224        | 1.0        | 1.0        | 5.3         | 16             |
| B1    | 240        | 1.1        | 1.0        | 7.8         | 16             |
| B2    | 260        | 1.2        | 1.1        | 9.2         | 20             |
| B3    | 300        | 1.4        | 1.2        | 12.0        | 24             |
| B4    | 380        | 1.8        | 1.4        | 19.0        | 28             |
| B5    | 456        | 2.2        | 1.6        | 30.0        | 32             |
| B6    | 528        | 2.6        | 1.8        | 43.0        | 38             |
| B7    | 600        | 3.1        | 2.0        | 66.0        | 44             |

Every model uses the same **MBConv + SE** layout; only the scaling changes.

---

## 5.4. EfficientNet + CBAM (modified variants)

When **CBAM** is added:

* Replace SE with CBAM, or
* Add CBAM after SE (Channel Attention → Spatial Attention).

Empirical studies show CBAM can improve accuracy by refining spatial focus, but at the cost of slightly more computation.

Example (pseudo-layout for one block):

```
x = MBConv(...)
x = SE(x)
x = CBAM(x)
```

This yields “EfficientNet-CBAM” (seen in some research papers or GitHub repos).

---

## 5.5. Summary of the Pipeline

For any EfficientNet-Bx:

```
Input
↓
Conv3x3 (Stem)
↓
[MBConv1 + SE] × n1
↓
[MBConv6 + SE] × n2
↓
[MBConv6 + SE] × n3
↓
[MBConv6 + SE] × n4
↓
[MBConv6 + SE] × n5
↓
[MBConv6 + SE] × n6
↓
[MBConv6 + SE] × n7
↓
Conv1x1 + Pool + FC (Head)
```

Optionally add:

```
+ CBAM attention (optional enhancement)
```

and:
$$
[n_1, n_2, n_3, n_4, n_5, n_6, n_7] = [1, 2, 2, 3, 3, 4, 1]
$$

---


## Explanation of Notation
**[MBConv1 + SE] × n1** or **[MBConv6 + SE] × n3**
is standard in papers and tables describing **EfficientNet**, **MobileNetV2**, and related models.


---


#### 1. What the number (1 or 6) means

The number after MBConv — e.g. **MBConv1** or **MBConv6** —
is the **expansion ratio**:

$$
t = 1, 6
$$

It controls how much the channel dimension is expanded inside the block:

$$
C_{expanded} = t \times C_{input}
$$

* **MBConv1** → expansion ratio ( t = 1 ): no expansion (input channels unchanged)
* **MBConv6** → expansion ratio ( t = 6 ): 6× more channels in the middle layers

Example:
If input has 32 channels:

* MBConv1 → expanded to 32
* MBConv6 → expanded to 192

✅ Larger expansion ratio → higher capacity but more compute.

---

#### 2. What SE means

**SE** = *Squeeze-and-Excitation* block, which adds **channel attention**.

It computes:
$$
\text{scale} = \sigma(W_2 , \delta(W_1 , \text{GAP}(x)))
$$

and multiplies this scale back to the feature map.
(SE squeezes spatial info via global average pooling and excites important channels.)

So, **MBConv6 + SE** = an MBConv block with expansion 6 and an SE attention inside.

---

#### 3. What × n means

The **× n** tells you **how many times** that block is repeated in sequence (same stage).

So:

* `[MBConv1 + SE] × n1` → repeat that structure **n1 times**
* `[MBConv6 + SE] × n3` → repeat that structure **n3 times**

Each stage typically keeps the same input/output channel dimensions and stride (except first block of a stage, which may downsample).

Example (EfficientNet-B0 simplified):

| Stage | Operator         | #Repeats | Output Channels | Stride | Resolution |
| ----- | ---------------- | -------- | --------------- | ------ | ---------- |
| 1     | Conv3×3          | 1        | 32              | 2      | 112×112    |
| 2     | **MBConv1 + SE** | **1**    | 16              | 1      | 112×112    |
| 3     | **MBConv6 + SE** | **2**    | 24              | 2      | 56×56      |
| 4     | **MBConv6 + SE** | **2**    | 40              | 2      | 28×28      |
| 5     | **MBConv6 + SE** | **3**    | 80              | 2      | 14×14      |
| 6     | **MBConv6 + SE** | **3**    | 112             | 1      | 14×14      |
| 7     | **MBConv6 + SE** | **4**    | 192             | 2      | 7×7        |
| 8     | **MBConv6 + SE** | **1**    | 320             | 1      | 7×7        |

So for example:

* `[MBConv6 + SE] × 3` means 3 consecutive MBConv6 blocks (with SE) in that stage.
* The first of them might downsample (stride 2), the rest keep stride 1.

---

## 4. Summary of notation

| Notation             | Meaning                                                                     |
| -------------------- | --------------------------------------------------------------------------- |
| MBConv               | Mobile Inverted Bottleneck Conv block                                       |
| MBConv1              | Expansion ratio = 1                                                         |
| MBConv6              | Expansion ratio = 6                                                         |
| SE                   | Squeeze-and-Excitation block (channel attention)                            |
| × n                  | Repeat n times in that stage                                                |
| `[MBConv6 + SE] × 3` | Three repeated inverted-bottleneck blocks with expansion 6 and SE attention |

---

✅ **Quick interpretation example:**

`[MBConv6 + SE] × 3`
→ Each block:

* expands channels by ×6,
* applies depthwise conv,
* applies SE attention,
* projects back and maybe adds residual,
  and there are **3 such blocks** in that stage.

---



## **6. EfficientNet Variants From timm**
This touches the **core architectural evolution** from **EfficientNet (V1)** to **EfficientNetV2 (V2)**.
Let’s go step by step and clarify **why EfficientNetV2 uses names like `s`, `m`, `l`, `xl`** instead of `b0–b8`, and what the difference really means.



In [4]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!
import warnings
import os

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm
# fmt: on


all_efficientnet = timm.list_models("*efficientnet*", pretrained=True)
for m in all_efficientnet:
    print(m)

efficientnet_b0.ra4_e3600_r224_in1k
efficientnet_b0.ra_in1k
efficientnet_b1.ft_in1k
efficientnet_b1.ra4_e3600_r240_in1k
efficientnet_b1_pruned.in1k
efficientnet_b2.ra_in1k
efficientnet_b2_pruned.in1k
efficientnet_b3.ra2_in1k
efficientnet_b3_pruned.in1k
efficientnet_b4.ra2_in1k
efficientnet_b5.sw_in12k
efficientnet_b5.sw_in12k_ft_in1k
efficientnet_el.ra_in1k
efficientnet_el_pruned.in1k
efficientnet_em.ra2_in1k
efficientnet_es.ra_in1k
efficientnet_es_pruned.in1k
efficientnet_lite0.ra_in1k
efficientnetv2_rw_m.agc_in1k
efficientnetv2_rw_s.ra2_in1k
efficientnetv2_rw_t.ra2_in1k
gc_efficientnetv2_rw_t.agc_in1k
test_efficientnet.r160_in1k
test_efficientnet_evos.r160_in1k
test_efficientnet_gn.r160_in1k
test_efficientnet_ln.r160_in1k
tf_efficientnet_b0.aa_in1k
tf_efficientnet_b0.ap_in1k
tf_efficientnet_b0.in1k
tf_efficientnet_b0.ns_jft_in1k
tf_efficientnet_b1.aa_in1k
tf_efficientnet_b1.ap_in1k
tf_efficientnet_b1.in1k
tf_efficientnet_b1.ns_jft_in1k
tf_efficientnet_b2.aa_in1k
tf_efficientnet_b2.ap


---

##  **7. EfficientNet (V1): the original scaling idea (B0–B8)**

The original **EfficientNet (2019)** paper introduced the **compound scaling rule**:
$$
\text{depth} \propto \alpha^\phi, \quad
\text{width} \propto \beta^\phi, \quad
\text{resolution} \propto \gamma^\phi
$$

with constraint:
$$
\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2
$$

Here:

* **ϕ** is a compound coefficient controlling how “big” the model is.
* **B0** is the baseline network.
* **B1–B8** are scaled versions by increasing ϕ.

| Model | Resolution | Parameters | Top-1 Acc. (ImageNet) |
| ----- | ---------- | ---------- | --------------------- |
| B0    | 224×224    | 5.3 M      | ~77%                  |
| B1    | 240×240    | 7.8 M      | ~79%                  |
| B2    | 260×260    | 9.2 M      | ~80%                  |
| B3    | 300×300    | 12 M       | ~81.5%                |
| B4    | 380×380    | 19 M       | ~83%                  |
| B5    | 456×456    | 30 M       | ~84%                  |
| B6    | 528×528    | 43 M       | ~84.5%                |
| B7    | 600×600    | 66 M       | ~85%                  |
| B8    | 672×672    | 87 M       | ~85.7%                |

✅ **Interpretation:**
EfficientNet-B0→B8 are *scaled-up* versions of the same base architecture using a mathematical rule.

---

## **8. EfficientNetV2 Overview**

EfficientNetV2 (Tan & Le, 2021, *Google Research*) is a redesigned, faster, and more efficient version of EfficientNetV1.
Its goals:

1. **Reduce training time** (up to 5× faster).
2. **Reduce memory and FLOPs** while maintaining or improving accuracy.
3. **Improve performance on small images and dense tasks** (like segmentation).

---

#### **8.1. High-Level Pipeline**

```
Input (e.g. 224×224)
↓
Conv3×3 (Stem)
↓
[Fused-MBConv + SE] × n₁
↓
[Fused-MBConv + SE] × n₂
↓
[MBConv + SE] × n₃
↓
[MBConv + SE] × n₄
↓
[MBConv + SE] × n₅
↓
Conv1×1 (Head)
↓
Global Average Pool
↓
Fully Connected (Classifier)
↓
Softmax Output
```

✅ Early stages → **Fused-MBConv** (faster on GPU/TPU, small resolution).
✅ Later stages → **MBConv (with SE)** (better representation capacity).

---

### **8.2. EfficientNetV2 Building Blocks**

---

#### **8.2.1 MBConv Block (same as V1)**

The standard **Mobile Inverted Bottleneck (MBConv)** block used in EfficientNetV1 and still in later stages of V2:

---

#### **8.2.2 Fused-MBConv Block (new in V2)**

To make early layers faster, V2 *fuses* the first two steps of MBConv:
the **1×1 expansion** and **3×3 depthwise conv**
→ replaced by **a single 3×3 regular convolution**.

### **8.3 Structure:**

```
Input
↓
3×3 Conv (expands and convolves together)
↓
BN + Swish
↓
(optional) SE
↓
1×1 Conv (projection)
↓
BN
↓
Residual (if stride=1 and same channels)
```

So, instead of:

```
1×1 expand → 3×3 depthwise → 1×1 project
```

it becomes:

```
3×3 normal conv → 1×1 project
```

✅ **Why:** Depthwise conv is memory-bound and inefficient on GPU for small feature maps.
✅ **Benefit:** Fused-MBConv improves speed and training efficiency for early high-resolution stages.

---

#### **8.4. EfficientNetV2 Stage Configuration**

For **EfficientNetV2-S**, the canonical structure is:

| Stage | Operator            | Expansion | Repeats (n) | Output Channels | Stride |    SE   | Input Res (example 224×224) |
| :---: | :------------------ | :-------: | :---------: | :-------------: | :----: | :-----: | :-------------------------: |
|   1   | **Fused-MBConv1**   |     1     |      2      |        24       |    1   |    No   |           224×224           |
|   2   | **Fused-MBConv4**   |     4     |      4      |        48       |    2   |    No   |           112×112           |
|   3   | **Fused-MBConv4**   |     4     |      4      |        64       |    2   |    No   |            56×56            |
|   4   | **MBConv4**         |     4     |      6      |       128       |    2   | **Yes** |            28×28            |
|   5   | **MBConv6**         |     6     |      9      |       160       |    1   | **Yes** |            14×14            |
|   6   | **MBConv6**         |     6     |      15     |       256       |    2   | **Yes** |             7×7             |
|   7   | Conv1×1 + Pool + FC |     –     |      –      |       1280      |    –   |    –    |             7×7             |

✅ `n` here corresponds to n₁ … n₆ (number of repeats per stage).
✅ The **first MBConv** in each stage may downsample (stride = 2).
✅ Later repeats in the same stage use stride = 1 and residuals.

---

#### **8.5. Key Differences from V1**

| Aspect            | EfficientNet-V1        | EfficientNet-V2                                                       |
| :---------------- | :--------------------- | :-------------------------------------------------------------------- |
| Early layers      | MBConv                 | **Fused-MBConv (standard conv)**                                      |
| Block types       | MBConv only            | **Hybrid: Fused-MBConv + MBConv**                                     |
| SE usage          | All blocks             | Only in MBConv (later stages)                                         |
| Expansion ratios  | 1, 6                   | Variable (1, 4, 6)                                                    |
| Training strategy | AutoAugment, fixed res | **Progressive learning:** start with small images, gradually increase |
| Accuracy          | up to ~85.7%           | up to ~86.3%                                                          |
| Training speed    | slower                 | up to 5× faster                                                       |
| Scaling           | Compound rule (ϕ)      | Manual scaling for S/M/L/XL                                           |

---

#### **8.6. EfficientNetV2 Family Summary**

| Model   | Params | Resolution | Top-1 Accuracy (ImageNet) | Notable Use          |
| :------ | :----: | :--------: | :-----------------------: | :------------------- |
| `v2_b0` |  ~7 M  |     224    |            ~79%           | Mobile-friendly      |
| `v2_b1` | ~8.5 M |     240    |            ~80%           | Small GPU            |
| `v2_b2` |  ~10 M |     260    |            ~81%           | Lightweight training |
| `v2_b3` |  ~14 M |     300    |           ~82.9%          | Balanced             |
| `v2_s`  |  ~22 M |     384    |           ~83.9%          | Standard choice      |
| `v2_m`  |  ~55 M |     480    |           ~85.1%          | High accuracy        |
| `v2_l`  | ~120 M |     480    |           ~85.7%          | Large GPU            |
| `v2_xl` | ~208 M |     512    |           ~86.3%          | Maximum accuracy     |

- ✅ V2-B0→B3 = small versions, similar scale to V1 (but hybrid blocks).
- ✅ V2-S/M/L/XL = large, high-accuracy architectures.



---

#### **8.7. Summary of Key Ideas**

| Concept                       | Description                                                     |
| :---------------------------- | :-------------------------------------------------------------- |
| **Fused-MBConv**              | Combines expansion + depthwise into one 3×3 conv for efficiency |
| **MBConv + SE**               | Used in deeper layers for accuracy                              |
| **Variable expansion ratios** | t = 1, 4, 6 instead of fixed 6                                  |
| **Progressive training**      | Start small, increase resolution/regularization gradually       |
| **Manual scaling**            | Replace B0–B8 with Small/Medium/Large/XL tailored designs       |

---

✅ **Intuitive summary:**
EfficientNetV1 = mathematically scaled MBConv tower.
EfficientNetV2 = hand-optimized hybrid of Fused-MBConv (fast early) + MBConv+SE (strong late), trained progressively for better speed–accuracy trade-off.

---




#### **8.8. When to use which**

| GPU budget  | Recommended model                    | Reason                       |
| ----------- | ------------------------------------ | ---------------------------- |
| ≤ 4 GB VRAM | `tf_efficientnetv2_b0.in1k`          | Small, accurate, lightweight |
| 4–8 GB      | `tf_efficientnetv2_s.in21k_ft_in1k`  | Balanced, fast training      |
| 8–16 GB     | `tf_efficientnetv2_m.in21k_ft_in1k`  | Higher accuracy              |
| ≥ 16 GB     | `tf_efficientnetv2_l.in21k_ft_in1k`  | SOTA accuracy                |
| > 24 GB     | `tf_efficientnetv2_xl.in21k_ft_in1k` | Maximum accuracy, expensive  |

---

#### **8.9. Summary**

| Family                | Naming          | Scaling Rule       | Block Type            | Accuracy | Training Speed |
| --------------------- | --------------- | ------------------ | --------------------- | -------- | -------------- |
| **EfficientNet (V1)** | B0–B8           | Compound (α,β,γ)   | MBConv                | 77–85.7% | Moderate       |
| **EfficientNetV2**    | B0–B3, S/M/L/XL | Manual progressive | MBConv + Fused-MBConv | 79–86.3% | Much faster    |

✅ **In short:**
EfficientNetV2 replaced the rigid “B-scaling” formula with new hand-optimized variants (S, M, L, XL) designed for speed and accuracy trade-offs.

