# **Feature Pyramid Network (FPN)**
**FPN (Feature Pyramid Network)** is one of the cornerstones of modern **detection and segmentation** architectures — and it ties directly to **PVT** (since PVT outputs a feature pyramid).


**Two versions**:

1. **Canonical ResNet-style FPN (official FPN paper)**
2. **PVT/Swin-Transformer style FPN (for use with PVT-v2-B2)**

# **Canonical FPN (ResNet Backbone)**

Assume the backbone input is an image of size:

**Input: $H \times W \times 3$**

ResNet produces the following feature stages:

| Stage | Source               | Spatial Resolution | Channels |
| ----- | -------------------- | ------------------ | -------- |
| $C_2$| After ResNet stage 2 | $H/4 \times W/4$  | 256      |
| $C_3$| After stage 3        | $H/8 \times W/8$  | 512      |
| $C_4$| After stage 4        | $H/16 \times W/16$| 1024     |
| $C_5$| After stage 5        | $H/32 \times W/32$| 2048     |

Now FPN converts them to:

**All P-levels have the same channel count: 256**

| Pyramid Level | Resolution         | Channels |
| ------------- | ------------------ | -------- |
| $P_5$        | $H/32 \times W/32$| 256      |
| $P_4$        | $H/16 \times W/16$| 256      |
| $P_3$        | $H/8 \times W/8$  | 256      |
| $P_2$        | $H/4 \times W/4$  | 256      |

---

## **FPN Equations (with ResNet dimensions)**

### **Step 1. Lateral 1×1 reductions**

Each backbone output is projected to 256 channels:

$$
\tilde{C}_5 = \text{Conv}_{1\times 1}(C_5), \quad \text{shape } (H/32, W/32, 256)
$$

$$
\tilde{C}_4 = \text{Conv}_{1\times 1}(C_4), \quad \text{shape } (H/16, W/16, 256)
$$

$$
\tilde{C}_3 = \text{Conv}_{1\times 1}(C_3), \quad \text{shape } (H/8, W/8, 256)
$$

$$
\tilde{C}_2 = \text{Conv}_{1\times 1}(C_2), \quad \text{shape } (H/4, W/4, 256)
$$

---

### **Step 2. Top–down fusion**

$$
P_5 = \tilde{C}_5
$$

$$
P_4 = \tilde{C}_4 + \text{Upsample}(P_5)
$$

$$
P_3 = \tilde{C}_3 + \text{Upsample}(P_4)
$$

$$
P_2 = \tilde{C}_2 + \text{Upsample}(P_3)
$$

Upsample is bilinear or nearest-neighbor.
It has **no parameters**.

---

### **Step 3. 3×3 smoothing conv**

Each $P_i$is passed through a 3×3 conv $stride 1, padding 1):

$$
P_i = \text{Conv}_{3\times 3}(P_i)
$$

This removes the checkerboard pattern coming from upsampling.

---

# **PVT / Swin-Transformer Style FPN**

Backbones like **PVT-v2-B2** produce different channel counts:

Using **PVT-v2-B2** as example:

| Stage | Resolution | Channels |
| ----- | ---------- | -------- |
| $C_1$| $H/4$     | 64       |
| $C_2$| $H/8$     | 128      |
| $C_3$| $H/16$    | 320      |
| $C_4$| $H/32$    | 512      |

FPN usually uses the last 3 or 4:

| C-level | Res    | Channels |
| ------- | ------ | -------- |
| $C_2$  | $H/4$ | 64       |
| $C_3$  | $H/8$ | 128      |
| $C_4$  | $H/16$| 320      |
| $C_5$  | $H/32$| 512      |

---

## **Unified FPN projection**

Everything is projected to **256 channels**:

| Pyramid Level | Resolution | Channels |
| ------------- | ---------- | -------- |
| $P_5$        | $H/32$    | 256      |
| $P_4$        | $H/16$    | 256      |
| $P_3$        | $H/8$     | 256      |
| $P_2$        | $H/4$     | 256      |

Equations remain exactly the same:

$$
P_5 = \text{Conv}_{1\times1}(C_5)
$$

$$
P_4 = \text{Conv}_{1\times1}(C_4)+ \text{Upsample}(P_5)
$$

$$
P_3 = \text{Conv}_{1\times1}(C_3)+ \text{Upsample}(P_4)
$$

$$
P_2 = \text{Conv}_{1\times1}(C_2)+ \text{Upsample}(P_3)
$$

Each $P_i$then gets a 3×3 conv.

---

# **Final Clean Diagram (Explicit Shapes)**

Assume input image:
**(H = 512, W = 512)** (just as example)

### **Backbone (PVT-v2-B2)**

```
C2: 128×128, 64 ch
C3:  64× 64, 128 ch
C4:  32× 32, 320 ch
C5:  16× 16, 512 ch
```

### **After 1×1 lateral projection**

```
C2 → 128×128, 256 ch
C3 →  64× 64, 256 ch
C4 →  32× 32, 256 ch
C5 →  16× 16, 256 ch
```

### **Top–down pathway**

```
P5 = C5                                      → 16×16,   256 ch
P4 = C4 + up(P5)    (up: 16→32)             → 32×32,   256 ch
P3 = C3 + up(P4)    (up: 32→64)             → 64×64,   256 ch
P2 = C2 + up(P3)    (up: 64→128)            → 128×128, 256 ch
```

### **Final 3×3 smoothing**

```
P5: 16×16   → 256 ch
P4: 32×32   → 256 ch
P3: 64×64   → 256 ch
P2: 128×128 → 256 ch
```

These four maps are then used for:

* Detection heads
* Segmentation decoders
* Anchor-free detectors
* Panoptic segmentation

---




## **Why it’s powerful**

- ✅ Handles **objects at multiple scales** (small and large).
- ✅ Uses **semantics from deep layers** + **resolution from shallow layers**.- 
- ✅ Simple and light, yet extremely effective.

---

## **Common architectures using FPN**

| Architecture                 | Backbone            | FPN used for           | Output purpose                 |
| :--------------------------- | :------------------ | :--------------------- | :----------------------------- |
| **Faster R-CNN + FPN**       | ResNet / Swin / PVT | Object detection       | Multi-scale RoI heads          |
| **RetinaNet**                | ResNet / PVT        | Single-stage detection | Multi-scale anchor predictions |
| **Mask R-CNN + FPN**         | ResNet / Swin / PVT | Instance segmentation  | Mask head features             |
| **UPerNet**                  | PVT / Swin / ViT    | Semantic segmentation  | Pyramid fusion before decoder  |
| **Detectron2 / MMDetection** | Many                | Detection/Segmentation | Backbone + FPN combo           |

---



## **Summary**

## **Backbone stages**

| Level | Resolution | Channels                                |
| ----- | ---------- | --------------------------------------- |
| $C_2$| $H/4$     | varies (64 in PVT-v2-B2, 256 in ResNet)|
| $C_3$| $H/8$     | varies                                  |
| $C_4$| $H/16$    | varies                                  |
| $C_5$| $H/32$    | varies                                  |

## **FPN pyramid**

| Level | Resolution | Channels |
| ----- | ---------- | -------- |
| $P_2$| $H/4$     | 256      |
| $P_3$| $H/8$     | 256      |
| $P_4$| $H/16$    | 256      |
| $P_5$| $H/32$    | 256      |


- ✅ **FPN (Feature Pyramid Network)** = multi-scale feature fusion architecture.
- ✅ Combines high-res spatial detail (low layers) with strong semantics (deep layers).
- ✅ Used in detection (Faster R-CNN, RetinaNet), segmentation (Mask R-CNN, UPerNet).
- ✅ Works seamlessly with hierarchical backbones like **ResNet**, **Swin**, **PVT**.
- ✅ In code, it’s mostly:

* `1×1 conv` for lateral mapping
* `upsample + addition`
* `3×3 conv` for smoothing

---



# **The FPN Structure (ASCII Diagram)**

```
            +--------------------------+
            |      Backbone (e.g. PVT) |
            +--------------------------+
                     │
         ┌───────────────────────────────┐
         │ Outputs from different stages │
         └───────────────────────────────┘
             C1: [B, 64, 56, 56]
             C2: [B,128, 28, 28]
             C3: [B,320, 14, 14]
             C4: [B,512,  7,  7]

                     ↓ (Top-down path)
        +------------------------------------------+
        |              FPN construction            |
        +------------------------------------------+

                             P4 ← 1×1 conv(C4)
                              │
                              │  (upsample by 2)
                              ↓
             P3 ← 1×1 conv(C3) + ↑ P4
              │
              │  (upsample by 2)
              ↓
     P2 ← 1×1 conv(C2) + ↑ P3
      │
      │  (upsample by 2)
      ↓
P1 ← 1×1 conv(C1) + ↑ P2

Each Pi then → 3×3 conv smoothing
(P1, P2, P3, P4 each: [B, 256, H_i, W_i])
```

✅ **Top-down path:**
Upsamples deeper, low-resolution features (P4→P3→P2→P1).

✅ **Lateral connections:**
Each upsampled feature is **added** to a 1×1-convolved version of the corresponding backbone feature.

✅ **3×3 conv smoothing:**
Removes aliasing artifacts after upsampling and addition.

---

# **2. The Idea in One Picture (Layer Fusion)**

```
      High-level semantics     (low resolution)
             ↑
             │   upsample (×2)
      +------+------+
      |             |
    1×1 conv     1×1 conv
    on C3         on C4
      │             │
      └────── add ──┘
             │
          3×3 conv
             ↓
            P3
```

This pattern repeats for each pyramid level — fusing information from the stage above.

---

# **Conceptual Analogy**

Think of FPN as **a decoder inside the backbone**:

* Encoder (backbone): progressively downsamples → semantic abstraction
* FPN (top-down): progressively upsamples → detail recovery
* Result: multi-scale, semantically rich features usable for **detection, segmentation, depth, etc.**

---



# **What Exactly Do we do with $P_1, P_2, P_3, P_4,$** 

once we have our **multi-scale pyramid outputs**
$$P_1, P_2, P_3, P_4,$$
**what exactly do we do with them next?**

Let’s go step by step — because the answer depends on your **task** (classification, detection, segmentation, depth, etc.), but the principles are always the same.

---

### **What we have so far**

After the **PVT → FPN pipeline**, we have:

| Level | Resolution | Channels | Contains               |
| :---: | :--------- | :------: | :--------------------- |
|   P1  | 56×56      |    256   | fine details, edges    |
|   P2  | 28×28      |    256   | object parts           |
|   P3  | 14×14      |    256   | larger objects         |
|   P4  | 7×7        |    256   | whole-object semantics |

Each $ P_i $ is:

* **semantically strong** (due to top-down flow),
* **spatially meaningful** (due to lateral connections),
* and **uniform in channels** (256).

Now we choose a **head** depending on the task.

---

### **Case 1 – Object Detection (e.g., RetinaNet, Faster R-CNN + FPN)**

**Goal:** detect objects at different scales.

Each $ P_i $ handles **objects of a specific size range**.

For example:

| Pyramid | Object Size          | Used for          |
| :------ | :------------------- | :---------------- |
| P1      | small (16–32 px)     | small objects     |
| P2      | medium (32–64 px)    | mid-sized objects |
| P3      | large (64–128 px)    | large objects     |
| P4      | very large (>128 px) | scene-level       |

---

### **Detection head structure**

Each pyramid level is fed to the same *head* (shared weights) that predicts:

1. **Class scores** (what object?)
2. **Bounding box offsets** (where?)

Each head is typically a few conv layers:

```python
class DetectionHead(nn.Module):
    def __init__(self, in_channels=256, num_classes=80):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1),
            nn.ReLU(),
        )
        self.cls = nn.Conv2d(256, num_classes, 3, padding=1)
        self.box = nn.Conv2d(256, 4, 3, padding=1)

    def forward(self, features):
        outputs = []
        for x in features:  # [P1, P2, P3, P4]
            f = self.conv(x)
            cls_out = self.cls(f)
            box_out = self.box(f)
            outputs.append((cls_out, box_out))
        return outputs
```

Each feature map produces predictions at its scale —
the results are merged and decoded into final bounding boxes.

✅ This is how **RetinaNet**, **Faster R-CNN + FPN**, and **YOLOX**-style detectors work.

---




### **3. Case 2 – Semantic Segmentation (e.g., UPerNet, DeepLabV3+)**

**Goal:** classify *each pixel*.

All pyramid levels are **upsampled to the same resolution** (e.g., 1/4 of the input image)
and **concatenated** (or summed) before the segmentation head.

---

### **Example: UPerNet-like head**

```python
upsampled = [
    F.interpolate(P, size=P1.shape[-2:], mode='bilinear', align_corners=False)
    for P in [P1, P2, P3, P4]
]
fusion = torch.cat(upsampled, dim=1)  # [B, 256*4, 56, 56]

# Final segmentation head
seg_head = nn.Sequential(
    nn.Conv2d(256*4, 256, 3, padding=1),
    nn.ReLU(),
    nn.Conv2d(256, num_classes, 1)
)
seg_logits = seg_head(fusion)
print(seg_logits.shape)
```

**Output:**

```
[B, num_classes, 56, 56]
```

✅ Every pixel’s prediction is now influenced by **both**

* fine details (from P1)
* and global context (from P4)

This is how **UPerNet**, **DeepLabV3+**, and **MaskFormer** use the FPN pyramid.

---



## **PVT-v2 + FPN segmentation**
Below is a **clean, minimal, end-to-end PVT-v2 + FPN segmentation example** using **timm**.

It follows a standard structure:

1. Load **PVT-v2-B2** as feature extractor
2. Build a **top-down FPN decoder**
3. Build a **final segmentation head**
4. Run forward with dummy input


In [14]:
# fmt: off
# isort: skip_file
# DO NOT reorganize imports - warnings filter must be FIRST!

import torch
import torch.nn as nn
import warnings
import os
import torch.nn.functional as F

warnings.filterwarnings('ignore')
os.environ['PYTHONWARNINGS'] = 'ignore'

import timm 
from timm import create_model

# fmt: on


#### **2. Load PVT-v2-B2 backbone**

PVT-v2-B2 in `timm` outputs 4 feature maps (C1, C2, C3, C4):

| Stage | Resolution | Channels |
| ----- | ---------- | -------- |
| C1    | 1/4        | 64       |
| C2    | 1/8        | 128      |
| C3    | 1/16       | 320      |
| C4    | 1/32       | 512      |

We get these with `features_only=True`.


In [21]:
backbone = timm.create_model(
    'pvt_v2_b2',
    pretrained=True,
    features_only=True
)

for i in backbone.feature_info:
    print(i)

print(f'Feature channels: {backbone.feature_info.channels()}')
print(f'Feature reduction: {backbone.feature_info.reduction()}')

{'num_chs': 64, 'reduction': 4, 'module': 'stages.0', 'index': 0}
{'num_chs': 128, 'reduction': 8, 'module': 'stages.1', 'index': 1}
{'num_chs': 320, 'reduction': 16, 'module': 'stages.2', 'index': 2}
{'num_chs': 512, 'reduction': 32, 'module': 'stages.3', 'index': 3}
Feature channels: [64, 128, 320, 512]
Feature reduction: [4, 8, 16, 32]


---

#### **3. FPN Decoder (minimal)**

The FPN idea:

* Convert all levels to same channel count (here 256)
* Start from highest level C4 → produce P4
* Upsample P4 and add to C3 → P3
* Upsample P3 and add to C2 → P2
* Upsample P2 and add to C1 → P1

Upsampling uses **bilinear interpolation (non-learned)**.

In [16]:
class FPN(nn.Module):
    def __init__(self, in_channels, out_channels=256): # in_channels is : [64, 128, 320, 512]
        super().__init__()

        # 1x1 lateral projections
        self.lateral = nn.ModuleList(
            [nn.Conv2d(c, out_channels, kernel_size=1) for c in in_channels]
        )

        # 3x3 smoothing after merging
        self.smooth = nn.ModuleList(
            [nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)
             for _ in range(len(in_channels))]
        )

    def forward(self, features):
        # features = [C1, C2, C3, C4]
        C1, C2, C3, C4 = features

        P4 = self.lateral[3](C4)
        P3 = self.lateral[2](C3) + F.interpolate(P4, size=C3.shape[-2:], mode='bilinear', align_corners=False)
        P2 = self.lateral[1](C2) + F.interpolate(P3, size=C2.shape[-2:], mode='bilinear', align_corners=False)
        P1 = self.lateral[0](C1) + F.interpolate(P2, size=C1.shape[-2:], mode='bilinear', align_corners=False)

        P1 = self.smooth[0](P1)
        P2 = self.smooth[1](P2)
        P3 = self.smooth[2](P3)
        P4 = self.smooth[3](P4)

        return [P1, P2, P3, P4]


#### Why a `nn.ModuleList` and not a plain Python `list` and `nn.ModuleList` for `self.lateral`



**Short Answer**

You **must** use `nn.ModuleList` (not a plain `list`) **when storing layers/modules** inside an `nn.Module` (your network), **if you want PyTorch to track their parameters and move them to GPU/CPU properly**.

---

Why **not** use a plain Python list?

When you write this:

```python
self.lateral = [nn.Conv2d(c, out_channels, 1) for c in in_channels]
```

You’re storing the layers in a plain list. But PyTorch **won’t register** them as part of your model.

--- 

As a result:

* `model.parameters()` **won’t include them**
* `.cuda()` / `.to(device)` **won’t move them**
* They **won’t show up** in `model.named_parameters()` or `model.state_dict()`
* **They won’t be trained!**
* Saving/loading the model will silently skip them

This is one of the most common beginner mistakes in PyTorch.

---

#### ✅ Why use `nn.ModuleList`?

```python
self.lateral = nn.ModuleList([
    nn.Conv2d(c, out_channels, 1) for c in in_channels
])
```

This tells PyTorch:
**“These are submodules. Track them. Register their parameters.”**

Now:

* Parameters **will be included in `.parameters()`**
* You can move them to GPU with `.cuda()` or `.to('cuda')`
* They will be saved/loaded with the model checkpoint
* You can iterate over them in `forward()` as usual

---

#### ✅ When to use `nn.ModuleList` vs `nn.Sequential`

| Use `nn.Sequential` when...                         | Use `nn.ModuleList` when...                                                      |
| --------------------------------------------------- | -------------------------------------------------------------------------------- |
| Modules are applied **in order, one after another** | You need **more flexible logic**, e.g., skip connections, top-down fusion, loops |
| Example: classic feedforward or MLP layers          | Example: FPN, U-Net skip connections, layer-wise operations                      |

In FPN:

```python
for i in range(4):
    P[i] = self.lateral[i](C[i]) + upsample(P[i+1])
```

So we need indexed access — this can’t be done with `Sequential`.

---

#### ✅ Common mistake clarified

You wrote:

> I have only seen that when we create our network model, we inherit from nn.ModuleList

That’s a **misunderstanding**:

You **don’t** inherit from `nn.ModuleList` in typical models.

Instead, you define your model as:

```python
class MyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([...])
```

So `nn.ModuleList` is **used inside a module**, not inherited directly — unless you're doing something very special like building an entire network as a list (rare and not recommended for general models).

---

#### ✅ Visual demo

Let’s verify it:

```python
import torch.nn as nn

class BadNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = [nn.Linear(10, 10) for _ in range(3)]

model = BadNet()
print(list(model.parameters()))  # ❌ Empty list!
```

Now with `ModuleList`:

```python
class GoodNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(10, 10) for _ in range(3)])

model = GoodNet()
print(list(model.parameters()))  # ✅ Contains all parameters
```

---

#### ✅ Summary

Use `nn.ModuleList` whenever:

* You are storing a list of `nn.Module` layers
* You need PyTorch to register and track those layers
* You want to iterate or use layers flexibly in `forward()`

❌ **Never** store layers in plain lists/tuples if you want them trained.



---

#### **4. Final Segmentation Head**

A simple head:

* Concatenate multi-scale FPN features
* Upsample to input size
* 3×3 Conv → output logits



In [17]:
class SegmentationHead(nn.Module):
    def __init__(self, fpn_channels=256, num_classes=1):
        super().__init__()

        self.conv = nn.Conv2d(fpn_channels * 4, num_classes, kernel_size=3, padding=1)

    def forward(self, P):
        # P = [P1, P2, P3, P4]
        P1, P2, P3, P4 = P

        size = P1.shape[-2:]

        P2 = F.interpolate(P2, size=size, mode='bilinear', align_corners=False)
        P3 = F.interpolate(P3, size=size, mode='bilinear', align_corners=False)
        P4 = F.interpolate(P4, size=size, mode='bilinear', align_corners=False)

        x = torch.cat([P1, P2, P3, P4], dim=1)
        x = self.conv(x)

        return x

#### **5. Full PVT-FPN Segmentation Model**

In [18]:
class PVT_FPN_Segmentation(nn.Module):
    def __init__(self, backbone_name='pvt_v2_b2', num_classes=1):
        super().__init__()

        self.backbone = timm.create_model(
            backbone_name,
            pretrained=True,
            features_only=True
        )

        in_channels = self.backbone.feature_info.channels()

        self.fpn = FPN(in_channels, out_channels=256)
        self.head = SegmentationHead(256, num_classes)

    def forward(self, x):
        features = self.backbone(x)          # C1,C2,C3,C4
        P = self.fpn(features)               # P1,P2,P3,P4
        logits = self.head(P)               # final prediction
        logits = F.interpolate(logits, size=x.shape[-2:], mode='bilinear', align_corners=False)
        return logits

---

#### **6. Test the whole pipeline**

In [20]:
if __name__ == "__main__":
    model = PVT_FPN_Segmentation(num_classes=1)

    x = torch.randn(1, 3, 224, 224)
    y = model(x)

    print("Output:", y.shape)


Output: torch.Size([1, 1, 224, 224])


Expected:

```
Output: torch.Size([1, 1, 224, 224])
```

---

# **7. Summary of what happens**

**PVT-v2 backbone:**

* Extract features at 4 scales

**FPN decoder:**

* Upsample from deep → shallow
* Add skip connections
* Produce 4 aligned maps P1–P4

**Head:**

* Bring all P maps to same size
* Concatenate
* Predict segmentation mask

---

If you want, I can also provide:

* A **U-Net style decoder** for PVT
* Training loop (Dice + BCE)
* Visualization code
* Export to ONNX
* Variant with **Mask2Former-style** PVT + FPN + Transformer decoder

Just tell me what you want next.

### **4. Case 3 – Depth Estimation / Optical Flow / Reconstruction**

For dense regression tasks (depth, disparity, etc.),
you can similarly **upsample and fuse** P1–P4, then predict a continuous map.

Example:

```python
depth_map = torch.sigmoid(seg_head(fusion)) * max_depth
```

✅ Benefit: FPN gives **multi-scale awareness** → depth edges are sharper and large planes smoother.

---



### **5. Case 4 – Feature Fusion / Global Pooling**

For classification or global tasks, sometimes we don’t need all scales.
You can do a **global average pooling** on the coarsest level:

$\text{cls\_vector} = \text{GAP}(P_4)$

and send it to a linear classifier.

Example:

```python
x = F.adaptive_avg_pool2d(P4, 1).flatten(1)
out = nn.Linear(256, num_classes)(x)
```

✅ This is how **FPN backbones** can still perform classification.

---



### **Summary of How Each P_i is Used**

| Task               | What we do with $P_1, P_2, P_3, P_4$                               | Output             |
| :----------------- | :----------------------------------------------------------------- | :----------------- |
| **Detection**      | Apply detection head to each Pᵢ separately (multi-scale)           | Class + bbox maps  |
| **Segmentation**   | Upsample all Pᵢ to same size, concatenate, predict per-pixel class | Segmentation mask  |
| **Depth / Flow**   | Similar to segmentation, but regression output                     | Depth / motion map |
| **Classification** | Use highest-level (P4), global avg pool                            | Image class        |

---

# **7. Intuition**

* **Backbone (PVT)** builds *a hierarchical pyramid of features* (edges → parts → objects).
* **FPN** refines it into *a multi-scale feature space* with uniform channels.
* **Heads** (detector / segmenter / depth regressor) consume those pyramids differently depending on the task.

So:

```
Image
  ↓
PVT Backbone → [C1..C4]
  ↓
FPN → [P1..P4]
  ↓
Task-specific head → output (boxes / masks / depth / class)
```

---

# **8. Bonus: Example of integration**

If you use **timm** and **torchvision.ops.FeaturePyramidNetwork**, you can combine them easily:

```python
import timm
from torchvision.ops import FeaturePyramidNetwork

# 1. Backbone
backbone = timm.create_model('pvt_v2_b2', pretrained=True, features_only=True)
channels = backbone.feature_info.channels()

# 2. FPN
fpn = FeaturePyramidNetwork(in_channels_list=channels, out_channels=256)

# 3. Forward
x = torch.randn(1, 3, 224, 224)
features = backbone(x)
pyramid = fpn({f"c{i}": f for i, f in enumerate(features)})

for k, v in pyramid.items():
    print(k, v.shape)
```

✅ You get a dictionary like:

```
c0 torch.Size([1, 256, 56, 56])
c1 torch.Size([1, 256, 28, 28])
c2 torch.Size([1, 256, 14, 14])
c3 torch.Size([1, 256, 7, 7])
```

Perfect for plugging into a detector or segmenter.

---

# **9. In summary**

✅ FPN produces **P1–P4**, a multi-scale, uniform set of features.
✅ Each PVT output (C_i) contributes to one pyramid level.
✅ What happens next depends on your task:

| Task               | How you use FPN outputs                               |
| :----------------- | :---------------------------------------------------- |
| **Detection**      | One prediction head per level (boxes + classes)       |
| **Segmentation**   | Upsample + fuse all levels for dense pixel prediction |
| **Depth / Flow**   | Same as segmentation but with regression              |
| **Classification** | Pool highest-level P4 feature                         |

---

Would you like me to show a **minimal end-to-end example** (in code) of using PVT + FPN + RetinaNet-style detection head on dummy input, just to see how each pyramid level produces predictions?