# **DenseNet201 — Architecture, Motivation, Properties**

## 1. Core Idea

DenseNet stands for **Densely Connected Convolutional Network**.

Instead of each layer receiving only the output of the previous layer, **each layer receives the concatenation of all preceding feature maps** in the block.

If the block has layers
$$x_0, x_1, x_2, \dots, x_L,$$
then each layer does
$$x_l = H_l([x_0, x_1, \dots, x_{l-1}]).$$

So inside a dense block, **the channel dimension grows** as you go deeper.

---


## **2. DenseNet Architecture (e.g., DenseNet-121 / 169 / 201)**

The global structure is:

1. **Stem**
2. **Dense Block 1**
3. **Transition Layer 1**
4. **Dense Block 2**
5. **Transition Layer 2**
6. **Dense Block 3**
7. **Transition Layer 3**
8. **Dense Block 4**
9. **Classifier Head**

So yes — it mirrors the ResNet “stem → 4 stages → head” structure, but each “stage” is a dense block instead of a stack of residual blocks.

---

#### **1. Stem**

The stem is:

* $7 \times 7$ conv, stride 2
* BatchNorm
* ReLU
* $3 \times 3$ max-pool, stride 2

Input example:
$$B \times 3 \times 224 \times 224$$
Output:
$$B \times 64 \times 56 \times 56.$$

This matches ResNet’s stem exactly.

---

#### 2. Dense Blocks (the 4 “stages”)

DenseNet has **4 dense blocks**, analogous to ResNet’s 4 stages.

But instead of residual blocks, a dense block contains **L dense layers**, where L depends on the model:

| Model        | Layers in Dense Block 1 | 2  | 3  | 4  |
| ------------ | ----------------------- | -- | -- | -- |
| DenseNet-121 | 6                       | 12 | 24 | 16 |
| DenseNet-169 | 6                       | 12 | 32 | 32 |
| DenseNet-201 | 6                       | 12 | 48 | 32 |
| DenseNet-264 | 6                       | 12 | 64 | 48 |



---

#### 3. Transition Layers (between stages)

These are placed between dense blocks and do:

1. $1 \times 1$ convolution (channel compression): $C \to \theta C$, usually $\theta = 0.5)$
2. $2 \times 2$ average pooling $stride 2$

Their job is exactly like ResNet downsampling blocks:

* reduce spatial size
* reduce channels
* forward features to the next stage

But the mechanism is different:

* ResNet uses stride-2 conv in residual blocks
* DenseNet uses pooling + compression

---

#### 4. Classifier Head

At the end:

* Global average pooling
* Fully connected layer (num_classes)

Input example:
$$B \times C_{\text{final}} \times 7 \times 7$$
Output:
$$B \times C_{\text{final}}$$
Then a linear classifier.

---


#### Clear Comparison: ResNet vs DenseNet Structure

| Component    | ResNet                 | DenseNet                           |
| ------------ | ---------------------- | ---------------------------------- |
| Stem         | Conv + BN + MaxPool    | Same                               |
| Stage 1      | Residual blocks × N    | Dense layers × L1                  |
| Downsampling | Stride 2 conv in block | Transition layer (pool + 1×1 conv) |
| Stage 2      | Residual blocks × N    | Dense layers × L2                  |
| Stage 3      | Residual blocks × N    | Dense layers × L3                  |
| Stage 4      | Residual blocks × N    | Dense layers × L4                  |
| Head         | GAP + FC               | Same                               |


## **3. DenseNet-201 Dense Block (Numeric Walkthrough, k = 32, L = 6)**

This is a complete, unified explanation of the shapes, architecture, and per-layer operations inside a DenseNet block, including bottleneck behavior and the transition layer.

---

#### **1. Input Setup**

Dense Block 1 receives:

* Batch size
  $$B = 4$$
* Input channels
  $$C_0 = 64$$
* Spatial dimensions
  $$H = W = 56$$

Input tensor:

$$x_0 \in \mathbb{R}^{4 \times 64 \times 56 \times 56}.$$

Dense block hyperparameters:

* Number of layers
  $$L = 6$$
* Growth rate
  $$k = 32$$
* Bottleneck channels
  $$4k = 128$$

Each dense layer adds **32 channels** to the global feature map.

---

#### **2. DenseNet Layer Architecture (L1–L6)**

Every dense layer has *identical internal structure*:

```
Input: concat([x0, x1, ..., x_{l-1}])
↓ BN
↓ ReLU
↓ 1×1 Conv  (output = 4k = 128 channels)
↓ BN
↓ ReLU
↓ 3×3 Conv  (output = k = 32 channels)
↓ Output = x_l  (32 channels)
```

Formal expression:

$$
H_l = \text{Conv}_{3\times3}(
\text{ReLU}(
\text{BN}(
\text{Conv}_{1\times1}(
\text{ReLU}(
\text{BN}([x_0,\dots,x_{l-1}])
)
)
)
)
)
$$

### Important:

The **1×1 bottleneck always receives all concatenated inputs**, not just the previous layer’s output.

So at each layer:

* Input channel count **increases**
* 1×1 conv **compresses** to 128 channels
* 3×3 conv **creates** 32 new channels

---

#### **3. Layer-by-Layer Numeric Shapes (Full Dense Block)**

Each layer’s **input** = concatenation of all previous outputs.

**Layer 1**

Input:
$$4 \times 64 \times 56 \times 56$$

1×1 conv:
$$64 \rightarrow 128$$
Output:
$$4 \times 128 \times 56 \times 56$$

3×3 conv:
$$128 \rightarrow 32$$
Output:
$$x_1 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Concatenated for next layer:
$$C = 64 + 32 = 96$$

---

 **Layer 2**

Input:
$$4 \times 96 \times 56 \times 56$$

1×1 conv:
$$96 \rightarrow 128$$

3×3 conv:
$$128 \rightarrow 32$$
Output:
$$x_2 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Concatenated:
$$C = 64 + 2\cdot32 = 128$$

---

 **Layer 3**

Input:
$$4 \times 128 \times 56 \times 56$$

1×1 conv:
$$128 \rightarrow 128$$

3×3 conv:
$$128 \rightarrow 32$$
Output:
$$x_3 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Concatenated:
$$C = 160$$

---

 **Layer 4**

Input:
$$4 \times 160 \times 56 \times 56$$

1×1 conv:
$$160 \rightarrow 128$$

3×3 conv:
$$128 \rightarrow 32$$

Output:
$$x_4 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Concatenated:
$$C = 192$$

---

 **Layer 5**

Input:
$$4 \times 192 \times 56 \times 56$$

1×1 conv:
$$192 \rightarrow 128$$

3×3 conv:
$$128 \rightarrow 32$$

Output:
$$x_5 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Concatenated:
$$C = 224$$

---

 **Layer 6**

Input:
$$4 \times 224 \times 56 \times 56$$

1×1 conv:
$$224 \rightarrow 128$$

3×3 conv:
$$128 \rightarrow 32$$

Output:
$$x_6 \in \mathbb{R}^{4 \times 32 \times 56 \times 56}$$

Final concatenated output:
$$C_{\text{out}} = 64 + 6\cdot32 = 256$$

So Dense Block 1 outputs:

$$\text{DB1 out} \in \mathbb{R}^{4 \times 256 \times 56 \times 56}.$$

General formula:

$$C_{\text{out}} = C_0 + Lk.$$

---

#### **4. Transition Layer After Dense Block**

DenseNet uses a transition layer:

* $1 \times 1$ convolution (channel compression)
* $2 \times 2$ average pooling (spatial downsampling)

Compression factor:
$$\theta = 0.5$$

Input:
$$4 \times 256 \times 56 \times 56$$

After $1 \times 1$ conv:
$$C' = \theta \cdot 256 = 128$$
Shape:
$$4 \times 128 \times 56 \times 56$$

After $2 \times 2$ avg pooling:
$$56 \rightarrow 28$$
Final shape:
$$4 \times 128 \times 28 \times 28$$

Thus:

* Input to block:
  $$4 \times 64 \times 56 \times 56$$
* Output of block:
  $$4 \times 256 \times 56 \times 56$$
* Output after transition:
  $$4 \times 128 \times 28 \times 28$$

---

#### **5. Final Combined Table (All Layers)**

| Layer | Input Channels | 1×1 Conv (bottleneck) | 3×3 Conv (growth) |
| ----- | -------------- | --------------------- | ----------------- |
| L1    | 64             | 64 → 128              | 128 → 32          |
| L2    | 96             | 96 → 128              | 128 → 32          |
| L3    | 128            | 128 → 128             | 128 → 32          |
| L4    | 160            | 160 → 128             | 128 → 32          |
| L5    | 192            | 192 → 128             | 128 → 32          |
| L6    | 224            | 224 → 128             | 128 → 32          |

---


## **4. Parameter Count**

DenseNet-201 has **≈ 20 million parameters**, which is very small compared to:

* ResNet-152 → 60M
* VGG-16 → 138M
* EfficientNet-B4 → 19M
* ConvNeXt-Tiny → 28M

You get high accuracy with much smaller memory footprint.


#### DenseNet-121 Depth Breakdown

DenseNet counts:

1. Every **1×1 conv** in each dense layer
2. Every **3×3 conv** in each dense layer
3. Every **conv** in the stem
4. Every **1×1 conv** in the transition layers
5. The final classifier FC layer is *not* counted in the depth

DenseNet-121 uses **DenseNet-BC** (Bottleneck + Compression), so each dense layer has **two convolutions**:

* 1×1 conv
* 3×3 conv

Thus:

#### Each dense layer = 2 conv layers

---

#### Step 1: Count dense block layers

Dense block layer counts:

* Block 1: 6
* Block 2: 12
* Block 3: 24
* Block 4: 16

Total dense layers:

$$6 + 12 + 24 + 16 = 58.$$

Each of these 58 layers contains **2 convs**:

Total convs inside dense blocks:

$$58 \times 2 = 116.$$

---

#### Step 2: Add the stem convolution

DenseNet stem has one convolution:

* $7 \times 7$ conv → 1 layer

So far:

$$116 + 1 = 117.$$

---

#### Step 3: Add transition-layer convolutions

There are **3 transition layers**, and each has:

* **1×1 conv** → 1 conv per transition

Thus:

$$+ 3 = 120.$$

---

#### Final Count: 121 Layers

DenseNet-121 includes:

* 116 convs in dense blocks
* 1 conv in stem
* 3 convs in transitions
* 1 classification layer is *not counted*
* Total:

$$121.$$

---


| Component              | Count        |
| ---------------------- | ------------ |
| Dense layers           | 58           |
| Conv per dense layer   | ×2           |
| Convs in dense blocks  | 58 × 2 = 116 |
| Stem conv              | +1           |
| Transition layer convs | +3           |
| **Total**              | **121**      |

That is why it's called **DenseNet-121**.

---



## **5. When DenseNet201 Performs Well**

DenseNet201 is **excellent for**:

#### ✔ Medical imaging (X-ray, CT, MRI)

Dense multi-scale features and stable gradients help a lot.

#### ✔ Small datasets

Because it has fewer parameters and strong feature reuse.

#### ✔ Tasks requiring very deep effective receptive fields

But without massive compute (compared to large ResNets).

#### ✔ Training from scratch or transfer learning

Dense connections help train even with limited data.

---

#### **6. When It Is Not Ideal**

DenseNet201 is **not great** when:

**❌ You need very large input images**

Channel concatenation grows memory quickly.

**❌ You use extremely large batch sizes**

High memory usage in dense blocks limits scaling.

**❌ You need high inference speed**

Dense concatenation makes it slower than EfficientNet/ConvNeXt.

For your **Lung Disease Dataset** project, DenseNet201 is a top-tier choice.

---

#### 7. How to Use DenseNet201 in PyTorch (timm)


In [1]:
import timm
import torch.nn as nn

model = timm.create_model(
    'densenet201',
    pretrained=True,
    num_classes=5
)
in_features = model.classifier.in_features
model.classifier = nn.Linear(in_features, 5)

model.safetensors:   0%|          | 0.00/81.1M [00:00<?, ?B/s]

---

#### 8. Optimizer Recommendation

For DenseNet201:

**Stage 1 (frozen backbone)**

Use:

* **Adam**, lr ≈ 1e-3 to 3e-4

Reason: Adam handles the random initial classifier head well.

**Stage 2 (full fine-tuning)**

Use:

* **AdamW**, lr ≈ 1e-4 to 3e-5
* Weight decay between **0.01 and 0.05**

DenseNet benefits strongly from weight decay because of concatenation-driven feature growth.

---

#### 9. Compared to Other Models

**Accuracy vs Parameter Count (rough)**

| Model           | Params | Accuracy (ImageNet) |
| --------------- | ------ | ------------------- |
| DenseNet201     | ~20M   | ~77%                |
| ResNet50        | 25M    | ~76%                |
| EfficientNet-B4 | 19M    | ~82%                |
| ConvNeXt-Tiny   | 28M    | ~82%                |
| Swin-Tiny       | 28M    | ~81%                |

DenseNet isn't SOTA anymore, but remains **very strong for medical imaging** because of feature reuse and stable gradients.

---