Below is a **clean, complete, and practical** overview of the **most commonly used architectures in medical imaging** (classification, segmentation, detection), including:

* typical backbone choices
* the **most common head replacements** used in medical DL
* recommended **training policies**
* domain-shift strategies
* grayscale handling
* regularization & augmentations

This is exactly the knowledge you need when designing medical models with pretrained backbones (ResNet, EfficientNet, ViT, Swin, PVT…).

---

# 1. Medical Image **Classification**

Used for: X-ray classification, CT slice categorization, MRI plane prediction, ultrasound pathology detection, dermatology classification.

### **Most common backbones**

* EfficientNet-B0/B1/B3
* ResNet-50/101
* DenseNet-121/169 (very common in hospitals)
* ViT-B/16 or DeiT-B
* Swin-T, Swin-S
* ConvNeXt-T/S

### **Most common head replacement**

A **2-layer MLP head** with dropout:

```
Dropout(p=0.3)
Linear(backbone_dim → 512)
ReLU
Dropout(p=0.3)
Linear(512 → num_classes)
```

Example dimensions:

* EfficientNet-B0: 1280
* ResNet-50: 2048
* ViT-B: 768
* Swin-T: 768

### **Why this head?**

* acts as domain adapter
* prevents overfitting
* small enough to be safe for tiny datasets

### **Training policy**

The most standard policy in medical imaging:

#### **Stage 1 — Freeze backbone**

```
freeze(backbone)
train classifier head only
epochs: 3–10
lr: 1e-3 or 1e-4
```

#### **Stage 2 — Unfreeze entire backbone**

```
unfreeze(backbone)
train end-to-end
epochs: 10–50
lr: 1e-5 or 3e-5
schedule: cosine or ReduceLROnPlateau
```

This is the universal "medical fine-tuning" recipe.

### **Grayscale handling**

Most common approach:

```
repeat grayscale → 3 channels
```

Works well with ImageNet pretrained CNNs and ViTs.

---

# 2. Medical **Segmentation**

Tasks: tumor segmentation, organ segmentation, vessel extraction, lesion boundary.

### **Most common architectures**

#### **A. U-Net**

The absolute standard.
Variants:

* U-Net
* U-Net++
* Attention U-Net
* Residual U-Net

Heads:

```
Conv → BN → ReLU → Conv → Sigmoid/Softmax
```

#### **B. SegFormer (very popular now)**

Backbone: MiT-B0/B1/B2

Head:

```
MLP “SegFormer head”:
    • project multi-scale features
    • concat
    • upsample
    • linear classifier on fused features
```

#### **C. Swin-UNet**

Encoder: Swin Transformer
Decoder: U-Net-style (skip connections, patch merging, patch expanding)

Head:

```
Conv2d(embedding_dim → num_classes, kernel=1)
```

#### **D. nnU-Net (winner of 95% competitions)**

Automatically configures:

* UNet depth
* patch size
* normalization
* optimizer
* augmentations

### **Training policy**

* always full fine-tuning (never freeze)
* optimizer: AdamW
* lr ~ 1e-4
* batch size small (1–4)
* epochs 300–1000 for 3D, 100–250 for 2D
* heavy augmentations (rotation, elastic, gamma, contrast, blur)

---

# 3. Medical **Detection / Localization**

Used for: nodule detection, polyp detection, mass detection in mammography.

### **Most common architectures**

* Faster R-CNN (backbone: ResNet-50)
* RetinaNet
* YOLOv5/YOLOv8 (very common now in hospitals)
* Vision Transformer detectors (rare but growing: ViTDet, SwinDet)

### **Common head replacement**

Typical classifiers for 2D detection:

```
Conv → ReLU → Conv → class_scores
Conv → ReLU → Conv → box_regression
```

YOLO style uses:

```
Conv → Conv → (class + box)
```

### **Training policy**

* start from COCO-pretrained weights
* freeze for ~1 epoch (optional)
* then unfreeze all
* lr: 1e-4 to 1e-3
* augmentation (mosaic, scale, flip)
* 100–300 epochs

---

# 4. Medical **3D classification / 3D segmentation**

Used for CT/MRI volumes.

### **Most common models**

* 3D U-Net
* 3D ResNet
* MONAI DenseNet-121
* nnU-Net 3D full resolution
* Swin-UNETR (3D transformer-based)

### **Heads**

Classification head:

```
Adaptive global pool 3D
Linear(512 → num_classes)
```

Segmentation head:

```
Conv3d → BN → ReLU → Conv3d → Softmax
```

### **Training policy**

* no freezing (always end-to-end)
* 3D patching (64³, 96³, 128³)
* lr: 2e-4
* epochs: 500–1500
* lots of augmentations

---

# 5. Medical **Self-Supervised Pretraining** (very common now)

### Most common SSL methods in medical imaging:

* SimCLR
* BYOL
* DINO
* Masked Autoencoders (MAE)
* Model Genesis
* Swin MAE (very strong on 3D CT)

### Typical head during SSL:

Projection head (MLP):

```
Linear(d → d)
ReLU
Linear(d → projection_dim)
```

Head is thrown away after SSL; backbone is fine-tuned for classification/segmentation.

---

# 6. What heads are used for **ViT / Swin / DeiT** in medical imaging?

### Classification:

```
MLP head:
  Linear(dim → 512)
  GELU
  Dropout
  Linear(512 → num_classes)
```

### Segmentation:

Use:

* SegFormer head
* UNETR head
* Swin-UNETR head

These fuse multi-scale transformer features.

---

# 7. Common training tricks in medical imaging

### **Loss functions**

Classification:

* BCEWithLogitsLoss
* Focal loss
* Weighted cross entropy

Segmentation:

* Dice loss
* Cross entropy
* Combo: 0.5·Dice + 0.5·CE
* Tversky loss (for imbalanced lesions)

### **Optimizers**

* AdamW (always)
* sometimes Adamax for medical datasets
* weight decay: 1e-4

### **Augmentations**

* slight rotation (<15°)
* horizontal/vertical flips
* CLAHE
* Gaussian blur
* gamma correction
* brightness/contrast adjustments
* elastic deformation

### **Normalization**

* per-image z-score
* or per-dataset mean/std

### **Training flow**

Most common:

```
1) Start with ImageNet-pretrained CNN/Transformer
2) Replace head with small MLP + dropout
3) Freeze backbone for warm-up
4) Unfreeze everything → train with small LR
5) Use class weights or focal loss if imbalance
6) Use medical augmentations (contrast, gamma)
7) Reduce LR when performance plateaus
```

---

# 8. Ready-to-use templates (summary)

---

## **A. Classification (most common template)**

```
EfficientNet-B0 backbone
    ↓
GlobalAvgPool
    ↓
Dropout(0.3)
Linear(1280→512) + ReLU
Dropout(0.25)
Linear(512→C)
```

Training:

* freeze 5–10 epochs
* unfreeze 20–50 epochs
* LR: 1e-3 → 1e-5
* loss: focal or weighted CE

---

## **B. Segmentation (most common template)**

```
SegFormer-B1 backbone (MiT-B1)
    ↓
SegFormer Head (MLP + upsampling)
    ↓
Conv2d → Softmax
```

or

```
UNet encoder: ResNet-34
Decoder: U-Net
Head: Conv → Conv → Sigmoid/Softmax
```

---

## **C. Detection template (medical YOLO)**

```
YOLOv8n/s backbone
Detect head (Conv → Conv → class+box)
```

Training:

* full fine-tuning
* lr: 1e-3
* epochs: 200

---

## **D. 3D segmentation template**

```
Swin-UNETR
ViT encoder (3D patches)
3D UNet-style decoder
Conv3d → Softmax
```

---

# If you want

I can also prepare:

* a **full unified table** for classification/segmentation/detection
* PyTorch templates for each
* or recommendations tailored to your dataset (CT, MRI, X-ray, etc.)
