# **Swin Transformer – A Step-by-Step Tutorial**

The **Swin Transformer** (Shifted Window Transformer, Liu et al., 2021) extends the **Vision Transformer (ViT)** to handle **high-resolution** and **dense prediction tasks** (e.g., detection, segmentation) efficiently — without losing its transformer flexibility.

---

## 1. Motivation

**ViT** treats an image as a sequence of patches and applies **global self-attention**.
While this works well for classification, it faces key limitations:

* **Quadratic complexity** in the number of patches.
* **No local inductive bias** (poor handling of fine details).
* **Fixed spatial resolution**, unsuitable for dense predictions.

The **Swin Transformer** solves these problems by:

1. Applying **local attention** inside non-overlapping windows (reducing complexity).
2. **Shifting windows** between layers to connect across regions.
3. Introducing **patch merging** to build a **hierarchical (multi-scale)** representation — similar to CNNs.

---

## 2. Architecture Overview

Swin Transformer follows a **hierarchical pyramid design**, much like ResNet:

| Stage   | Input Resolution | Patch Operation | Output Channels | Description      |
| ------- | ---------------- | --------------- | --------------- | ---------------- |
| Stage 1 | 4×4 patches      | Patch Embedding | 96              | Linear embedding |
| Stage 2 | 1/2 spatial size | Patch Merging   | 192             | Downsampling     |
| Stage 3 | 1/4 spatial size | Patch Merging   | 384             | Downsampling     |
| Stage 4 | 1/8 spatial size | Patch Merging   | 768             | Downsampling     |

Each stage contains several **Swin Transformer Blocks**, each block consisting of:

1. **W-MSA** (Window-based Multi-Head Self-Attention)
2. **SW-MSA** (Shifted-Window Multi-Head Self-Attention)
3. **Feed-forward MLP**
4. **LayerNorm + Residual connections**


<img src="images/two_successive_swin_transformer_blocks.png" height="30%" width="30%" />


<img src="images/swin_architecture.png" height="60%" width="60%" />


---


- Swin-T: $C = 96$, $ \text{layer numbers} =\{2, 2, 6, 2\}$
- Swin-S: $C = 96$, $ \text{layer numbers} =\{2, 2, 18, 2\}$
- Swin-B: $C = 128$, $ \text{layer numbers}= \{2, 2, 18, 2\}$
- Swin-L: $C = 192$, $ \text{layer numbers} =\{2, 2, 18, 2\}$




## 3. Step-by-Step Pipeline

### Step 1: Patch Partition and Embedding

Split the input image into non-overlapping 4×4 patches.

If the image is of size
$$ H \times W \times 3 $$
then after partition:
$$ \frac{H}{4} \times \frac{W}{4} \times 48 $$
since each patch has $ 4 \times 4 \times 3 = 48 $ elements.

A **linear projection** maps these 48-d vectors into an embedding dimension (e.g., 96):
$$ X \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{4} \times 96} $$

---

### Step 2: Window-based Multi-Head Self-Attention (W-MSA)

Each feature map is divided into **non-overlapping windows** (e.g., 7×7 patches).
Attention is computed **independently** within each window.

This reduces computational cost from global $ O((HW)^2) $ to
$$ O(M^2HW) $$
where $ M $ is the window size (e.g., 7).

In this setup,

> **All queries within a window share the same key set.**

That is, every patch inside a window can only attend to other patches inside **that same window**, not across windows.

Formally, for window $ w $:
$$
Q^{(w)} = X^{(w)}W^Q,\quad
K^{(w)} = X^{(w)}W^K,\quad
V^{(w)} = X^{(w)}W^V
$$
and
$$
\text{Attention}^{(w)} = \text{Softmax}!\left(\frac{Q^{(w)}{K^{(w)}}^T}{\sqrt{d}}\right)V^{(w)}
$$

This local attention structure introduces **spatial locality** and scales efficiently with image size.

---

### Step 3: Shifted-Window Multi-Head Self-Attention (SW-MSA)

In the **next block**, windows are **shifted by half the window size** (e.g., 3 pixels if ( M=7 )).
This shift allows patches that were previously in separate windows to now fall in the same window.

Thus, alternating between **W-MSA** and **SW-MSA** layers enables:

* Local attention within windows.
* Cross-window communication.
* Gradual expansion of the receptive field — achieving a global view over multiple layers.

---

### Step 4: Patch Merging – Building a Hierarchy

To reduce spatial resolution and increase semantic richness, Swin introduces **Patch Merging**, analogous to CNN downsampling.

Given:
$$ X \in \mathbb{R}^{H \times W \times C} $$

1. **Group 2×2 neighboring patches:**
   Each group of four patches is concatenated:
   $$
   [x_{00}, x_{01}, x_{10}, x_{11}] \in \mathbb{R}^{4C}
   $$

2. **Linear Projection:**
   Reduce dimensionality from $ 4C $ → $ 2C $:
   $$
   X' = \text{Linear}(\text{Concat}_{2\times2}(X)) \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times 2C}
   $$

Thus:

* Spatial size halves.
* Channel dimension doubles.

After several patch-merging steps, the model forms a **feature pyramid** where deeper stages capture more abstract semantics.

---

## 4. Inside a Swin Transformer Block

Given an input $ X \in \mathbb{R}^{H \times W \times C} $:

1. **LayerNorm:**
   $$
   \hat{X} = \text{LN}(X)
   $$
2. **(Shifted) Window Attention:**
   $$
   X' = X + \text{WindowAttention}(\hat{X})
   $$
3. **Feed-forward (MLP) with residual:**
   $$
   X'' = X' + \text{MLP}(\text{LN}(X'))
   $$
4. **MLP structure:**
   $$
   \text{MLP}(x) = \text{Linear}_2(\text{GELU}(\text{Linear}_1(x)))
   $$
   where hidden dimension = $ 4C $.

---

## 5. Multi-Head Self-Attention Inside a Window

Within each window:

$$
\text{Attention}(Q, K, V) = \text{Softmax}!\left(\frac{QK^T}{\sqrt{d}} + B\right)V
$$

Here:

* $ B $ — learnable **relative position bias** for spatial awareness.
* $ Q, K, V $ — derived from the same window.
* Computation is efficient since windows are small.

---

## 6. Hierarchical Output Example

For a 224×224 input image, Swin Transformer produces multi-scale features:

| Stage | Resolution | Channels |
| ----- | ---------- | -------- |
| 1     | 56×56      | 96       |
| 2     | 28×28      | 192      |
| 3     | 14×14      | 384      |
| 4     | 7×7        | 768      |

These outputs form a **feature pyramid** — ideal for:

* **Classification:** via global average pooling.
* **Detection/Segmentation:** as inputs to FPNs (e.g., Mask R-CNN).

---

## 7. Intuitive Summary

* **ViT**: global attention, single-scale, quadratic complexity.
* **Swin**: local attention, hierarchical, linear complexity.
* **Shifted windows**: enable cross-region communication.
* **Patch merging**: provides multi-scale features like CNNs.

 **In one sentence:**

> Swin Transformer limits self-attention to local windows (shared key sets), shifts them between layers for global context, and builds a hierarchical multi-scale representation through patch merging — combining the strengths of CNNs and Transformers.

---
