## Squeeze-and-Excitation Networks (SENet)


## **What ‚ÄúSE‚Äù Stands For**

**SE block** = **Squeeze-and-Excitation block**.
It‚Äôs a lightweight attention mechanism for CNNs introduced in the paper:

> *‚ÄúSqueeze-and-Excitation Networks‚Äù (Hu et al., CVPR 2018)*

It improves a network‚Äôs ability to model the **importance of each channel** in feature maps.


<img src="images/squeeze_and_excitation_networks_SENet.png" />

---

#### **Why We Need It**

Convolutions learn spatial filters but treat all channels equally.
Some channels might carry more relevant information for the current task.
An SE block lets the network **recalibrate channel-wise feature responses** adaptively.

---

#### **How an SE Block Works**

Let‚Äôs say the input feature map is $X\in \mathbb{R}^{H\times W\times C}$ (height, width, channels).

#### Step A: **Squeeze** (Global information)

* Perform **global average pooling** over spatial dimensions $H \times W$ to get one value per channel.
* This yields a vector $z\in \mathbb{R}^{C}$ summarizing each channel‚Äôs global response.

Mathematically:
$$
z_c = \frac{1}{H , W}\sum_{i=1}^{H}\sum_{j=1}^{W} X_{i,j,c}
$$

#### Step B: **Excitation** (Learn channel attention)

* Pass (z) through a small **two-layer MLP**:

  * First layer reduces dimension to $C/r$ (bottleneck; $r$ is reduction ratio like 16)
  * Apply ReLU.
  * Second layer expands back to $C$.
  * Apply sigmoid to get weights $s\in[0,1]^C$.

$$
s = \sigma(W_2 , \delta(W_1 z))
$$

where $\delta$ is ReLU, $\sigma$ is sigmoid.

#### Step C: **Scale** (Recalibration)

* Multiply the original feature map channels by the learned weights:

$$
\tilde{X}{i,j,c} = s_c \cdot X{i,j,c}
$$

So channels the block deems ‚Äúimportant‚Äù get boosted; less important channels get suppressed.

---

#### **Visual Diagram**

```
Input Feature Map (H√óW√óC)
       ‚îÇ
 Global Average Pooling (Squeeze)
       ‚Üì
 Channel Descriptor (C)
       ‚îÇ
 Fully Connected (reduce C‚ÜíC/r), ReLU
       ‚îÇ
 Fully Connected (C/r‚ÜíC), Sigmoid (Excitation)
       ‚Üì
 Channel Weights (C)
       ‚îÇ
 Scale original feature map channel-wise
       ‚Üì
Output Feature Map (H√óW√óC)
```

---

#### **Where It‚Äôs Used**

* Originally introduced in **SENet** (ImageNet winner 2017/2018).
* Incorporated into **MobileNetV3**, **EfficientNet** (in MBConv blocks), ResNeXt, etc.
* Adds only a tiny computational overhead but often improves accuracy significantly.

---



## **Numerical Eexample**


Assume we have a small **feature map** from some CNN layer:

$$
X \in \mathbb{R}^{4\times4\times3}
$$
So:

* Height = 4
* Width = 4
* Channels = 3

We‚Äôll pick a **reduction ratio** $r = 3$ ‚Üí bottleneck dimension = $C / r = 1$.

---

####  **Squeeze** (Global Average Pooling)

Compute one average per channel:

$$
z_c = \frac{1}{H \times W} \sum_{i,j} X_{i,j,c}
$$

Let‚Äôs assume the averages come out as:
$$
z = [2.0, 0.5, 1.0]
$$

So now we have a vector of size (3).

---

#### **Excitation** (Two small fully-connected layers)

**First FC layer (reduce channels)**

$$
W_1: \mathbb{R}^{3 \rightarrow 1}
$$

We‚Äôll use these weights for illustration:
$$
W_1 = [0.2, 0.4, 0.1]
$$

Compute:
$$
h = W_1 \cdot z = 0.2(2.0) + 0.4(0.5) + 0.1(1.0) = 0.4 + 0.2 + 0.1 = 0.7
$$

Apply ReLU:
$$
h = \max(0, 0.7) = 0.7
$$

---

**Second FC layer (expand back)**

$$
W_2: \mathbb{R}^{1 \rightarrow 3}
$$

Say $W_2 = [0.5, 1.0, -0.5]^T$

Compute:
$$
s = W_2 \cdot h = [0.5√ó0.7, 1.0√ó0.7, -0.5√ó0.7] = [0.35, 0.7, -0.35]
$$

Apply **sigmoid** to get weights between 0 and 1:
$$
\sigma(s) = [0.586, 0.668, 0.413]
$$

These are our **channel attention weights**.

---

#### **Scale** (Channel Recalibration)

Now, multiply each channel of the original feature map by its corresponding weight:

| Channel | Scale weight | Example original mean | Scaled mean        |
| ------- | ------------ | --------------------- | ------------------ |
| 1       | 0.586        | 2.0                   | 2.0 √ó 0.586 = 1.17 |
| 2       | 0.668        | 0.5                   | 0.5 √ó 0.668 = 0.33 |
| 3       | 0.413        | 1.0                   | 1.0 √ó 0.413 = 0.41 |

‚Üí The first channel gets boosted moderately,
‚Üí The second a bit,
‚Üí The third is suppressed.

---

#### Intuition

* The SE block **‚Äúsqueezed‚Äù** spatial info into 3 global averages.
* Then it **‚Äúexcited‚Äù** them with a small MLP to learn *which channels matter*.
* Finally, it **scaled** the original channels by those learned importance weights.

So the network learns to emphasize informative channels and dampen unhelpful ones ‚Äî dynamically, per input image.

---




#### **PyTorch Implementation (simple)**

```python
import torch
import torch.nn as nn

class SEBlock(nn.Module):
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channels, channels // reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channels // reduction, channels, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)        # Squeeze
        y = self.fc(y).view(b, c, 1, 1)        # Excitation
        return x * y                          # Scale
```

---



## **Spatial Attention vs Spatial Attention**
####  1. The Core Difference

| Type                            | Focus                 | What it learns                         | Output shape  |
| ------------------------------- | --------------------- | -------------------------------------- | ------------- |
| **Squeeze-and-Excitation (SE)** | **Channel attention** | How important each *channel* is        | ( (1, 1, C) ) |
| **Spatial Attention**           | **Spatial attention** | How important each *pixel/location* is | ( (H, W, 1) ) |

In short:

* **SE ‚Üí ‚ÄúWhich feature maps (channels) should I amplify?‚Äù**
* **Spatial ‚Üí ‚ÄúWhich spatial regions (pixels) should I focus on?‚Äù**

---

####  **2. Mechanism Breakdown**

A. **Squeeze-and-Excitation (Channel Attention)**

1. **Squeeze:** Global average pooling ‚Üí summarize each channel ‚Üí vector of length (C).
2. **Excite:** MLP ‚Üí outputs one weight per channel (values 0‚Äì1).
3. **Scale:** Multiply each channel globally by its weight.

  **Effect:** Emphasizes useful *feature types* (e.g., ‚Äúedges,‚Äù ‚Äútextures,‚Äù ‚Äúobject color‚Äù channels).
Every pixel within a channel gets scaled equally.

**Visualization:**

```
Input:  H√óW√óC
 ‚Üì (Global Average Pool)
Vector (1√ó1√óC)
 ‚Üì (2 FC layers + sigmoid)
Weights (1√ó1√óC)
 ‚Üì (channel-wise multiply)
Output: H√óW√óC
```

---

B. **Spatial Attention**

1. **Squeeze channels:** Compute an importance map for *each spatial location* instead of each channel.
   Typically:

   * Apply average pooling and max pooling along channels ‚Üí two 2D maps (H√óW√ó1 each).
2. **Concatenate them** and run a small 2D convolution (e.g., 7√ó7).
3. **Sigmoid** ‚Üí get spatial attention map (H√óW√ó1).
4. **Multiply** original feature map spatially.

   **Effect:** Emphasizes *where* in the image to focus (e.g., object regions vs. background).
Each pixel gets a unique weight.

**Visualization:**

```
Input:  H√óW√óC
 ‚Üì (Channel AvgPool + MaxPool ‚Üí concat)
Feature (H√óW√ó2)
 ‚Üì (Conv 7√ó7 + sigmoid)
Attention Map (H√óW√ó1)
 ‚Üì (spatial multiply)
Output: H√óW√óC
```

---

####  3. Comparison Table

| Aspect             | Squeeze-and-Excitation (SE)           | Spatial Attention                      |
| ------------------ | ------------------------------------- | -------------------------------------- |
| Focus              | Channels                              | Spatial positions                      |
| Key question       | *‚ÄúWhich feature maps are important?‚Äù* | *‚ÄúWhich image regions are important?‚Äù* |
| Pooling used       | Global average pooling (over H, W)    | Pool over channels (avg & max)         |
| Weight shape       | (1√ó1√óC)                               | (H√óW√ó1)                                |
| Computational cost | Very low                              | Slightly higher (7√ó7 conv)             |
| Used in            | EfficientNet, SENet, MobileNetV3      | CBAM, BAM, attention-based CNNs        |
| Effect             | Reweight feature *types*              | Reweight feature *locations*           |

---

####  4. Combining Both (CBAM)

Many modern networks (e.g., **CBAM ‚Äî Convolutional Block Attention Module**) use *both*:

1. Apply SE (channel attention).
2. Then apply spatial attention.

This way the network learns **what** and **where** to focus.

---

####  5. Intuitive Analogy

| Analogy           | Meaning                                                         |
| ----------------- | --------------------------------------------------------------- |
| SE block          | ‚ÄúTurn up the volume for the *important instruments* (channels)‚Äù |
| Spatial attention | ‚ÄúFocus your *eyes* on where the action happens in the image‚Äù    |

Together ‚Üí you both **listen carefully** and **look carefully** üëÄüéß

---

Would you like me to show a **small visual example (with a heatmap)** comparing how SE and spatial attention would reweight an example 3√ó3√ó3 feature map?
