## Convolutional Block Attention Module (CBAM)

Spatial Attention in deep learning is a **mechanism that tells the network *where* to focus in the spatial dimensions** of a feature map (i.e., across height and width).

The **CBAM** defines **two submodules** applied sequentially:

**CBAM = Channel Attention + Spatial Attention**


$$
F' = M_s(M_c(F))
$$

where:

* $M_c(F)$ = **Channel Attention Module (CAM)** — focuses on *what* features (channels) are important.
* $M_s(F)$ = **Spatial Attention Module (SAM)** — focuses on *where* in the spatial map to attend.

So, the CBAM block contains **both** modules — they are not alternatives, they are *stacked*.

<img src="images/convolutional_block_attention_module.png" width="70%" height="70%" />


---



## **1.Spatial Attention Module (SAM)**

Imagine an image with a cat sitting on grass.
A convolutional layer will extract features everywhere — cat + grass + background.
Spatial Attention helps the model **highlight the cat region** (important spatially) and **suppress the background** (less relevant).

In short:

$$
\text{Focus on "where"} \quad \text{→ spatial attention.}
$$

---


It computes a 2D attention map $A_s \in \mathbb{R}^{H \times W}$ that highlights important *spatial regions* (the “where”).

Formula:

$$
A_s = \sigma(\text{Conv2D}([\text{AvgPool}(F); \text{MaxPool}(F)]))
$$

and the final output:

$$
F' = A_s \odot F
$$




---

####  **1.2. General Formula**

Spatial Attention generates an **attention map** $A_s \in \mathbb{R}^{H \times W}$ (same spatial size as input) that represents *importance of each spatial location*.

Given an input feature map
$$
F \in \mathbb{R}^{C \times H \times W}
$$

we compute:

$$
F'_i = A_s \odot F_i
$$

where:

* $\odot$ is element-wise multiplication broadcast along the channel dimension
* $A_s$ is a 2D spatial attention map, usually normalized with sigmoid so that $A_s \in [0,1]^{H \times W}$

---

####  **1.3. How to Compute the Spatial Attention Map**

Steps:

1. **Aggregate along channels** using average-pooling and max-pooling:
   $$
   F_{\text{avg}} = \text{AvgPool}(F) \in \mathbb{R}^{1 \times H \times W}
   $$
   $$
   F_{\text{max}} = \text{MaxPool}(F) \in \mathbb{R}^{1 \times H \times W}
   $$

2. **Concatenate and convolve:**
   $$
   F_{\text{cat}} = [F_{\text{avg}}; F_{\text{max}}]
   $$
   $$
   A_s = \sigma(\text{Conv2D}(F_{\text{cat}}))
   $$

   where $\sigma$ is the sigmoid activation.

3. **Apply attention:**
   $$
   F' = A_s \odot F
   $$

Thus, $A_s$ highlights the important *spatial* locations.

---


<img src="images/spatial_attention_module.png" width="70%" height="70%" />


---

#### **1.4. Code Example (CBAM-style Spatial Attention)**

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SpatialAttention(nn.Module):
    def __init__(self, kernel_size=7):
        super(SpatialAttention, self).__init__()
        padding = kernel_size // 2
        self.conv = nn.Conv2d(2, 1, kernel_size=kernel_size, padding=padding, bias=False)

    def forward(self, x):
        # x: [B, C, H, W]
        avg_out = torch.mean(x, dim=1, keepdim=True)        # [B, 1, H, W]
        max_out, _ = torch.max(x, dim=1, keepdim=True)      # [B, 1, H, W]
        x_cat = torch.cat([avg_out, max_out], dim=1)        # [B, 2, H, W]
        attn = torch.sigmoid(self.conv(x_cat))              # [B, 1, H, W]
        return x * attn                                     # spatially weighted feature map
```

---

#### **1.5. Visualization**

If you visualize the $A_s$ attention map, you’d typically see:

* Bright areas → model focus (object regions)
* Dark areas → ignored background

This makes the model **more interpretable** and **performance-improving** for detection, segmentation, and recognition tasks.

---



## **2. Spatial Attention Module (SAM) Numerical Example**

We’ll use the formula from CBAM:

$$
A_s = \sigma(\text{Conv2D}([\text{AvgPool}(F); \text{MaxPool}(F)]))
$$

and then apply $A_s$ to the input feature map.

---

#### **2.1. Input feature map**

Suppose we have an input feature map with 2 channels, 2×2 spatial size:

$$
F =
\begin{bmatrix}
\text{Channel 1} &
\text{Channel 2}
\end{bmatrix}
$$

Let’s define it numerically:

$$
F_1 =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix},
\quad
F_2 =
\begin{bmatrix}
2 & 0 \\
1 & 3
\end{bmatrix}
$$

So
$F$ has shape $[C, H, W] = [2, 2, 2]$.

---

#### **2.2. Average Pooling along channels**

We compute the mean for each pixel across channels:

$$
F_{\text{avg}}(i,j) = \frac{F_1(i,j) + F_2(i,j)}{2}
$$

So:

| Location | Calculation | Result |
| -------- | ----------- | ------ |
| (0,0)    | (1 + 2)/2   | 1.5    |
| (0,1)    | (2 + 0)/2   | 1.0    |
| (1,0)    | (3 + 1)/2   | 2.0    |
| (1,1)    | (4 + 3)/2   | 3.5    |

$$
F_{\text{avg}} =
\begin{bmatrix}
1.5 & 1.0 \\
2.0 & 3.5
\end{bmatrix}
$$

---

#### **2.3. Max Pooling along channels**

We take the maximum across channels:

| Location | max(Ch1, Ch2) | Result |
| -------- | ------------- | ------ |
| (0,0)    | max(1,2)      | 2      |
| (0,1)    | max(2,0)      | 2      |
| (1,0)    | max(3,1)      | 3      |
| (1,1)    | max(4,3)      | 4      |

$$
F_{\text{max}} =
\begin{bmatrix}
2 & 2 \\
3 & 4
\end{bmatrix}
$$

---

#### **2.4. Concatenate and convolve**

We stack them along the channel dimension:
$$
F_{\text{cat}} = [F_{\text{avg}}; F_{\text{max}}]
$$
so shape becomes `[2, 2, 2]`.

Let’s assume the 2D convolution kernel has weights:
$$
W = [w_1, w_2] = [0.5, 0.5]
$$
(no bias, kernel size = 1×1).

Then convolution output:
$$
A_s = \sigma(0.5 \times F_{\text{avg}} + 0.5 \times F_{\text{max}})
$$

Compute elementwise:

| Location | Equation        | Value | After Sigmoid |
| -------- | --------------- | ----- | ------------- |
| (0,0)    | 0.5×1.5 + 0.5×2 | 1.75  | 0.85          |
| (0,1)    | 0.5×1.0 + 0.5×2 | 1.5   | 0.82          |
| (1,0)    | 0.5×2.0 + 0.5×3 | 2.5   | 0.92          |
| (1,1)    | 0.5×3.5 + 0.5×4 | 3.75  | 0.98          |

$$
A_s =
\begin{bmatrix}
0.85 & 0.82 \\
0.92 & 0.98
\end{bmatrix}
$$

---

#### **2.5. Apply Spatial Attention**

Multiply each spatial location (broadcasted across channels):

For Channel 1:


$$
\text{For Channel 1:}
$$
$$
F'_1 = F_1 \odot A_s =
\begin{bmatrix}
1\times 0.85 & 2\times 0.82 \\
3\times 0.92 & 4\times 0.98
\end{bmatrix}
=
\begin{bmatrix}
0.85 & 1.64 \\
2.76 & 3.92
\end{bmatrix}
$$

$$
\text{For Channel 2:}
$$
$$
F'_2 = F_2 \odot A_s =
\begin{bmatrix}
2\times 0.85 & 0\times 0.82 \\
1\times 0.92 & 3\times 0.98
\end{bmatrix}
=
\begin{bmatrix}
1.70 & 0.00 \\
0.92 & 2.94
\end{bmatrix}
$$


---

#### **2.6. Result**

The **final attention-weighted feature map** is:

$$
F' =
\Bigg[
\begin{bmatrix}
0.85 & 1.64 \\
2.76 & 3.92
\end{bmatrix},
\quad
\begin{bmatrix}
1.70 & 0.00 \\
0.92 & 2.94
\end{bmatrix}
\Bigg]
$$

Here:

* Bright regions (bottom-right corner) got **higher attention values** (≈ 0.98)
* Top-left corner got **lower attention values** (≈ 0.82)

So the model **focuses spatially** on the more informative region of the feature map.

---



## **3. Channel Attention Module (CAM)**

Let’s now explain the **Channel Attention Module (CAM)**, which comes *before* SAM.

This one tells the network **which channels (features)** to emphasize — for example, “edges” or “color” channels.

Given a feature map:
$$
F \in \mathbb{R}^{C \times H \times W}
$$

CAM computes a **channel attention vector** $A_c \in \mathbb{R}^{C}$:

---

#### **3.a) Channel Attention Formula**

1. **Global Pooling (squeeze spatial info):**

Compute both **average** and **max** over spatial dimensions:

$$
F_{\text{avg}} = \text{AvgPool}(F) \in \mathbb{R}^{C \times 1 \times 1}
$$

$$
F_{\text{max}} = \text{MaxPool}(F) \in \mathbb{R}^{C \times 1 \times 1}
$$

2. **Shared MLP (two FC layers):**

Each passes through a small MLP (with reduction ratio $r$, e.g. 16):

$$
M_{\text{avg}} = W_1(\text{ReLU}(W_0(F_{\text{avg}})))
$$

$$
M_{\text{max}} = W_1(\text{ReLU}(W_0(F_{\text{max}})))
$$

3. **Combine and activate:**

$$
A_c = \sigma(M_{\text{avg}} + M_{\text{max}})
$$

4. **Apply attention across channels:**

$$
F' = A_c \odot F
$$

(where $\odot$ means broadcasting multiplication across spatial dims)

---

#### **3.b) Intuition**

* Each channel represents a *type of feature* (edges, color blobs, textures, etc.).
* CAM learns which channels are *useful* for the current task.
* SAM (later) learns *where* these features matter spatially.

---

#### . Complete CBAM Flow

The full process is:

$$
\begin{align}
F' &= M_c(F) \
F'' &= M_s(F') \
\text{Output} &= F''
\end{align}
$$

**Order matters**:
Channel attention first, spatial attention second.

---

#### . Visualization

| Step                  | Input   | Output  | Focus                    |
| --------------------- | ------- | ------- | ------------------------ |
| **Channel Attention** | [C×H×W] | [C×1×1] | **What** channels matter |
| **Spatial Attention** | [C×H×W] | [1×H×W] | **Where** they matter    |

Together, they make the model focus on both *what* and *where* to look.

---

#### 6. PyTorch Overview

```python
class CBAM(nn.Module):
    def __init__(self, channels, reduction=16, kernel_size=7):
        super(CBAM, self).__init__()
        self.channel_attention = ChannelAttention(channels, reduction)
        self.spatial_attention = SpatialAttention(kernel_size)

    def forward(self, x):
        x = self.channel_attention(x)  # what
        x = self.spatial_attention(x)  # where
        return x
```

---

#### . Summary Table

| Module  | Input Shape | Output Shape | Mechanism           | Focus   | Example Formula                                   |
| ------- | ----------- | ------------ | ------------------- | ------- | ------------------------------------------------- |
| **CAM** | [C, H, W]   | [C, 1, 1]    | Avg/Max Pool + MLP  | *What*  | $A_c = \sigma(MLP(AvgPool(F)) + MLP(MaxPool(F)))$ |
| **SAM** | [C, H, W]   | [1, H, W]    | Avg/Max Pool + Conv | *Where* | $A_s = \sigma(Conv2D([AvgPool(F); MaxPool(F)]))$  |

---



#### Channel Attention Module
<img src="images/channel_attention_module.png" width="70%" height="70%" />

## **7. Channel Attention Module (CAM) Numerical Example**

Input Feature Map:

$$
F_1 =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}, \quad
F_2 =
\begin{bmatrix}
2 & 0 \\
1 & 3
\end{bmatrix}
$$

So $F$ has shape $[C, H, W] = [2, 2, 2]$.

---

#### **7.1. Global Pooling (squeeze spatial info)**

We apply **Average Pooling** and **Max Pooling** over the spatial dimensions $(H, W)$ **independently for each channel**.

**Average Pooling**

Compute the mean per channel:

| Channel | Formula     | Result |
| ------- | ----------- | ------ |
| 1       | (1+2+3+4)/4 | 2.5    |
| 2       | (2+0+1+3)/4 | 1.5    |

$$
F_{\text{avg}} =
\begin{bmatrix}
2.5 \
1.5
\end{bmatrix}
$$

**Max Pooling**

Compute the max per channel:

| Channel | Formula      | Result |
| ------- | ------------ | ------ |
| 1       | max(1,2,3,4) | 4      |
| 2       | max(2,0,1,3) | 3      |

$$
F_{\text{max}} =
\begin{bmatrix}
4 \\
3
\end{bmatrix}
$$

---

#### **7.2. Shared MLP (two fully connected layers)**

The MLP is shared for both pooled vectors.

Let’s assume:

* Number of channels $C = 2$
* Reduction ratio $r = 1$ (to keep small numbers)
* So, weights $W_0, W_1$ are both $2\times2$ matrices.

We’ll choose simple example weights:

$$
W_0 =
\begin{bmatrix}
0.5 & 0.1 \\
0.3 & 0.7
\end{bmatrix},
\quad
W_1 =
\begin{bmatrix}
0.2 & 0.4 \\
0.6 & 0.8
\end{bmatrix}
$$

Now process both pooled vectors.

---

#### **7.3 Average Path**

First layer:
$$
z_1 = \text{ReLU}(W_0 F_{\text{avg}})
$$

Compute:

$$
\begin{align*}
W_0 F_{\text{avg}} &=
\begin{bmatrix}
0.5 & 0.1 \\
0.3 & 0.7
\end{bmatrix}
\begin{bmatrix}
2.5 \\
1.5
\end{bmatrix} \\
&=
\begin{bmatrix}
(0.5 \times 2.5 + 0.1 \times 1.5) \\
(0.3 \times 2.5 + 0.7 \times 1.5)
\end{bmatrix} \\
&=
\begin{bmatrix}
1.4 \\
2.2
\end{bmatrix}
\end{align*}
$$

After ReLU: same values (positive).

Second layer:

$$
\begin{align*}
M_{\text{avg}} &= W_1 z_1 \\
&=
\begin{bmatrix}
0.2 & 0.4 \\
0.6 & 0.8
\end{bmatrix}
\begin{bmatrix}
1.4 \\
2.2
\end{bmatrix} \\
&=
\begin{bmatrix}
(0.2 \times 1.4 + 0.4 \times 2.2) \\
(0.6 \times 1.4 + 0.8 \times 2.2)
\end{bmatrix} \\
&=
\begin{bmatrix}
1.48 \\
2.92
\end{bmatrix}
\end{align*}
$$

---

#### **7.4. Max Path**

First layer:
$$
z_2 = \text{ReLU}(W_0 F_{\text{max}})
$$

Compute:

$$
\begin{align*}
W_0 F_{\text{max}} &=
\begin{bmatrix}
0.5 & 0.1 \\
0.3 & 0.7
\end{bmatrix}
\begin{bmatrix}
4 \\
3
\end{bmatrix} \\
&=
\begin{bmatrix}
(0.5 \times 4 + 0.1 \times 3) \\
(0.3 \times 4 + 0.7 \times 3)
\end{bmatrix} \\
&=
\begin{bmatrix}
2.3 \\
3.3
\end{bmatrix}
\end{align*}
$$

Second layer:

$$
\begin{align*}
M_{\text{max}} &= W_1 z_2 \\
&=
\begin{bmatrix}
0.2 & 0.4 \\
0.6 & 0.8
\end{bmatrix}
\begin{bmatrix}
2.3 \\
3.3
\end{bmatrix} \\
&=
\begin{bmatrix}
(0.2 \times 2.3 + 0.4 \times 3.3) \\
(0.6 \times 2.3 + 0.8 \times 3.3)
\end{bmatrix} \\
&=
\begin{bmatrix}
1.98 \\
4.20
\end{bmatrix}
\end{align*}
$$

---

#### **7.5. Combine and Apply Sigmoid**

Add both paths:

$$
\begin{align*}
M_{\text{avg}} + M_{\text{max}} &=
\begin{bmatrix}
1.48 + 1.96 \\
2.92 + 4.20
\end{bmatrix} \\
&=
\begin{bmatrix}
3.44 \\
7.12
\end{bmatrix}
\end{align*}
$$

Apply sigmoid:

| Channel | Input | Sigmoid Output |
| ------- | ----- | -------------- |
| 1       | 3.44  | 0.969          |
| 2       | 7.12  | 0.999          |

So:

$$
A_c =
\begin{bmatrix}
0.969 \\
0.999
\end{bmatrix}
$$

---

#### **7.6. Apply Channel Attention**

Now multiply each channel by its attention weight:

$$
F'_1 = 0.969 × F_1, \quad F'_2 = 0.999 × F_2
$$

Compute:

$$
F'_1 =
\begin{bmatrix}
0.969 & 1.938 \\
2.907 & 3.876
\end{bmatrix},
\quad
F'_2 =
\begin{bmatrix}
1.998 & 0.000 \\
0.999 & 2.997
\end{bmatrix}
$$

---

#### **7.7. Interpretation**

* **Channel 1 weight = 0.969**
* **Channel 2 weight = 0.999**

→ The module slightly prefers **Channel 2** (which may contain more discriminative features in this toy example).
Each channel’s spatial structure remains the same, only *scaled*.

---

#### **7.8. Combined CAM Effect with SAM**

If you later pass this through the **Spatial Attention Module (SAM)** (from before), you’ll get:

$$
F'' = M_s(F') = A_s \odot F'
$$

meaning:

* **CAM**: emphasizes *which features* to keep
* **SAM**: emphasizes *where* to look

---

## **8. Complete CBAM: SAM + CAM Numerical Example**


CBAM applies attention **sequentially**:

$$
F' = M_c(F) \quad \text{(Channel Attention)} 
$$

$$
F'' = M_s(F') \quad \text{(Spatial Attention)}
$$

---

#### **8.1. Inputs Recap**

Initial feature maps:

$$
F_1 =
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}, \quad
F_2 =
\begin{bmatrix}
2 & 0 \\
1 & 3
\end{bmatrix}
$$

---

#### **8.2. Channel Attention Output (from CAM example)**

Channel attention weights:

$$
A_c =
\begin{bmatrix}
0.969 \\
0.999
\end{bmatrix}
$$

Applying them to each channel gives:

$$
F'_1 =
\begin{bmatrix}
0.969 & 1.938 \\
2.907 & 3.876
\end{bmatrix}, \quad
F'_2 =
\begin{bmatrix}
1.998 & 0.000 \\
0.999 & 2.997
\end{bmatrix}
$$

---

#### **8.3. Spatial Attention Map (from SAM example)**

We had computed earlier:

$$
A_s =
\begin{bmatrix}
0.85 & 0.82 \\
0.92 & 0.98
\end{bmatrix}
$$

---

#### **8.4. Apply Spatial Attention**

Multiply each spatial element by its attention weight (broadcasted over channels):

**Channel 1:**

$$
\begin{align*}
F''_1 &= F'_1 \odot A_s \\
&=
\begin{bmatrix}
0.969 \times 0.85 & 1.938 \times 0.82 \\
2.907 \times 0.92 & 3.876 \times 0.98
\end{bmatrix} \\
&=
\begin{bmatrix}
0.824 & 1.589 \\
2.675 & 3.799
\end{bmatrix}
\end{align*}
$$

**Channel 2:**

$$
\begin{align*}
F''_2 &= F'_2 \odot A_s \\
&=
\begin{bmatrix}
1.998 \times 0.85 & 0.000 \times 0.82 \\
0.999 \times 0.92 & 2.997 \times 0.98
\end{bmatrix} \\
&=
\begin{bmatrix}
1.698 & 0.000 \\
0.919 & 2.937
\end{bmatrix}
\end{align*}
$$

---

#### **8.5. Final CBAM Output**

$$
F'' =
\Bigg[
\begin{bmatrix}
0.824 & 1.589 \\
2.675 & 3.799
\end{bmatrix},
\quad
\begin{bmatrix}
1.698 & 0.000 \\
0.919 & 2.937
\end{bmatrix}
\Bigg]
$$

---

#### **8.6. Interpretation**

| Step                        | Focus                       | Effect                         |
| --------------------------- | --------------------------- | ------------------------------ |
| **Channel Attention (CAM)** | *What* features matter      | Slightly boosts channel 2      |
| **Spatial Attention (SAM)** | *Where* in each map matters | Emphasizes bottom-right region |

Thus, after both attentions:

* The **bottom-right** regions (high $A_s$ = 0.98) have the **strongest activations**.
* **Channel 2** remains slightly stronger due to higher channel weight (0.999).
* **Background regions** (e.g. top-left) are suppressed.

---

✅ **Final Summary**

$$
\boxed{
F'' = M_s(M_c(F))
}
$$

with numerically:

$$
A_c =
\begin{bmatrix}
0.969 \\ 0.999
\end{bmatrix},
\quad
A_s =
\begin{bmatrix}
0.85 & 0.82 \\ 0.92 & 0.98
\end{bmatrix}
$$

## **9. Where and How CBAM Fits Inside a CNN**


CBAM is a **plug-and-play attention module** — you insert it **after** a convolutional block to refine the feature maps spatially and across channels.

It consists of two sub-modules applied sequentially:

$$
F' = M_s(M_c(F)) \quad \text{(Channel → Spatial)}
$$

---

#### **9.1. Where It Goes in a CNN**

In the original CBAM paper (Sanghyun Woo et al., ECCV 2018), the authors insert CBAM **after each residual block** of a ResNet, just before the residual addition.

So, for a residual block with input $x$:

$$
\text{out} = \text{ConvBlock}(x) \
\text{out} = \text{CBAM}(\text{out}) \
\text{out} = \text{out} + x
$$

This means CBAM refines the features **before the skip connection adds them back**.

---

#### **9.2. Example: Inside a ResNet Block**

**Original Residual Block:**

```
x → Conv → BN → ReLU → Conv → BN → (add skip x)
```

**With CBAM:**

```
x → Conv → BN → ReLU → Conv → BN → CBAM → (add skip x)
```

✅ You can add CBAM to *each block*, or to only *selected stages* (to save computation).

---

#### **9.3. Inside the CBAM Block**

The CBAM block itself is lightweight and small:

```python
class CBAMBlock(nn.Module):
    def __init__(self, channels):
        super(CBAMBlock, self).__init__()
        self.channel_att = ChannelAttention(channels)
        self.spatial_att = SpatialAttention()
    def forward(self, x):
        x = self.channel_att(x)
        x = self.spatial_att(x)
        return x
```

---

#### **9.4. Example in Architecture (ResNet with CBAM)**

If we visualize the flow in a ResNet-50:

| Stage   | Typical Layers              | CBAM Added After   |
| ------- | --------------------------- | ------------------ |
| Conv1   | 7×7 conv, BN, ReLU, MaxPool | Optional           |
| Conv2_x | 3 residual blocks           | ✅ after each block |
| Conv3_x | 4 residual blocks           | ✅ after each block |
| Conv4_x | 6 residual blocks           | ✅ after each block |
| Conv5_x | 3 residual blocks           | ✅ after each block |
| FC      | AvgPool + Linear            | ✖ Not here         |

So yes — we usually **repeat CBAM after every residual block**, but you can selectively apply it if speed matters.

---

#### **9.5. Other Architectures Where CBAM Is Used**

| Model Type                                   | How CBAM Is Used                               | Purpose                                      |
| -------------------------------------------- | ---------------------------------------------- | -------------------------------------------- |
| **ResNet / ResNeXt / DenseNet**              | Added after each block                         | Improves classification & detection accuracy |
| **U-Net / SegNet**                           | Added at encoder or decoder stages             | Improves segmentation precision              |
| **YOLO / SSD (Detection)**                   | Added after feature extractors (e.g. backbone) | Helps localize and identify objects          |
| **ViT / Swin Hybrid Models**                 | Applied to CNN stem before transformer         | Gives inductive spatial bias                 |
| **Lightweight CNNs (MobileNet, ShuffleNet)** | Added only to last few layers                  | Minimal cost, large accuracy gain            |

---

#### **9.6. Why Not Put It Everywhere?**

CBAM is lightweight but still adds small computational overhead:

* Channel attention uses MLPs (2 FC layers).
* Spatial attention adds one convolution.

Hence:

* For **small networks (e.g., MobileNet)**, use it **only in deeper layers**.
* For **large ResNets**, you can afford to apply CBAM to every block.

---

#### **9.7. Effect in Practice**

CBAM improves:

* **Classification accuracy** (ImageNet, CIFAR, etc.)
* **Localization and detection** (COCO, Pascal VOC)
* **Segmentation boundary precision**

Typical improvement:
+1 – 3 % Top-1 accuracy with negligible extra FLOPs.

---




## **10. CBAM and SENet**
CBAM fully includes the functionality of SENet — and then extends it.

**CBAM = SENet + spatial refinement**

| Type                  | Input            | Output          | Focus        | Common Use            |
| --------------------- | ---------------- | --------------- | ------------ | --------------------- |
| Channel Attention     | Feature map      | Channel weights | What         | SENet, CBAM           |
| **Spatial Attention** | Feature map      | Spatial mask    | **Where**    | CBAM, attention U-Net |
| Self-Attention        | Flattened tokens | Weighted tokens | Where + What | Transformers          |

---