# Attention Mechanisms in Convolutional Neural Networks

Attention mechanisms in convolutional neural networks enable the model to adaptivelyfocus on the most relevant features of the input signal, either at the channel level orat the spatial level. These modules learn to recalibrate intermediate activations byassigning differentiated importance weights, which increases the representationalcapacity of the model without drastically increasing the number of parameters or thecomputational cost.In modern architectures, attention is integrated in a modular way into existingconvolutional blocks, such as the residual blocks of ResNet. The following sectionsdescribe and implement two of the most influential attention mechanisms in convolutionalnetworks: The Squeeze-and-Excitation (SE) block and the Convolutional Block AttentionModule (CBAM).

## Squeeze-and-Excitation (SE) Block

The Squeeze-and-Excitation block, introduced in the work _Squeeze-and-ExcitationNetworks_ (https://arxiv.org/abs/1709.01507), incorporates a channel-wise attentionmechanism. The central idea is to explicitly model the dependency relationships betweenfeature channels so that the network learns to emphasize those channels that are mostinformative for the task, while suppressing less relevant or redundant channels.The SE mechanism decomposes into two conceptual stages, commonly referred to as _squeeze_and _excitation_. In the squeeze phase, the spatial dimension of each feature map isreduced by means of global average pooling. In this way, each channel is compressed intoa single scalar value that summarizes its global activation across the entire image. Inthe excitation phase, these aggregated values are fed into a small fully connectednetwork that learns a channel-wise attention function. The output of this network is avector of weights in the interval $(0, 1)$, which is applied multiplicatively to theoriginal channels, recalibrating their relative importance.Let $X \in \mathbb{R}^{B \times C \times H \times W}$ be a feature tensor with batch size$B$, number of channels $C$, and spatial dimensions $H \times W$. The squeeze operationcomputes, for each channel $c$,\[ z*c = \frac{1}{HW} \sum*{i=1}^{H} \sum\_{j=1}^{W} X_c(i, j). \]The compressed vector $z \in \mathbb{R}^{C}$ is processed by a two-layer fully connectednetwork with an intermediate dimensionality reduction, which produces a vector of weights$s \in (0, 1)^{C}$ after a sigmoid activation. The recalibration is implemented as\[ \tilde{X}\_c(i, j) = s_c \cdot X_c(i, j). \]The following code shows an implementation of the SE block and its integration into abasic residual block in PyTorch. The code is designed for direct use in a reproducibleand fully executable workflow.

In [None]:
import torchimport torch.nn as nnimport torch.nn.functional as Fclass SqueezeExcitation(nn.Module):    def __init__(self, in_channels: int, reduction_ratio: int = 16) -> None:        super().__init__()        reduced_channels = max(in_channels // reduction_ratio, 1)        # Squeeze: Global average pooling per channel        self.squeeze = nn.AdaptiveAvgPool2d(1)        # Excitation: Two fully connected (implemented as Linear) layers        self.excitation = nn.Sequential(            nn.Linear(in_channels, reduced_channels, bias=False),            nn.ReLU(inplace=True),            nn.Linear(reduced_channels, in_channels, bias=False),            nn.Sigmoid()        )    def forward(self, x: torch.Tensor) -> torch.Tensor:        batch_size, channels, _, _ = x.size()        # Squeeze: Global average pooling per channel        squeezed = self.squeeze(x).view(batch_size, channels)        # Excitation: Channel-wise weights in (0, 1)        excited = self.excitation(squeezed).view(batch_size, channels, 1, 1)        # Channel-wise recalibration        return x * excitedclass SEResidualBlock(nn.Module):    def __init__(        self,        in_channels: int,        out_channels: int,        stride: int = 1,        reduction_ratio: int = 16    ) -> None:        super().__init__()        self.conv1 = nn.Conv2d(            in_channels, out_channels,            kernel_size=3, stride=stride, padding=1, bias=False        )        self.bn1 = nn.BatchNorm2d(out_channels)        self.conv2 = nn.Conv2d(            out_channels, out_channels,            kernel_size=3, stride=1, padding=1, bias=False        )        self.bn2 = nn.BatchNorm2d(out_channels)        self.se = SqueezeExcitation(out_channels, reduction_ratio)        self.shortcut = nn.Sequential()        if stride != 1 or in_channels != out_channels:            self.shortcut = nn.Sequential(                nn.Conv2d(                    in_channels, out_channels,                    kernel_size=1, stride=stride, bias=False                ),                nn.BatchNorm2d(out_channels)            )    def forward(self, x: torch.Tensor) -> torch.Tensor:        identity = self.shortcut(x)        out = F.relu(self.bn1(self.conv1(x)))        out = self.bn2(self.conv2(out))        out = self.se(out)        out += identity        out = F.relu(out)        return out

To verify the correct construction and behavior of the SE block, a small functional testcan be defined. This test checks that input and output shapes match and reports thenumber of parameters of the SE module.

In [None]:
def test_se_block() -> None:    x = torch.randn(2, 64, 32, 32)    se_block = SqueezeExcitation(in_channels=64, reduction_ratio=16)    output = se_block(x)    print(f"Input shape:  {x.shape}")    print(f"Output shape: {output.shape}")    print(f"SE parameters: {sum(p.numel() for p in se_block.parameters())}")    assert x.shape == output.shape, "Shape mismatch"    print("SE Block test passed")test_se_block()

The SE block introduces a relatively moderate number of additional parameters, controlledby the hyperparameter `reduction_ratio`. This parameter determines the bottleneck size inthe excitation network: Larger values reduce the capacity of the module but also decreaseits computational cost. In practice, configurations such as `reduction_ratio = 16`usually provide a good balance between modeling capacity and efficiency.

## Convolutional Block Attention Module (CBAM)

The Convolutional Block Attention Module (CBAM) extends the SE idea by sequentiallyincorporating attention both in the channel domain and in the spatial domain. First, itapplies a channel attention module conceptually similar to SE, but combining informationfrom global average pooling and global max pooling. Subsequently, it applies a spatialattention module that analyzes the distribution of activations across channels todetermine which regions of the image are most relevant.The channel attention module in CBAM is built from two parallel paths. One path receivesas input the output of a global average pooling and the other uses the output of a globalmax pooling, both computed over the spatial dimensions for each channel. Each of thesesummaries is processed by a small $1 \times 1$ convolutional network that acts as ashared fully connected projection. The two resulting outputs are combined by element-wiseaddition and then passed through a sigmoid function to obtain a channel attention mapthat modulates the contribution of each channel.The spatial attention module is applied to the feature maps already recalibrated bychannel. To this end, two single-channel spatial maps are computed by aggregating overthe channel dimension using mean and maximum operations. These two maps are concatenatedalong the channel axis and processed by a convolution of size $k \times k$, typicallywith $k = 7$, followed by a sigmoid activation. The result is a spatial attention mapthat is applied multiplicatively to the signal, modulating the importance of each spatialposition $(i, j)$ in the image.The following code presents the implementation of CBAM (channel and spatial attention)and its integration into a residual block.

In [None]:
class ChannelAttention(nn.Module):    def __init__(self, in_channels: int, reduction_ratio: int = 16) -> None:        super().__init__()        reduced_channels = max(in_channels // reduction_ratio, 1)        self.avg_pool = nn.AdaptiveAvgPool2d(1)        self.max_pool = nn.AdaptiveMaxPool2d(1)        # Shared MLP implemented with 1x1 convolutions        self.fc = nn.Sequential(            nn.Conv2d(in_channels, reduced_channels, kernel_size=1, bias=False),            nn.ReLU(inplace=True),            nn.Conv2d(reduced_channels, in_channels, kernel_size=1, bias=False)        )        self.sigmoid = nn.Sigmoid()    def forward(self, x: torch.Tensor) -> torch.Tensor:        avg_out = self.fc(self.avg_pool(x))        max_out = self.fc(self.max_pool(x))        attention = self.sigmoid(avg_out + max_out)        return x * attentionclass SpatialAttention(nn.Module):    def __init__(self, kernel_size: int = 7) -> None:        super().__init__()        padding = (kernel_size - 1) // 2        self.conv = nn.Conv2d(            2, 1, kernel_size=kernel_size,            padding=padding, bias=False        )        self.sigmoid = nn.Sigmoid()    def forward(self, x: torch.Tensor) -> torch.Tensor:        # Channel-wise average and max projections        avg_out = torch.mean(x, dim=1, keepdim=True)        max_out, _ = torch.max(x, dim=1, keepdim=True)        combined = torch.cat([avg_out, max_out], dim=1)        attention = self.sigmoid(self.conv(combined))        return x * attentionclass CBAM(nn.Module):    def __init__(        self,        in_channels: int,        reduction_ratio: int = 16,        kernel_size: int = 7    ) -> None:        super().__init__()        self.channel_attention = ChannelAttention(            in_channels, reduction_ratio        )        self.spatial_attention = SpatialAttention(kernel_size)    def forward(self, x: torch.Tensor) -> torch.Tensor:        x = self.channel_attention(x)        x = self.spatial_attention(x)        return xclass CBAMResidualBlock(nn.Module):    def __init__(        self,        in_channels: int,        out_channels: int,        stride: int = 1    ) -> None:        super().__init__()        self.conv1 = nn.Conv2d(            in_channels, out_channels,            kernel_size=3, stride=stride, padding=1, bias=False        )        self.bn1 = nn.BatchNorm2d(out_channels)        self.conv2 = nn.Conv2d(            out_channels, out_channels,            kernel_size=3, stride=1, padding=1, bias=False        )        self.bn2 = nn.BatchNorm2d(out_channels)        self.cbam = CBAM(out_channels)        self.shortcut = nn.Sequential()        if stride != 1 or in_channels != out_channels:            self.shortcut = nn.Sequential(                nn.Conv2d(                    in_channels, out_channels,                    kernel_size=1, stride=stride, bias=False                ),                nn.BatchNorm2d(out_channels)            )    def forward(self, x: torch.Tensor) -> torch.Tensor:        identity = self.shortcut(x)        out = F.relu(self.bn1(self.conv1(x)))        out = self.bn2(self.conv2(out))        out = self.cbam(out)        out += identity        out = F.relu(out)        return out

The following code fragment performs a basic check of the CBAM module, analogous to thetest applied in the case of the SE block. It validates that the input and output have thesame shape and reports the number of parameters of the module.

In [None]:
def test_cbam() -> None:    x = torch.randn(2, 64, 32, 32)    cbam = CBAM(in_channels=64)    output = cbam(x)    print(f"Input shape:  {x.shape}")    print(f"Output shape: {output.shape}")    print(f"CBAM parameters: {sum(p.numel() for p in cbam.parameters())}")    assert x.shape == output.shape, "Shape mismatch"    print("CBAM test passed")test_cbam()

In practice, CBAM often provides consistent improvements over SE, since it combineschannel-level and spatial attention in a complementary way. Spatial attention isparticularly useful in tasks where the localization of objects or discriminative regionsplays a critical role, such as object detection, semantic and instance segmentation, orrecognition in scenarios with multiple instances per image.