# Normalization

Normalization constitutes a key stage both in input data preprocessing and in the
internal design of neural network architectures. Its primary objective is to control the
scale of numerical values, ensuring that different features are in comparable ranges and
that training is stable, efficient, and less sensitive to initialization or
hyperparameter choices.

In the context of images, normalization can be divided into two major conceptual blocks.
On one hand, input data normalization, which is applied before introducing images into
the network. On the other hand, layer normalization, which is applied to the internal
activations of the network during training. Although both categories pursue similar
objectives, they are implemented at different stages of the data flow and with different
mechanisms.

## Input Data Normalization

Input data normalization is applied directly to images before they are processed by the
network layers. In the case of images, one works with tensors or arrays where each pixel
can be represented with raw values in the range $[0, 255]$ or, after prior conversion,
with floating-point values.

The purpose of this normalization is threefold. First, it provides numerical stability by
avoiding excessively large or small values, which can cause uncontrolled or practically
null gradients. Second, it accelerates training, as gradients propagate more uniformly
through the network. Finally, it prevents one feature from dominating others simply due
to its scale, favoring that all dimensions of the feature space contribute comparably to
model learning.

### Motivation for Normalizing Input

Input normalization fulfills several essential objectives. In terms of numerical
stability, it prevents activations from reaching magnitudes that hinder the convergence
of optimization algorithms. Additionally, input homogenization facilitates that gradients
calculated during backpropagation have reasonable orders of magnitude, which allows using
more aggressive learning rates without compromising convergence. Finally, by adjusting
all features to similar ranges, a balancing effect is produced, so that the model is not
biased toward those components with larger numerical values.

### Input Normalization Techniques

Various standard techniques exist for normalizing images, each suitable for certain
scenarios and architectures. A first technique consists of Min-Max normalization to the
range $[0, 1]$. In this case, the image is linearly rescaled using its minimum and
maximum values:

In [1]:
# 3pps
import numpy as np


def normalize_min_max(image):
    """Brings the image to the range [0, 1]."""
    image = image.astype(np.float32)
    normalized = (image - image.min()) / (image.max() - image.min() + 1e-8)
    return normalized

This method is useful when one wants to work with values bounded between 0 and 1, for
example in simple models or when one wishes to visualize or combine different data
sources normalized to the same range.

A widely used variant in deep neural networks consists of bringing values to the range
$[-1, 1]$. For typical 8-bit images, a direct way to achieve this is to first divide by
255 and then apply a linear transformation:

In [2]:
def normalize_minus_one_to_one(image):
    """Brings the image to the range [-1, 1]."""
    image = image.astype(np.float32) / 255.0
    normalized = 2.0 * image - 1.0
    return normalized

This type of normalization is common in architectures such as Generative Adversarial
Networks (GANs), where it is preferable for input data to be centered around zero.

Another fundamental approach is standardization or $z$-score normalization. In this case,
the mean is subtracted and divided by the standard deviation of the data:

In [3]:
def standardize(image):
    """Standardizes: (x - mean) / standard deviation."""
    image = image.astype(np.float32)
    mean = image.mean()
    std = image.std()
    standardized = (image - mean) / (std + 1e-8)
    return standardized

This technique transforms data so that it has approximately zero mean and unit variance.
In computer vision, it is frequently used at the channel level, utilizing precomputed
means and standard deviations over large datasets, such as ImageNet.

In practice, frameworks like PyTorch facilitate input normalization through predefined
transformations. A typical example for models pretrained on ImageNet is as follows:

In [4]:
# 3pps
from torchvision import transforms


# Standard transformation for pretrained models (ImageNet)
transform = transforms.Compose(
    [
        transforms.ToTensor(),  # Converts to tensor and scales to [0, 1]
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],  # ImageNet mean per channel
            std=[0.229, 0.224, 0.225],  # ImageNet standard deviation per channel
        ),
    ]
)

In this pipeline, the `ToTensor` function converts the image to a floating-point tensor
and scales values to the range $[0, 1]$. Subsequently, `Normalize` applies channel-wise
standardization using global statistics from the original training set. This practice
ensures that pretrained models receive inputs in the same statistical regime for which
they were optimized.

## Layer Normalization in Neural Networks

Layer normalization is performed within the network architecture, on the intermediate
activations that are generated as data advances through different layers. Unlike input
data normalization, which is fixed preprocessing, layer normalization is implemented as
differentiable blocks that form part of the model and that, in many cases, contain
learnable parameters.

The general idea consists of normalizing activations according to certain dimensions (for
example, over the batch, over channels, or over all elements of a sample), and then
applying a linear transformation with scale and shift parameters that are learned during
training. In this way, the so-called "internal covariate shift" is corrected and the
distribution of activations is stabilized, which facilitates the training of deep
networks.

### Local Response Normalization (LRN)

Local Response Normalization (LRN) is a technique introduced in early networks such as
AlexNet. Its purpose is to perform normalization based on the response of neighboring
channels, mimicking certain lateral inhibition mechanisms observed in the biological
visual system. Although it is included here for historical completeness, in practice its
current use is residual, as it has been widely displaced by more effective methods such
as Batch Normalization or Layer Normalization.

A schematic implementation of LRN in PyTorch can be structured as a class that receives
parameters such as neighborhood size $n$, coefficients $\alpha$ and $\beta$, and a
constant $k$:

In [5]:
# 3pps
import torch
import torch.nn as nn


class LocalResponseNormalization(nn.Module):
    def __init__(self, k=2.0, n=5, alpha=1e-4, beta=0.75):
        super().__init__()
        self.k = k
        self.n = n
        self.alpha = alpha
        self.beta = beta

    # The complete implementation would include the calculation of normalization
    # over neighboring channels according to the above parameters.

Although LRN had relevance in early works with deep CNNs, its current impact is very
limited and it is not considered a recommendable choice for modern architectures.

### Global Response Normalization (GRN)

Global Response Normalization (GRN) is proposed as a more recent alternative to local
normalization. Instead of normalizing with respect to neighboring channels, GRN considers
the global response of all channels for each spatial position and regulates the magnitude
of activations per channel from that global information. The objective is to prevent
certain channels from becoming redundant or systematically dominating the representation,
promoting a more balanced distribution of energy across channels.

A typical GRN implementation in PyTorch can take the following form:

In [6]:
class GlobalResponseNormalization(nn.Module):
    def __init__(self, num_channels, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.eps = eps

    def forward(self, x):
        # Calculate global norm per channel (p=2 over spatial dimensions)
        gx = torch.norm(x, p=2, dim=(2, 3), keepdim=True)
        # Normalize with respect to the mean of global norms
        nx = gx / (gx.mean(dim=1, keepdim=True) + self.eps)
        # Rescale and add residual component
        return self.gamma * (x * nx) + self.beta + x

In this block, an $L^2$ norm per channel is first calculated by aggregating over spatial
dimensions. Subsequently, this norm is normalized with respect to its mean and used to
rescale the original activations through the learnable parameters $\gamma$ and $\beta$,
to which the input itself is also added as a residual term. This type of normalization
has been explored in modern convolutional architectures and in masked autoencoder models.

### Batch Normalization (BN)

Batch Normalization (BN) is one of the most influential internal normalization techniques
in deep networks. Its central idea consists of normalizing activations using statistics
(mean and variance) calculated over the training batch itself for each channel.

For an activation tensor $x$ of size $(N, C, H, W)$, where $N$ is the batch size, $C$ is
the number of channels, and $(H, W)$ is the spatial dimension, the mean and variance per
channel are calculated in training mode:

$$ \mu*c = \frac{1}{N H W} \sum*{n,h,w} x*{n,c,h,w}, \quad \sigma_c^2 = \frac{1}{N H W}
\sum*{n,h,w} (x\_{n,c,h,w} - \mu_c)^2. $$

Next, normalization is performed:

$$ \hat{x}_{n,c,h,w} = \frac{x_{n,c,h,w} - \mu_c}{\sqrt{\sigma_c^2 + \varepsilon}}, $$

and an affine transformation is applied with learnable parameters $\gamma_c$ and
$\beta_c$:

$$ y*{n,c,h,w} = \gamma_c \hat{x}*{n,c,h,w} + \beta_c. $$

A simplified implementation of two-dimensional Batch Normalization can be expressed in
PyTorch as follows:

In [7]:
class BatchNormalization2D(nn.Module):
    def __init__(self, num_channels, eps=1e-5, momentum=0.1):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(1, num_channels, 1, 1))
        self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.eps = eps
        self.momentum = momentum

        # Accumulated statistics for inference
        self.register_buffer("running_mean", torch.zeros(1, num_channels, 1, 1))
        self.register_buffer("running_var", torch.ones(1, num_channels, 1, 1))

    def forward(self, x):
        if self.training:
            mean = x.mean(dim=(0, 2, 3), keepdim=True)
            var = x.var(dim=(0, 2, 3), keepdim=True)

            # Update accumulated statistics
            self.running_mean = (
                1 - self.momentum
            ) * self.running_mean + self.momentum * mean
            self.running_var = (
                1 - self.momentum
            ) * self.running_var + self.momentum * var
        else:
            mean = self.running_mean
            var = self.running_var

        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

During training, batch statistics are used and accumulated means and variances are
updated with a certain momentum. During inference, batch statistics are no longer used
and accumulated means and variances are employed instead, which guarantees deterministic
behavior.

Among the main advantages of Batch Normalization are training acceleration, the
possibility of using higher learning rates, and reduced dependence on weight
initialization. In many architectures, BN also contributes to reducing the need for
additional regularization techniques such as Dropout. However, it also presents
limitations. In particular, its performance degrades when the batch size is very small,
as mean and variance estimates become noisy, and its behavior differs between training
and inference modes, which requires careful management of `train` and `eval` modes.

### Layer Normalization (LN)

Layer Normalization (LN) is designed to overcome some limitations of BN, especially in
contexts where batch size is small or where the model structure does not adapt well to
batch normalization, such as in recurrent networks or Transformers. In LN, normalization
is performed independently for each sample, aggregating over all its feature dimensions.

If one considers an input tensor $x$ associated with an individual sample, LN calculates
the mean and variance over all relevant dimensions (for example, over channels and
spatial positions) for each batch element, and normalizes analogously to BN but without
depending on other samples in the batch. Thus, the normalization behavior is identical in
training and inference, and does not depend on batch size.

A schematic implementation of Layer Normalization for tensors of type $(N, C, H, W)$ can
be written as follows:

In [8]:
class LayerNormalization2D(nn.Module):
    def __init__(self, num_channels=None, eps=1e-6):
        super().__init__()
        self.eps = eps
        if num_channels is not None:
            self.gamma = nn.Parameter(torch.ones(1, num_channels, 1, 1))
            self.beta = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        else:
            self.gamma = None
            self.beta = None

    def forward(self, x):
        mean = x.mean(dim=(1, 2, 3), keepdim=True)
        var = x.var(dim=(1, 2, 3), keepdim=True)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        if self.gamma is not None and self.beta is not None:
            return self.gamma * x_norm + self.beta
        return x_norm

This normalization is especially suitable for attention-based architectures, such as
Transformers, and for recurrent networks, where dependence on batch statistics could
introduce undesired noise. Additionally, by not differentiating between training and
inference modes, it simplifies the operational flow of the model and facilitates the use
of very small batch sizes, even equal to one.