## Imports and Utils

In [19]:
import os
import random
from typing import Dict, Tuple

import cv2
import numpy as np
import pandas as pd
import torch
import torchvision.datasets as datasets
from PIL import Image, ImageFile
from torchvision import transforms
from tqdm import tqdm

In [5]:
def seed_all(seed: int = 1930):
    """Seed all random number generators."""
    print("Using Seed Number {}".format(seed))
    # set PYTHONHASHSEED env var at fixed value
    os.environ["PYTHONHASHSEED"] = str(seed)    
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)  # pytorch (both CPU and CUDA)
    np.random.seed(seed)  # for numpy pseudo-random generator
    random.seed(seed)  # set fixed value for python built-in pseudo-random generator
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False

In [6]:
seed_all()

Using Seed Number 1930


## Disclaimer

Note that the following method is not the **most efficient way**, but it is good for learning as the steps in the codes are laid out **sequentially** so that it is easy to follow.

!!! info
    There are a few pre-preprocessing techniques for image data. Here we discuss the most common one that I encounter, Normalization across channels.

    [1^]: Extracted from [CS231n](https://cs231n.github.io/neural-networks-2/#datapre)

!!! warning
    Data leakage will occur if you apply this pre-processing step prior to your train-valid-test split. We should apply normalization on the training step, obtaining the mean and std metrics for $X_{\text{train}}$ and apply them to validation set during model selection, and to test set during model evaluation.
    In our examples below, I apply mean and std calculation on the training set (which includes the validation set), in reality, we should further split the training set into training and validation sets.

## General Steps to Normalize

!!! warning "Important Warning"
    **Important: Most of the times we resize the images, so different image size may result in different mean and std. So remember to resize first then calcuate.**

1. RGB image with 3 channels. We assume it is channels first, if not convert from channels last to channels first. As I am using PyTorch primarily, this is more natural to me. See CIFAR-10 for such example.

    - Load the data into disk using either `cv2` or `PIL`. Divide by 255 across all images first to normalize it.
    - Then find the image's mean and std per channel. 
        - For example, if we want to find the mean of the red channel of a batch of images, and assume we have 10 images of size $(100, 100, 3)$ each. Then each image has 3 channels, each channel has $100 \times 100$ pixels, and therefore 10 such images will have $10 \times 100 \times 100 = 100000$ pixels. We `flatten()` all these 10 images' red channel and take the average (i.e. sum all $100000$ red pixels, and divide by $1000000$). We do the same for all the other channels.

2. Grayscale image with 1 channel.
    - This is just average the values in one channel.

3. Audio/Spectrograms like SETI etc.

## CIFAR-10 (RGB)

We first see an example of calculating the mean and standard deviation of cifar10, which is of RGB channels.

```python
Mean: {"R": 0.49139968 "G": 0.48215827 "B": 0.44653124}
Standard Deviation: {"R": 0.24703233 "G": 0.24348505 "B": 0.26158768}
```

We will code a function to calculate the mean and standard deviation of a batch of images.

In [20]:
TRANSFORMS = transforms.Compose([transforms.ToTensor()])

In [25]:
trainset_cifar10 = datasets.CIFAR10(
    root="./data", train=True, download=True, transform=TRANSFORMS
)
testset_cifar10 = datasets.CIFAR10(
    root="./data", train=False, download=True, transform=TRANSFORMS
)
train_images_cifar10 = np.asarray(trainset_cifar10.data)  # (50000, 32, 32, 3)
test_images_cifar10 = np.asarray(testset_cifar10.data)  # (10000, 32, 32, 3)

Files already downloaded and verified
Files already downloaded and verified


In [22]:
def calcMeanStd(images: np.ndarray) -> Dict[str, Tuple[float]]:
    """Take in an numpy array of images and returns mean and std per channel.
    This function assumes for a start, your array is loaded into disk.

    Args:
        images (np.ndarray): [num_images, channel, height, width] or [num_images, height, width, channel]

    Returns:
        Dict[str, Tuple[float]]: {"mean": (mean_r, mean_g, mean_b), "std": (std_r, std_g, std_b)}
    """
    
    images = np.asarray(images) # good way to test if images is passed in the correct dtype
    images = images / 255.      # min-max and divide by 255

    if images.ndim == 4:                            # RGB
        if images.shape[1] != 3:                    # if channel is not first, make it so, assume channels last
            images = images.transpose(0, 3, 1, 2)   # if tensor use permute instead
                                                    # permutation applies the following mapping
                                                    # axis0 -> axis0
                                                    # axis1 -> axis3
                                                    # axis2 -> axis1
                                                    # axis3 -> axis2
        
        b, c, w, h = images.shape

        r_channel, g_channel, b_channel = images[:, 0, :, :], images[:, 1, :, :], images[:, 2, :, :]      # get rgb channels individually
        r_channel, g_channel, b_channel = r_channel.flatten(), g_channel.flatten(), b_channel.flatten()   # flatten each channel into one array
        mean_r = r_channel.mean(axis=None) # since we are averaging per channel, we get the first channel's mean by r_channel.mean
        mean_g = g_channel.mean(axis=None) # same as above
        mean_b = b_channel.mean(axis=None) # same as above
        
        # calculate std over each channel (r,g,b)
        std_r = r_channel.std(axis=None)
        std_g = g_channel.std(axis=None)
        std_b = b_channel.std(axis=None)

        return {'mean': (mean_r, mean_g, mean_b), 'std': (std_r, std_g, std_b)}
    
    elif images.ndim == 3:              # grayscale
        gray_channel = images.flatten() # flatten directly since only 1 channel
        mean = gray_channel.mean(axis=None)
        std = gray_channel.std(axis=None)
        
        return {"mean": (mean,), "std": (std, )}

    else:
        raise ValueError("passed error is not of the right shape!")

In [26]:
mean_std_cifar = calcMeanStd(train_images_cifar10)
print(mean_std_cifar)

{'mean': (0.49139967861519745, 0.4821584083946076, 0.44653091444546616), 'std': (0.2470322324632823, 0.24348512800005553, 0.2615878417279641)}


In [27]:
# alternate way to do this.
print(trainset_cifar10.data.shape)
print(trainset_cifar10.data.mean(axis=(0,1,2))/255)
print(trainset_cifar10.data.std(axis=(0,1,2))/255)

(50000, 32, 32, 3)
[0.49139968 0.48215841 0.44653091]
[0.24703223 0.24348513 0.26158784]


Depending on your use case, we can normalize the test/validation set with the parameters found on the train set, though in practice, for image recognition problems, we use the same normalization parameters on both the train and validation set, and apply it to test set.

The steps are:

1. Calculate the mean and std using the method above.
2. Divide the training/validation/test set by 255.
3. Normalize it using the values found. 

Note step 2 can be skipped **if the normalization method in the library does a division of 255 internally.**

In [28]:
TRANSFORMS_with_normalization = transforms.Compose(
    [
        transforms.Normalize(
            mean=mean_std_cifar["mean"], std=mean_std_cifar["std"]
        ),
        transforms.ToTensor(),
    ]
)

In [29]:
trainset_cifar10 = datasets.CIFAR10(
    root="./data", train=True, download=True, transform=TRANSFORMS_with_normalization
)
testset_cifar10 = datasets.CIFAR10(
    root="./data", train=False, download=True, transform=TRANSFORMS_with_normalization
)
train_images_cifar10 = np.asarray(trainset_cifar10.data)  # (50000, 32, 32, 3)
test_images_cifar10 = np.asarray(testset_cifar10.data)  # (10000, 32, 32, 3)

Files already downloaded and verified
Files already downloaded and verified


## MNIST (Grayscale)

We next see an example of calculating the mean and standard deviation of MNIST, which is of one channel (grayscale).

```python
Mean: 0.1307
Standard Deviation: 0.3081
```

We will code a function to calculate the mean and standard deviation of a batch of images.

In [40]:
# mnist
trainset_mnist = datasets.MNIST(
    root="./data/", train=True, download=True, transform=TRANSFORMS
)
testset_mnist = datasets.MNIST(
    root="./data", train=False, download=True, transform=TRANSFORMS
)
train_images_mnist = np.asarray(trainset_mnist.data)  # (60000, 28, 28)
test_images_mnist = np.asarray(testset_mnist.data)  # (10000, 28, 28)

In [42]:
mean_std_mnist = calcMeanStd(train_images_mnist)
print(mean_std_mnist)

{'mean': (0.1306604762738429,), 'std': (0.3081078038564622,)}


In [45]:
print(trainset_mnist.data.float().mean() / 255)
print(trainset_mnist.data.float().std() / 255)

tensor(0.1307)
tensor(0.3081)


## References

To read up more on how others do it **efficiently**, please have a read below.

- https://www.kaggle.com/kozodoi/seti-mean-and-std-of-new-data/notebook
- https://www.kaggle.com/kozodoi/computing-dataset-mean-and-std
- https://forums.fast.ai/t/calculating-our-own-image-stats-imagenet-stats-cifar-stats-etc/40355/3
- https://github.com/JoshVarty/CancerDetection/blob/master/01_ImageStats.ipynb
- https://forums.fast.ai/t/calculating-new-stats/31214
- https://forums.fast.ai/t/calcuating-the-mean-and-standard-deviation-for-normalize/62883/13
- https://www.kaggle.com/c/ranzcr-clip-catheter-line-classification/discussion/211039
- https://stackoverflow.com/questions/58151507/why-pytorch-officially-use-mean-0-485-0-456-0-406-and-std-0-229-0-224-0-2
- https://stackoverflow.com/questions/65699020/calculate-standard-deviation-for-grayscale-imagenet-pixel-values-with-rotation-m/65717887#65717887
- https://stackoverflow.com/questions/66678052/how-to-calculate-the-mean-and-the-std-of-cifar10-data
- https://drive.google.com/drive/u/1/folders/1Gum3vsRsKKRSFZ1hyKaPTiVs1AUAmdKD
- https://stackoverflow.com/questions/50710493/cifar-10-meaningless-normalization-values
- https://discuss.pytorch.org/t/normalization-in-the-mnist-example/457/6
- https://github.com/kuangliu/pytorch-cifar/issues/19
- https://github.com/Armour/pytorch-nn-practice/blob/master/utils/meanstd.py