## Preprocessing and Normalization in PyTorch

In PyTorch, preprocessing transforms prepare raw images for model input. A typical pipeline includes:

* **Resizing**
  Ensures all images share consistent dimensions.

  ```python
  transforms.Resize((256, 256))       # Fixed size  
  transforms.Resize(256)              # Maintains aspect ratio
  ```

* **Cropping**

  * **CenterCrop:** For consistent evaluation crops.
  * **RandomCrop / RandomResizedCrop:** For training augmentation.

* **Data Augmentation** *(training only)*
  Examples: `RandomHorizontalFlip`, `RandomRotation`, `ColorJitter`, `RandomErasing`, `RandomAffine`, etc., to improve generalization.

* **ToTensor**
  Converts a PIL image or NumPy array to a `torch.Tensor` in `[C, H, W]` format, scaling pixel values from `[0, 255]` to `[0.0, 1.0]`.

  ```python
  transforms.ToTensor()
  ```

  **Note:** This does **not** normalize by mean/std — it only rescales.

* **Normalization (Standardization)**
  Applies per-channel Z-score normalization:

  $$
  X' = \frac{X - \mu}{\sigma}
  $$

  where `mean = μ` and `std = σ` are computed **from the training set only**.

  ```python
  transforms.Normalize(mean=[0.485, 0.456, 0.406],
                       std=[0.229, 0.224, 0.225])  # ImageNet stats
  ```

  Use dataset-specific stats unless working with pretrained models, in which case match the model’s training stats (e.g., ImageNet).

---

## Building a Transform Pipeline

**Golden rule:**

```
Resize → Augment → ToTensor → Normalize
```

This order ensures:

* Resizing/augmentation happens on PIL images.
* Conversion to tensors happens before normalization.
* Normalization matches the way statistics were computed.

**Example: Training vs Evaluation**

```python
# Training
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])

# Evaluation
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])
```

---

#### Key Points

* **ToTensor** only rescales to `[0, 1]` — normalization is a separate step.
* Mean/std are **always computed from the training set only** to avoid data leakage.
* Apply the same normalization to training, validation, and test sets.
* Keep the order of transforms consistent with how you computed your stats.

---




## Standardization vs. Normalization

### 1.1 Normalization (Min-Max Scaling or Feature Scaling)
Normalization rescales the feature values to a fixed range, usually $[0, 1]$. The formula is:

$
X_{\text{norm}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$

- **Purpose**: Useful when you want all features to contribute equally to the model, especially when they are on different scales.
- **Assumptions**: Does **not** assume any particular distribution of the data.
- **Sensitive to outliers**: Yes — since it relies on the minimum and maximum values, outliers can significantly affect the scaling.
- **Use cases**: Algorithms that rely on distances or assume bounded input features, such as:
  - K-Nearest Neighbors (KNN)
  - Neural Networks (e.g., when using sigmoid/tanh activations)
  - Principal Component Analysis (PCA), when interpretability is not affected by bounded scale

---

### 1.2 Standardization (Z-score Normalization)
Standardization transforms the data to have zero mean and unit variance. The formula is:

$
X_{\text{standard}} = \frac{X - \mu}{\sigma}
$

- **Purpose**: Useful when features have different means and variances and you want to center them around 0.
- **Assumptions**: Works well if the data is approximately normally distributed, but this is **not a strict requirement**.
- **Sensitive to outliers**: Less than min-max normalization, but outliers still affect mean and standard deviation.
- **Use cases**: Algorithms that assume data is centered or use covariance:
  - Linear Regression
  - Logistic Regression
  - Support Vector Machines (SVM)
  - Principal Component Analysis (PCA) (for preserving variance direction)
  - K-Means Clustering

---



| Aspect               | Normalization \([0,1]\)                  | Standardization ($\mu=0$, $\sigma=1$) |
|----------------------|-------------------------------------------|--------------------------------------------|
| Range                | Bounded $([0,1]$)                      | Unbounded                                 |
| Sensitive to outliers | High                                    | Medium                                     |
| Assumes normality    | No                                       | No (but benefits from it)                 |
| Preserves outliers   | No                                       | Yes (to an extent)                        |
| Use cases            | KNN, Neural Nets                         | SVM, Linear Models, PCA, K-Means          |

---


## 3. Example From MNIST and ImageNet


**1. MNIST mean & std**

* The commonly used values `mean=[0.1307]` and `std=[0.3081]` are **precomputed** from the **training split** of MNIST (grayscale, so only one channel).
* These statistics are used to normalize **both** training and test images, but they are **calculated only from the training set** to avoid "data leakage."

**2. ImageNet mean & std**

```python
transforms.Normalize(mean=[0.485, 0.456, 0.406],
                     std=[0.229, 0.224, 0.225])
```

* These values are computed from the **ImageNet training split** (RGB channels).
* Same rule: we compute the stats on the training set, then apply them to **both** training and validation/test sets.

**3. Dataset separation**

* Yes — for datasets like MNIST, CIFAR-10, and ImageNet, the **training and test (or validation) splits are separate**.
* When you download them with `torchvision.datasets.MNIST` or `ImageNet`, you can explicitly choose `train=True` or `train=False` (for MNIST/CIFAR), or use the appropriate folder (`train/` vs `val/`) for ImageNet.

**4. Why fix the mean/std?**

* Once the dataset split is fixed, we don’t recompute the mean/std every time — we use the same fixed numbers so experiments are reproducible.
* Computing them separately for train and test would be incorrect because it leaks test information.

---


When you evaluate a model, you must apply **the exact same preprocessing** to the test (or validation) data as you did to the training data.

So for ImageNet:

```python
transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225]
)
```

is applied **both**:

* During training (on training images, after augmentation like crop/flip)
* During testing/validation (on test images, without augmentations except resizing/cropping)

**Reason:**

* The model learns to work with inputs that are normalized in this way.
* Feeding it unnormalized test data would cause a distribution mismatch, hurting performance.

📌 In short:

* Mean/std are computed **only** from the training split (no leakage).
* Normalization with those values is applied to **all splits** when feeding the model.

---

If you want, I can also show **a minimal PyTorch example** where we normalize both train and test splits for MNIST or ImageNet.



## Normalizing your own Data set
Unlike pre-packaged datasets such as `ImageNet` or `MNIST`, which come with fixed training and test splits and precomputed normalization statistics, working with your own custom dataset requires you to decide how to split the data and how to compute normalization values. In this case, you have two main approaches:



#### A) Random split each run

You have to, compute mean/std on **train only** each time

```python
import torch
from torchvision import datasets, transforms
from torch.utils.data import random_split, DataLoader, Subset

root = "/path/to/images"  # class-subfolder structure for ImageFolder

# Base transform WITHOUT Normalize for stats computation
base_tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),          # -> [0,1]
])

full = datasets.ImageFolder(root, transform=base_tf)

# Fixed-seed split
n = len(full)
n_train = int(0.7 * n)
n_val   = int(0.15 * n)
n_test  = n - n_train - n_val
train_raw, val_raw, test_raw = random_split(full, [n_train, n_val, n_test])

# Compute mean/std on TRAIN ONLY (this run)
def mean_std(dataset):
    loader = DataLoader(dataset, batch_size=64, shuffle=False, num_workers=4)
    n_pix, mean, M2 = 0, torch.zeros(3), torch.zeros(3)
    for x, _ in loader:
        b, c, h, w = x.shape
        x = x.view(b, c, -1)
        bm = x.mean(dim=(0, 2))
        bv = x.var(dim=(0, 2), unbiased=False)
        bp = b*h*w
        delta = bm - mean
        tot = n_pix + bp
        mean += delta * (bp / tot)
        M2 += bv*bp + (delta**2) * (n_pix*bp/tot)
        n_pix = tot
    std = torch.sqrt(M2 / n_pix)
    return mean.tolist(), std.tolist()

mean, std = mean_std(train_raw)

# Rebuild datasets WITH Normalize, preserving indices
norm_tf = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=mean, std=std),
])

full_norm = datasets.ImageFolder(root, transform=norm_tf)
train = Subset(full_norm, train_raw.indices)
val   = Subset(full_norm,  val_raw.indices)
test  = Subset(full_norm,  test_raw.indices)
```

#### B) Fixed Split

Using fix seed will create the same random set so we can precalculate the mean and std.
```python
seed = 42
g = torch.Generator().manual_seed(seed)

train_raw, val_raw, test_raw = random_split(full, [n_train, n_val, n_test], generator=g)
```




