# Feature Scaling in PyTorch

When you have to apply feature scaling to tabular data, you should use [sklearn](https://scikit-learn.org/). When you are dealing with text, you tokenize text and asign the tokens to one-hot vectors, which puts the tokens on the same scale (this probably does not make sense at the moment). Other type of sequential data, like weather forecasting, can be scaled when you construct your Dataset using the equations we learned in the previous section. When it comes to vision though, PyTorch provides scaling capabilities out of the box in `torchvision.transforms`. Below we are going to cover how we can apply those transforms and we are going to measure the performance of a neural network with and without scaling.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
# it is relatively common to call the transforms namespace T
import torchvision.transforms as T
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader

When we create a dataset using the `MNIST` class, we can pass a `transform` argument. As the name suggests we can apply a transform to images. For example if we use the `PILToTensor` transform, we transform the data from an `PIL` format to a tensor format. Often you will need to apply more than one transform. You can concatenate transforms by using `transforms.Compose([transform1,transform2,...])`. While torchvision provides a great number of transforms (see [Torchvision Docs](https://pytorch.org/vision/stable/transforms.html#)), sometimes you might want more control. `transforms.Lambda()` takes a Python lambda function, in which you can process images as you desire. Below we prepare two sets of transforms, that we will both apply to MNIST.

The first set of transforms first transforms the `PIL` image into a `Tensor` and then turns the `Tensor` into a float32 data format. Both steps are important, because PyTorch can only work with tensors and as we intend to use the GPU, float32 is required.

In [2]:
transform = T.Compose([T.PILToTensor(), 
                       T.Lambda(lambda tensor : tensor.to(torch.float32))
])

Those transforms do not include any form of scaling, therefore we expect the training to be relatively slow.

In [3]:
dataset_orig = MNIST(root="../datasets/", train=True, download=True, transform=transform)

Below we calculate the mean and the standard deviation of the images pixel values. You will notice that there is only one mean and std and not 784 (28*28 pixels). That is because in computer vision the scaling is done per channel and not per pixel. If we were dealing with color images, we would have 3 channes and would therefore require 3 mean and std calculations.

In [4]:
# calculate mean and std
# we will need this part later for normalization
# we divide by 255.0, because the images will be transformed into the 0-1 range automatically
mean = (dataset_orig.data.float() / 255.0).mean()
std = (dataset_orig.data.float() / 255.0).std()

The second set of transforms first applies `transforms.ToTensor` which turns the `PIL` image into a float32 `Tensor` and scales the image into a 0-1 range. The `transforms.Normalize` transform conducts what we call standardization, by subracting the mean and dividing by the standard deviation. If you have a color image with 3 channels, you need to provide a tuple of mean and std values.

In [5]:
transform = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=mean, std=std)
])

In [6]:
dataset_normalized = MNIST(root="../datasets/", train=True, download=True, transform=transform)

In [7]:
# parameters
DEVICE = ("cuda:0" if torch.cuda.is_available() else "cpu")
NUM_EPOCHS=20
BATCH_SIZE=32

#number of hidden units in the first and second hidden layer
HIDDEN_SIZE_1 = 100
HIDDEN_SIZE_2 = 50
NUM_LABELS = 10
NUM_FEATURES = 28*28
ALPHA = 0.1

Based on the datasets we have two dataloaders: `dataloader_orig` without scaling and `dataloader_normalized` with scaling.

In [8]:
dataloader_orig = DataLoader(dataset=dataset_orig, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              drop_last=True,
                              num_workers=4)

In [9]:
dataloader_normalized = DataLoader(dataset=dataset_normalized, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              drop_last=True,
                              num_workers=4)

The `train` function is the same generic function that we used in the previous PyTorch tutorials. 

In [10]:
def train(dataloader, model, criterion, optimizer):
    for epoch in range(NUM_EPOCHS):
        loss_sum = 0
        batch_nums = 0
        for batch_idx, (features, labels) in enumerate(dataloader):
            # move features and labels to GPU
            features = features.view(-1, NUM_FEATURES).to(DEVICE)
            labels = labels.to(DEVICE)

            # ------ FORWARD PASS --------
            probs = model(features)

            # ------CALCULATE LOSS --------
            loss = criterion(probs, labels)

            # ------BACKPROPAGATION --------
            loss.backward()

            # ------GRADIENT DESCENT --------
            optimizer.step()

            # ------CLEAR GRADIENTS --------
            optimizer.zero_grad()

            # ------TRACK LOSS --------
            batch_nums += 1
            loss_sum += loss.detach().cpu()

        print(f'Epoch: {epoch+1} Loss: {loss_sum / batch_nums}')

Same goes for the `model`. There is nothing new. Just a plain vanilla fully connected neural network.

In [11]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
                nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_2, NUM_LABELS),
            )
    
    def forward(self, features):
        return self.layers(features)

We first train on the non standardized dataset.

In [12]:
model = Model().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)
train(dataloader_orig, model, criterion, optimizer)

Epoch: 1 Loss: 0.9535496234893799
Epoch: 2 Loss: 0.7631136178970337
Epoch: 3 Loss: 0.7229397296905518
Epoch: 4 Loss: 0.7591567635536194
Epoch: 5 Loss: 0.7013404965400696
Epoch: 6 Loss: 0.7294915318489075
Epoch: 7 Loss: 0.6799680590629578
Epoch: 8 Loss: 0.6984747648239136
Epoch: 9 Loss: 0.6679096221923828
Epoch: 10 Loss: 0.6765357851982117
Epoch: 11 Loss: 0.657106876373291
Epoch: 12 Loss: 0.68031245470047
Epoch: 13 Loss: 0.6622717380523682
Epoch: 14 Loss: 0.5883834362030029
Epoch: 15 Loss: 0.5738955736160278
Epoch: 16 Loss: 0.5895755290985107
Epoch: 17 Loss: 0.5575388669967651
Epoch: 18 Loss: 0.5529080629348755
Epoch: 19 Loss: 0.5125207901000977
Epoch: 20 Loss: 0.5626105070114136


We recreate the model with fresh weights and conduct the training on the standardized dataset.

In [13]:
model = Model().to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)
train(dataloader_normalized, model, criterion, optimizer)

Epoch: 1 Loss: 0.809180736541748
Epoch: 2 Loss: 0.2534318268299103
Epoch: 3 Loss: 0.1727478802204132
Epoch: 4 Loss: 0.13004107773303986
Epoch: 5 Loss: 0.10386991500854492
Epoch: 6 Loss: 0.08564582467079163
Epoch: 7 Loss: 0.07254727929830551
Epoch: 8 Loss: 0.06272272020578384
Epoch: 9 Loss: 0.053299564868211746
Epoch: 10 Loss: 0.04657835140824318
Epoch: 11 Loss: 0.0407768115401268
Epoch: 12 Loss: 0.03545976057648659
Epoch: 13 Loss: 0.031082086265087128
Epoch: 14 Loss: 0.027383966371417046
Epoch: 15 Loss: 0.023808151483535767
Epoch: 16 Loss: 0.021104995161294937
Epoch: 17 Loss: 0.018414998427033424
Epoch: 18 Loss: 0.01610402762889862
Epoch: 19 Loss: 0.014298812486231327
Epoch: 20 Loss: 0.012513038702309132


You should notice the huge difference. The loss decreases at a much higher rate with a standardised dataset.