<a href="https://colab.research.google.com/github/devashishk99/DL-Fundamentals-Lightning-AI/blob/main/Unit%203/exercise_2_standardization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 3, Exercise 2: Standardization

This exercise is an extension of Exercise 1. Here, the goal is to add code to standardize the features such that they have a mean of 0 and a standard deviation of 1 as discussed in Unit 3.7.

Most of the code below is identical to Exercise 1. To avoid not spoil the solution for Exercise 1, the same code parts are missing.

## 1) Installing Libraries

You likely already have all libraries installed and don't need to do anything here.

In [None]:
# !conda install numpy pandas matplotlib --yes

In [None]:
# !pip install torch

In [None]:
# !conda install watermark

In [None]:
%load_ext watermark
%watermark -v -p numpy,pandas,matplotlib,torch

## 2) Loading the Dataset

We are using the familiar `read_csv` function from pandas to load the dataset:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data_banknote_authentication.txt", header=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [3]:
X_features = df[[0, 1, 2, 3]].values
y_labels = df[4].values

Number of examples and features:

In [4]:
X_features.shape

(1372, 4)

It is usually a good idea to look at the label distribution:

In [5]:
import numpy as np

np.bincount(y_labels)

array([762, 610])

## 3) Defining a DataLoader

The `DataLoader` code is the same code code we used in Unit 3.6:

In [6]:
from torch.utils.data import Dataset, DataLoader


class MyDataset(Dataset):
    def __init__(self, X, y):

        self.features = torch.tensor(X, dtype=torch.float32)
        self.labels = torch.tensor(y, dtype=torch.float32)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.labels[index]
        return x, y

    def __len__(self):
        return self.labels.shape[0]

We will be using 80% of the data for training, 20% of the data for validation. In a real-project, we would also have a separate dataset for the final test set (in this case, we do not have an explicit test set).

In [7]:
train_size = int(X_features.shape[0]*0.80)
train_size

1097

In [8]:
val_size = X_features.shape[0] - train_size
val_size

275

Using `torch.utils.data.random_split`, we generate the training and validation sets along with the respective data loaders:

In [9]:
import torch

dataset = MyDataset(X_features, y_labels)

torch.manual_seed(1)
train_set, val_set = torch.utils.data.random_split(dataset, [train_size, val_size])

train_loader = DataLoader(
    dataset=train_set,
    batch_size=10,
    shuffle=True,
)

val_loader = DataLoader(
    dataset=val_set,
    batch_size=10,
    shuffle=False,
)

## 4) Standardization

There are multiple ways to implement the standardization procedure. For this exercise, we are going to implement a procedure that standardizes the features after we created the data loader.

Since this dataset has 4 features, there should be 4 means and 4 standard deviations we compute from the training set. We can do this as follows:

In [10]:
train_mean = torch.zeros(X_features.shape[1])

for x, y in train_loader:
    train_mean += x.sum(dim=0)

train_mean /= len(train_set)

train_std = torch.zeros(X_features.shape[1])
for x, y in train_loader:
    train_std += ((x - train_mean)**2).sum(dim=0)

train_std = torch.sqrt(train_std / (len(train_set)-1))

In [11]:
print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8575, 5.9216, 4.3869, 2.1041])


We compute the means and standard deviations by iterating over the training loader. This is an approach that even works for large datasets where the entire dataset doesn't fit into memory.

A simpler approach, which only works for smaller datasets that fit into memory, is as follows:

In [12]:
all_x = []
for x, y in train_loader:
    all_x.append(x)

train_std = torch.concat(all_x).std(dim=0)
train_mean = torch.concat(all_x).mean(dim=0)

In [13]:
print("Feature means:", train_mean)
print("Feature std. devs:", train_std)

Feature means: tensor([ 0.3854,  1.8680,  1.4923, -1.1999])
Feature std. devs: tensor([2.8575, 5.9216, 4.3869, 2.1041])


<font color='red'>YOUR TASK is now to implement a standardization function based on these training set parameters above:</font>

In [14]:
def standardize(df, train_mean, train_std): # YOUR CODE
  return (df - train_mean)/train_std

## 5) Implementing the model

Here, we are resusing the same model code we used in Unit 3.6:

In [15]:
class LogisticRegression(torch.nn.Module):

    def __init__(self, num_features):
        super().__init__()
        self.linear = torch.nn.Linear(num_features, 1)

    def forward(self, x):
        logits = self.linear(x)
        probas = torch.sigmoid(logits)
        return probas

## 6) The training loop

In this section, we are using the training loop from Unit 3.6. It's the exact same code except for some small modification: We added the line `if not batch_idx % 20` to only print the lost for every 20th batch (to reduce the number of output lines).

<font color='red'>YOUR TASK is to use the standardization code correctly in the for loop. Then, find a good learning rate and epoch number to that you achieve a training and validation performance of at least 98%.</font>

In [16]:
import torch.nn.functional as F


torch.manual_seed(1)
model = LogisticRegression(num_features=4)
optimizer = torch.optim.SGD(model.parameters(), lr=1.) ## possible SOLUTION

num_epochs = 2 ## possible SOLUTION

for epoch in range(num_epochs):

    model = model.train()
    for batch_idx, (features, class_labels) in enumerate(train_loader):

        features = standardize(features, train_mean, train_std) ## SOLUTION
        probas = model(features)

        loss = F.binary_cross_entropy(probas, class_labels.view(probas.shape))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### LOGGING
        if not batch_idx % 20: # log every 20th batch
            print(f'Epoch: {epoch+1:03d}/{num_epochs:03d}'
                   f' | Batch {batch_idx:03d}/{len(train_loader):03d}'
                   f' | Loss: {loss:.2f}')

Epoch: 001/002 | Batch 000/110 | Loss: 0.93
Epoch: 001/002 | Batch 020/110 | Loss: 0.08
Epoch: 001/002 | Batch 040/110 | Loss: 0.23
Epoch: 001/002 | Batch 060/110 | Loss: 0.05
Epoch: 001/002 | Batch 080/110 | Loss: 0.03
Epoch: 001/002 | Batch 100/110 | Loss: 0.05
Epoch: 002/002 | Batch 000/110 | Loss: 0.13
Epoch: 002/002 | Batch 020/110 | Loss: 0.02
Epoch: 002/002 | Batch 040/110 | Loss: 0.21
Epoch: 002/002 | Batch 060/110 | Loss: 0.08
Epoch: 002/002 | Batch 080/110 | Loss: 0.07
Epoch: 002/002 | Batch 100/110 | Loss: 0.07


## 7) Evaluating the results

Again, reusing the code from Unit 3.6, we will calculate the training and validation set accuracy.

<font color='red'>Use the code below as is. What do you observe? And why?</font>

In [17]:
def compute_accuracy(model, dataloader):

    model = model.eval()

    correct = 0.0
    total_examples = 0

    for idx, (features, class_labels) in enumerate(dataloader):

        with torch.no_grad():
            probas = model(features)

        pred = torch.where(probas > 0.5, 1, 0)
        lab = class_labels.view(pred.shape).to(pred.dtype)

        compare = lab == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples

In [18]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 83.96%


<font color='red'>Notice that the code validation accuracy is not shown? It's part of the exercise to implement it :)</font>

In [20]:
## SOLUTION

val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 80.36%


<font color='red'>Now, add the standardization to the `compute_accuracy` function above and recompute the training and validation accuracy. What do you observe?</font>

In [21]:
def compute_accuracy(model, dataloader):

    model = model.eval()

    correct = 0.0
    total_examples = 0

    for idx, (features, class_labels) in enumerate(dataloader):

        features = standardize(features, train_mean, train_std) ## SOLUTION
        with torch.no_grad():
            probas = model(features)

        pred = torch.where(probas > 0.5, 1, 0)
        lab = class_labels.view(pred.shape).to(pred.dtype)

        compare = lab == pred
        correct += torch.sum(compare)
        total_examples += len(compare)

    return correct / total_examples

In [22]:
train_acc = compute_accuracy(model, train_loader)
print(f"Accuracy: {train_acc*100:.2f}%")

Accuracy: 98.09%


In [23]:
val_acc = compute_accuracy(model, val_loader)
print(f"Accuracy: {val_acc*100:.2f}%")

Accuracy: 98.18%
