#### PyTorch Dataset and DataLoader classes

It is often time-consuming to compute the gradient on entire datasets. It is a better idea to split the dataset into batches and calculate gradients on those batches. This is an implementation of mini-batch gradient descent.

1 epoch = 1 complete forward and 1 complete backward pass over <b>all</b> training samples in the dataset

batch_size = number of training samples in 1 complete forward and 1 complete backward pass

number of iterations = number of passes (complete forward and complete backward pass = 1 pass), each pass using batch_size number of samples

e.g.
100 samples, batch_size 20:<br>
1 epoch has 100/20 = 5 iterations
1 iteration goes over 20 samples
1 epoch goes over all 100 samples

In [1]:
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import math

When implementing the Dataset class, the __init__() (used to load data), __getitem__() (used for indexing) and the __len__() (used to get the total number of samples) methods must be implemented 

In [2]:
class WineDataset(Dataset):
    def __init__(self):
        # Data loading
        xy = np.loadtxt('wine.csv', delimiter=',', dtype=np.float32, skiprows=1) # Loads the dataset ignoring the 1st row (headers) and using a delimiter of ','
        self.X = xy[:, 1:] # Features (depends on dataset)
        self.X = torch.tensor(self.X, dtype=torch.float32) # Convert to tensor
        self.y = xy[:, [0]] # Outputs
        self.y = torch.tensor(self.y, dtype=torch.float32) # Convert to tensor
        self.num_samples = xy.shape[0] # Number of samples
    
    def __getitem__(self, index):
        # Returns data at particular index
        return self.X[index], self.y[index]
    
    def __len__(self):
        # Returns length of dataset
        return self.num_samples

In [3]:
# Create the dataset
dataset = WineDataset()

In [4]:
# Check the first row
first_row = dataset[0]
print(first_row) # Prints first row of data from dataset with features at 0th index and output at 1th index

(tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03]), tensor([1.]))


In [5]:
# Use DataLoader with dataset
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True, num_workers=2) # Specifies dataset to load with a batch size of 4 and shuffling; num_workers splits the process into multiple threads making the data loading process faster

In [6]:
# To access the next batch of data
dataiter = iter(dataloader)
features, labels = dataiter.next()
print(features, labels)

tensor([[1.3860e+01, 1.5100e+00, 2.6700e+00, 2.5000e+01, 8.6000e+01, 2.9500e+00,
         2.8600e+00, 2.1000e-01, 1.8700e+00, 3.3800e+00, 1.3600e+00, 3.1600e+00,
         4.1000e+02],
        [1.2370e+01, 1.0700e+00, 2.1000e+00, 1.8500e+01, 8.8000e+01, 3.5200e+00,
         3.7500e+00, 2.4000e-01, 1.9500e+00, 4.5000e+00, 1.0400e+00, 2.7700e+00,
         6.6000e+02],
        [1.2370e+01, 1.1700e+00, 1.9200e+00, 1.9600e+01, 7.8000e+01, 2.1100e+00,
         2.0000e+00, 2.7000e-01, 1.0400e+00, 4.6800e+00, 1.1200e+00, 3.4800e+00,
         5.1000e+02],
        [1.3940e+01, 1.7300e+00, 2.2700e+00, 1.7400e+01, 1.0800e+02, 2.8800e+00,
         3.5400e+00, 3.2000e-01, 2.0800e+00, 8.9000e+00, 1.1200e+00, 3.1000e+00,
         1.2600e+03]]) tensor([[2.],
        [2.],
        [2.],
        [1.]])


In [7]:
print(len(dataset)) # Length of dataset
print(len(dataloader)) # Number of batches

178
45


In [8]:
# Complete (dummy) training loop
# Hyperparameters
num_epochs = 4
num_iterations = len(dataloader) # Number of batches = number of iterations

for epoch in range(num_epochs):
    # Go over all batches
    for batch, (inputs, outputs) in enumerate(dataloader):
        if (batch + 1) % 5 == 0:
            print(f"Epoch: {epoch+1}/{num_epochs} Batch: {batch+1}/{num_iterations} Input size: {inputs.shape}")
            # Forward pass
            # Loss calculation
            # Backward pass
            # Update parameters
            # Zero out gradients

Epoch: 1/4 Batch: 5/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 10/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 15/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 20/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 25/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 30/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 35/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 40/45 Input size: torch.Size([4, 13])
Epoch: 1/4 Batch: 45/45 Input size: torch.Size([2, 13])
Epoch: 2/4 Batch: 5/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 10/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 15/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 20/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 25/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 30/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 35/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 40/45 Input size: torch.Size([4, 13])
Epoch: 2/4 Batch: 45/45 Input size: torch.Size([2,

There are 45 batches out of which 44 of them contain 4 samples and 1 contains 2 samples for a total of 178 samples.

This loop calculates the gradient over each batch and then updates the parameters batch-to-batch for one epoch. It performs the updates over all the epochs in this manner.