# PyTorch Dataset and DataLoader classes

This will demonstrate how to create and load custom datasets to work with PyTorch models. <br>

Let's talk a bit about the process of batch training:
- a single epoch involves one forward and one backward pass of all training samples.
- batch size is the number of training samples being taken in one forward and backward pass.
- number of iterations is the number of passes, each pass using the batch size for number of samples.

So for example, 100 samples, batch size 20 --> 100/20 = 5 iterations for 1 epoch.

In [1]:
# Imports
import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

In [6]:
# Implementing a custom dataset
class WineDataset(Dataset):  # Extends the Dataset module
    def __init__(self):
        # Data loading
        xy = np.loadtxt('data/wine.csv', delimiter=',', dtype=np.float32, skiprows=1)
        # Split and convert
        self.x = torch.from_numpy(xy[:, 1:])  # All the samples except first column
        self.y = torch.from_numpy(xy[:, [0]])  # n_samples, 1
        self.num_samples = xy.shape[0]  # Number of samples

    def __getitem__(self, index):  # Will allow for indexing later
        # dataset[0] would be valid due to this method
        return self.x[index], self.y[index]

    def __len__(self):  # This allows us to get the length of the dataset
        return self.num_samples


# These are the basic methods our dataset class needs
# Then we instantiate it

dataset = WineDataset()
first_data = dataset[0]  # We are able to use indexing
test_features, test_labels = first_data
print(f'Features: {test_features}\nLabels: {test_labels}')



Features: tensor([1.4230e+01, 1.7100e+00, 2.4300e+00, 1.5600e+01, 1.2700e+02, 2.8000e+00,
        3.0600e+00, 2.8000e-01, 2.2900e+00, 5.6400e+00, 1.0400e+00, 3.9200e+00,
        1.0650e+03])
Labels: tensor([1.])


In [8]:
# Now we'll use a dataloader to load it
dataloader = DataLoader(dataset=dataset, batch_size=4, shuffle=True, num_workers=0)
# Shuffle shuffles the data, num_workers uses multithreading
# If num_workers is causing issues, try using 0

data_iter = iter(dataloader)
data = data_iter.next()

# Unpack
features, labels = data
print(f'Features: {features}\nLabels: {labels}')

Features: tensor([[1.3720e+01, 1.4300e+00, 2.5000e+00, 1.6700e+01, 1.0800e+02, 3.4000e+00,
         3.6700e+00, 1.9000e-01, 2.0400e+00, 6.8000e+00, 8.9000e-01, 2.8700e+00,
         1.2850e+03],
        [1.2370e+01, 1.2100e+00, 2.5600e+00, 1.8100e+01, 9.8000e+01, 2.4200e+00,
         2.6500e+00, 3.7000e-01, 2.0800e+00, 4.6000e+00, 1.1900e+00, 2.3000e+00,
         6.7800e+02],
        [1.4830e+01, 1.6400e+00, 2.1700e+00, 1.4000e+01, 9.7000e+01, 2.8000e+00,
         2.9800e+00, 2.9000e-01, 1.9800e+00, 5.2000e+00, 1.0800e+00, 2.8500e+00,
         1.0450e+03],
        [1.3300e+01, 1.7200e+00, 2.1400e+00, 1.7000e+01, 9.4000e+01, 2.4000e+00,
         2.1900e+00, 2.7000e-01, 1.3500e+00, 3.9500e+00, 1.0200e+00, 2.7700e+00,
         1.2850e+03]])
Labels: tensor([[1.],
        [2.],
        [1.],
        [1.]])


In [11]:
# Now let's iterate over the whole dataloader in a dummy training loop
# Hyperparameters:
num_epochs = 2
total_samples = len(dataset)
batch_size = 4
n_iterations = math.ceil(total_samples/batch_size)
print(total_samples, n_iterations)

for epoch in range(num_epochs):
    for iteration, (inputs, labels) in enumerate(dataloader):
        # Forward pass
        # Backward pass
        # Update weights
        # But this is dummy, so we're not doing it
        if (iteration + 1) % 5 == 0:
            print(f'Epoch {epoch+1}/{num_epochs}, Step {iteration+1}/{n_iterations}, Inputs: {inputs.shape}')


178 45
Epoch 1/2, Step 5/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 10/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 15/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 20/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 25/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 30/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 35/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 40/45, Inputs: torch.Size([4, 13])
Epoch 1/2, Step 45/45, Inputs: torch.Size([2, 13])
Epoch 2/2, Step 5/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 10/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 15/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 20/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 25/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 30/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 35/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 40/45, Inputs: torch.Size([4, 13])
Epoch 2/2, Step 45/45, Inputs: torch.Size([2, 13])


Pytorch also has inbuilt datasets we can access <br>
MNIST: <br>
`torchvision.datasets.MNIST()` <br>
Fashion MNIST, CIFAR, COCO, etc. Google the docs for which datasets are available.