## Intro to PyTorch DataLoader

In this lab, we're going to take a look at PyTorch's dataloader class and how it can be used to load your data in minibatches. We'll see how you can implement your custom Dataset and wrap it using pytorch's dataloader for efficient training.

In [1]:
import os
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader

In [2]:
# In PyTorch loading your custom dataset is easy, just inherit the PyTorch's Dataset class and implement the 
# __getitem__ and __len__ methods.

# Here we are going to write a custom dataset for LAB-1 numpy data.

class CustomDataset(Dataset):
    
    def __init__(self, numpy_train_data, numpy_train_labels):
        '''
        numpy_train_data: path to training data numpy file
        numpy_train_labels: path to training labels numpy file
        '''
        
        self.train_data = np.load(numpy_train_data)
        self.train_labels = np.load(numpy_train_labels)
        
    def __getitem__(self, idx): 
        '''
        idx: index of the data point to return
        '''
        # here we are returning the single data point and label at position idx
        return self.train_data[idx], self.train_labels[idx]
    
    def __len__(self):
        # here we return the length of the dataset
        return self.train_data.shape[0]

Copy the norb data to current folder or do a soft-link. And run the next command

In [3]:
custom_data = CustomDataset('norb/train.npy', 'norb/train_cat.npy')

In [4]:
# To retrieve a single sample, just index the object.
idx = 10

data, labels = custom_data[:]
print("data has shape: {}".format(data.shape))
print("labels has shape: {}".format(labels.shape))

data has shape: (29160, 1, 108, 108)
labels has shape: (29160,)


In [5]:
# Now we can wrap our custom dataset using PyTorch's dataloader, which has many functionalities inbuilt
# to ease the process of setting up training.

# DataLoader can be used to wrap any Dataset and functions as a iterator to load data samples.

# We can define our batch size to be loaded, wheshufflether to shuffle the batch,
# and lastly number of parallel workers for loading the data batch.

dataloader = DataLoader(custom_data, batch_size = 4, shuffle = True, num_workers = 1)

In [6]:
# Now we can load the data by iterating over the dataloader itself to retrieve batches of data.

# We set our batch size to be 4 in the previous step, so in a single iteration we should be retrieving 4 samples at 
# once

for data_batch, labels_batch in dataloader:
    print("DataLoader data_batch has shape: {}".format(data_batch.size()))
    print("Dataloader labels_batch has shape: {}".format(labels_batch.size()))
    break

DataLoader data_batch has shape: torch.Size([4, 1, 108, 108])
Dataloader labels_batch has shape: torch.Size([4])


As we can see the DataLoader and Dataset class in PyTorch can be used to easily create custom loaders and load the data in batches with ease.