# Dataset and DataLoader

Machine learning, especially deep learning is dependent on lots and lots of data. That depnedence can put a high strain on researchers and practicioners. We need to be able to store, manage and retrieve high amounts of data and when we retrieve the data we need to make sure, that we do go beyond the capacity of our RAM or VRAM. PyTorch gives us a flexible way to deal with our data problems the way we see fit and provides the ```Dataset``` and the ```DataLoader``` classes to manage data.

In [1]:
from torch.utils.data import Dataset, DataLoader

## Dataset

The Dataset object is the PyTorch representation of data. In PyTorch any class that implements the __getitem__ and the __len__ magic methods are considered to be Datasets. That means that theoretically ```[1, 2, 3]``` is a Dataset.

In [2]:
dataset = [1,2,3]
# a simple list has the __len__ method
print(len(dataset))
# a simple list has the __getitem__ method
print(dataset[1])

3
2


When we are dealing with real world data we will actually subclass the ```Dataset``` method and overwrite the ```__getitem__``` and the ```__len__``` methods. Below we create a dataset that contains a list of numbers, the size of which depends on a parameter in the ```__init___``` method. The ```__getitem__``` method implements the logic, which determines how the individual element of our data should be returned given only the index of data.

In [3]:
class ListDataset(Dataset):
    def __init__(self, size):
        self.data = list(range(size))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

We can use the dataset the way we could use a list.

In [4]:
dataset = ListDataset(100)

In [5]:
print(len(dataset))

100


In [6]:
print(dataset[42])

42


## DataLoader

During the training process we  interact with the ```DataLoader``` object and never with the ```Dataset``` directly. The goal of the ```DataLoader``` is to return data in batch sized pieces, that are utilized for training or testing.

The DataLoaser class has many parameters, we will start with the most important ones.

- ```dataset```: The Dataset object that implements the ```__len__``` and ```__getitem__``` interface
- ```batch_size```: size of the mini-batch used in training/testing, defaults to 1
- ```shuffle```: determines if the data is shuffeled in each epoch, defaults to False

Generate a Dataset with 5 elements.

In [7]:
dataset = ListDataset(5)

Generate a dataloader that shuffles the dataset object and returns 2 elements at a time.

In [8]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True)

We iterate through the dataloader to receive a batch at a time. Once only one object remains, a single element is returned.

In [9]:
for epoch in range(2):
    print(f'EPOCH Nr. {epoch+1}')
    print('-' * 45)
    for batch_num, data in enumerate(dataloader):
        print(f'Batch Nr: {batch_num+1} Data: {data}')
    print()

EPOCH Nr. 1
---------------------------------------------
Batch Nr: 1 Data: tensor([2, 4])
Batch Nr: 2 Data: tensor([0, 3])
Batch Nr: 3 Data: tensor([1])

EPOCH Nr. 2
---------------------------------------------
Batch Nr: 1 Data: tensor([1, 3])
Batch Nr: 2 Data: tensor([4, 2])
Batch Nr: 3 Data: tensor([0])



Often we want our batches to be of equal size. If a batch is too small the calculation of the gradient might be too noisy. To avoid that we can the following argument.

- ```drop_last```: does not include if the last batch is less than ```batch_size```, defaults to False

Below we see that out of 5 samples only 4 are included in the loop if we set the ```drop_last``` variable to ```True```.

In [10]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True)

In [11]:
for epoch in range(2):
    print(f'EPOCH Nr. {epoch+1}')
    print('-' * 45)
    for batch_num, data in enumerate(dataloader):
        print(f'Batch Nr: {batch_num+1} Data: {data}')
    print()

EPOCH Nr. 1
---------------------------------------------
Batch Nr: 1 Data: tensor([4, 3])
Batch Nr: 2 Data: tensor([1, 0])

EPOCH Nr. 2
---------------------------------------------
Batch Nr: 1 Data: tensor([0, 3])
Batch Nr: 2 Data: tensor([1, 4])



PyTorch gives us the ability to get the data in parallel by using subprocesses.

- ```num_workers```: integer value that determines the number of workers that get the data in parallel. The default is 0, which means that only the main process is used. 

In [12]:
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True, num_workers=4)

We won't notice the speed difference using such a simple example, but the speedup with large datasets might be noticable.

Behind the scene PyTorch does a load of work. In most cases the default way things are processed behind the scene are sufficient, but sometimes you might need more control. We are not going to cover those details just yet, because for the most part the default ```DataLoader``` is sufficient and we will cover the special cases when the need arises. If you are faced with a problem that requires more control, you can look at the [PyTorch documentation](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)