# My summary

## Loading data

Preparing data for deep learning algorithms could be a complex pipeline by itself. PyTorch provides many utility classes that abstract a lot of complexity such as data-parallelization through multi-threading, data-augmenting, and batching.

### Dataset class

Any custom dataset class, say for example, our Dogs dataset class, has to inherit from the PyTorch dataset class. The custom class has to implement two main functions, namely `__len__(self)` and `__getitem__(self, idx)`. Any custom class acting as a `Dataset` class should look like the following code snippet:

In [1]:
import numpy as np
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from glob import glob

In [2]:
class DogsAndCatsDataset(Dataset):
    def __init__(self,):
        pass
    def __len__(self,):
        pass
    def __getitem__(self, idx):
        pass

We do any initialization, if required, inside the `__init__` method, for example, reading the index of the table and reading the filenames of the images, in our case. The `__len__(self)` operation is responsible for returning the maximum number of elements in our dataset. The `__getitem__(self, idx` operation returns an element based on the `idx` every time it is called. The following code implements our DogsAndCatsDataset class:

In [3]:
class DogsAndCatsDataset(Dataset):
    def __init__(self, root_dir, size=(224, 224)):
        self.files = glob(root_dir + '*.jpg')
        self.size = size
    def __len__(self,):
        return len(self.files)
    def __getitem__(self, idx):
        img = np.asarray(Image.open(self.files[idx]).resize(self.size))
        label = self.files[idx].split('/')[-2]
        return img, label

Once the DogsAndCatsDataset class is created, we can create an object and iterate over it, which is shown in the following code:

In [4]:
dogsdset = DogsAndCatsDataset('data/dogscats/train/dogs/')
for image, label in dogsdset:
    # Apply you DL on the dataset
    pass

### DataLoader class

Applying a deep learning algorithm on a single instance of data is not optimal. We need a batch of data, as modern GPUs are optimized for better performance when executed on a batch of data. The DataLoader class helps to create batches by abstracting a lot of complexity.

The `DataLoader` class present in PyTorch's `utils` class combines a dataset object along with different samplers, such as `SequentialSampler` and `RandomSampler`, and provides us with a batch of images, either using a single or multi-process iterators. Samplers are different strategies for providing data to algorithms. The following is an example of a `DataLoader` for our Dogs. vs Cats. dataset:

In [8]:
dataloader = DataLoader(dogsdset, batch_size=32, num_workers=3)
for imgs, labels in dataloader:
    # Apply you DL on the dataset
    pass

`imgs` will contain a tensor of shape (32, 224, 224, 3), where 32 represents the batch size.

The PyTorch team also maintains two useful libraries, called `torchvision` and `torchtext`, which are built on top of the `Dataset` and `DataLoader` classes.