# Datasets in PyTorch

Objective: decouple code for train and loading/preprocessing data

data primitivis:
- `torch.utils.data.Dataset`: stores samples and their corresponding labels / target
- `torch.utils.data.DataLoader`: wraps an iterable around the dataset. Ease of access to samples.

download and work with famous dataset: Image Datasets, Text Datasets, Audio

Example: 
- `torchvision.datasets.MNIST`
- `torchtext.datasets.IMDB`
- `torchaudio.datasets.SPEECHCOMMANDS`

All of them are inherited from `torch.utils.data.Dataset`

In [26]:
import torch
from torchvision import datasets
from torchvision import transforms

transform = transforms.ToTensor() # PIL image ===> [0, 1] Pytorch Tensor

trainset = datasets.MNIST(root='~/.pytorch/MNIST_data/', download=True, train = True, transform=transform) # C:\Users\ashkan

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

## The content of `torch.utils.data.Dataset`'s object can be accessed by indexing

All datasets implement `__getitem__` and `__len__`

```python
class CustomImageDataset(torch.utils.nn.Dataset):
    def __init__(self,...):
        super.__init__()
        ...
    def __getitem__:
        ...
    def __len__:
        ...
```


In [32]:
# image, label = trainset[50000] 
image, label = trainset.__getitem__(50000)

print(type(image)) # 0 - 255
print(label)

print(len(trainset))
print(trainset.__len__())

<class 'torch.Tensor'>
3
60000
60000


In [28]:
print(image.shape)

torch.Size([1, 28, 28])


shape is `Channel*Height*Width`

`255*255*3` ===> `3*255*255`

In [15]:
import PIL.Image as Image

image.show()

In [16]:
image.getextrema()

(0, 255)

## `torch.utils.data.DataLoader`

`DataLoader` wraps an iterator around `Dataset`

- `__iter__`: returns the iterator object itself
- `__next__`: returns the next item in the sequence.

In [34]:
dataiter = iter(trainloader) 

images, labels = dataiter.next() # next(dataiter)

print(images.shape) # PyTorch: NCHW ----- Tensorflow/Keras: NHWC 
print(type(images))

print(labels.shape)

torch.Size([64, 1, 28, 28])
<class 'torch.Tensor'>
torch.Size([64])
