# PyTorch's datasets and dataloaders

Basic usage example of `PyTorch` `DataLoader`s; extracted from Aakash's post.

# Reference
- [Linear regression with pytorch](https://medium.com/jovianml/linear-regression-with-pytorch-3dde91d60b50) (Aakash N S)
- [PyTorch `TensorDataset`](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#TensorDataset) (pytorch documentation)
- [PyTorch `DataLoader`
](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) (pytorch documentation)

---
tags: pytorch, tutorial, dataset, dataloader

# Imports

In [1]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Data, dataset, dataloader

In [2]:
# Number of samples
n_batches = 5
batch_size = 4
N = (n_batches - 1)*batch_size + batch_size//2 # size of dataset

# Data
x = torch.tensor([range(N), range(N)], dtype=float).view(N, -1)
y = x.sum(1)

# Dataset
ds = TensorDataset(x, y)

# Dataloader
dl = DataLoader(ds, batch_size, shuffle=False)

# Get one batch from dataloader
Note that with the argument `shuffle` set to `False`, we will always get the first batch:

In [3]:
for batch in dl:
    break
print(batch[0])
print(batch[1])

tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.]], dtype=torch.float64)
tensor([ 1.,  5.,  9., 13.], dtype=torch.float64)


Alternatively:

In [4]:
next(iter(dl))

[tensor([[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]], dtype=torch.float64),
 tensor([ 1.,  5.,  9., 13.], dtype=torch.float64)]

In [5]:
next(iter(dl))

[tensor([[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]], dtype=torch.float64),
 tensor([ 1.,  5.,  9., 13.], dtype=torch.float64)]

# Number of batches

In [6]:
print(f"There are {sum([1 for batch in dl])} batches.")

There are 5 batches.


# Sizes of batches

In [7]:
print("Batch sizes:", end="")
print([xb.shape[0] for xb, _ in dl])

Batch sizes:[4, 4, 4, 4, 2]


# Inspect all batches

In [8]:
[batch for batch in dl]

[[tensor([[0., 1.],
          [2., 3.],
          [4., 5.],
          [6., 7.]], dtype=torch.float64),
  tensor([ 1.,  5.,  9., 13.], dtype=torch.float64)],
 [tensor([[ 8.,  9.],
          [10., 11.],
          [12., 13.],
          [14., 15.]], dtype=torch.float64),
  tensor([17., 21., 25., 29.], dtype=torch.float64)],
 [tensor([[16., 17.],
          [ 0.,  1.],
          [ 2.,  3.],
          [ 4.,  5.]], dtype=torch.float64),
  tensor([33.,  1.,  5.,  9.], dtype=torch.float64)],
 [tensor([[ 6.,  7.],
          [ 8.,  9.],
          [10., 11.],
          [12., 13.]], dtype=torch.float64),
  tensor([13., 17., 21., 25.], dtype=torch.float64)],
 [tensor([[14., 15.],
          [16., 17.]], dtype=torch.float64),
  tensor([29., 33.], dtype=torch.float64)]]

# Dataloaders with shuffling

In [9]:
dl = DataLoader(ds, batch_size, shuffle=True)
[batch for batch in dl]

[[tensor([[10., 11.],
          [ 2.,  3.],
          [ 0.,  1.],
          [ 8.,  9.]], dtype=torch.float64),
  tensor([21.,  5.,  1., 17.], dtype=torch.float64)],
 [tensor([[10., 11.],
          [12., 13.],
          [ 2.,  3.],
          [14., 15.]], dtype=torch.float64),
  tensor([21., 25.,  5., 29.], dtype=torch.float64)],
 [tensor([[ 0.,  1.],
          [ 6.,  7.],
          [16., 17.],
          [ 4.,  5.]], dtype=torch.float64),
  tensor([ 1., 13., 33.,  9.], dtype=torch.float64)],
 [tensor([[14., 15.],
          [ 6.,  7.],
          [ 4.,  5.],
          [ 8.,  9.]], dtype=torch.float64),
  tensor([29., 13.,  9., 17.], dtype=torch.float64)],
 [tensor([[16., 17.],
          [12., 13.]], dtype=torch.float64),
  tensor([33., 25.], dtype=torch.float64)]]

Inspect one batch with the following will produce a different batch each time:

In [10]:
next(iter(dl))

[tensor([[10., 11.],
         [16., 17.],
         [ 2.,  3.],
         [ 2.,  3.]], dtype=torch.float64),
 tensor([21., 33.,  5.,  5.], dtype=torch.float64)]

In [11]:
next(iter(dl))

[tensor([[ 8.,  9.],
         [14., 15.],
         [ 6.,  7.],
         [12., 13.]], dtype=torch.float64),
 tensor([17., 29., 13., 25.], dtype=torch.float64)]

# Drop last batch if incomplete

In [12]:
dl = DataLoader(ds, batch_size, drop_last=True, shuffle=False)
print(f"There are {sum([1 for _ in dl])} batches.")
print("Batch sizes:", end="")
print([xb.shape[0] for xb, _ in dl])

There are 4 batches.
Batch sizes:[4, 4, 4, 4]
