In [1]:
import torch as t
import torch.utils.data as td
import numpy as np

# Data Stuff
In Pytorch there are two main classes for processing data, `Dataset` and `DataLoader`. Think of `Dataset` as representing the entire dataset, whether or not it fits in memory is irrelvant. `DataLoader` takes in a dataset and doles out batches of instances that it pulls from the dataset. 

## Object Model

At its simplest, the `DataLoader` needs a `Dataset` and a `Sampler` to sample from that dataset. Also, the `Sampler` will need access to the dataset so it can know the total number of available instances.

![concept_model](./imgs/concept_model.png)

In reality a way to batch the samples is also needed. So `DataLoader` ships with a copule of reasonable defaults.
![default](./imgs/default.png)

The builtin `BatchSampler` can be configured with 2 parameters - `batch_size` and `drop_last`. 

Instead of the default `SequentialSampler`, I can choose to use the builtin `RandomSampler` by setting the `shuffle=True` argument.
![random](./imgs/random.png)

I can also use my own custom sampler that the batcher will automatically wrap. This is done by passing my custom sampler object in the `sampler` argument.
![sampler](./imgs/sampler.png)

I can get rid of the batcher entirely if I so choose, by setting the `batch_size=None`.
![nobatcher](./imgs/nobatcher.png)

Or, substitute it with my custom batcher by passing my custom batcher object in the `batch_sampler` argument.
![custom_batcher](./imgs/custom_batcher.png)


## DataLoader

In order to yield batches, dataloader first randomly samples instances from the dataset, and then collates these instances into a single batch of whatever size the caller specified, before yielding to the caller. While these defaults are good for most cases, both the sampling and batching operations are heavily customizable.

### Sampling
If we think of datasets as indexed containers of data, pretty much like a `list` or a `dict`, we can think of samplers as objects that decide which index to read next. PyTorch as a bunch of built-in samplers like `SequentialSampler`, `RandomSampler`, etc.

In [2]:
def print_dl(dl):
    print("\n")
    print("sampler: ", type(dl.sampler))
    print("batch size: ", dl.batch_size)
    print("batch sampler: ", type(dl.batch_sampler))

In [3]:
class MockDataset(td.Dataset):
    def __init__(self, size=100):
        super().__init__()
        self._size = size

    def __getitem__(self, key):
        raise NotImplementedError()

    def __len__(self):
        return self._size

In [4]:
sampler = td.SequentialSampler(MockDataset(10))
for idx in sampler:
    print(idx, end=" ")

0 1 2 3 4 5 6 7 8 9 

In [5]:
sampler = td.RandomSampler(MockDataset(10))
for idx in sampler:
    print(idx, end=" ")

7 8 9 2 6 1 5 0 3 4 

I can also implement my own sampler.

In [6]:
class ReverseSampler(td.Sampler):
    def __init__(self, data_source):
        super().__init__(data_source)
        self._len = len(data_source)

    def __iter__(self):
        yield from range(self._len-1, -1, -1)

In [7]:
sampler = ReverseSampler(MockDataset(10))
for idx in sampler:
    print(idx, end=" ")

9 8 7 6 5 4 3 2 1 0 

By default `DataLoader` will create a sequential sampler. If I set the `shuffle` arg to `True` it will create a `RandomSampler`. And of course I can specify my custom sampler if I want. But I cannot specify my custom sampler and at the same time tell the data loader to use the `RandomSampler` by setting `shuffle=True`.

In [8]:
dl = td.DataLoader(MockDataset())
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [9]:
dl = td.DataLoader(MockDataset(), shuffle=True)
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.RandomSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [10]:
ds = MockDataset()
dl = td.DataLoader(ds, sampler=ReverseSampler(ds))
print_dl(dl)




sampler:  <class '__main__.ReverseSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [11]:
try:
    ds = MockDataset()
    dl = td.DataLoader(ds, sampler=ReverseSampler(ds), shuffle=True)
except ValueError as ve:
    print(ve)

sampler option is mutually exclusive with shuffle


### Batching
Samplers will return the next index to read. But often in ML we want to learn on a batch of data. For this, we need a way to batch the indexes in a list. The default `BatchSampler` does just that. It is a `Sampler` that wraps other samplers and simply collects the bunch of indexes before returning the entire batch. It accepts two arguments besides the sampler that it is wrapping, the `batch_size` and whether or not to drop the last batch if it has fewer samples via the `drop_last` arg.

In [12]:
sampler = td.SequentialSampler(MockDataset())
batcher = td.BatchSampler(sampler, batch_size=14, drop_last=False)
for idxs in batcher:
    print(idxs)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
[28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]
[56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83]
[84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97]
[98, 99]


In [13]:
sampler = td.SequentialSampler(MockDataset())
batcher = td.BatchSampler(sampler, batch_size=14, drop_last=True)
for idxs in batcher:
    print(idxs)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
[28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
[42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]
[56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83]
[84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97]


And as usual I can create my custom batcher. I can use the same pattern as the built-in `BatchSampler` where it wraps an existing sampler, or I can simply implement it as a `Sampler` that accepts a `data_source` but returns a batch of samples at once instead of single samples.

In [14]:
class FibnonacciBatcher(td.Sampler):
    def __init__(self, data_source):
        self._sampler = sampler
        self._n1, self._n2 = 0, 1
        self._len = len(data_source)

    def __iter__(self):
        batch = []
        batch_size = self._n1 + self._n2
        for idx in range(self._len):
            batch.append(idx)
            if len(batch) == batch_size:
                yield batch
                batch = []
                self._n1, self._n2 = self._n2, self._n1 + self._n2
                batch_size = self._n1 + self._n2
        if len(batch) > 0:
            yield batch

In [15]:
batcher = FibnonacciBatcher(MockDataset())
for idxs in batcher:
    print(idxs)

[0]
[1, 2]
[3, 4, 5]
[6, 7, 8, 9, 10]
[11, 12, 13, 14, 15, 16, 17, 18]
[19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
[32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52]
[53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86]
[87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


By default the `DataLoader` will create a `BatchSampler` by passing it the sampler, `batch_size`, and `drop_last` args. The default values are `1` and `False`. The sampler can be either of the automated samplers (`SequentialSampler` or `RandomSampler`) or it can be a user specified one. I can also specify my custom batcher using the `batch_sampler` argument. But then I cannot also tell the data loader to pass in the `batch_size` and `drop_last` args to my custom sampler because it might not take these args. Also, no need to pass in any samplers via the `shuffle` or `sampler` args because my custom sampler might not be a wrapper on another sampler.

I can completely turn off automated batching by specifying both the `batch_size` and `batch_sampler` to `None`. The default value of `batch_sampler` is already `None`.


In [16]:
ds = MockDataset()
dl = td.DataLoader(ds)
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [17]:
ds = MockDataset()
dl = td.DataLoader(ds, batch_size=2, drop_last=True)
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  2
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [18]:
ds = MockDataset()
batcher = FibnonacciBatcher(ds)
dl = td.DataLoader(ds, batch_sampler=batcher)
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  None
batch sampler:  <class '__main__.FibnonacciBatcher'>


In [19]:
ds = MockDataset()
batcher = FibnonacciBatcher(ds)
try:
    dl = td.DataLoader(ds, batch_sampler=batcher, batch_size=10)
except ValueError as ve:
    print(ve)

try:
    dl = td.DataLoader(ds, batch_sampler=batcher, drop_last=True)
except ValueError as ve:
    print(ve)

try:
    dl = td.DataLoader(ds, batch_sampler=batcher, shuffle=True)
except ValueError as ve:
    print(ve)

try:
    dl = td.DataLoader(ds, batch_sampler=batcher, sampler=ReverseSampler(ds))
except ValueError as ve:
    print(ve)            

batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last
batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last
batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last
batch_sampler option is mutually exclusive with batch_size, shuffle, sampler, and drop_last


In [20]:
ds = MockDataset()
dl = td.DataLoader(ds, batch_sampler=None, batch_size=None)
print_dl(dl)



sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  None
batch sampler:  <class 'NoneType'>


### Collating Data
The callable specified as the `collate_fn` takes in sample(s) yielded by the samplers and converts them to mathematical tensors. If automated batching is on, then the callable takes in a list of samples and converts them into a single tensor with the outer dim being the batch size. It will also automatically convert all numpy arrays to tensors and a couple of other utility things. When automated batching is turned off, it will be called with a single sample. The default behavior is to convert all numpy arrays to `torch.Tensor` objects and return those. Examples of this are provided after a discussion on datasets.

## Datasets
Abstraction over my entire actual underlying dataset. The underlying dataset could be a folder of images with the folder name as the label, or it can be data that is being streamed from a data warehouse, or it can be data that is read in chunks from the hard drive, etc. There are two broad types of datasets:

  * **Map-Style**: When it is possible to index the underlying the dataset then it makes sense to use the map-style datasets. The dataset is expected to return one "row" of the underlying dataset at a time. Here the `Dataset` child class will have to implement `__getitem__` and `__len__` methods. Moreover, it has to guarantee that when called with the same key, the dataset will always return the same row.

  * **Iterable-Style**: When it is not possible to index the underlying dataset, or when returning data one row at a time is too expensive, then it makes sense to implement an iterable-style dataset. Here the `IterableDataset` child class will only have to implement the `__iter__` method that needs to return an iterator.

Usually if the entire underlying dataset will fit in memory we use map-style datasets. Sometimes, even when it doesn't but if the data retrieval cost is not too high, e.g., if it is reading data from the disk over the network from a database with the indexes as the primary keys, it might still be ok to use map-style datasets. But if the data retrieval cost is high, e.g., reading data from a data warehouse we might want to use the iterable-style datasets. And if the underlying dataset is a kafka-like stream, there is no option but to use iterable-style datasets.

### Map-Style Datasets
It is possible to use various samplers via the `shuffle`, `sampler`, `batch_size`, `drop_last`, and `batch_sampler` args.

In [21]:
class MyMappedDataset(td.Dataset):
    def __init__(self, n=5, m=10):
        self._x = np.arange(n * m).reshape(m, n)
        self._y = np.random.choice([0, 1], size=25, p=[0.7, 0.3])

    def __getitem__(self, idx):
        return self._x[idx], self._y[idx]
    
    def __len__(self):
        return self._x.shape[0]

In [22]:
for x, y in MyMappedDataset(n=3, m=5):
    print(x, y)

[0 1 2] 0
[3 4 5] 0
[6 7 8] 0
[ 9 10 11] 0
[12 13 14] 1


In the example below see how the individual 1-D numpy array is converted to a 2-D `torch.Tensor` with the outer dim as the default batch size of 1. This is the default batch `collate_fn` in action.

In [23]:
ds = MyMappedDataset(n=3, m=5)
dl = td.DataLoader(ds)
for x, y in dl:
    print(x, y)

print_dl(dl)

tensor([[0, 1, 2]]) tensor([0])
tensor([[3, 4, 5]]) tensor([0])
tensor([[6, 7, 8]]) tensor([1])
tensor([[ 9, 10, 11]]) tensor([0])
tensor([[12, 13, 14]]) tensor([0])


sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In the example below I am turning automated batching off. The default `collate_fn` now just converts numpy arrays to `torch.Tensor` but there is no extra dim.

In [24]:
ds = MyMappedDataset(n=3, m=5)
dl = td.DataLoader(ds, batch_size=None)
for x, y in dl:
    print(x, y)

print_dl(dl)
    

tensor([0, 1, 2]) tensor(0)
tensor([3, 4, 5]) tensor(1)
tensor([6, 7, 8]) tensor(0)
tensor([ 9, 10, 11]) tensor(0)
tensor([12, 13, 14]) tensor(1)


sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  None
batch sampler:  <class 'NoneType'>


In [25]:
ds = MyMappedDataset()
dl = td.DataLoader(ds, shuffle=True, batch_size=3)
for x, y in dl:
    print(x, y)

print_dl(dl)

tensor([[ 0,  1,  2,  3,  4],
        [15, 16, 17, 18, 19],
        [45, 46, 47, 48, 49]]) tensor([0, 0, 0])
tensor([[30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39],
        [10, 11, 12, 13, 14]]) tensor([0, 1, 1])
tensor([[40, 41, 42, 43, 44],
        [ 5,  6,  7,  8,  9],
        [25, 26, 27, 28, 29]]) tensor([1, 1, 0])
tensor([[20, 21, 22, 23, 24]]) tensor([0])


sampler:  <class 'torch.utils.data.sampler.RandomSampler'>
batch size:  3
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [26]:
ds = MyMappedDataset()
dl = td.DataLoader(ds, shuffle=True, batch_size=3, drop_last=True)
for x, y in dl:
    print(x, y)

print_dl(dl)

tensor([[45, 46, 47, 48, 49],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24]]) tensor([0, 0, 0])
tensor([[40, 41, 42, 43, 44],
        [30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39]]) tensor([0, 0, 0])
tensor([[10, 11, 12, 13, 14],
        [25, 26, 27, 28, 29],
        [ 5,  6,  7,  8,  9]]) tensor([0, 0, 0])


sampler:  <class 'torch.utils.data.sampler.RandomSampler'>
batch size:  3
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [27]:
ds = MyMappedDataset()
dl = td.DataLoader(ds, sampler=ReverseSampler(ds), batch_size=3)
for x, y in dl:
    print(x, y)

print_dl(dl)

tensor([[45, 46, 47, 48, 49],
        [40, 41, 42, 43, 44],
        [35, 36, 37, 38, 39]]) tensor([0, 0, 0])
tensor([[30, 31, 32, 33, 34],
        [25, 26, 27, 28, 29],
        [20, 21, 22, 23, 24]]) tensor([1, 0, 0])
tensor([[15, 16, 17, 18, 19],
        [10, 11, 12, 13, 14],
        [ 5,  6,  7,  8,  9]]) tensor([0, 0, 0])
tensor([[0, 1, 2, 3, 4]]) tensor([1])


sampler:  <class '__main__.ReverseSampler'>
batch size:  3
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [28]:
ds = MyMappedDataset()
dl = td.DataLoader(ds, batch_sampler=FibnonacciBatcher(ds))
for x, y in dl:
    print(x, y)

print_dl(dl)

tensor([[0, 1, 2, 3, 4]]) tensor([0])
tensor([[ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]]) tensor([0, 0])
tensor([[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]) tensor([0, 1, 0])
tensor([[30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49]]) tensor([1, 1, 1, 0])


sampler:  <class 'torch.utils.data.sampler.SequentialSampler'>
batch size:  None
batch sampler:  <class '__main__.FibnonacciBatcher'>


### Iterable-Style Datasets
At their most general these datasets have no concept of indexes. Think of the underlying data source as a continuous stream. Of course I can also implement a slow dataset as an interable style dataset as well and it might be possible to index into these datasets. But considering the general case, there are no indexes. And because of this, there is no sense in having any kind of sampler or custom batching via the `shuffle`, `sampler`, or `batch_sampler` args. 

However, the data loader does support automated batching with these datasets where it will create an internal constant sampler and pass that to the default `BatchSampler` class along with the `batch_size` and `drop_last` args. This mostly makes sense for the streaming type underlying data sources where each row can be retrieved easily and rows are then batched together in a single batch. Though I am not sure of the scenario where `drop_last` would make sense.

In [29]:
class MyStreamingDataset(td.IterableDataset):
    def __init__(self, n=5):
        self._n = n

    def __iter__(self):
        ctr = 1
        while True:
            x = np.full(self._n, ctr)
            y = np.random.choice([0, 1], size=1, p=[0.3, 0.7])[0]
            yield x, y
            ctr += 1

In [30]:
for x, y in MyStreamingDataset():
    if x[0] == 5: break
    print(x, y)

[1 1 1 1 1] 0
[2 2 2 2 2] 1
[3 3 3 3 3] 1
[4 4 4 4 4] 1


Again notice the default `collate_fn` doing its thing by adding an extra dim and converting numpy arrays to tensors.

In [31]:
ds = MyStreamingDataset()
dl = td.DataLoader(ds)
for x, y in dl:
    if x[0, 0] == 5: break
    print(x, y)

print_dl(dl)    

tensor([[1, 1, 1, 1, 1]]) tensor([1])
tensor([[2, 2, 2, 2, 2]]) tensor([1])
tensor([[3, 3, 3, 3, 3]]) tensor([1])
tensor([[4, 4, 4, 4, 4]]) tensor([1])


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [32]:
ds = MyStreamingDataset()
dl = td.DataLoader(ds, batch_size=3)
for x, y in dl:
    if x[0, 0] >= 15: break
    print(x, y)

print_dl(dl)    

tensor([[1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3]]) tensor([0, 1, 1])
tensor([[4, 4, 4, 4, 4],
        [5, 5, 5, 5, 5],
        [6, 6, 6, 6, 6]]) tensor([1, 1, 1])
tensor([[7, 7, 7, 7, 7],
        [8, 8, 8, 8, 8],
        [9, 9, 9, 9, 9]]) tensor([1, 1, 1])
tensor([[10, 10, 10, 10, 10],
        [11, 11, 11, 11, 11],
        [12, 12, 12, 12, 12]]) tensor([1, 0, 1])
tensor([[13, 13, 13, 13, 13],
        [14, 14, 14, 14, 14],
        [15, 15, 15, 15, 15]]) tensor([0, 0, 0])


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  3
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [33]:
try:
    ds = MyStreamingDataset()
    dl = td.DataLoader(ds, shuffle=True)
except ValueError as ve:
    print(ve)

try:
    ds = MyStreamingDataset()
    dl = td.DataLoader(ds, sampler=ReverseSampler(ds))
except TypeError as te:
    print(te) 

try:
    ds = MyStreamingDataset()
    dl = td.DataLoader(ds, batch_sampler=FibnonacciBatcher(ds))
except TypeError as te:
    print(te)       

DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
object of type 'MyStreamingDataset' has no len()
object of type 'MyStreamingDataset' has no len()


For the sake of illustration lets implement a slow dataset that returns a single row a time.

In [34]:
import time
import random


class MySlowDataset(td.IterableDataset):
    def __init__(self, n=5, m=100):
        self._n = n
        self._m = m

    def __iter__(self):
        for i in range(1, self._m+1):
            x = np.full(self._n, i)
            y = np.random.choice([0, 1], size=1, p=[0.3, 0.7])[0]
            time.sleep(random.randint(1, 5))
            yield x, y

    def __len__(self):
        # even though this is not strictly needed, lets implement it to see
        # the kind of errors that samplers will give
        return self._m

In [35]:
for x, y in MySlowDataset(m=3):
    print(x, y)

[1 1 1 1 1] 1
[2 2 2 2 2] 1
[3 3 3 3 3] 0


In [36]:
ds = MySlowDataset(m=3)
dl = td.DataLoader(ds)
for x, y in dl:
    if x[0, 0] == 5: break
    print(x, y)

print_dl(dl)    

tensor([[1, 1, 1, 1, 1]]) tensor([0])
tensor([[2, 2, 2, 2, 2]]) tensor([1])
tensor([[3, 3, 3, 3, 3]]) tensor([1])


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [37]:
ds = MySlowDataset(m=6)
dl = td.DataLoader(ds, batch_size=3)
for x, y in dl:
    if x[0, 0] >= 15: break
    print(x, y)

print_dl(dl)    

tensor([[1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3]]) tensor([1, 1, 1])
tensor([[4, 4, 4, 4, 4],
        [5, 5, 5, 5, 5],
        [6, 6, 6, 6, 6]]) tensor([0, 1, 1])


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  3
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [38]:
ds = MySlowDataset(m=6)
dl = td.DataLoader(ds, batch_size=4, drop_last=True)
for x, y in dl:
    if x[0, 0] >= 15: break
    print(x, y)

print_dl(dl)    

tensor([[1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4]]) tensor([1, 1, 1, 0])


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  4
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


In [39]:
try:
    ds = MySlowDataset()
    dl = td.DataLoader(ds, shuffle=True)
except ValueError as ve:
    print(ve)

try:
    ds = MySlowDataset()
    dl = td.DataLoader(ds, sampler=ReverseSampler(ds))
except ValueError as ve:
    print(ve) 

try:
    ds = MySlowDataset()
    dl = td.DataLoader(ds, batch_sampler=FibnonacciBatcher(ds))
except ValueError as ve:
    print(ve)     

DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
DataLoader with IterableDataset: expected unspecified sampler option, but got sampler=<__main__.ReverseSampler object at 0x7ff9785d5700>
DataLoader with IterableDataset: expected unspecified batch_sampler option, but got batch_sampler=<__main__.FibnonacciBatcher object at 0x7ff9785d5370>


Now lets implement a batched slow dataset which is a more realistic scenario.

In [44]:
import time
import random


class MySlowBatchedDataset(td.IterableDataset):
    def __init__(self, n=5, m=100, batch_size=10):
        self._n = n
        self._m = m
        self._batch_size = batch_size

    def __iter__(self):
        batch_x = np.empty((self._batch_size, self._n))
        batch_y = np.empty(self._batch_size)
        next_idx = 0
        for i in range(1, self._m + 1):
            curr_idx = next_idx
            batch_x[curr_idx] = np.full(self._n, i)
            batch_y[curr_idx] = np.random.choice([0, 1], size=1, p=[0.3, 0.7])[0]
            next_idx += 1
            if next_idx == self._batch_size:
                time.sleep(random.randint(1, 5))
                yield batch_x.astype(np.float32), batch_y.astype(int)
                batch_x = np.empty((self._batch_size, self._n))
                batch_y = np.empty(self._batch_size)
                next_idx = 0
        if next_idx < self._batch_size:
            time.sleep(random.randint(1, 5))
            yield batch_x[:next_idx], batch_y[:next_idx]

    def __len__(self):
        # even though this is not strictly needed, lets implement it to see
        # the kind of errors that samplers will give
        return self._m

In [45]:
for x, y in MySlowBatchedDataset(m=10, batch_size=3):
    print(x, y)

[[1. 1. 1. 1. 1.]
 [2. 2. 2. 2. 2.]
 [3. 3. 3. 3. 3.]] [1 0 1]
[[4. 4. 4. 4. 4.]
 [5. 5. 5. 5. 5.]
 [6. 6. 6. 6. 6.]] [0 1 1]
[[7. 7. 7. 7. 7.]
 [8. 8. 8. 8. 8.]
 [9. 9. 9. 9. 9.]] [1 0 1]
[[10. 10. 10. 10. 10.]] [0.]


In the example below the default `collate_fn` for batches is doing its thing and adding an extra dim to everything. For a batched dataset this is not really needed.

In [46]:
ds = MySlowBatchedDataset(m=10, batch_size=3)
dl = td.DataLoader(ds)
for x, y in dl:
    print(x, y)

print_dl(dl)    

tensor([[[1., 1., 1., 1., 1.],
         [2., 2., 2., 2., 2.],
         [3., 3., 3., 3., 3.]]]) tensor([[0, 1, 1]])
tensor([[[4., 4., 4., 4., 4.],
         [5., 5., 5., 5., 5.],
         [6., 6., 6., 6., 6.]]]) tensor([[0, 0, 1]])
tensor([[[7., 7., 7., 7., 7.],
         [8., 8., 8., 8., 8.],
         [9., 9., 9., 9., 9.]]]) tensor([[1, 0, 1]])
tensor([[[10., 10., 10., 10., 10.]]], dtype=torch.float64) tensor([[0.]], dtype=torch.float64)


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  1
batch sampler:  <class 'torch.utils.data.sampler.BatchSampler'>


By turning of automated batching, the default `collate_fn` for single rows gets into action and all it does is convert numpy arrays to tensors.

In [47]:
ds = MySlowBatchedDataset(m=10, batch_size=3)
dl = td.DataLoader(ds, batch_size=None)
for x, y in dl:
    print(x, y)

print_dl(dl)    

tensor([[1., 1., 1., 1., 1.],
        [2., 2., 2., 2., 2.],
        [3., 3., 3., 3., 3.]]) tensor([1, 0, 1])
tensor([[4., 4., 4., 4., 4.],
        [5., 5., 5., 5., 5.],
        [6., 6., 6., 6., 6.]]) tensor([1, 0, 1])
tensor([[7., 7., 7., 7., 7.],
        [8., 8., 8., 8., 8.],
        [9., 9., 9., 9., 9.]]) tensor([0, 0, 1])
tensor([[10., 10., 10., 10., 10.]], dtype=torch.float64) tensor([0.], dtype=torch.float64)


sampler:  <class 'torch.utils.data.dataloader._InfiniteConstantSampler'>
batch size:  None
batch sampler:  <class 'NoneType'>
