# Understanding Generators for Deep Learning

Generators are ubuquitous in Deep Learning. Regardless of a data set fitting in memory or not, you want to feed a network with mini-batches, and for this, the most effective tool is a generator. Generators have some interesting characteristics that make them particularly suitable for this task. Let's illustrate how to use generators with a simple example. We start creating two objects mimicking a typical dataset, a data array `X` and a target vector `y`. For simplicity, we let them have 100 observations.

In [26]:
import numpy as np
X = np.random.rand(100, 10)
y = np.random.rand(100, 1)
X.shape, y.shape

((100, 10), (100, 1))

We want to create a generator that takes in input the batch size, and returns batches of `X` and `y` of that size. Two situations can occur:

1. `batch_size` is an integer divisor of `X.shape[0]` (and, obviously, of `y.shape[0]`).
2. `batch_size` is not an integer divisor of `X.shape[0]` (and, obviously, of `y.shape[0]`).

To create our generator we must select `start` and `end` such that:
1. The first `start` value is 0 and the first `end` valud is `batch_size -1` (note the -1!).
2. The second `start` value is `batch_size` and the end value is `2 * batch_size - 1`
and so on. What about the last batch? Shall we stop when the batch is the last integer divisor of `X.shape[0]`? It turns out that this is not necessary. Generators can return batches of unequal sizes, if they are properly written. Let us see some concrete examples.

1. `batch_size = 50`. We want `start, end` to be, in order, `(0, 49)`, `(50, 99)`.
2. `batch_size = 49`. We have `start, end` equal to `(0, 48)`, `(49, 97)`. There are still two observation left, but the next batch would be `(98, 147)`, which is well outside of the limits.

The nice thing about generators is that even if we go for the last iteration, the two remaining samples will be returned, but no error condition will be raised. For this to work, we must properly count the number of batches we need to output. The code below shows how to do it.

In [27]:
def batch_generator(batch_size):
    num_batches = int(np.ceil(X.shape[0] / batch_size)) # IMPORTANT!
    start = 0
    for batch in range(num_batches):
        end = start + batch_size
        xb = X[slice(start, end)]
        yb = y[slice(start, end)]
        yield (xb, yb)
        start += batch_size

In the code above we are taking the *ceiling* of the ratio of `X.shape[0]` and `batch_size`. Look at the examples below to convince yourself that this is the right thing to do:

In [23]:
np.ceil(100 / 50), np.ceil(100 / 49), np.ceil(100 / 51)

(2.0, 3.0, 2.0)

This is what we want. In  the first case there are exactly 2 batches of 50 samples. In the second case there are two batches of 49 samples and one of 2 samples, and in the third case there are two samples of 51 and 49 samples respectively. Let's go through these three cases with our function.

In [29]:
batchgen = batch_generator(50)
for a, b in batchgen:
    print(a.shape, b.shape)

(50, 10) (50, 1)
(50, 10) (50, 1)


Second case: `batch_size = 49`

In [30]:
batchgen = batch_generator(49)
for a, b in batchgen:
    print(a.shape, b.shape)

(49, 10) (49, 1)
(49, 10) (49, 1)
(2, 10) (2, 1)


Second case: `batch_size = 51`

In [31]:
batchgen = batch_generator(51)
for a, b in batchgen:
    print(a.shape, b.shape)

(51, 10) (51, 1)
(49, 10) (49, 1)


## What is an epoch (do I need to use `while True:`)?

An epoch is a full pass on the whole dataset. In our case this means that we run the for loop above in its entirety. There is no need for constructs like `while True` to guarantee that the computation doesn't halt. However this requires re-instantiating the batch generator at each epoch.

In [32]:
for epoch in range(3):
    batchgen = batch_generator(49)
    for a, b in batchgen:
        print(a.shape, b.shape)

(49, 10) (49, 1)
(49, 10) (49, 1)
(2, 10) (2, 1)
(49, 10) (49, 1)
(49, 10) (49, 1)
(2, 10) (2, 1)
(49, 10) (49, 1)
(49, 10) (49, 1)
(2, 10) (2, 1)
