<a href="https://colab.research.google.com/github/adholmgren/tensorflow_playground/blob/master/tf_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Copyright 2020, Andrew Holmgren

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

# A practical guide to the Dataset class in tensorflow
The tensorflow Dataset class can streamline your machine learning application, but I've personally found many times when the documentation could benefit from some more examples or some more exposition. This notebook will hopefully serve as a guide, or at the very least a supplement, to the tensorflow documentation. An intermediate level of Python knowledge is assumed but not necessarily required, some concepts are best communicated with intermediate Python knowledge. Basic Python is required.

In [17]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  !pip install --quiet tensorflow-gpu>=2.0.0

In [36]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow.keras as K
# tf.enable_eager_execution()

# Build dataset from arrays

If the data comes in as arrays it's pretty simple to get the data into a tensorflow dataset using the `from_tensor_slices` method.

In [37]:
arr = np.array([7, 2, 1, 6, 3, 5, 9])
dataset = tf.data.Dataset.from_tensor_slices(arr)

The tensor slices assumes the first dimension of the array is the dimension of new instances. For example, in the last code block it consumed the (7,) array and generated 7 instances of () shaped Tensors. As another example, the following code will consume 2 instances of 2x2 arrays.

In [38]:
dataset_2D = tf.data.Dataset.from_tensor_slices(np.reshape(np.arange(2**3), (2, 2, 2)))
for elem in dataset_2D:
    print(elem)

tf.Tensor(
[[0 1]
 [2 3]], shape=(2, 2), dtype=int64)
tf.Tensor(
[[4 5]
 [6 7]], shape=(2, 2), dtype=int64)


In comparison, this code will give just one instance of a 2x2x2 array.

In [39]:
dataset_3D = tf.data.Dataset.from_tensor_slices(np.reshape(np.arange(2**3), (1, 2, 2, 2)))
for elem in dataset_3D:
    print(elem)

tf.Tensor(
[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]], shape=(2, 2, 2), dtype=int64)


**Test yourself**: build a dataset with 3 instances of a 3x3 array. Build 1 instance of a 3x10 array.

In [40]:
# 3 instances of 3x3 array
# code here

# 1 instance of a 3x10 array
# code here

Personally, I feel this next method is a bit redundant (I haven't looked into the source code, there could some slightly different optimization), but there's also a `from_tensor` method that creates a single instance. I think the context that makes the most sense for this method is testing at inference.

In [41]:
dataset_3D = tf.data.Dataset.from_tensors(np.reshape(np.arange(2**3), (2, 2, 2)))
for elem in dataset_3D:
    print(elem)

tf.Tensor(
[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]], shape=(2, 2, 2), dtype=int64)


You can see that this is the same thing as using  
```Python
dataset_3D = tf.data.Dataset.from_tensor_slices(np.reshape(np.arange(2**3), (1, 2, 2, 2)))
```  
Basically, if your array is a single instance this is a way for the Dataset class to consume the single instance. 

# Using methods in the dataset class

### Dataset as an iterator

Just like most Python objects, the dataset class has an iterator. In fact, the fundamental point of the dataset class is that its very essence is to be an iterator. All the methods within the dataset class either instantiate the iterator (e.g. `from_tensor_slices`), modify the iterator (e.g. `map`), or control the iterator (e.g. `batch`). The Dataset class gives users flexibility to control the memory and processing in flowing data inputs to Tensorflow's machine learning models. Most people can, and probably want to, stop there -- the dataset is fundamentally an iterator similar to Python's range, or numpy arrays, or a thousand other Python objects that iterate.

For anyone interested in peaking a bit more into the nitty gritties read on. If you're not familiar with Python iterators and how they're built, here's a pretty good guide ([link](https://www.ics.uci.edu/~pattis/ICS-33/lectures/iterators.txt)). To really find out what's happening with the dataset iterator you have to go to the source code. The first thing that the source code lets you know is that it implements the Python iterator protocol and therefore can only be used in eager mode. This comment actually has the potential to cause confusion because the Dataset class can be used with a static graph, it will just function differently. An important consequence of eager vs. static execution, is that in static graph the entirety of arrays are placed into the static graph as Variables (potentially taking up a lot of memory and running into byte limits in graph serialization).
```Python
def __iter__(self):
    """Creates an `Iterator` for enumerating the elements of this dataset.
    The returned iterator implements the Python iterator protocol and therefore
    can only be used in eager mode.
    Returns:
      An `Iterator` over the elements of this dataset.
    Raises:
      RuntimeError: If not inside of tf.function and not executing eagerly.
    """
    if (context.executing_eagerly()
        or ops.get_default_graph()._building_function):  # pylint: disable=protected-access
      return iterator_ops.OwnedIterator(self)
    else:
      raise RuntimeError("__iter__() is only supported inside of tf.function "
                         "or when eager execution is enabled.")
```
We then see that it uses the tensorflow iterator_ops for its iterator, so we can go to that source code. That source code points us to the [iterator_ops](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/iterator_ops.py). Of which, the main elements are `_create_iterator` and `_next_internal`. These are effective the `__iter__` and `__next__` methods for the class. Those functions themselves have some more nitty gritties, but suffice to say that you can trace it far enough to reassure yourself that the iterator is moving through nests of the Tensor class. The gist of `_create_iterator` is that it has some pretty thorough robustness so you don't break it, but ultimately it is assigning pieces that fundamentally want to work with tensors. The `_next_internal` checks to see whether the backend is in eager_execution or static graph execution. In eager execution, where you'll most easily see the results, the next method is primarily concerned with returning tensor elements. There's a lot more tracking and integrating the core fundamentals than described thus far, but hopefully this at least provides a shallow insight into the class.
```Python
def _create_iterator(self, dataset):
    # pylint: disable=protected-access
    dataset = dataset._apply_options()

    # Store dataset reference to ensure that dataset is alive when this iterator
    # is being used. For example, `tf.data.Dataset.from_generator` registers
    # a few py_funcs that are needed in `self._next_internal`.  If the dataset
    # is deleted, this iterator crashes on `self.__next__(...)` call.
    self._dataset = dataset

    ds_variant = dataset._variant_tensor
    self._element_spec = dataset.element_spec
    self._flat_output_types = structure.get_flat_tensor_types(
        self._element_spec)
    self._flat_output_shapes = structure.get_flat_tensor_shapes(
        self._element_spec)
    with ops.colocate_with(ds_variant):
      self._iterator_resource, self._deleter = (
          gen_dataset_ops.anonymous_iterator_v2(
              output_types=self._flat_output_types,
              output_shapes=self._flat_output_shapes))
      gen_dataset_ops.make_iterator(ds_variant, self._iterator_resource)
      # Delete the resource when this object is deleted
      self._resource_deleter = IteratorResourceDeleter(
          handle=self._iterator_resource,
          device=self._device,
          deleter=self._deleter)

def _next_internal(self):
    """Returns a nested structure of `tf.Tensor`s containing the next element.
    """
    if not context.executing_eagerly():
      with ops.device(self._device):
        ret = gen_dataset_ops.iterator_get_next(
            self._iterator_resource,
            output_types=self._flat_output_types,
            output_shapes=self._flat_output_shapes)
      return structure.from_compatible_tensor_list(self._element_spec, ret)

    # This runs in sync mode as iterators use an error status to communicate
    # that there is no more data to iterate over.
    # TODO(b/77291417): Fix
    with context.execution_mode(context.SYNC):
      with ops.device(self._device):
        # TODO(ashankar): Consider removing this ops.device() contextmanager
        # and instead mimic ops placement in graphs: Operations on resource
        # handles execute on the same device as where the resource is placed.
        ret = gen_dataset_ops.iterator_get_next(
            self._iterator_resource,
            output_types=self._flat_output_types,
            output_shapes=self._flat_output_shapes)

      try:
        # Fast path for the case `self._structure` is not a nested structure.
        return self._element_spec._from_compatible_tensor_list(ret)  # pylint: disable=protected-access
      except AttributeError:
        return structure.from_compatible_tensor_list(self._element_spec, ret)
```

Here's the most common way Dataset iterator is used

In [42]:
# create dataset
arr = np.array([7, 2, 1, 6, 3, 5, 9])
dataset = tf.data.Dataset.from_tensor_slices(arr)

In [43]:
# loop through dataset elements
for elem in dataset:
    print(elem)

tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


If you only wanted a certain number of elements, you could even set conditional break in there.

In [44]:
count = 0
n_elem = 3
for elem in dataset:
    print(elem)
    count += 1
    if count >= n_elem:
        break

tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)


Another way to see just the first few elements is to work with the bare iterator.

In [45]:
ds_iter = iter(dataset)
print(f'first element: {next(ds_iter)}')
print(f'second element: {next(ds_iter)}')
print(f'third element: {next(ds_iter)}')
del ds_iter

first element: 7
second element: 2
third element: 1


If you want to know what the for loop is really doing, we could do the functionally same operation written as below. The key is that the for loop goes until it hits a StopIteration exception that is built into the class's `next` method.

In [46]:
ds_iter = iter(dataset)
try:
    while True:
        elem = next(ds_iter)
        print(elem)
except StopIteration:
        pass
finally:
        del ds_iter

tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


Not as concise, is it?

## Take

Take, grab, give, this could have been called a lot of things but, alas, it is called `take`. This method takes a count (integer) and creates a new dataset that will iterate through tensors from at most count elements. The statement to reiterate is take builds a new dataset, it does not extract elements. Another tip worth noting is that you can overcount the take count and when the dataset iterates it will just go to max number of elements in the dataset. Similarly, any negative number simply takes the entire dataset.

In [84]:
arr = np.array([7, 2, 1, 6, 3, 5, 9])
dataset = tf.data.Dataset.from_tensor_slices(arr)

In [85]:
ds_take = dataset.take(3)
print(list(ds_take.as_numpy_iterator()), "\n")
ds_take_more = dataset.take(1000)
print("take over count: ", list(ds_take_more.as_numpy_iterator()))
ds_take_neg = dataset.take(-8)
print("take negative count (take all): ", list(ds_take_neg.as_numpy_iterator()))
ds_take_none = dataset.take(0)
print("take zero count (none): ", list(ds_take_none.as_numpy_iterator()))

[7, 2, 1] 

take over count:  [7, 2, 1, 6, 3, 5, 9]
take negative count (take all):  [7, 2, 1, 6, 3, 5, 9]
take zero count (none):  []


A decent analogy for `take` would be a wrapper on a generator with a stop condition that makes the first generator stop earlier if the count is reached and keep going if not.  Since the [true class](https://github.com/tensorflow/tensorflow/blob/e26286914b6d3cf7ed0c9a47f50c07391c2174a6/tensorflow/core/kernels/data/take_dataset_op.cc) is defined in C++, this analogy is merely illustrative.

In [49]:
# Similar concept to Dataset.take
def foo_generator():  # base iterator like Dataset
    for i in range(10):
        yield i 

def foo_take(count):
    counter = 0
    pity_the_foo = foo_generator()
    run_all = count < 0
    while True:
        try:
            if (counter >= count) and not run_all:
                raise StopIteration
            yield next(pity_the_foo)
            counter += 1
        except StopIteration:
            break

In [50]:
print(list(foo_take(5)))
print(list(foo_take(1000)))
print(list(foo_take(-1)))
print(list(foo_take(0)))

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[]


## Map dataset

In [54]:
arr = np.array([7, 2, 1, 6, 3, 5, 9])
dataset = tf.data.Dataset.from_tensor_slices(arr)

You can use map to apply a function to the dataset. For example, if you want to add 1 to to every element of the dataset you could do

In [86]:
def add_one(x):  # note: easily done anonymously lambda x: x + 1
    return x + 1 

dp1 = dataset.map(add_one)
print("dataset ", "mapped dataset")
for ds_elem, elem in zip(dataset, dp1):
    print(ds_elem.numpy(), "\t", elem.numpy())

dataset  mapped dataset
7 	 8
2 	 3
1 	 2
6 	 7
3 	 4
5 	 6
9 	 10


As another example of a mapping function, we can make a function that one-hot encodes values corresponding to a class index.

In [87]:
def one_hot_encode(x, max_index):
    tensor_zeros = tf.zeros(max_index, dtype=tf.int32)
    tensor_ones = tf.ones(max_index, dtype=tf.int32)
    tensor_range = tf.range(max_index, dtype=tf.int32)
    # dataset defaulted to int64 type, hence the casting, types may vary by context
    one_hot_array = tf.where(tensor_range == tf.cast(x, tf.int32), tensor_ones, tensor_zeros)
    return one_hot_array

ds_one_hot = dataset.map(lambda x: one_hot_encode(x, tf.constant(10, dtype=tf.int32)))
for elem in ds_one_hot:
    print(elem)

tf.Tensor([0 0 0 0 0 0 0 1 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 0 1 0 0 0 0 0 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 1 0 0 0 0 0 0 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 0 0 0 0 0 1 0 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 0 0 1 0 0 0 0 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 0 0 0 0 1 0 0 0 0], shape=(10,), dtype=int32)
tf.Tensor([0 0 0 0 0 0 0 0 0 1], shape=(10,), dtype=int32)


Alternatively, rather than manually making a one hot encoding function you could just use tensorflows built-in one_hot function.

In [57]:
ds_one_hot = dataset.map(lambda x: tf.one_hot(x, tf.constant(10, dtype=tf.int32), dtype=tf.int32))
for elem, one_hot_elem in zip(dataset, ds_one_hot):
    print(f'category index {elem.numpy()}, one_hot_encoding {one_hot_elem.numpy()}')

category index 7, one_hot_encoding [0 0 0 0 0 0 0 1 0 0]
category index 2, one_hot_encoding [0 0 1 0 0 0 0 0 0 0]
category index 1, one_hot_encoding [0 1 0 0 0 0 0 0 0 0]
category index 6, one_hot_encoding [0 0 0 0 0 0 1 0 0 0]
category index 3, one_hot_encoding [0 0 0 1 0 0 0 0 0 0]
category index 5, one_hot_encoding [0 0 0 0 0 1 0 0 0 0]
category index 9, one_hot_encoding [0 0 0 0 0 0 0 0 0 1]


To really do the one-hot encoding to death, note that the tensors could also return as a sparse type.

In [58]:
def one_hot_encode(x, max_index):
    return tf.sparse.SparseTensor(indices=[[x]], values=[tf.cast(1, dtype=tf.int64)], dense_shape=[max_index])

ds_one_hot = dataset.map(lambda x: one_hot_encode(x, tf.constant(10, dtype=tf.int64)))
for elem in ds_one_hot:
    print(elem)
    print(tf.sparse.to_dense(elem))

SparseTensor(indices=tf.Tensor([[7]], shape=(1, 1), dtype=int64), values=tf.Tensor([1], shape=(1,), dtype=int64), dense_shape=tf.Tensor([10], shape=(1,), dtype=int64))
tf.Tensor([0 0 0 0 0 0 0 1 0 0], shape=(10,), dtype=int64)
SparseTensor(indices=tf.Tensor([[2]], shape=(1, 1), dtype=int64), values=tf.Tensor([1], shape=(1,), dtype=int64), dense_shape=tf.Tensor([10], shape=(1,), dtype=int64))
tf.Tensor([0 0 1 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
SparseTensor(indices=tf.Tensor([[1]], shape=(1, 1), dtype=int64), values=tf.Tensor([1], shape=(1,), dtype=int64), dense_shape=tf.Tensor([10], shape=(1,), dtype=int64))
tf.Tensor([0 1 0 0 0 0 0 0 0 0], shape=(10,), dtype=int64)
SparseTensor(indices=tf.Tensor([[6]], shape=(1, 1), dtype=int64), values=tf.Tensor([1], shape=(1,), dtype=int64), dense_shape=tf.Tensor([10], shape=(1,), dtype=int64))
tf.Tensor([0 0 0 0 0 0 1 0 0 0], shape=(10,), dtype=int64)
SparseTensor(indices=tf.Tensor([[3]], shape=(1, 1), dtype=int64), values=tf.Tensor([1], shap

## Filter dataset

Both map and filter applies a function to each element in a dataset. Wheras map uses a function to alter the dataset elements in some way, filter works on a conditional return to delete dataset elements that fail the conditional test. The filter method is good for deleting bad dataset inputs.  

Let's imagine a dataset of images that vary in length and width. Most pretrained networks are trained for images of sizes 299x299, 256x256, or 224x224. So if images are coming in the general size of 300x300, some more (say 350x350) and some less (say 250x250) then for the most part I can safely crop down to the right size. However, if the data augmentation involves rotation or other affine transforms then extra support around the input size will be desired, and as such some images may need to get thrown out. The following example goes through such a case.

In [88]:
# generate random array of random size
def gen_series(n_samps):
  i = 0
  while True:
    rand_shape = np.random.randint(2, 5)
    # note need to explicitly output shape because a symbolic tensor does not have a shape 
    yield (rand_shape, np.random.random(size=(rand_shape, rand_shape)))
    i += 1
    if i > n_samps:
        break


Now set the random seed for reproducibility and make 5 random numpy arrays.

In [89]:
np.random.seed(314)
for elem in gen_series(5):
    print(elem)

(2, array([[0.97120896, 0.481791  ],
       [0.9738772 , 0.59946984]]))
(4, array([[0.82735501, 0.72795148, 0.26048042, 0.9117634 ],
       [0.26075656, 0.76637602, 0.26153114, 0.12229137],
       [0.38600554, 0.84008124, 0.27817936, 0.06991369],
       [0.63310965, 0.58476603, 0.58123194, 0.6772054 ]]))
(4, array([[0.39143885, 0.16435973, 0.43433933, 0.74557941],
       [0.97003736, 0.35446608, 0.49190316, 0.30551103],
       [0.44273468, 0.38317651, 0.57375445, 0.5094681 ],
       [0.32474148, 0.46083002, 0.00804761, 0.45918614]]))
(4, array([[0.40695651, 0.17784994, 0.90925204, 0.545331  ],
       [0.1004968 , 0.71872059, 0.97842935, 0.3097757 ],
       [0.26012577, 0.66289961, 0.13971997, 0.08372171],
       [0.52679728, 0.6102353 , 0.86738912, 0.14893502]]))
(3, array([[0.27418571, 0.40196772, 0.16730927],
       [0.45452528, 0.84794886, 0.45265904],
       [0.85438408, 0.04457095, 0.43005191]]))
(3, array([[0.98196678, 0.04156222, 0.09930461],
       [0.22048275, 0.66002216, 0.99

Turn the arrays into a dataset

In [90]:
np.random.seed(314)
dataset_shapes = tf.data.Dataset.from_generator(gen_series,
                                                output_types=(tf.int32, tf.float32),
                                                args = [5]
                                                )

Filter out any arrays that have a shape less than 3 (or conversely keep arrays larger than 2)

In [91]:
def filter_shape(shape, arr):
    return shape > 2

np.random.seed(314)
ds_filter = dataset_shapes.filter(filter_shape)
list(ds_filter.as_numpy_iterator())

[(4, array([[0.827355  , 0.72795147, 0.26048043, 0.91176337],
         [0.26075655, 0.766376  , 0.26153114, 0.12229137],
         [0.38600555, 0.8400812 , 0.27817935, 0.06991369],
         [0.6331096 , 0.58476603, 0.58123195, 0.6772054 ]], dtype=float32)),
 (4, array([[0.39143884, 0.16435973, 0.4343393 , 0.7455794 ],
         [0.97003734, 0.35446608, 0.49190316, 0.30551103],
         [0.4427347 , 0.3831765 , 0.57375443, 0.5094681 ],
         [0.32474148, 0.46083003, 0.00804761, 0.45918614]], dtype=float32)),
 (4, array([[0.40695652, 0.17784993, 0.90925205, 0.545331  ],
         [0.1004968 , 0.7187206 , 0.9784294 , 0.3097757 ],
         [0.2601258 , 0.6628996 , 0.13971996, 0.08372171],
         [0.5267973 , 0.6102353 , 0.86738914, 0.14893502]], dtype=float32)),
 (3, array([[0.27418572, 0.40196773, 0.16730927],
         [0.4545253 , 0.84794885, 0.45265904],
         [0.85438406, 0.04457095, 0.43005192]], dtype=float32)),
 (3, array([[0.9819668 , 0.04156223, 0.09930461],
         [0.22048

Another case where a filter could be helpful is if the data hits some value then you know it's corrupt (maybe a CRC check, or just physical constraints). For example, it could be that you know its impossible for a speed-o-meter to register a speed less than 0.

In [65]:
np.random.seed(272)
dataset_threshold = tf.data.Dataset.from_tensor_slices(np.random.rand(10)).map(lambda x: x - 0.3)
list(dataset_threshold.as_numpy_iterator())

[-0.05227946474140316,
 0.5785707195716125,
 0.6378979597461731,
 0.006009622292020567,
 0.16534980848240438,
 -0.12634543430508632,
 -0.0776061455831683,
 -0.017021838344000118,
 0.317895438599268,
 -0.1027123945757819]

Keep any values that are above 0

In [66]:
list(dataset_threshold.filter(lambda x: x > 0).as_numpy_iterator())

[0.5785707195716125,
 0.6378979597461731,
 0.006009622292020567,
 0.16534980848240438,
 0.317895438599268]

Try it out for yourself. Make an array with normally distributed data and filter out an events more than 2*sigma.

## Interleave

Building from the concepts of `map`, `interleave` is `map` operation. What is different, though, is that `interleave` is a map of a dataset. Wheras `map` operates on tensor elements and typically involves processing those elements, the `interleave` method operates on a dataset itself and typically involves creating more samples. In this sense interleave often builds on a map.

Personally, I found this function was really difficult to wrap my head around.

Interleave has two main controls, `cycle_length` and `block_length`.   
  * `cycle_length` 
  * `block_length`

Cycle length is how many times to cycle through the datasets before starting to interleave, and block length is the length of the interleave.

In [83]:
dataset = tf.data.Dataset.range(1, 6)  # ==> [ 1, 2, 3, 4, 5 ]
# NOTE: New lines indicate "block" boundaries.
dataset = dataset.interleave(
    lambda x: tf.data.Dataset.from_tensors(x).repeat(6),
    cycle_length=3, block_length=4)
list(dataset.as_numpy_iterator())


[1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 3,
 1,
 1,
 2,
 2,
 3,
 3,
 4,
 4,
 4,
 4,
 5,
 5,
 5,
 5,
 4,
 4,
 5,
 5]

In [77]:
base_arr = [['a'], ['b']]
num_array = [np.arange(4), np.arange(6, 10)]

In [78]:
transform = lambda x: -x
dataset = tf.data.Dataset.from_tensor_slices(base_arr)
print(list(dataset.as_numpy_iterator()))

[array([b'a'], dtype=object), array([b'b'], dtype=object)]


In [79]:
interleaved_set = dataset.interleave(lambda x: tf.data.Dataset.from_tensor_slices(num_array),
                                     cycle_length=1,
                                     block_length=1,
                                     )
list(interleaved_set.as_numpy_iterator())

ValueError: ignored

## Shuffle

Shuffle is pretty much what you imagine, it takes your dataset and will spit out a random permutation of your dataset. The main nuance to the shuffle function is the buffer size argument. That being said, the buffer size argument is itself pretty nuanced.

Take the following example: there's a dataset with numbers one through 9.  
With a buffer size of 2, the first shuffle picks from  
[0, 1]  
let's say that it picks 1, now the buffer has  
[0, 2]  
let's say that it picks 2 this time around, now the buffer has  
[0, 3]  
and the shuffled array, to date, has  
[1, 2, 3]  
elements. As such, the placement of 0 in later spots of the array follows a geometric distribution. On the other end, 9 will only ever be in the 8th or 9th index (zero-indexing), or more generally for a shuffle size of 2 a value in the nth index can only appear in the n-1 index or the nth index. Even more generally, for shuffle size of s, a value in the nth index (zero-indexing) can appear in the max(0, n-s+1) index, e.g. with a shuffle size of 10 the 10th element or 9 index can appear at position 9-10+1=0 index.

In [94]:
ds = tf.data.Dataset.from_tensor_slices(np.arange(10))
print([x.numpy() for x in ds])
ds_shuffle = ds.shuffle(2, reshuffle_each_iteration=True)
for _ in range(20):
    print([x.numpy() for x in ds_shuffle])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 0, 2, 3, 5, 4, 6, 8, 9, 7]
[1, 2, 0, 3, 4, 5, 7, 6, 8, 9]
[0, 1, 3, 2, 5, 4, 6, 8, 7, 9]
[1, 0, 2, 3, 5, 4, 6, 7, 9, 8]
[1, 0, 2, 4, 3, 5, 7, 6, 9, 8]
[0, 2, 3, 1, 4, 5, 6, 7, 9, 8]
[0, 1, 2, 4, 5, 6, 7, 8, 9, 3]
[1, 2, 3, 4, 0, 5, 6, 8, 7, 9]
[0, 2, 3, 4, 5, 1, 7, 8, 6, 9]
[0, 2, 1, 3, 5, 6, 4, 8, 9, 7]
[0, 1, 2, 3, 5, 6, 7, 4, 9, 8]
[1, 2, 3, 0, 4, 6, 7, 5, 9, 8]
[0, 2, 1, 3, 4, 6, 5, 7, 9, 8]
[1, 0, 3, 4, 5, 6, 7, 2, 8, 9]
[1, 2, 0, 4, 5, 6, 7, 3, 8, 9]
[1, 2, 0, 3, 5, 4, 7, 6, 9, 8]
[1, 0, 3, 2, 4, 6, 5, 8, 9, 7]
[1, 2, 0, 3, 5, 6, 7, 4, 8, 9]
[0, 2, 1, 4, 5, 6, 3, 7, 8, 9]
[0, 2, 3, 1, 5, 6, 4, 7, 9, 8]


Another example of shuffle, but with a tuple input

In [93]:
rand_ds = tf.data.Dataset.from_tensor_slices((np.random.randn(10, 5), np.arange(10)))
for arr, label in rand_ds:
  print(arr.numpy(), '\t', label.numpy())

[-0.17038979 -1.01244879  0.17762838 -1.24809581  0.36346598] 	 0
[ 1.48327704 -0.28145836  0.41544126 -0.54829399  0.24367764] 	 1
[-0.37245552  0.86221593 -1.40354516  1.41368913 -0.44068334] 	 2
[ 1.20110936 -1.47307629 -0.15802795 -0.67009457 -1.02665006] 	 3
[ 1.2861011   0.15637034 -0.83163134  1.89373309  1.79107898] 	 4
[ 0.09053548  0.05684375 -0.14790181 -0.85165972 -0.60498225] 	 5
[ 0.86458806 -2.2202643   0.32900089 -1.29260993  0.85133427] 	 6
[-1.33570526  0.21023918 -0.5957493   0.23565526  0.7347443 ] 	 7
[ 1.01661854  0.089515   -1.06593683  0.42040595  0.18637991] 	 8
[-1.53374496 -0.08477672  0.01286809  0.6599574   0.63874449] 	 9


In [None]:
for i in range(10):
  ds_iter = iter(rand_ds.shuffle(2, reshuffle_each_iteration=True))
  arr, label = next(ds_iter)
  print(arr.numpy(), '\t', label.numpy())
  arr, label = next(ds_iter)
  print(arr.numpy(), '\t', label.numpy())
  print('\n')

# Datasets with x, y

In [None]:
train, test = tf.keras.datasets.fashion_mnist.load_data()


In [None]:
images, labels = train
images = images/255.0
labels = labels.astype(np.int32)

In [None]:
mnist_ds = dataset.shuffle(5000).batch(32)

In [None]:
mnist_ds

In [None]:
images_reshaped = images.reshape((60000, -1))

In [None]:
type(images_reshaped)

In [None]:
images_reshaped.shape

In [None]:
ds_linear = tf.data.Dataset.from_tensor_slices((images_reshaped))

In [None]:
mnist_ds_linear = ds_linear.shuffle(10).batch(1)

In [None]:
mnist_ds_linear

In [None]:
for num_now in ds_linear.take(2):
  plt.figure(); plt.imshow(np.reshape(num_now, (28, 28)), cmap='gray')

In [None]:
for num_now in ds_linear.shuffle(5000).take(2):
  plt.figure(); plt.imshow(np.reshape(num_now, (28, 28)), cmap='gray')

In [None]:
fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

In [None]:
model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

In [None]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])


In [None]:
model.fit(fmnist_train_ds, epochs=10)

In [None]:
result = model.predict(fmnist_train_ds, steps = 1)

In [None]:
test_ds = tf.data.Dataset.from_tensor_slices((images, labels))

In [None]:
x, y = list(zip(*list(test_ds.batch(5).take(1).as_numpy_iterator())))

In [None]:
y

In [None]:
x, y = list(zip(*list(test_ds.batch(5).take(1).as_numpy_iterator())))
y_predict_all = model.predict(test_ds.batch(5), steps=1)
y_top = np.argmax(y_predict_all, axis=1)
print(y, y_top)

In [None]:
for image_now in x[0]:
    plt.figure()
    plt.imshow(image_now.squeeze(), cmap='gray')

In [None]:
x_lin = tf.range(0, 100, dtype=tf.float32)
y_lin = tf.range(0, 100, dtype=tf.float32)

In [None]:
x_ds = tf.data.Dataset.from_tensor_slices(x_lin)
y_ds = tf.data.Dataset.from_tensor_slices(y_lin)

In [None]:
def add_noise(x):
    return tf.random.normal(x.shape, mean=x)

In [None]:
x_noisy = x_ds.map(add_noise)

In [None]:
xy_ds = tf.data.Dataset.zip((x_noisy, y_ds))

In [None]:
model = tf.keras.Sequential([
     tf.keras.layers.Dense(10, input_shape=[1]),
     tf.keras.layers.Dense(1)]
)

In [None]:
model.compile(loss="mse")

In [None]:
xy_ds.take(1)

In [None]:
model.fit(xy_ds.batch(5).repeat(), epochs=100, steps_per_epoch=10)

In [None]:
model.predict(xy_ds.take(1), steps=1)

# Numpy (python) wrap

Sometimes it will be easier to repurpose previously written Python code, most likely a function using numpy or scipy, than rewriting it with tensorflow methods. The dataset objects are fundamentally built for graph mode, even if there are ways to expose pieces with eager execution, and as such datasets want to operate and use tensors. 

In [None]:
def add_random_noise(x, mu=0.0, std=1.0):
    if isinstance(x, np.float):
        x_noisy = x + std * np.random.randn(1) + mu
    else:
        x_noisy = x + std * np.random.randn(*x.shape) + mu
    return np.float32(x_noisy)

In [None]:
arr = np.array([7, 2, 1, 6, 3, 5, 9], dtype=np.float32)
dataset = tf.data.Dataset.from_tensor_slices(arr)

In [None]:
@tf.function(input_signature=[tf.TensorSpec(None, tf.float32)]) 
def tf_add_random(input):
    mu = 0.
    sigma = 1.
    np_random = lambda x: add_random_noise(x, mu, sigma)
    y = tf.numpy_function(np_random, [input], tf.float32) 
    return y

In [None]:
list(dataset.map(tf_add_random).as_numpy_iterator())