In [1]:
import numpy as np
import tensorflow as tf

### The TensorFlow `Dataset` framework – main components

The TensorFlow Dataset framework has two main components:

* The `Dataset`
* An associated `Iterator`

The Dataset is basically where the data resides. This data can be loaded in from a number of sources – existing tensors, numpy arrays and numpy files, the `TFRecord` format and direct from text files. 

### `Dataset` structure

* A dataset is set up to hold elements that each have the same structure (i.e. our data samples).
* These elements can hold one or more `tf.Tensor` objects called _components_, each with an associated `tf.DType` and a `tf.TensorShape` which can be fully or partially specified.
* You can use the `Dataset.output_types` and `Dataset.output_shapes` properties to check on the types and shapes of each component of a dataset element. 

In [2]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4,10]))
print('\n',dataset1.output_types,'\n')
print(dataset1.output_shapes,'\n')


 <dtype: 'float32'> 

(10,) 



In [3]:
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4,100], maxval=100, dtype=tf.int32)))
print('\n',dataset2.output_types,'\n')
print(dataset2.output_shapes,'\n')


 (tf.float32, tf.int32) 

(TensorShape([]), TensorShape([Dimension(100)])) 



Note the _nested structure_ of these properties map to the structure of an element.

In [4]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print('\n',dataset3.output_types,'\n')
print(dataset3.output_shapes,'\n')


 (tf.float32, (tf.float32, tf.int32)) 

(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)]))) 



### Naming components
* To keep things organized, you may want to name each component of an element, for instance if they represent different features of a training example.

In [5]:
dataset = tf.data.Dataset.from_tensor_slices(
   {"component a": tf.random_uniform([4]),
    "component b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print('\n',dataset.output_types,'\n')
print(dataset.output_shapes,'\n')


 {'component b': tf.int32, 'component a': tf.float32} 

{'component b': TensorShape([Dimension(100)]), 'component a': TensorShape([])} 



### `Dataset` transformations

Once you’ve loaded the data into the `Dataset` object, you can string together various operations to apply to the data, these include operations such as:

* `batch()` – this allows you to consume the data from your TensorFlow Dataset in batches
* `map()` – this allows you to transform the data using lambda statements applied to each element
* `flat_map()` - maps across flattened dataset
* `zip()` – this allows you to zip together different `Dataset` objects into a new Dataset, in a similar way to the Python zip function
* `filter()` – this allows you to remove problematic data-points in your data-set, again based on some lambda function
* `repeat()` – this operation restricts the number of times data is consumed from the Dataset before a `tf.errors.OutOfRangeError` error is thrown
* `shuffle()` – this operation shuffles the data in the Dataset

There are many other methods that the Dataset API includes – see https://www.tensorflow.org/api_docs/python/tf/data/Dataset for more details.  

The `Dataset` transformations support datasets of any structure. When using the `Dataset.map()`, `Dataset.flat_map()`, and `Dataset.filter()` transformations, which apply a function to each element, the element structure determines the arguments of the function:

`dataset1 = dataset1.map(lambda x: ...)`

`dataset2 = dataset2.flat_map(lambda x, y: ...)`

Note: Argument destructuring is not available in Python 3

`dataset3 = dataset3.filter(lambda x, (y, z): ...)`

### The `Iterator`
The next component in the TensorFlow Dataset framework is the Iterator. This creates operations which can be called during the training, validation and/or testing of your model in TensorFlow. 

The `tf.data` API currently supports the following iterators:

* **one-shot** - iterate once through a dataset - no need to initialize
* **initializable** - allows parameterization of dataset definitiion - requires initialization
* **reinitializable** - can be initialized from multiple different `Dataset` objects
* **feedable** - allows you to select what `Iterator` to call and use via `feed_dict` mechanism

### One-shot Iterators
#### Using `.make_one_shot_iterator()` and `get_next`
* A **one-shot** iterator only supports iterating once through a dataset.
* Doesn't need explicit initialization
* Currently only type of iterators easily usable with TF's `Estimator` class.
* Handles almost all the cases that existing (and previous) queue-based input pipelines support.

In [6]:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(100):
        value = sess.run(next_element)
        assert i == value

First create a dataset out of numpy ranges

In [7]:
x = np.arange(0,10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Then we can create a TensorFlow Dataset object straight from the nparray using
`from_tensor_slices()`:

In [8]:
dx = tf.data.Dataset.from_tensor_slices(x)

* The object `dx `is now a TensorFlow `Dataset` object. 
* The next step is to create an Iterator that will extract data from this dataset. 
* In the code below, the iterator is created using the method `make_one_shot_iterator()`.  The iterator arising from this method can only be initialized and run once – it can’t be re-initialized. The importance of being able to re-initialize an iterator will be explained more later.

In [9]:
# create a one-shot iterator
iterator = dx.make_one_shot_iterator()
# extract an element
next_element = iterator.get_next()

In [10]:
with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))

0
1


In [11]:
next_element

<tf.Tensor 'IteratorGetNext_1:0' shape=() dtype=int64>

In [12]:
iterator.get_next()

<tf.Tensor 'IteratorGetNext_2:0' shape=() dtype=int64>

In [13]:
# extracts all the data and then throws an OutOfRangeError at end
with tf.Session() as sess:
    for i in range(11):
        val = sess.run(next_element)
        print(val)

0
1
2
3
4
5
6
7
8
9


OutOfRangeError: End of sequence
	 [[Node: IteratorGetNext_1 = IteratorGetNext[output_shapes=[[]], output_types=[DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_1)]]

Caused by op 'IteratorGetNext_1', defined at:
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/usr/local/lib/python3.5/dist-packages/tornado/platform/asyncio.py", line 127, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
    self._run_once()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
    handle._run()
  File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
    self._callback(*self._args)
  File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 536, in <lambda>
    self.io_loop.add_callback(lambda : self._handle_events(self.socket, 0))
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.5/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-f919e9329b5a>", line 4, in <module>
    next_element = iterator.get_next()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
    name=name)), self._output_types,
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
	 [[Node: IteratorGetNext_1 = IteratorGetNext[output_shapes=[[]], output_types=[DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator_1)]]


### Initializable Iterators
#### Using `.make_initializable_iterator()` and `iterator.initializer`
* An **initializable iterator** requires you to run an explicit `iterator.initializer` operation before using it. 
* In exchange for this inconvenience, it lets you parameterize the definition of the dataset, using one or more `tf.placeholder()` tensors that can be fed when you initialize the iterator.

In [14]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

Initialize an iterator over a dataset with 10 elements.

In [15]:
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 10})
    for i in range(10):
        value = sess.run(next_element)
        assert i == value
        #print(value)

Now we initialize the **same** iterator over the **same** dataset but now with a **different** number of elements - in this case 100.

In [16]:
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 100})
    for i in range(100):
      value = sess.run(next_element)
      assert i == value

### Reinitializable Iterators
#### Using `Iterator.from_structure` and `make_initializer`
* A **reinitializable iterator** can be initialized from multiple **different** `Dataset` objects. 
* For example, you might have a training input pipeline that uses some form of data augmentation for an image dataset, such as adding rotation or random crops or other random perturbations.
* You may have a validation input pipeline that evaluates predictions on unmodified (i.e. un-augmented) data. 
* These pipelines will typically use **different** `Dataset` objects that have the **same structure** (i.e. the same types and compatible shapes for each component).
*  A reinitializable iterator is defined by its structure. For example:

`iterator = tf.data.Iterator.from_structure(training_dataset.output_types, training_dataset.output_shapes)`

* We could use the `output_types` and `output_shapes` properties of either the `training_dataset` or the `validation_dataset` here, because their types and sizes are compatible.

#### reinitializable iterator example
* Here we are _transforming_ the input dataset by adding some random noise to it.

In [17]:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64))

Or, alternatively, we can define our own data transformation functions.

In [18]:
def my_func(x):
    return x + tf.random_uniform([], -10, 10, tf.int64)

training_dataset = tf.data.Dataset.range(100).map(my_func).repeat()

In [19]:
validation_dataset = tf.data.Dataset.range(50)

In [20]:
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
next_element = iterator.get_next()

# Notice separate initializers for the training and validation sets
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
with tf.Session() as sess:
    for _ in range(20):
      # Initialize an iterator over the training dataset.
      sess.run(training_init_op)
      for _ in range(100):
        sess.run(next_element)

      # Initialize an iterator over the validation dataset.
      sess.run(validation_init_op)
      for _ in range(50):
        sess.run(next_element)

### feedable Iterators

#### Using `tf.data.Iterator.from_string_handle()`

* A feedable iterator can be used together with `tf.placeholder` to select what `Iterator` to use in each call to `tf.Session.run`, via the familiar `feed_dict` mechanism. 
* It offers the same functionality as a reinitializable iterator, but doesn't need you to initialize the iterator from the start of a dataset when you switch between iterators. 
* For example, using the same training and validation example from above, you can use `tf.data.Iterator.from_string_handle` to define a feedable iterator that allows you to switch between the two datasets.
* A feedable iterator is defined by a **handle placeholder and its structure**. For example:

`handle = tf.placeholder(tf.string, shape=[])`

`iterator = tf.data.Iterator.from_string_handle(handle, training_dataset.output_types, training_dataset.output_shapes)`

* We could use the `output_types` and `output_shapes` properties of either the
 `training_dataset` or the `validation_dataset` here, because they have identical structure.
* Now we can create multiple different types of iterators, using, for example, `make_one_shot_iterator` or `make_initializable_iterator` and then use the `string_handle()` method with them to get tensors we can use in `feed_dict`
* In other words, the `Iterator.string_handle()` method returns a tensor that can be evaluated and used to feed the `handle` placeholder when we put it in the feed_dict format.

In [21]:
def my_func(x):
    return x + tf.random_uniform([], -10, 10, tf.int64)

# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(my_func).repeat()
validation_dataset = tf.data.Dataset.range(50)

In [22]:
# This is our handle placeholder, with a dtype of tf.string
handle = tf.placeholder(tf.string, shape=[])

# Notice when defining the iterator we are using our handle placeholder 
# as well as our dataset dtypes and shapes
iterator = tf.data.Iterator.from_string_handle(
    handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next()

# You can use feedable iterators with a variety of different kinds of iterators
# (such as one-shot and initializable iterators).
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

counter = 0

with tf.Session() as sess:
    # The `Iterator.string_handle()` method returns a tensor that can be evaluated
    # and used to feed the `handle` placeholder when we put it in the feed_dict format.
    training_handle = sess.run(training_iterator.string_handle())
    validation_handle = sess.run(validation_iterator.string_handle())
    
    # Loop forever, alternating between training and validation.
    while True:
        # Run 200 steps using the training dataset. Note that the training dataset is
        # infinite, and we resume from where we left off in the previous `while` loop
        # iteration.
        for _ in range(200):
            sess.run(next_element, feed_dict={handle: training_handle})

        # Run one pass over the validation dataset.
        sess.run(validation_iterator.initializer)
        for _ in range(50):
            sess.run(next_element, feed_dict={handle: validation_handle})
        
        counter +=1
        if counter > 300:
            print("We have completed {} steps".format(counter))
            break

We have completed 301 steps


### Consuming iterator data
#### Understanding `Iterator.get_next()`
* The `Iterator.get_next()` method returns one or more `tf.Tensor` objects (depending on how the `Dataset` was set up) that correspond to the symbolic next element of an iterator. 
* Each time these tensors are evaluated, they take the value of the next element in the underlying dataset. 
* Same as for other TensorFlow expressions, calling `Iterator.get_next()` does not immediately advance the iterator. Instead you need to fetch the resulting `tf.Tensor` objects in the context of a `tf.Session.run()` to get the next elements and advance the iterator.
* For one-shot and initializable iterators: If the iterator reaches the end of the dataset, executing the Iterator.get_next() operation will raise a tf.errors.OutOfRangeError. After this point the iterator will be in an unusable state, and you must initialize it again if you want to use it further.
* A common way of dealing with this is to wrap the training loop in a try-except block

In [23]:
dataset = tf.data.Dataset.range(5)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

# Typically `result` will be the output of a model, 
# or an optimizer's training operation.In this case we are just doing simple addition
result = tf.add(next_element, next_element)

with tf.Session() as sess:
    sess.run(iterator.initializer)
    print(sess.run(result))  # ==> "0"
    print(sess.run(result))  # ==> "2"
    print(sess.run(result))  # ==> "4"
    print(sess.run(result))  # ==> "6"
    print(sess.run(result))  # ==> "8"
    try:
      sess.run(result)
    except tf.errors.OutOfRangeError:
      print("End of dataset")  # ==> "End of dataset"

0
2
4
6
8
End of dataset


### Getting data from iterators with a nested data structure
* If each element of the dataset has a nested structure, the return value of `Iterator.get_next()` will be one or more `tf.Tensor` objects in the same nested structure.
* Note that next1, next2, and next3 are tensors produced by the same op/node (created by `Iterator.get_next()`). Therefore, evaluating any of these tensors will advance the iterator for all components. 
* A typical consumer of an iterator will include all components in a single expression.

In [24]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
dataset2 = tf.data.Dataset.from_tensor_slices((tf.random_uniform([4]), tf.random_uniform([4, 100])))
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

iterator = dataset3.make_initializable_iterator()

with tf.Session() as sess:
    sess.run(iterator.initializer)
    next1, (next2, next3) = iterator.get_next()
print(next1)
print(next2)
print(next3)

Tensor("IteratorGetNext_7:0", shape=(10,), dtype=float32)
Tensor("IteratorGetNext_7:1", shape=(), dtype=float32)
Tensor("IteratorGetNext_7:2", shape=(100,), dtype=float32)


### Saving iterator state
#### Using `tf.contrib.data.make_saveable_from_iterator`
* The `tf.contrib.data.make_saveable_from_iterator` function creates a `SaveableObject` from an iterator, which can be used to save and restore the current state of the iterator (and, effectively, the whole input pipeline). 
* A saveable object thus created can be added to `tf.train.Saver` variables list or the `tf.GraphKeys.SAVEABLE_OBJECTS` collection for saving and restoring in the same manner as a `tf.Variable`.

In [27]:
tf.reset_default_graph()
dataset = tf.data.Dataset.range(100)
iterator_to_save = dataset.make_initializable_iterator()

# Create saveable object from iterator.
saveable = tf.contrib.data.make_saveable_from_iterator(iterator_to_save)

# Save the iterator state by adding it to the saveable objects collection.
tf.add_to_collection(tf.GraphKeys.SAVEABLE_OBJECTS, saveable)

# define a saver as normal
saver = tf.train.Saver()

# you'll have some creterion for saving
should_checkpoint=True
path_to_checkpoint='./logs/20_Dataset_API'

with tf.Session() as sess:
    sess.run(iterator_to_save.initializer)
    if should_checkpoint:
        saver.save(sess, path_to_checkpoint)

In [28]:
# Restore the iterator state.
with tf.Session() as sess:
  saver.restore(sess, path_to_checkpoint)

INFO:tensorflow:Restoring parameters from ./logs/20_Dataset_API


### Batching dataset elements
#### Simple batching: Using `.batch()`
* The simplest form of batching stacks n consecutive elements of a dataset into a single element. 
* The `Dataset.batch()` transformation does exactly this, with the same constraints as the `tf.stack()` operator, applied to each component of the elements: i.e. for each component i, all elements must have a tensor of the exact same shape.

In [29]:
inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_element))  # ==> ([0, 1, 2,   3],   [ 0, -1,  -2,  -3])
    print(sess.run(next_element))  # ==> ([4, 5, 6,   7],   [-4, -5,  -6,  -7])
    print(sess.run(next_element))  # ==> ([8, 9, 10, 11],   [-8, -9, -10, -11])

(array([0, 1, 2, 3]), array([ 0, -1, -2, -3]))
(array([4, 5, 6, 7]), array([-4, -5, -6, -7]))
(array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11]))


#### Another example with `.batch()`

In [30]:
x = np.arange(0,10)
dx = tf.data.Dataset.from_tensor_slices(x).batch(3)
iterator = dx.make_initializable_iterator()
next_element = iterator.get_next()

In [31]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    for i in range(15):
        val = sess.run(next_element)
        print([val], i)
        if (i + 1) % (10 // 3) == 0 and i > 0:
            sess.run(iterator.initializer)

[array([0, 1, 2])] 0
[array([3, 4, 5])] 1
[array([6, 7, 8])] 2
[array([0, 1, 2])] 3
[array([3, 4, 5])] 4
[array([6, 7, 8])] 5
[array([0, 1, 2])] 6
[array([3, 4, 5])] 7
[array([6, 7, 8])] 8
[array([0, 1, 2])] 9
[array([3, 4, 5])] 10
[array([6, 7, 8])] 11
[array([0, 1, 2])] 12
[array([3, 4, 5])] 13
[array([6, 7, 8])] 14


#### zip datasets together to pair input-output training/validation pairs of data
* In this example, the batching takes place appropriately whith the zipped together datasets (i.e. 3 items from dx, 3 items from dy.
* The re-initialization `if` statement on the last two lines can be shorted by replacing the dcomb dataset creation line as seen below where adding the `.repeat()` method without argument which means the dataset can be repeated indefinitely without throwing an OutOfRangeError

In [32]:
x = np.arange(0,10)
y = np.arange(1, 11)

def simple_zip_example(x, y):
    # create dataset objects from the arrays
    dx = tf.data.Dataset.from_tensor_slices(x)
    dy = tf.data.Dataset.from_tensor_slices(y)
    
    # Zip the two datasets together
    #dcomb = tf.data.Dataset.zip((dx, dy)).batch(3)
    dcomb = tf.data.Dataset.zip((dx, dy)).repeat().batch(3)
    iterator = dcomb.make_initializable_iterator()
    
    # extract an element
    next_element = iterator.get_next()
    
    with tf.Session() as sess:
        sess.run(iterator.initializer)
        for i in range(15):
            val = sess.run(next_element)
            print(val)
#             if (i+1) % (10 // 3) == 0 and i > 0:
#                sess.run(iterator.initializer)
    

In [33]:
simple_zip_example(x,y)

(array([0, 1, 2]), array([1, 2, 3]))
(array([3, 4, 5]), array([4, 5, 6]))
(array([6, 7, 8]), array([7, 8, 9]))
(array([9, 0, 1]), array([10,  1,  2]))
(array([2, 3, 4]), array([3, 4, 5]))
(array([5, 6, 7]), array([6, 7, 8]))
(array([8, 9, 0]), array([ 9, 10,  1]))
(array([1, 2, 3]), array([2, 3, 4]))
(array([4, 5, 6]), array([5, 6, 7]))
(array([7, 8, 9]), array([ 8,  9, 10]))
(array([0, 1, 2]), array([1, 2, 3]))
(array([3, 4, 5]), array([4, 5, 6]))
(array([6, 7, 8]), array([7, 8, 9]))
(array([9, 0, 1]), array([10,  1,  2]))
(array([2, 3, 4]), array([3, 4, 5]))


### Minimal MNIST Dataset example

In [34]:
from sklearn.datasets import load_digits

In [35]:
digits = load_digits(return_X_y=True)

In [36]:
train_images = digits[0][:int(len(digits[0]) * 0.8)]
train_labels = digits[1][:int(len(digits[0]) * 0.8)]
valid_images = digits[0][int(len(digits[0]) * 0.8):]
valid_labels = digits[1][int(len(digits[0]) * 0.8):]

In [37]:
# create the training datasets
dx_train = tf.data.Dataset.from_tensor_slices(train_images)
# apply a one-hot transform to each label for use in the NN
dy_train = tf.data.Dataset.from_tensor_slices(train_labels).map(lambda z: tf.one_hot(z, 10))
# zip the x and y training_data together and shuffle, batch etc.
train_dataset = tf.data.Dataset.zip((dx_train, dy_train)).shuffle(500).repeat().batch(30)

In [38]:
# do the same operations for the validation set
dx_valid = tf.data.Dataset.from_tensor_slices(valid_images)
dy_valid = tf.data.Dataset.from_tensor_slices(valid_labels).map(lambda z: tf.one_hot(z, 10))
valid_dataset = tf.data.Dataset.zip((dx_valid, dy_valid)).shuffle(500).repeat().batch(30)

* Now, we want to be able to extract data from either the train_dataset or the valid_dataset seamlessly. 
* This is important, as we don’t want to have to change how data flows through the neural network structure when all we want to do is just change the dataset the model is consuming. 
* To do this, we can use another way of creating the Iterator object – the `from_structure()` method. 
* This method creates a generic iterator object – all it needs is the data types of the data it will be outputting and the output data size/shape in order to be created. 
The code below uses this methodology:

In [39]:
# create general iterator
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
                                          train_dataset.output_shapes)
next_element = iterator.get_next()

# Now we need operations which can be called during training or eval to initialize
# this generic iterator and "point it" to the desired dataset.
training_init_op = iterator.make_initializer(train_dataset)
validation_init_op = iterator.make_initializer(valid_dataset)

In [40]:
def nn_model(in_data):
    bn = tf.layers.batch_normalization(in_data)
    fc1 = tf.layers.dense(bn, 50)
    fc2 = tf.layers.dense(fc1, 50)
    fc2 = tf.layers.dropout(fc2)
    fc3 = tf.layers.dense(fc2, 10)
    return fc3

* Note that the next_element operation is handled directly in the model – in other words, it doesn’t need to be called explicitly during the training loop as will be seen below.
* Rather, whenever any of the operations following this point in the graph are called (i.e. the loss operation, the optimization operation etc.) the TensorFlow graph structure will know to run the next_element operation and extract the data from whichever dataset has been initialized into the iterator. 
* The next_element operation, because it is operating on the generic iterator which is defined by the shape of the train_dataset, is a tuple – the first element ([0]) will contain the MNIST images, while the second element ([1]) will contain the corresponding labels. Therefore, next_element[0] will extract the image data batch and send it into the neural network model (nn_model) as the input data.

In [41]:
logits = nn_model(next_element[0])

In [42]:
# add the optimizer and loss
loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits_v2(labels=next_element[1], logits=logits))
optimizer = tf.train.AdamOptimizer().minimize(loss)

# get accuracy
prediction = tf.argmax(logits, 1)
equality = tf.equal(prediction, tf.argmax(next_element[1], 1))
accuracy = tf.reduce_mean(tf.cast(equality, tf.float32))
init_op = tf.global_variables_initializer()

In [43]:
# run the training
epochs = 600
with tf.Session() as sess:
    sess.run(init_op)
    sess.run(training_init_op)
    for i in range(epochs):
        l, _, acc = sess.run([loss, optimizer, accuracy])
        if i % 50 == 0:
            print("Epoch: {}, loss: {:.3f}, training accuracy: {:.2f}%".format(i,l,acc*100))
    # now setup the validation run
    valid_iters = 100
    # re-initialize the iterator, but this time with validation data
    sess.run(validation_init_op)
    avg_acc = 0
    for i in range(valid_iters):
        acc = sess.run([accuracy])
        avg_acc += acc[0]
    print("Average validation set accuracy over {} iterations is {:.2f}%".format(valid_iters,avg_acc))

Epoch: 0, loss: 588.366, training accuracy: 13.33%
Epoch: 50, loss: 35.890, training accuracy: 76.67%
Epoch: 100, loss: 16.141, training accuracy: 80.00%
Epoch: 150, loss: 2.740, training accuracy: 96.67%
Epoch: 200, loss: 5.205, training accuracy: 96.67%
Epoch: 250, loss: 8.484, training accuracy: 86.67%
Epoch: 300, loss: 2.002, training accuracy: 96.67%
Epoch: 350, loss: 1.934, training accuracy: 100.00%
Epoch: 400, loss: 1.203, training accuracy: 100.00%
Epoch: 450, loss: 2.091, training accuracy: 96.67%
Epoch: 500, loss: 0.458, training accuracy: 100.00%
Epoch: 550, loss: 2.094, training accuracy: 93.33%
Average validation set accuracy over 100 iterations is 88.67%


### Some repetition from here on

In [44]:
x = np.random.sample((100,2))
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))

[0.68805744 0.11730824]


In [45]:
# using two numpy arrays
features, labels = (np.random.sample((100,2)), np.random.sample((100,1)))
dataset = tf.data.Dataset.from_tensor_slices((features,labels))

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))

(array([0.39828852, 0.73085527]), array([0.54407894]))


In [46]:
# using a tensor 
dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([100, 2]))

iter = dataset.make_initializable_iterator()
el = iter.get_next()

with tf.Session() as sess:
    sess.run(iter.initializer)
    print(sess.run(el))

[0.5345912  0.29279315]


In [47]:
# using a placeholder
x = tf.placeholder(tf.float32, shape=[None,2])
dataset = tf.data.Dataset.from_tensor_slices(x)

data = np.random.sample((100,2))

iter = dataset.make_initializable_iterator()
el = iter.get_next()

with tf.Session() as sess:
    sess.run(iter.initializer, feed_dict={ x: data })
    print(sess.run(el))

[0.5966419 0.5760475]


In [55]:
# from generator
sequence = np.array([[1],[2,3],[3,4]])

def generator():
    for el in sequence:
        yield el

dataset = tf.data.Dataset().from_generator(generator,
                                           output_types=tf.float32, 
                                           output_shapes=((1,)))
iter = dataset.make_initializable_iterator()
el = iter.get_next()

with tf.Session() as sess:
    sess.run(iter.initializer)
    print(sess.run(el))

[1.]


In [56]:
# initializable iterator to switch between data
EPOCHS = 10

x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1])
dataset = tf.data.Dataset.from_tensor_slices((x, y))

train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.array([[1,2]]), np.array([[0]]))

iter = dataset.make_initializable_iterator()
features, labels = iter.get_next()

with tf.Session() as sess:
#     initialise iterator with train data
    sess.run(iter.initializer, feed_dict={ x: train_data[0], y: train_data[1]})
    for _ in range(EPOCHS):
        sess.run([features, labels])
#     switch to test data
    sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1]})
    print(sess.run([features, labels]))

[array([1., 2.], dtype=float32), array([0.], dtype=float32)]


In [57]:
# Reinitializable iterator to switch between Datasets
EPOCHS = 10
# making fake data using numpy
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((10,2)), np.random.sample((10,1)))
# create two datasets, one for training and one for test
train_dataset = tf.data.Dataset.from_tensor_slices(train_data)
test_dataset = tf.data.Dataset.from_tensor_slices(test_data)
# create a iterator of the correct shape and type
iter = tf.data.Iterator.from_structure(train_dataset.output_types,
                                           train_dataset.output_shapes)
features, labels = iter.get_next()
# create the initialisation operations
train_init_op = iter.make_initializer(train_dataset)
test_init_op = iter.make_initializer(test_dataset)

with tf.Session() as sess:
    sess.run(train_init_op) # switch to train dataset
    for _ in range(EPOCHS):
        sess.run([features, labels])
    sess.run(test_init_op) # switch to val dataset
    print(sess.run([features, labels]))

[array([0.75548354, 0.47333289]), array([0.36416387])]


In [58]:
# BATCHING
BATCH_SIZE = 4
x = np.random.sample((100,2))
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x).batch(BATCH_SIZE)

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))

[[0.76792777 0.51634994]
 [0.00132168 0.18642143]
 [0.69443948 0.63152526]
 [0.16554104 0.57096108]]


In [59]:
# REPEAT
BATCH_SIZE = 4
x = np.array([[1],[2],[3],[4]])
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.repeat()

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))
#     this will run forever
#     while True:
#         print(sess.run(el))    

[1]


In [60]:
# MAP
x = np.array([[1],[2],[3],[4]])
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.map(lambda x: x*2)

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))
#     this will run forever
#         for _ in range(len(x)):
#             print(sess.run(el))

[2]


In [61]:
# SHUFFLE
BATCH_SIZE = 4
x = np.array([[1],[2],[3],[4]])
# make a dataset from a numpy array
dataset = tf.data.Dataset.from_tensor_slices(x)
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(BATCH_SIZE)

iter = dataset.make_one_shot_iterator()
el = iter.get_next()

with tf.Session() as sess:
    print(sess.run(el))

[[4]
 [2]
 [3]
 [1]]


In [62]:
# how to pass the value to a model
EPOCHS = 10
BATCH_SIZE = 16
# using two numpy arrays
features, labels = (np.array([np.random.sample((100,2))]), 
                    np.array([np.random.sample((100,1))]))

dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE)

iter = dataset.make_one_shot_iterator()
x, y = iter.get_next()

# make a simple model
net = tf.layers.dense(x, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input
net = tf.layers.dense(net, 8, activation=tf.tanh)
prediction = tf.layers.dense(net, 1, activation=tf.tanh)

loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label
train_op = tf.train.AdamOptimizer().minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(EPOCHS):
        _, loss_value = sess.run([train_op, loss])
        print("Iter: {}, Loss: {:.4f}".format(i, loss_value))

Iter: 0, Loss: 0.2183
Iter: 1, Loss: 0.2084
Iter: 2, Loss: 0.1990
Iter: 3, Loss: 0.1901
Iter: 4, Loss: 0.1817
Iter: 5, Loss: 0.1738
Iter: 6, Loss: 0.1664
Iter: 7, Loss: 0.1595
Iter: 8, Loss: 0.1530
Iter: 9, Loss: 0.1471


In [63]:
# Wrapping all together -> Switch between train and test set
EPOCHS = 10
BATCH_SIZE = 16
# create a placeholder to dynamically switch between batch sizes
batch_size = tf.placeholder(tf.int64)

x, y = tf.placeholder(tf.float32, shape=[None,2]), tf.placeholder(tf.float32, shape=[None,1])
dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size).repeat()

# using two numpy arrays
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((20,2)), np.random.sample((20,1)))

n_batches = len(train_data[0]) // BATCH_SIZE

iter = dataset.make_initializable_iterator()
features, labels = iter.get_next()
# make a simple model
net = tf.layers.dense(features, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input
net = tf.layers.dense(net, 8, activation=tf.tanh)
prediction = tf.layers.dense(net, 1, activation=tf.tanh)

loss = tf.losses.mean_squared_error(prediction, labels) # pass the second value from iter.get_net() as label
train_op = tf.train.AdamOptimizer().minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    # initialise iterator with train data
    sess.run(iter.initializer, feed_dict={ x: train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
    print('Training...')
    for i in range(EPOCHS):
        tot_loss = 0
        for _ in range(n_batches):
            _, loss_value = sess.run([train_op, loss])
            tot_loss += loss_value
        print("Iter: {}, Loss: {:.4f}".format(i, tot_loss / n_batches))
    # initialise iterator with test data
    sess.run(iter.initializer, feed_dict={ x: test_data[0], y: test_data[1], batch_size: test_data[0].shape[0]})
    print('Test Loss: {:4f}'.format(sess.run(loss)))

Training...
Iter: 0, Loss: 0.3523
Iter: 1, Loss: 0.2568
Iter: 2, Loss: 0.1890
Iter: 3, Loss: 0.1561
Iter: 4, Loss: 0.1304
Iter: 5, Loss: 0.1114
Iter: 6, Loss: 0.0956
Iter: 7, Loss: 0.0839
Iter: 8, Loss: 0.0796
Iter: 9, Loss: 0.0710
Test Loss: 0.066840


### Consuming TFRecord data
* The TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. \
* The `tf.data.TFRecordDataset` class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.