# Getting started with TensorFlow's `Dataset` API

This notebook contains examples on how to build simple input pipelines with TensorFlow's [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data/) API. The examples are based on contructing `Dataset` objects from Numpy arrays in memory, which is intended to be used only with very small datasets as it can be considerably inefficient.

More info can be found on the session [Importing Data](https://www.tensorflow.org/guide/datasets) on TensorFlow's page.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
tf.VERSION

'1.12.0'

In [3]:
# Create fake data
nsamples = 10
nfeatures = 4
x_numpy = np.random.random((nsamples, nfeatures))
y_numpy = x_numpy.sum(axis=1).astype(int)

# Check the y values
# np.unique(y_numpy)

In [4]:
# Creating a `Dataset` object
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)
iterator = dataset.make_one_shot_iterator()
next_item = iterator.get_next()

# * Dataset.repeat() concatenates the datataset without signaling the end of one epoch
#   and the beginning of the next one.

In [5]:
# This properties of a Dataset instance allow you to inspect
# the types, classes (they are allways Tensor though) and
# shapes of the components of a dataset element.
print('output_classes:', dataset.output_classes)
print('output_shapes: ', dataset.output_shapes)
print('output_types:  ', dataset.output_types)

output_classes: (<class 'tensorflow.python.framework.ops.Tensor'>, <class 'tensorflow.python.framework.ops.Tensor'>)
output_shapes:  (TensorShape([Dimension(None), Dimension(4)]), TensorShape([Dimension(None)]))
output_types:   (tf.float64, tf.int64)


In [6]:
# If the iterator reaches the end of the dataset (with all the repeats),
# executing the `next_item` operation will raise a `tf.errors.OutOfRangeError`.
with tf.Session() as sess:
    for i in range(10):
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

features: [[0.08979985 0.48339875 0.50210204 0.88801814]]  |  label: [1]
features: [[0.03819731 0.01846576 0.10768675 0.00198685]]  |  label: [0]
features: [[0.74537316 0.17139221 0.36994259 0.17733685]]  |  label: [1]
features: [[0.30423547 0.08525225 0.51536512 0.04312901]]  |  label: [0]
features: [[0.41625645 0.44135946 0.08748522 0.58499039]]  |  label: [1]
features: [[0.16022997 0.29757888 0.98556616 0.36110203]]  |  label: [1]
features: [[0.79684992 0.59123642 0.48409895 0.56672481]]  |  label: [2]
features: [[0.0429743  0.53089394 0.79250231 0.92004754]]  |  label: [2]
features: [[0.94971716 0.68010704 0.81005402 0.18546125]]  |  label: [2]
features: [[0.70738882 0.64529225 0.48295785 0.2769267 ]]  |  label: [2]


In [7]:
with tf.Session() as sess:
    try:
        while True:
            features, label = sess.run(next_item)
            print('features: %s  |  label: %s' % (features, label))
    except tf.errors.OutOfRangeError:
        print('The dataset ran out of entries!')

features: [[0.70738882 0.64529225 0.48295785 0.2769267 ]]  |  label: [2]
features: [[0.79684992 0.59123642 0.48409895 0.56672481]]  |  label: [2]
features: [[0.03819731 0.01846576 0.10768675 0.00198685]]  |  label: [0]
features: [[0.94971716 0.68010704 0.81005402 0.18546125]]  |  label: [2]
features: [[0.08979985 0.48339875 0.50210204 0.88801814]]  |  label: [1]
features: [[0.74537316 0.17139221 0.36994259 0.17733685]]  |  label: [1]
features: [[0.30423547 0.08525225 0.51536512 0.04312901]]  |  label: [0]
features: [[0.16022997 0.29757888 0.98556616 0.36110203]]  |  label: [1]
features: [[0.41625645 0.44135946 0.08748522 0.58499039]]  |  label: [1]
features: [[0.0429743  0.53089394 0.79250231 0.92004754]]  |  label: [2]
The dataset ran out of entries!


In [8]:
# tf.errors.OutOfRangeError:
# `tf.train.MonitoredTrainingSession` is necessary when running distributed training
# with TensorFlow. It uses tf.errors.OutOfRangeError to identify the last iteration
# automatically.

with tf.train.MonitoredTrainingSession() as sess:
    while not sess.should_stop():
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
features: [[0.0429743  0.53089394 0.79250231 0.92004754]]  |  label: [2]
features: [[0.41625645 0.44135946 0.08748522 0.58499039]]  |  label: [1]
features: [[0.74537316 0.17139221 0.36994259 0.17733685]]  |  label: [1]
features: [[0.08979985 0.48339875 0.50210204 0.88801814]]  |  label: [1]
features: [[0.79684992 0.59123642 0.48409895 0.56672481]]  |  label: [2]
features: [[0.70738882 0.64529225 0.48295785 0.2769267 ]]  |  label: [2]
features: [[0.94971716 0.68010704 0.81005402 0.18546125]]  |  label: [2]
features: [[0.30423547 0.08525225 0.51536512 0.04312901]]  |  label: [0]
features: [[0.16022997 0.29757888 0.98556616 0.36110203]]  |  label: [1]
features: [[0.03819731 0.01846576 0.10768675 0.00198685]]  |  label: [0]


### The `Dataset` object can be created also from Tensor objects

If the input pipeline depends now on TensorFlow operations, like on the followin cell, for instance, we need to use an **initializable iterator** instead of a **one-shot** one and run `sess.run(iterator.initializer)` befor start iterating. This will perform the graph operations (in this case evalate the tesnors) and that will be needed in the pipelines.

In [9]:
x_tensor = tf.random.uniform([nsamples, nfeatures])
y_tensor = tf.cast(tf.reduce_sum(x_tensor, axis=1), tf.int32)

In [10]:
dataset = tf.data.Dataset.from_tensor_slices((x_tensor, y_tensor))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)
iterator = dataset.make_initializable_iterator()
next_item = iterator.get_next()

In [11]:
with tf.train.MonitoredTrainingSession() as sess:
    sess.run(iterator.initializer)
    while not sess.should_stop():
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
features: [[0.8323463  0.19634986 0.28265893 0.58835554]]  |  label: [1]
features: [[0.5557394  0.15045857 0.42133427 0.57148683]]  |  label: [1]
features: [[0.07681918 0.48761415 0.34690034 0.07560027]]  |  label: [0]
features: [[0.6231345  0.12168407 0.82891834 0.4291649 ]]  |  label: [2]
features: [[0.54758704 0.5609318  0.51281536 0.20193076]]  |  label: [1]
features: [[0.9646871  0.44459295 0.22541761 0.20803297]]  |  label: [1]
features: [[0.4618914  0.38408446 0.47942042 0.77109647]]  |  label: [2]
features: [[0.21837783 0.8274019  0.41136444 0.9828838 ]]  |  label: [2]
features: [[0.55970013 0.93917    0.6917205  0.6595677 ]]  |  label: [2]
features: [[0.7341771  0.6856011  0.17116964 0.85631883]]  |  label: [2]


[`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) embeds the features and labels arrays in your TensorFlow graph as `tf.constant` operations. This works well for a small dataset, but wastes memory because the contents of the array will be copied multiple times.

As an alternative, a **feedable** `Dataset` can be defined in terms of `tf.placeholder` tensors, and feed the NumPy arrays when an Iterator is initialized over the dataset. However this is still very ineficient!!!

In [12]:
features_placeholder = tf.placeholder(x_numpy.dtype, x_numpy.shape)
label_placeholder = tf.placeholder(y_numpy.dtype, y_numpy.shape)

In [13]:
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, label_placeholder))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)
# transformations
iterator = dataset.make_initializable_iterator()
next_item = iterator.get_next()

In [14]:
with tf.train.MonitoredTrainingSession() as sess:
    sess.run(iterator.initializer, feed_dict={features_placeholder: x_numpy,
                                              label_placeholder: y_numpy})
    while not sess.should_stop():
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
features: [[0.03819731 0.01846576 0.10768675 0.00198685]]  |  label: [0]
features: [[0.94971716 0.68010704 0.81005402 0.18546125]]  |  label: [2]
features: [[0.30423547 0.08525225 0.51536512 0.04312901]]  |  label: [0]
features: [[0.74537316 0.17139221 0.36994259 0.17733685]]  |  label: [1]
features: [[0.41625645 0.44135946 0.08748522 0.58499039]]  |  label: [1]
features: [[0.70738882 0.64529225 0.48295785 0.2769267 ]]  |  label: [2]
features: [[0.16022997 0.29757888 0.98556616 0.36110203]]  |  label: [1]
features: [[0.79684992 0.59123642 0.48409895 0.56672481]]  |  label: [2]
features: [[0.0429743  0.53089394 0.79250231 0.92004754]]  |  label: [2]
features: [[0.08979985 0.48339875 0.50210204 0.88801814]]  |  label: [1]


## Add data transformations to the pipeline

Lest's say that for our problem it is beneficial to center the features between -0.5 and 0.5. Also, we would like to transform the labels from integers to one-hot encoded. This is can be donde with `Dataset`'s method `map`.

In [15]:
# The following transformations are quite simple and can be done
# on a single function, but we will use two different functions
# to show how operations can be pipelined.

def center(*row):
    features = row[0] - 0.5
    label = row[1]
    return features, label

def make_on_hot_labels(features, label):
    return features, tf.one_hot(label, 4)

# simpler with `dataset = dataset.filter(lambda f, l: tf.equal(l, 1))`
def filter_labels(features, label):
    return tf.equal(label, 1)

# simpler with `dataset = dataset.filter(lambda f, l: tf.greater(f[0], 0)`
def filter_features(features, label):
    return tf.greater(features[0], 0)

In [16]:
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.filter(filter_labels)
dataset = dataset.map(center)
dataset = dataset.filter(filter_features)
dataset = dataset.map(make_on_hot_labels)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)
next_item = dataset.make_one_shot_iterator().get_next()

In [17]:
with tf.train.MonitoredTrainingSession() as sess:
    while not sess.should_stop():
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
features: [[ 0.24537316 -0.32860779 -0.13005741 -0.32266315]]  |  label: [[0. 1. 0. 0.]]


## Giving names to `Dataset` components

In [18]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
print('output_classes:', dataset.output_classes)
print('output_shapes: ', dataset.output_shapes)
print('output_types:  ', dataset.output_types)

output_classes: {'features': <class 'tensorflow.python.framework.ops.Tensor'>, 'label': <class 'tensorflow.python.framework.ops.Tensor'>}
output_shapes:  {'features': TensorShape([Dimension(4)]), 'label': TensorShape([])}
output_types:   {'features': tf.float64, 'label': tf.int64}


In [19]:
def center(row):
    features = row['features'] - 0.5
    label = row['label']
    return features, label

In [20]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
dataset = dataset.map(center)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)   # batch size
dataset = dataset.repeat(1)  # number of epochs
next_item = dataset.make_one_shot_iterator().get_next()

In [21]:
with tf.train.MonitoredTrainingSession() as sess:
    while not sess.should_stop():
        features, label = sess.run(next_item)
        print('features: %s  |  label: %s' % (features, label))

INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
features: [[-0.19576453 -0.41474775  0.01536512 -0.45687099]]  |  label: [0]
features: [[-0.46180269 -0.48153424 -0.39231325 -0.49801315]]  |  label: [0]
features: [[ 0.20738882  0.14529225 -0.01704215 -0.2230733 ]]  |  label: [2]
features: [[-0.33977003 -0.20242112  0.48556616 -0.13889797]]  |  label: [1]
features: [[ 0.44971716  0.18010704  0.31005402 -0.31453875]]  |  label: [2]
features: [[ 0.24537316 -0.32860779 -0.13005741 -0.32266315]]  |  label: [1]
features: [[-0.4570257   0.03089394  0.29250231  0.42004754]]  |  label: [2]
features: [[-0.41020015 -0.01660125  0.00210204  0.38801814]]  |  label: [1]
features: [[-0.08374355 -0.05864054 -0.41251478  0.08499039]]  |  label: [1]
features: [[ 0.29684992  0.09123642 -0.01590105  0.06672481]]  |  label: [2]
