# Getting started with TensorFlow's `Dataset` API with eager execution

This notebook contains examples on how to build simple input pipelines with TensorFlow's [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data/) API. The examples are based on contructing `Dataset` objects from Numpy arrays in memory, which is intended to be used only with very small datasets as it can be considerably inefficient.

More info can be found on the session [Importing Data](https://www.tensorflow.org/guide/datasets) on TensorFlow's page.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
tf.enable_eager_execution()

In [3]:
tf.VERSION

'1.13.1'

In [5]:
# Create fake data
nsamples = 20
nfeatures = 4
x_numpy = np.random.random((nsamples, nfeatures))
y_numpy = x_numpy.sum(axis=1).astype(int)

In [6]:
# Creating a `Dataset` object
# With eager execution there is not need to create the iterator object
# to iterate over the dataset
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [7]:
# This properties of a Dataset instance allow you to inspect
# the types, classes (they are allways Tensor though) and
# shapes of the components of a dataset element.
print('output_classes:', dataset.output_classes)
print('output_shapes: ', dataset.output_shapes)
print('output_types:  ', dataset.output_types)

output_classes: (<class 'tensorflow.python.framework.ops.Tensor'>, <class 'tensorflow.python.framework.ops.Tensor'>)
output_shapes:  (TensorShape([Dimension(None), Dimension(4)]), TensorShape([Dimension(None)]))
output_types:   (tf.float64, tf.int64)


In [8]:
for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

Instructions for updating:
Colocations handled automatically by placer.
features: [[0.57450237 0.21575314 0.78712689 0.28127561]]  |  label: [1]
features: [[0.01352067 0.13658407 0.30805905 0.83829867]]  |  label: [1]
features: [[0.78760446 0.85005774 0.92827435 0.78870465]]  |  label: [3]
features: [[0.32542749 0.54024076 0.85206354 0.92061661]]  |  label: [2]
features: [[0.12255501 0.39644084 0.98641667 0.07655477]]  |  label: [1]
features: [[0.62194791 0.34082038 0.06981713 0.48575887]]  |  label: [1]
features: [[0.13269982 0.32203321 0.08099787 0.20184865]]  |  label: [0]
features: [[0.44447412 0.31827045 0.87963864 0.20281922]]  |  label: [1]
features: [[0.45134301 0.50320016 0.32430183 0.88046682]]  |  label: [2]
features: [[0.30191061 0.65322297 0.54011029 0.73059632]]  |  label: [2]
features: [[0.12390366 0.34119404 0.79383614 0.70713711]]  |  label: [1]
features: [[0.78364559 0.009132   0.03656106 0.09885238]]  |  label: [0]
features: [[0.19541322 0.74802709 0.36409407 0.25749

 With eager execution `tf.errors.OutOfRangeError` is no longer raised when the iterator reaches the end of the dataset (with all the repeats)

### The `Dataset` object can be created also from Tensor objects

In [9]:
x_tensor = tf.random.uniform([nsamples, nfeatures])
y_tensor = tf.cast(tf.reduce_sum(x_tensor, axis=1), tf.int32)

In [10]:
dataset = tf.data.Dataset.from_tensor_slices((x_tensor, y_tensor))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [11]:
for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

features: [[0.9508163  0.388245   0.10347426 0.17912197]]  |  label: [1]
features: [[0.7252369  0.0251298  0.05889571 0.60462415]]  |  label: [1]
features: [[0.9087721  0.6278064  0.93858564 0.76789796]]  |  label: [3]
features: [[0.3583138  0.07513583 0.38090575 0.20150185]]  |  label: [1]
features: [[0.99589133 0.70830715 0.15883005 0.05250812]]  |  label: [1]
features: [[0.7065567  0.55317616 0.39521015 0.6568639 ]]  |  label: [2]
features: [[0.18450737 0.6289238  0.9752656  0.44455767]]  |  label: [2]
features: [[0.4589064  0.9014076  0.9669143  0.45641887]]  |  label: [2]
features: [[0.37846136 0.47982407 0.23239541 0.6978121 ]]  |  label: [1]
features: [[0.02581453 0.24033892 0.07214499 0.02725887]]  |  label: [0]
features: [[0.07872832 0.4945264  0.11889136 0.73061514]]  |  label: [1]
features: [[0.7008077  0.8412634  0.46224904 0.30928707]]  |  label: [2]
features: [[0.26136672 0.07456124 0.17268598 0.19749177]]  |  label: [0]
features: [[0.20625508 0.11120594 0.42264736 0.8693

[`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) embeds the features and labels arrays in your TensorFlow graph as `tf.constant` operations. This works well for a small dataset, but wastes memory because the contents of the array will be copied multiple times.

## Add data transformations to the pipeline

Lest's say that for our problem it is beneficial to center the features between -0.5 and 0.5. Also, we would like to transform the labels from integers to one-hot encoded. This is can be donde with `Dataset`'s method `map`.

In [12]:
# The following transformations are quite simple and can be done
# on a single function, but we will use two different functions
# to show how operations can be pipelined.

def center(*row):
    features = row[0] - 0.5
    label = row[1]
    return features, label

def make_on_hot_labels(features, label):
    return features, tf.one_hot(label, 4)

# simpler with `dataset = dataset.filter(lambda f, l: tf.equal(l, 1))`
def filter_labels(features, label):
    return tf.equal(label, 1)

# simpler with `dataset = dataset.filter(lambda f, l: tf.greater(f[0], 0)`
def filter_features(features, label):
    return tf.greater(features[0], 0)

In [14]:
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.filter(filter_labels)
dataset = dataset.map(center)
dataset = dataset.filter(filter_features)
dataset = dataset.map(make_on_hot_labels)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(2)

for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))
    
# Note here that maybe nothing is printed. That's because the rando data is
# such that the two filter operations will filter it out completely. In that
# case you may generate again the data and run.

features: [[ 0.07450237 -0.28424686  0.28712689 -0.21872439]]  |  label: [[0. 1. 0. 0.]]
features: [[ 0.12194791 -0.15917962 -0.43018287 -0.01424113]]  |  label: [[0. 1. 0. 0.]]
features: [[ 0.12194791 -0.15917962 -0.43018287 -0.01424113]]  |  label: [[0. 1. 0. 0.]]
features: [[ 0.07450237 -0.28424686  0.28712689 -0.21872439]]  |  label: [[0. 1. 0. 0.]]


## Giving names to `Dataset` components

In [15]:
named_dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
print('output_classes:', named_dataset.output_classes)
print('output_shapes: ', named_dataset.output_shapes)
print('output_types:  ', named_dataset.output_types)

output_classes: {'features': <class 'tensorflow.python.framework.ops.Tensor'>, 'label': <class 'tensorflow.python.framework.ops.Tensor'>}
output_shapes:  {'features': TensorShape([Dimension(4)]), 'label': TensorShape([])}
output_types:   {'features': tf.float64, 'label': tf.int64}


In [16]:
def center(row):
    features = row['features'] - 0.5
    label = row['label']
    return features, label

In [17]:
named_dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
named_dataset = named_dataset.map(center)
named_dataset = named_dataset.shuffle(150)
named_dataset = named_dataset.batch(1)   # batch size
named_dataset = named_dataset.repeat(1)  # number of epochs

for features, label in named_dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

features: [[-0.19808939  0.15322297  0.04011029  0.23059632]]  |  label: [2]
features: [[ 0.28364559 -0.490868   -0.46343894 -0.40114762]]  |  label: [0]
features: [[-0.36730018 -0.17796679 -0.41900213 -0.29815135]]  |  label: [0]
features: [[-0.47174095 -0.25192335  0.08604084  0.29782747]]  |  label: [1]
features: [[-0.20132728 -0.44612544 -0.32267796 -0.2770587 ]]  |  label: [0]
features: [[ 0.33631338  0.11832159  0.49789804 -0.19494417]]  |  label: [2]
features: [[ 0.12194791 -0.15917962 -0.43018287 -0.01424113]]  |  label: [1]
features: [[-0.48647933 -0.36341593 -0.19194095  0.33829867]]  |  label: [1]
features: [[0.28760446 0.35005774 0.42827435 0.28870465]]  |  label: [3]
features: [[ 0.48564307 -0.23078198  0.02344173  0.13675444]]  |  label: [2]
features: [[-0.37609634 -0.15880596  0.29383614  0.20713711]]  |  label: [1]
features: [[-0.30458678  0.24802709 -0.13590593 -0.2425086 ]]  |  label: [1]
features: [[-0.17457251  0.04024076  0.35206354  0.42061661]]  |  label: [2]
fea