# Getting started with TensorFlow's `Dataset` API

This notebook contains examples on how to build simple input pipelines with TensorFlow's [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data/) API. The examples are based on contructing `Dataset` objects from Numpy arrays in memory, which is intended to be used only with very small datasets as it can be considerably inefficient.

More info can be found on the session [Importing Data](https://www.tensorflow.org/guide/datasets) on TensorFlow's page.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
tf.version.VERSION

'2.3.0'

In [3]:
# Create fake data
nsamples = 10
nfeatures = 4
x_numpy = np.random.random((nsamples, nfeatures))
y_numpy = x_numpy.sum(axis=1).astype(int)

In [22]:
# Creating a `Dataset` object
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

# * Dataset.repeat() concatenates the datataset without signaling the end of one epoch
#   and the beginning of the next one.

In [23]:
for x, y in dataset:
    print(f'x: {x}    y: {y}')

x: [[0.52729416 0.97212842 0.79189043 0.11363075]]    y: [2]
x: [[0.44317303 0.70935323 0.82828039 0.05453286]]    y: [2]
x: [[0.5501465  0.13939612 0.32231135 0.52412096]]    y: [1]
x: [[0.63984006 0.58797306 0.71099367 0.61624241]]    y: [2]
x: [[0.09530047 0.28822325 0.69780745 0.84440639]]    y: [1]
x: [[0.05906955 0.33904709 0.58170577 0.24772649]]    y: [1]
x: [[0.73532572 0.13725345 0.08542806 0.76036736]]    y: [1]
x: [[0.47149137 0.94588893 0.36952536 0.93342537]]    y: [2]
x: [[0.29603562 0.24451967 0.96109423 0.48680933]]    y: [1]
x: [[0.79008348 0.84817151 0.19547031 0.15326672]]    y: [1]


In [24]:
# iterate up to the 5th sample
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[0.29603562 0.24451967 0.96109423 0.48680933]]    y: [1]
x: [[0.63984006 0.58797306 0.71099367 0.61624241]]    y: [2]
x: [[0.52729416 0.97212842 0.79189043 0.11363075]]    y: [2]
x: [[0.5501465  0.13939612 0.32231135 0.52412096]]    y: [1]
x: [[0.44317303 0.70935323 0.82828039 0.05453286]]    y: [2]


### The `Dataset` object can be created also from Tensor objects

If the input pipeline depends now on TensorFlow operations, like on the followin cell, for instance, we need to use an **initializable iterator** instead of a **one-shot** one and run `sess.run(iterator.initializer)` befor start iterating. This will perform the graph operations (in this case evalate the tesnors) and that will be needed in the pipelines.

In [26]:
x_tensor = tf.random.uniform([nsamples, nfeatures])
y_tensor = tf.cast(tf.reduce_sum(x_tensor, axis=1), tf.int32)

In [27]:
dataset = tf.data.Dataset.from_tensor_slices((x_tensor, y_tensor))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [28]:
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[0.60480237 0.37253392 0.01995909 0.73342705]]    y: [1]
x: [[0.15328169 0.34407985 0.48509598 0.39430058]]    y: [1]
x: [[0.40143    0.2087518  0.45996332 0.683092  ]]    y: [1]
x: [[0.37643218 0.7631918  0.7444774  0.6310526 ]]    y: [2]
x: [[0.5774617  0.06088769 0.42384326 0.81501126]]    y: [1]


[`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) embeds the features and labels arrays in your TensorFlow graph as `tf.constant` operations. This works well for a small dataset, but wastes memory because the contents of the array will be copied multiple times.

## Add data transformations to the pipeline

Lest's say that for our problem it is beneficial to center the features between -0.5 and 0.5. Also, we would like to transform the labels from integers to one-hot encoded. This is can be donde with `Dataset`'s method `map`.

In [30]:
# The following transformations are quite simple and can be done
# on a single function, but we will use two different functions
# to show how operations can be pipelined.

def center(*row):
    features = row[0] - 0.5
    label = row[1]
    return features, label

def make_on_hot_labels(features, label):
    return features, tf.one_hot(label, 4)

# simpler with `dataset = dataset.filter(lambda f, l: tf.equal(l, 1))`
def filter_labels(features, label):
    return tf.equal(label, 1)

# simpler with `dataset = dataset.filter(lambda f, l: tf.greater(f[0], 0)`
def filter_features(features, label):
    return tf.greater(features[0], 0)

In [31]:
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.filter(filter_labels)
dataset = dataset.map(center)
dataset = dataset.filter(filter_features)
dataset = dataset.map(make_on_hot_labels)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [32]:
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[ 0.29008348  0.34817151 -0.30452969 -0.34673328]]    y: [[0. 1. 0. 0.]]
x: [[ 0.0501465  -0.36060388 -0.17768865  0.02412096]]    y: [[0. 1. 0. 0.]]
x: [[ 0.23532572 -0.36274655 -0.41457194  0.26036736]]    y: [[0. 1. 0. 0.]]


## Giving names to `Dataset` components

In [34]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})

In [43]:
def center(row):
    features = row['features'] - 0.5
    label = row['label']
    return {'features': features, 'label': label}

In [44]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
dataset = dataset.map(center)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [52]:
for d in dataset.take(5):
    print(f"x: {d['features']}    y: {d['label']}")

x: [[ 0.0501465  -0.36060388 -0.17768865  0.02412096]]    y: [1]
x: [[ 0.29008348  0.34817151 -0.30452969 -0.34673328]]    y: [1]
x: [[-0.44093045 -0.16095291  0.08170577 -0.25227351]]    y: [1]
x: [[ 0.02729416  0.47212842  0.29189043 -0.38636925]]    y: [2]
x: [[-0.02850863  0.44588893 -0.13047464  0.43342537]]    y: [2]
