# Getting started with TensorFlow's `Dataset` API

In this notebook we use TensorFlow's [`tf.data`](https://www.tensorflow.org/api_docs/python/tf/data/) API  to build simple input pipelines from Numpy and TensorFlow arrays existing in memory. This is intended to be used only with very small datasets as it can be considerably inefficient.

More info can be found on the session [Importing Data](https://www.tensorflow.org/guide/datasets) on TensorFlow's page.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
tf.version.VERSION

'2.3.0'

In [4]:
# Create fake data
nsamples = 10
nfeatures = 4
x_numpy = np.random.random((nsamples, nfeatures))
y_numpy = x_numpy.sum(axis=1).astype(int)

In [13]:
# Creating a `Dataset` object
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

# * Dataset.repeat() concatenates the datataset without signaling the end of one epoch
#   and the beginning of the next one.

In [14]:
for x, y in dataset:
    print(f'x: {x}    y: {y}')

x: [[0.78139481 0.11883897 0.39035306 0.73005025]]    y: [2]
x: [[0.51803594 0.029389   0.51314648 0.63139709]]    y: [1]
x: [[0.22394586 0.9510952  0.12551951 0.54944123]]    y: [1]
x: [[0.50468981 0.41202756 0.25616242 0.55185019]]    y: [1]
x: [[0.2047827  0.72270374 0.85931468 0.79648837]]    y: [2]
x: [[0.39116994 0.34961736 0.56749722 0.67636049]]    y: [1]
x: [[0.39408979 0.58097187 0.31183686 0.78895149]]    y: [2]
x: [[0.26150226 0.24383822 0.24565584 0.89782191]]    y: [1]
x: [[0.94700325 0.25707502 0.85429373 0.37009758]]    y: [2]
x: [[0.84791543 0.05778967 0.14045354 0.57825393]]    y: [1]


In [15]:
# iterate up to the 5th sample
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[0.51803594 0.029389   0.51314648 0.63139709]]    y: [1]
x: [[0.39408979 0.58097187 0.31183686 0.78895149]]    y: [2]
x: [[0.39116994 0.34961736 0.56749722 0.67636049]]    y: [1]
x: [[0.22394586 0.9510952  0.12551951 0.54944123]]    y: [1]
x: [[0.2047827  0.72270374 0.85931468 0.79648837]]    y: [2]


### The `Dataset` object can be created also from TensorFlow tensor objects

In [17]:
x_tensor = tf.random.uniform([nsamples, nfeatures])
y_tensor = tf.cast(tf.reduce_sum(x_tensor, axis=1), tf.int32)

In [18]:
dataset = tf.data.Dataset.from_tensor_slices((x_tensor, y_tensor))
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [20]:
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[0.8927978  0.441705   0.5889245  0.17077637]]    y: [2]
x: [[0.74003947 0.9098822  0.01213014 0.5557623 ]]    y: [2]
x: [[0.7819412  0.7896348  0.13750732 0.45953274]]    y: [2]
x: [[0.36963928 0.7769023  0.36441612 0.05384433]]    y: [1]
x: [[0.580734   0.79456425 0.9809855  0.44709933]]    y: [2]


[`tf.data.Dataset.from_tensor_slices`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) embeds the features and labels arrays in your TensorFlow graph as `tf.constant` operations. This works well for a small dataset, but wastes memory because the contents of the array will be copied multiple times.

## Adding data transformations to the pipeline

Let's say that for our problem it is beneficial to center the features between -0.5 and 0.5. Also, we would like to transform the labels from integers to one-hot encoded. This is can be done with `Dataset`'s method `map`.

In [45]:
# The following transformations are quite simple and can be done
# on a single function, but we will use two different functions
# to show how operations can be put together as a pipeline.

def center(*row):
    features = row[0] - 0.5
    label = row[1]
    return features, label

def make_on_hot_labels(features, label):
    return features, tf.one_hot(label, 4)

# simpler with `dataset = dataset.filter(lambda f, l: tf.equal(l, 1))`
def filter_labels(features, label):
    return tf.equal(label, 1)

# simpler with `dataset = dataset.filter(lambda f, l: tf.greater(f[0], 0)`
def filter_features(features, label):
    return tf.greater(features[0], 0)

In [46]:
dataset = tf.data.Dataset.from_tensor_slices((x_numpy, y_numpy))
dataset = dataset.filter(filter_labels)
dataset = dataset.map(center)
dataset = dataset.filter(filter_features)
dataset = dataset.map(make_on_hot_labels)
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [47]:
for x, y in dataset.take(5):
    print(f'x: {x}    y: {y}')

x: [[ 0.34791543 -0.44221033 -0.35954646  0.07825393]]    y: [[0. 1. 0. 0.]]
x: [[ 0.00468981 -0.08797244 -0.24383758  0.05185019]]    y: [[0. 1. 0. 0.]]
x: [[ 0.01803594 -0.470611    0.01314648  0.13139709]]    y: [[0. 1. 0. 0.]]


## Giving names to `Dataset` components

In [48]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})

In [49]:
def center(row):
    features = row['features'] - 0.5
    label = row['label']
    return {'features': features, 'label': label}

In [51]:
dataset = tf.data.Dataset.from_tensor_slices({'features': x_numpy, 'label': y_numpy})
dataset = dataset.map(center)
dataset = dataset.shuffle(10)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [52]:
for d in dataset.take(5):
    print(f"x: {d['features']}    y: {d['label']}")

x: [[-0.10883006 -0.15038264  0.06749722  0.17636049]]    y: [1]
x: [[ 0.00468981 -0.08797244 -0.24383758  0.05185019]]    y: [1]
x: [[-0.23849774 -0.25616178 -0.25434416  0.39782191]]    y: [1]
x: [[ 0.44700325 -0.24292498  0.35429373 -0.12990242]]    y: [2]
x: [[ 0.01803594 -0.470611    0.01314648  0.13139709]]    y: [1]
