# Use `tf.data` api for data importing

The tf.data API enables you to build complex input pipelines from
simple, reusable pieces. The `tf.data` api contains two key APIs:
`tf.data.Dataset` and `tf.data.Iterator`.

- A `tf.data.Dataset` represents a sequence of elements, in which each
  element contains one or more `Tensor` objects. There are two
  distinct ways to create a dataset: 
  - Creating a **Source** (e.g `Dataset.from_tensor_slices()`)
    constructs a dataset from one or more `tf.Tensor` objects. 
  - Applying transformation (e.g. `Dataset.batch()`) constructs a
    dataset from one or more `tf.data.Dataset` objects. 
- A `tf.data.Iterator` provides the main way to extract elements from
  a dataset. The operation returned by `Iterator.get_next()` yields
  the next element of a Dataset when executed, and **act acts as the
  interface between input pipeline code and your model**. The simplest
  iterator is a **one-shot iterator**.

The most of the contents in this notebook are from
[here](https://www.tensorflow.org/programmers_guide/datasets) and [here](https://jhui.github.io/2017/11/21/TensorFlow-Importing-data/).

In [13]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"]="3"

import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_formats = ('png', 'retina')

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)


In [3]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
print(dataset1.output_types)
print(dataset1.output_shapes)

<dtype: 'float32'>
(10,)


In [26]:
dataset2 = tf.data.Dataset.from_tensor_slices(
    (tf.random_uniform([4]),
     tf.random_uniform([4, 100], maxval=100, dtype=tf.float32)))

print(dataset2.output_types)
print(dataset2.output_shapes)

(tf.float32, tf.float32)
(TensorShape([]), TensorShape([Dimension(100)]))


In [28]:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
print(dataset3.output_types)
print(dataset3.output_shapes)

(tf.float32, (tf.float32, tf.float32))
(TensorShape([Dimension(10)]), (TensorShape([]), TensorShape([Dimension(100)])))


In [29]:
dataset = tf.data.Dataset.from_tensor_slices(
    {"a": tf.random_uniform([4]),
     "b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types)
print(dataset.output_shapes)

{'a': tf.float32, 'b': tf.int32}
{'a': TensorShape([]), 'b': TensorShape([Dimension(100)])}


The `Dataset` transformations supported datasets of any structure.
When using the `Dataset.map()`, `Dataset.flat_map()`, and
`Dataset.filter()` transformations, which apply a function to each
element, the element structure determines the arguments of the function:

In [None]:
dataset1 = dataset1.map(lambda x: x**2) # square data
dataset2 = dataset2.flat_map(lambda x, y: x+y) # sum

dataset3 = dataset3.filter(lambda x, (y, z): ...)


# Creating an Iterator

The use of `Iterator` for mini-batch learning is convenient. The
`tf.data` supports four types iterators, in increasing level of
sophistication: 
- one-shot
- initializable
- reintializable
- feedable

The **one-shot** iterator is the simplest one among these types of iterator. This is also the only iterator type supported by the `Estimator`  api.

## **one-shot** iteartor

In [36]:
dataset = tf.data.Dataset.range(100)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    for i in range(100):
        value = sess.run(next_element)
        assert i == value
        print(value, sep=", ", end=" ")

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 

## initializable

An **initializable** iterator requires you to run an explicit
`iterator.initializer` operation before using it. Although the
**initializable** introduced an extra step of work, in exchange, it
enables one to parameterize the definition of the dataset, using one
or more `tf.placeholder()` tensors that can be fed when initialize the
iterator.

In [40]:
max_value = tf.placeholder(tf.int64, shape=[])
dataset = tf.data.Dataset.range(max_value)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 10})
    for i in range(10):
        value = sess.run(next_element)
        assert i == value
        print(value, end=", ")

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 

In [41]:
# Initialize the same iterator over a dataset with 100 elements
with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={max_value: 100})
    for i in range(100):
        value = sess.run(next_element)
        assert i == value
        print(value, end=", ")

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 

## reintializable

A **reintializable** iterator can be initialized from multiple
different `Dataset` objects. For example, you might have a training
input pipeline that uses **random perturbations** to the input image to
improve generalization, and a validation input pipeline that evaluate
predictions on unmodified data.

In [None]:
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, dtype=tf.int64))
validation_dataset = tf.data.Dataset.range(50)

iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
                                           training_dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)

# Run 20 epochs on the training set, and then followed by the validation set 
with tf.Session() as sess:
    for _ in range(20):
        sess.run(training_init_op)
        for _ in range(100):
            sess.run(next_element)
            
        sess.run(validation_init_op)
        for _ in range(50):
            sess.run(next_element)

    

## feedable

A **feedable** iterator can be used together with `tf.placeholder` to
select which `Iterator` to use in each call to the `sess.run()`, via
the `feed_dict()** method.

A **feedable** offers the same functionality as a **reinitializable**
iterator, but it does not require to initialize the iterator from the
start of a dataset when switch between iterators. 

In [None]:
training_dataset = tf.data.Dataset.range(100).map(
    lambda x: x + tf.random_uniform([], -10, 10, tf.int64)).repeat()
validation_dataset = tf.data.Dataset.range(50)

# A feedable iterator is defined by a handle placeholder and its structure.
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
    handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next()

# You can use feedable iterators with a variety of different kinds of iterator (one_shot, initializable, and reinitializable)
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_initializable_iterator()

# the `Iterator.string_handle` method returns a tensor that can be evaluated
# and used to feed the placeholder
with tf.Session() as sess:
    training_handle = sess.run(training_iterator.string_handle())
    validation_handle = sess.run(validation_iterator.string_handle())

    while True:
        for _ in range(200):
            sess.run(next_element, feed_dict={handle: training_handle})

        sess.run(validation_iterator.initializer)
        for _ in range(50):
            sess.run(next_element, feed_dict={handle: valdiation_handle})
        



## Consuming data
### Consuming `numpy` arrays

If the data is small enough to fit in memory, the simplest way to
create a `Dataset` is to convert them to `tf.Tensor` object and use
`Dataset.from_tensor_slices()`.

In [None]:
with np.load('data/trainig_data.npy') as data:
    features = data['features']
    labels = data['labels']

assert features.shape[0] == labels.shape[0]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
    

The snippet will only work well for small dataset, otherwise it will waste memory as the contents in the dataset need to be copied multiple times. For larget dataset, it is better to use a placeholder and defer the data copying until you initialized the *iterator*.

An alternative to the above snippet is implemented by define the `Dataset` in terms of `tf.placeholder` tensors, and  feed the `numpy` arrays when `Iterator` is initialized over the dataset.

In [None]:
dt = np.dtype([('features', float, (2, )),
               ('labels', int)])

x = np.zeros((2, ), dtype=dt)
x[0]['features'] = [3.0, 2.5]
x[0]['label'] = 2

x[1]['features'] = [1.4, 2.3]
x[1]['label'] = 1

np.save("training_data.npy", x)


In [None]:
with np.load("training_data.npy") as data:
    features = data["features"]
    labels = data["labels"]

assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# more transformatin on the dataset
iterator = dataset.make_initializable_iterator()

with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                              labels_placeholder: labels})
    


## Batching dataset elements

The simplest form of batching stacks `n` consective elements of a dataset into a single element. This is exactly what the `Dataset.batch()` does.

### Simple batching

In [43]:
inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

iterator = batched_dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))
    print(sess.run(next_element))

(array([0, 1, 2, 3]), array([ 0, -1, -2, -3]))
(array([4, 5, 6, 7]), array([-4, -5, -6, -7]))
(array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11]))


### Batching tensors with padding

The simple batching works only for data of the same shape. For data elements of varied size, such as sequences of different lengths, another method is needed. The `Dataset.padded_batch()` is introduced to handle this case.

In [49]:
dataset = tf.data.Dataset.range(100)
# tf.fill(dims, value, name=None)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=[None])

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    print(sess.run(next_element))
    print(sess.run(next_element))

[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]
[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]
