## Loading and Preprocessing Data with TensorFlow

The Data API can read from CSV, binary files, binaries that use TFRecord SQL databases or even Google Big Query, and the API takes care of everything to make it efficient. Also, we can use for preprocessing.

### The Data API

A _dataset_ represents a sequence of data items.

In [1]:
import tensorflow as tf

In [5]:
# Dataset reading from RAM
X = tf.range(10)
# Creates a dataset that contains 10 items (slices of X)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset
# the same: tf.data.Dataset.range(10)

<TensorSliceDataset shapes: (), types: tf.int32>

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### Chaining Transformations
It creates new datasets, we should have a reference to them!
<center><img src="img/trans.png"></img></center>

In [6]:
# calling transformation methods:
# repeat() repeats the original dataset 3 times
# batch() groups the items of the previous dataset in batches of seven items
dataset2 = dataset.repeat(3).batch(7)
for item in dataset2:
    print(item)
# with drop_remain=True, the last one wouldn't appear

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In [15]:
# lambda
dataset3 = dataset2.map(lambda x: x * 2)
for item in dataset3:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


For very intensive computations, to allow multithreading use tthe following argument: _num_parallel_calls_.

_map()_ applies a transformation to each item, _apply()_ applies to whole dataset.

In [18]:
# with this, each item in the new dataset will be a single-integer
# tensor instead of a batch of seven integers:
dataset3_1 = dataset3.apply(tf.data.experimental.unbatch())
for i in dataset3_1:
    print(i)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, sh

In [19]:
# Filtering
dataset3_2 = dataset3_1.filter(lambda x: x < 10)
for i in dataset3_2:
    print(i) 

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)


### Shuffling the Data

_shouffle()_ create a new dataset that will start filling up a buffer with the first items of the soure dataset, when it is asked for an item, one is pulled randomly from the buffer and replaced with one from the source dataset, until it has iterated entirely through the source dataset. We must specify the buffer size and don't exceed the amount of RAM we have, a seed could be used too.

In [22]:
dataset4 = tf.data.Dataset.range(10).repeat(3)
dataset4_1 = dataset4.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset4_1:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


If we call repeat on a shuffled dataset, a new order is generated every time, but we could set _reshuffle_each_iteration=False_.

For models that does not fit in memory, shuffling the data itself (_shuf_ in linux), splitting in multiple files and read them in random order during training, or many at the same time; there are many techniques and the Data API makes it possible very easily.

__Interleaving lines from multiple files__

