In [1]:
import tensorflow as tf
from tensorflow import keras

import numpy as np
import pandas as pd
import scipy as sp

# Loading and Preprocessing Data via TensorFlow

Usually working with datasets that will not fit in RAM. Can use the **TensorFlow Dataset API**, which will take care of optimizations including...
- Multithreading
- Queuing
- Batching
- Prefetching

The Data API helps bringing in from binary, tensorflowbinary, csv, or SQL files/databases, but can also help in its preprocessing.

Two things we will focus on:
- _TF Transform_ (tf.Transform) - Helps to write a preprocessing function to run in batch mode on the training data such that it can be incorporated into the training model where once it is deployed, it will automatically incorporate new instances.
- _TF Datasets_ (TFDS) - Can download many existing datasets and can use database objects for convenient manipulation.



***
## The Data API

### Introduction

Below is an example that can fit entirely in RAM, but it serves as a starting point.

In [2]:
X = tf.range(10)  # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

**from_tensor_slices()** method takes a Tensor and creates a tf.data.Dataset object with elements that are slices of X defaulting to the first dimension.

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### Chaining Transformations

An example of a transformation chain is seen below:

![Chaining TensorFlow Transformations](chaining_transformations_tf_dataset.PNG)



In [4]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


Can even set the drop_remainder flag to True to ensure all the batches have the same size.

These dataset methods create new methods, so they need to be saved to a reference. But their elements can also be mutated with the .map method.

In [5]:
dataset = dataset.map(lambda x: x * 2)
dataset

<MapDataset shapes: (None,), types: tf.int32>

**.map()** method is the one that will be called if your dataset needs preprocessing before being fed into the network. 

