## The Data API

The main object used in the data API are ```datasets```. Usually you will use datasets that gradually read data from disk, but let's start with an example that creates a dataset entirely in RAM.

In [1]:
import tensorflow as tf
from tensorflow import keras

In [2]:
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

```from_tensor_slices()``` takes a tensor and creates a Dataset whose elements are all slices of X (along the first dimension). This dataset contains 10 items: tensors 0, 1, 2, ..., 9.

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


## Chaining Transformations

Dataset methods reeturn a new Dataset, so we can chain transformations like so

In [4]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


Note that calling the repeat method, it returns a new dataset that repeats the items of the original dataset 3 times, however it does not copy data in memory 3 times!

batch() returned 2 remaining items in the last tensor, if we wanted to keep the tensors with the same size we could pass ```drop_remainder=True```

We can also apply transformations to each item with the ```map``` method. Sometimes we will perform expensive computations with this method, such as rotating or reshaping an image. To spawn multiple threads to speed things up, set the ```num_parallel_calls``` argument. Note that the function you pass to ```map``` must be convertible to a TF function.

In [5]:
for item in dataset.map(lambda x: x**2):
    print(item)

Cause: could not parse the source code:

for item in dataset.map(lambda x: x**2):

This error may be avoided by creating the lambda in a standalone statement.

Cause: could not parse the source code:

for item in dataset.map(lambda x: x**2):

This error may be avoided by creating the lambda in a standalone statement.

tf.Tensor([ 0  1  4  9 16 25 36], shape=(7,), dtype=int32)
tf.Tensor([49 64 81  0  1  4  9], shape=(7,), dtype=int32)
tf.Tensor([16 25 36 49 64 81  0], shape=(7,), dtype=int32)
tf.Tensor([ 1  4  9 16 25 36 49], shape=(7,), dtype=int32)
tf.Tensor([64 81], shape=(2,), dtype=int32)


The ```apply()```  methods applies a transformation to the dataset as whole. Using the ```unbatch``` function, each item in the new dataset will be a single-integer instead of a batch of seven items.

In [6]:
dataset = dataset.apply(tf.data.Dataset.unbatch)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype

We can also filter data with ```filter```

In [7]:
dataset = dataset.filter(lambda x: x < 10)

And to look at a t few items from the dataset use ```take```

In [8]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)


## Shuffling the data

The ```shuffle()``` method creates a new dataset that fills up a buffer with the first items of the source dataset. Whenever it is asked for an item, it will pull one out randomly from the buffer and replace it with a fresh new one from the source dataset, until it has iterated thorugh the source dataset. At this point it continues to  pull out items randomly from the buffer until it is empty.

You must specify the buffer size and it is important to make it large enough else shuffling will not be very effective

In [9]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


For large datasets tha do not fit im memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to the size of the dataset.

One solution to this is to shuffle the source data itself, for example on Linux you can use ```shuf``` to shuffle text files. 
Even if the source data is shuffled, you migh want to shuffle it more or else the same order will be repeated at each epoch and the model might end up biased. To shuffle some more, a common approach is to split the source data into multiple files, then read them in a random order during training. With this, instances located in the same file will still end up close to each other. To avoid this you can pick multiple files randomly and read them simutaneously, interleaving their records.
Then on top of that we can add a shuffling buffer with ```shuffle```.

The best part about this: the Data API makes it easy for you to do all this.

### Interleaving lines from multiple files.

Let's start by loading the California Housing dataset, shuffling it, split into a training and validation set and a test. Finally we split each set into many csv files

In [10]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

In [18]:
import os

def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [19]:
import numpy as np

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ['MedianHouseValue']
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, 'train', header, n_parts=20)
train_filepaths = save_to_multiple_csv_files(valid_data, 'valid', header, n_parts=10)
train_filepaths = save_to_multiple_csv_files(test_data, 'test', header, n_parts=10)

In [20]:
import pandas as pd
pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,0.477
1,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,0.458
2,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,5.00001
3,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72,2.186
4,3.725,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93,2.78


Now let's create a dataset using only these filepaths. By default, ```list_files``` retuns a dataset that shuffles the file paths. You can set shuffle=False if you don't want it

In [22]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

Next we use the ```interleave``` method to read from five files at a time and interleave their lines

In [24]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

This pulls five files at a time, skipping the first line (containing headers) then constructs a dataset by reading one line from each file. It then pulls the next five file paths, interleaves them in the same way and so on until it runs out of file paths

In [26]:
for line in dataset.take(5):
    print(line.numpy())

b'5.3623,15.0,7.55956678700361,1.1407942238267148,937.0,3.3826714801444044,33.73,-116.89,2.013'
b'5.5968,23.0,5.783870967741936,1.0290322580645161,1145.0,3.693548387096774,37.4,-121.85,2.432'
b'1.2012,12.0,1.4657534246575343,0.8986301369863013,1194.0,3.271232876712329,34.05,-118.27,2.75'
b'6.2427,19.0,6.446293494704992,1.0257186081694403,2621.0,3.9652042360060515,33.85,-118.08,2.887'
b'3.2596,33.0,5.017656500802568,1.0064205457463884,2300.0,3.691813804173355,32.71,-117.03,1.03'


These are the first 5 lines of the dataset, but they are byte strings, we now need to parse them.

### Preprocessing Data

In [30]:
# Using X_mean, X_std from above...
n_inputs = 8

def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])  # Convert features to 1D tensor array
    y = tf.stack(fields[-1:])  # Convert target to 1D tensor array
    return (x - X_mean) / X_std, y

In [34]:
for line in dataset.take(1):
    print(preprocess(line.numpy()))

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
array([-1.4092057 , -1.3151377 , -1.564544  , -0.4318407 , -0.21015665,
        0.1322812 , -0.74789494,  0.6568782 ], dtype=float32)>, <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.75], dtype=float32)>)


### Putting everything together / Prefetching

In [36]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                       n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.map(preprocess, num_parallel_calls=n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)
    return dataset.batch(batch_size).prefetch(1)

The last line in the function above uses the ```prefetch``` method. This creates a dataset that tries to be one batch ahead. For example if we are training a model on this dataset, while the training is happening, the dataset will already be working in parallel on getting the next batch ready. This can dramatically improve performance.

### Using teh Dataset with tf.keras