# TensorFlow Dataset API

**Learning Objectives**
1. Learn how to use tf.data to read data from memory
1. Learn how to use tf.data in a training loop
1. Learn how to use tf.data to read data from disk
1. Learn how to write production input pipelines with feature engineering (batching, shuffling, etc.)


In this notebook, we will start by refactoring the linear regression we implemented in the previous lab so that it takes data from a`tf.data.Dataset`, and we will learn how to implement **stochastic gradient descent** with it. In this case, the original dataset will be synthetic and read by the `tf.data` API directly  from memory.

In a second part, we will learn how to load a dataset with the `tf.data` API when the dataset resides on disk.

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/deepdive2/introduction_to_tensorflow/labs/2_dataset_api.ipynb) -- try to complete that notebook first before reviewing this solution notebook.


In [None]:
# Import necessary libraries

import json # The json module is mainly used to convert the python dictionary above into a JSON string that can written in a file
import os # Interact withh the operating system
import math # python module provides mathematical functions
from pprint import pprint

import numpy as np
import tensorflow as tf
print("Tensorflow version: ", tf.version.VERSION)

#### Objective 1: Loading data from memory

#### Creating the dataset

Let us consider the synthetic dataset of the previous section:

In [None]:
n_points = 20
# constant() method will create a constant tensor from tensor-like objects
X = tf.constant(range(n_points), dtype = tf.float16)
Y = 3 * X - 5
pprint(X)
pprint(Y)

We begin with implementing a function that takes as input


- our $X$ and $Y$ vectors of synthetic data generated by the linear function $y= 2x + 10$
- the number of passes over the dataset we want to train on (`epochs`)
- the size of the batches the dataset (`batch_size`)

and returns a `tf.data.Dataset`: 

**Remark:** Note that the last batch may not contain the exact number of elements you specified because the dataset was exhausted.

If you want batches with the exact same number of elements per batch, we will have to discard the last batch by
setting:

```python
dataset = dataset.batch(batch_size, drop_remainder=True)
```

We will do that here.

In [None]:
# Create dataset procedure

def new_dataset(X, Y, epochs, batch_size):
    #Using tf.data.dataset.from_tensor_slices to get slices of list or array
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder=True)
    return dataset

In [None]:
# try the function by iterating thrice over our dataset in batches of 6 points

EPOCH = 6
BATCH_SIZE = 6

dataset = new_dataset(X, Y, epochs=EPOCH, batch_size=BATCH_SIZE)
pprint(dataset)

for i, (x, y) in enumerate(dataset):
    print("x: ", x.numpy(), "y: ", y.numpy())
    assert len(x) == BATCH_SIZE
    assert len(y) == BATCH_SIZE

#### Lost Function and Gradients

The lost function anf the function that computes the gradients are the same as before

In [None]:
# Let us define loss_mse() procedure which will return computed mean of elements across dimensions of a tensor.

def loss_mse(X, Y, w0, w1):
    Y_hat = w0 * X + w1
    error = (Y_hat - Y)**2
    return tf.reduce_mean(error)

# Define compute_gradient() procedure which will return value of recorded operations for automatic differentiation
def compute_gradient(X, Y, w0, w1):
    with tf.GradientTape() as tape:
        loss = loss_mse(X, Y, w0, w1)
    return tape.gradient(loss, [w0, w1])

#### Objective 2: Use tf.data in Training Loop

The main difference now is that now, in the training loop, we will iterate directly on the tf.data.Dataset generated by our create_dataset function.

In [None]:
# Here, we will configure the dataset so that it iterates 250 times over our synthetic dataset of batches 6

EPOCHS = 300
BATCH_SIZE = 6
LEARNING_RATE = .02

MSG = "STEP {step} - loss: {loss}, w0: {w0}, w1: {w1}\n"

w0 = tf.Variable(0.0, dtype=tf.float16)
w1 = tf.Variable(0.0, dtype=tf.float16)

dataset = new_dataset(X, Y, epochs=EPOCHS, batch_size=BATCH_SIZE)

for step, (X_batch, Y_batch) in enumerate(dataset):
    
    dw0, dw1 = compute_gradient(X_batch, Y_batch, w0, w1)
    w0.assign_sub(dw0 * LEARNING_RATE)
    w1.assign_sub(dw1 * LEARNING_RATE)
    
    if step % 100 == 0:
        loss = loss_mse(X_batch, Y_batch, w0, w1)
        print(MSG.format(step=step, loss=loss, w0=w0.numpy(), w1=w1.numpy()))
        
assert loss < 0.0001
assert abs(w0 -2) < 0.001
assert abs(w1 - 10) < 0.001

#### Objective 3: Loading data from Disk

##### Locating the CSV files


The taxifare dataset files have been saves into ../data.


#### Use the tf.data to load the csv files

The tf.data API can be easily read csv files using the helper function tf.data.experimental.make_csv_dataset

If you have TFRecords(which is recommended), you may use tf.data.experimental.make_batched_features_dataset

The first step is to define:
    - the feature names into a list CSV_COLUMNS
    - their default values into a list DEFAULTS

In [None]:
#Defining column names into a list CSV_COLUMNS

CSV_COLUMNS = [
    'fare_amount',
    'pickup_datetime',
    'pickup_longitude',
    'pickup_latitude',
    'dropoff_longitude',
    'dropoff_latitude',
    'passenger_count',
    'key'
]

# define label column
LABEL_COLUMN = 'fare_amount'

# Define default values into the list DEFAULTS
DEFAULTS = [
    [0.0], ['na'], [0.0], [0.0], [0.0], [0.0], [0.0], ['na']]

Wrap the call to make_csv_dataset into its own function that will take only the file pattern(glob) where the dataset files are to be located.

In [None]:
def create_dataset(pattern):
    #tf.data.experimental.make_csv_dataset() method reads CSV files into a dataset
    return tf.data.experimental.make_csv_dataset(pattern, 1, column_names=CSV_COLUMNS, column_defaults=DEFAULTS)


tempds = create_dataset('taxi-test.csv')
print(tempds)

Note that the output is a prefetch dataset where each element is an OrderedDict whose keys are the feature names and whose values are tensors of shape (1,) --vectors

In [None]:
# Iterate over the first two elements in the dataset using dataset.take(2)
# Convert to python dictionary to numpy for readibality

for data in tempds.take(2):
    pprint({k: v.numpy() for k, v in data.items()})
    print("\n")

#### Objective 4 - Creating Input Pipelines - Transforming the features

Objective is to create a dictionary of features plus label. So, we are going to
1. Remove the unwanted key
2. Keep the label separate from the features
    
Let's first implement a function that takes as input a row (represented as an OrderedDict in our tf.data.Dataset as above) and then returns a tuple with two elements:

- The fist element being the same OrderedDict with the label dropped
- The second element being the label itself - fare_amount

Note that we will need to also remove the key and pickup_datetime column, which we wont use

In [None]:
UNWANTED_COLS = ['pickup_datetime', 'key']

# let's define the features_and_labels() method

def features_and_labels(row_data):
    label = row_data.pop(LABEL_COLUMN)
    features = row_data
    
    for unwanted_col in UNWANTED_COLS:
        features.pop(unwanted_col)
    
    return features, label

In [None]:
# Iterate for two examples in tempds to make sure it is working as expected
for row_data in tempds.take(2):
    features, label = features_and_labels(row_data)
    pprint(features)
    print(label, "\n")
    assert UNWANTED_COLS[0] not in features.keys()
    assert UNWANTED_COLS[1] not in features.keys()
    assert label.shape == [1]

#### Batching

Let's now refactor our `create_dataset` function so that it takes an additional argument `batch_size` and batch the data correspondingly. We will also use the `features_and_labels` function we implemented for our dataset to produce tuples of features and labels.

In [None]:
# Define create_dataset method:

def create_dataset(pattern, batch_size):
    dataset = tf.data.experimental.make_csv_dataset(
    pattern, batch_size, CSV_COLUMNS, DEFAULTS)
    return dataset.map(features_and_labels)

In [None]:
# Let's test that our batches are of the right size

BATCH_SIZE = 2

tempds = create_dataset('taxi-test.csv', batch_size=BATCH_SIZE)

for X_batch, Y_batch in tempds.take(2):
    pprint({k: v.numpy() for k, v in X_batch.items()})
    print(Y_batch.numpy(), "\n")
    assert len(Y_batch) == BATCH_SIZE

#### Shuffling

When training a deep learning model in batches over multiple workers, it is helpful if we shuffle the data. That way, different workers will be working on different parts of the input file at the same time, and so averaging gradients across workers will help. Also, during training, we will need to read the data indefinitely.


Let's refactor our `create_dataset` function so that it shuffles the data, when the dataset is used for training.

We will introduce an additional argument `mode` to our function to allow the function body to distinguish the case
when it needs to shuffle the data (`mode == "train"`) from when it shouldn't (`mode == "eval"`).

Also, before returning we will want to prefetch 1 data point ahead of time (`dataset.prefetch(1)`) to speed-up training:

In [None]:
def create_dataset(pattern, batch_size=1, mode='eval'):
    dataset = tf.data.experimental.make_csv_dataset(pattern, batch_size, CSV_COLUMNS, DEFAULTS)
    
    dataset = dataset.map(features_and_labels)
    
    if mode == 'train':
        dataset = dataset.shuffle(1000).repeat()
    
    #take advantage of the multi-threading; 1=AUTOTUNE
    dataset = dataset.prefetch(1)
    return dataset

In [None]:
#Check if the function is working as expected

tempds = create_dataset('taxi-train.csv', 2, "train")
print(list(tempds.take(2)))

In [None]:
tempds = create_dataset('taxi-valid.csv', 3, "eval")
print(list(tempds.take(3)))