# SIT744 Practical 5: tf.data input pipeline

*Dr Wei Luo*

<div class="alert alert-info">
We suggest that you run this notebook using Google Colab.
</div>


## Pre-practical readings

- [tf.data: Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data)

## Task 1 Experiments with neural networks

Last week, we saw how neural networks can be used for classification and regression. In this task, you will complete some experiments suggested in the textbook.


### Task 1.1 Experiments with binary classification

The code for binary classification is copied below for you.

Complete the following experiments:

- You used two representation layers before the final classification layer. Try using one or three representation layers, and see how doing so affects validation and test accuracy.
- Try using layers with more units or fewer units: 32 units, 64 units, and so on.
- Try using the mse loss function instead of binary_crossentropy.
- Try using the tanh activation (an activation that was popular in the early days of neural networks) instead of relu.

In [0]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
from keras.datasets import imdb



(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)



def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)


# Our vectorized labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')


x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [0]:
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))



model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))



In [0]:

import matplotlib.pyplot as plt

acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

results = model.evaluate(x_test, y_test)
results

### Task 2.2 Experiments with multiclass classification

The code for multiclass classification is copied below for you.

Complete the following experiments:
- Try using larger or smaller layers: 32 units, 128 units, and so on.
- You used two intermediate layers before the final softmax classification layer. Now try using a single intermediate layer, or three intermediate layers.


In [0]:
from keras.datasets import reuters
from keras.utils.np_utils import to_categorical

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

# Our vectorized training data
x_train = vectorize_sequences(train_data)
# Our vectorized test data
x_test = vectorize_sequences(test_data)


one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

In [0]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

x_val = x_train[:1000]
partial_x_train = x_train[1000:]

y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]


history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

In [0]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()


model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(partial_x_train,
          partial_y_train,
          epochs=8,
          batch_size=512,
          validation_data=(x_val, y_val))
results = model.evaluate(x_test, one_hot_test_labels)
results

## Task 2 TensorFlow input pipelines

If you are familiar with the Scikit-learn [Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline), you would appreciate the convenience it brings to data preprocessing. In TensorFlow, the `tf.data` API provides similar functionality. 

1. When you dealing with a huge dataset or a data stream that cannot fit into memory, the `tf.data` API provides a way to feed training data batches into TensorFlow.
2. As we see in the lecture, often data from different raw formats (for example text) need to be preprocessed. The `tf.data` allows this and other transformations to be performed on-the-fly. 

The key abstraction in the `tf.data` API is the `tf.data.Dataset` interface, which consists of a sequence of **elements** and an iterator for the sequence. (If you are not familiar with Iterator, see [here](https://wiki.python.org/moin/Iterator).) Often an element is a pair of batched data *(training features, labels)*. 



### Task 2.1 TensorFlow Datasets: a collection of ready-to-use datasets

If you are looking for some common datasets to test your model, you can find example Datasets in `tensorflow_datasets`. These are similar to the NumPy datasets provided by `tf.keras.datasets`. 

**Warning**: You should not confuse "TensorFlow Datasets" provided by the `tensorflow_datasets` package (aka `tfds`, see below) with the `tf.data.Dataset` API (aka `Dataset`). The former is a collection of ready-to-go datasets packaged using the latter. (Yes I know. Unfortunately, TensorFlow has a lot of confusingly named modules. But you should feel lucky that you do not have to learn TensorFlow v1 anymore.)

In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds

# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', shuffle_files=True)

Every Dataset may have a different data structure for its elements. You can see how each element is organised via `Dataset.element_spec`.

In [0]:
print(ds.element_spec)

Dataset is designed to be used for large datasets. Therefore it is not meant to be loaded into the memory all at once. Instead, a Dataset has an iterable interface and you can use a for-loop to progressively access the elements in a Dataset.

In [0]:
for example in ds.take(1):
  image, label = example["image"], example["label"]


## Show the image and label
import matplotlib.pyplot as plt
import numpy as np

def show(image, label):
  plt.figure()
  plt.imshow(np.squeeze(image), cmap='gray')
  plt.title(label.numpy())
  plt.axis('off')

show(image, label)


In the above example, the function `take(n)` limits the number of elements returned by the iterator to `n`.

**exercise** Find out what other datasets are available from `tensorflow_datasets`.

### Task 2.2 Define a data source

When you want to use your own data, a Dataset can be sourced from either the memory or physical files. 

#### Dataset from the memory

To define a data source in the memory, you use the function `tf.data.Dataset.from_tensor_slices()`. 

In [0]:
dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])

for item in dataset:
    print(item)

In most cases, you do not want to use the closely-named function `tf.data.Dataset.from_tensors()`. 

**exercise** Repeat the above experiment replacing `tf.data.Dataset.from_tensor_slices()` with  `tf.data.Dataset.from_tensors()`. What will you get?

#### Dataset from files



##### TextLineDataset

If you have a text file, you can create a Dataset so that each line is an element.

In [0]:
## You can find the file in your Google Colab host machine
file_paths = ["sample_data/california_housing_train.csv"]

housing_dataset = tf.data.TextLineDataset(file_paths)

print(housing_dataset.element_spec)

for line in housing_dataset.take(5):
    print(line.numpy())



For a CSV file, each line still needs to be parsed. You can follow the code snippet below for that.

In [0]:
for line in housing_dataset.skip(1).take(5):
  column_default_values = [[]] * 9
  columns = tf.io.decode_csv(line, record_defaults = column_default_values)
  print(tf.stack(columns).numpy())

##### make_csv_dataset

Alternatively you can use the function `tf.data.experimental.make_csv_dataset` to read CSV files.

In [0]:
housing_batches = tf.data.experimental.make_csv_dataset(
    file_paths, batch_size=4,
    label_name="median_house_value")

print(housing_batches.element_spec)

housing_batches.element_spec[0]

In [0]:
for features, labels in housing_batches.take(1):
  print(f"median_price: {labels}")
  for key, value in features.items():
      print(f"{key}: {value}")

Below is another example.

In [0]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)

titanic_batches = tf.data.experimental.make_csv_dataset(
    train_file_path, batch_size=4,
    label_name="survived")

print(titanic_batches.element_spec)

print()
print("Features:")
titanic_batches.element_spec[0]

You can use `take(n)` to get the first n elements from a Dataset.

In [0]:
for features, labels in titanic_batches.take(1):
  print(f"survived: {labels}")
  for key, value in features.items():
      print(f"{key}: {value}")

Later on, the features should be processed into numerical tensors. In particular, categorical features should be encoded by either one-hot vectors or more sophisticated embeddings, which we will learn later.

##### TFRecord

Besides CSV files, TensorFlow actually recommends the binary TFRecord files for your data. You have encountered this file format in Practical 2, when we learned TensorBoard. The event files generated by `tf.summary` are just TFRecord files. A TFRecord file contains a sequence of `tf.train.Example`. You can use `tf.data.TFRecordDataset` to construct a Dataset from one or more TFRecord files. As it is more complex, we will not go into the details in this practical.


In [0]:
DATA_URL = "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001"
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", DATA_URL )
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])

print(dataset.element_spec)

Here `Dataset.element_spec` is not of much use as each element is stored as a byte string. To access the features, follow the example below. As you can see, the data itself is deeply embedded in a nested data structure. This is for reversing space for future extensions of the API.

In [0]:
for element in dataset.take(1):
  example = tf.train.Example.FromString(element.numpy())

print(dict(example.features.feature).keys())

### Task 2.3 Batching

Batching can be achieved through the `batch()` function.

In [0]:
print("Individual element")
for element in housing_dataset.skip(1).take(8): #Skipping the CSV heading
  print(element)

print()
print("Three batches")
for batches in housing_dataset.skip(1).take(8).batch(3):
  print(batches)

**question** What is the size of the last batch?

**exercise** Read the tf.Dataset documentation to find out how to do repeat and shuffle training data. Then apply them to the housing dataset.

### Task 2.4 Dataset transformation 

Data preprocessing can be achieved by chaining transformations over the original Dataset. The main function for element-wise transforms is `Dataset.map()`, which is similar to the `map()` function built in other programming languages (such as [Python](https://docs.python.org/3/library/functions.html#map) or [R](https://purrr.tidyverse.org/reference/map.html)).

Below is the MNIST example we saw earlier.

In [0]:
for example in ds.take(2):
  image, label = example["image"], example["label"]
  show(image, label)

In [0]:
def resize_image(example):
  image, label = example["image"], example["label"]
  image = tf.image.resize(image, [14, 14])
  return image, label 
                     

resized_ds = ds.map(resize_image)


for image, label in resized_ds.take(2):
  show(image, label)

**exercise** Use the map function to scale the image pixels to have values between 0 and 1.

## Additional resources

- [A blog article on TensorFlow Data Pipeline](https://heartbeat.fritz.ai/building-a-data-pipeline-with-tensorflow-3047656b5095)