<h1 style='font-size:40px'> tf.data Pipeline</h1>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            In this notebook I'll practice my skills with the tf.data module solving Exercise 9 from Hands-On Machine Learning with Scikit-Learn and TensorFlow's Chapter 13.
        </li>
        <li> 
            The Exercise commands us the following:
            <p style='font-style:italic;margin-top:10px'> 
                Load the Fashion MNIST dataset (introduced in Chapter 10); split
it into a training set, a validation set, and a test set; shuffle the
training set; and save each dataset to multiple TFRecord files.
Each record should be a serialized Example protobuf with two
features: the serialized image (use tf.io.serialize_tensor()
to serialize each image), and the label. 11 Then use tf.data to create
an efficient dataset for each set. Finally, use a Keras model to
train these datasets, including a preprocessing layer to standardize
each input feature. Try to make the input pipeline as efficient as
possible, using TensorBoard to visualize profiling data.
            </p>
        </li>
    </ul>
</div>

<h2 style='font-size:30px'> Data Importing & Splitting</h2>

In [1]:
# Loading the fashion_mnist dataset.
from tensorflow.keras.datasets import fashion_mnist
from sklearn.model_selection import train_test_split
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Now, generating the validation set with `train_test_split`.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.1, random_state=42)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [2]:
# Storing each one of the sets in a tf.data.Dataset object.
# The classes store 1000 elements batches. The groups' data will be put into a .tfrecord file.
from tensorflow.data import Dataset
batch_size = 1000
train = Dataset.from_tensor_slices((X_train, y_train)).shuffle(54000).batch(batch_size)
val = Dataset.from_tensor_slices((X_val, y_val)).batch(batch_size)
test = Dataset.from_tensor_slices((X_test, y_test)).batch(batch_size)

In [3]:
# We'll produce 1000 instances .tfrecord files. One corresponding to a bacth created.
train_files = len(X_train) // batch_size
val_files = len(X_val) // batch_size
test_files = len(X_test) // batch_size

<h2 style='font-size:30px'> .tfrecord's Production</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            TensorFlow records demands data to be stored in protobuf format. We can do so by placing the serialized information in an `Example` class. 
        </li>
    </ul>
</div>

In [4]:
from tensorflow import Tensor
from tensorflow.train import BytesList, Example, Features, Feature, Int64List
from tensorflow.io import serialize_tensor, TFRecordWriter

def create_example(images:Tensor, targets:Tensor)->str:
    '''
        Generates a serialized `Example` object holding the pixel intensities and target values from a collection
        of MNIST images.
        
        Parameters
        ----------
        `images`: A 3-D `tf.Tensor` with the digits pixels. \n
        `targets`: An 1-D `tf.Tensor` with the digits labels.
        
        Returns
        -------
        An `tf.train.Example` object storing both pixels and target values.
    '''
    # Serializing the input vectors.
    serialized_images, serialized_targets = serialize_tensor(images), serialize_tensor(targets)
    example = Example(
        features=Features(
            feature={
            'pixels':Feature(bytes_list=BytesList(value=[serialized_images.numpy()])),
            'target':Feature(bytes_list=BytesList(value=[serialized_targets.numpy()]))
        }
        ))
    # Now, converting the `Example` object into a binary string.
    return example.SerializeToString()

In [5]:
# It is convenient to place all data files in a separate directory.
! mkdir mnist

In [6]:
def create_files(dataset:Dataset, filename:str, directory:str='.')->None:
    '''
        Creates the .tfrecord's files based on the batches from a provided `dataset`.
        
        Parameters
        ----------
        `dataset`: A `tf.data.Dataset` object. \n
        `filename`: A custom name for file identification. \n
        `directory`: A string that indicates the directory where the files are put.
    '''
    for index, (images, labels) in dataset.enumerate():
        file = TFRecordWriter(f'{directory}/{filename}_{index}.tfrecord')
        serialized_data = create_example(images, labels)
        file.write(serialized_data)

# Generating the files.
create_files(train, 'train', 'mnist')
create_files(val, 'val', 'mnist')
create_files(test, 'test', 'mnist')

<p style='color:red'> Pré-processamento dos .tfrecord's

<h2 style='font-size:30px'> </h2>