<h1 style='font-size:40px'> tf.data Pipeline</h1>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            In this notebook I'll practice my skills with the tf.data module solving Exercise 9 from Hands-On Machine Learning with Scikit-Learn and TensorFlow's Chapter 13.
        </li>
        <li> 
            The Exercise commands us the following:
            <p style='font-style:italic;margin-top:10px'> 
                Load the Fashion MNIST dataset (introduced in Chapter 10); split
it into a training set, a validation set, and a test set; shuffle the
training set; and save each dataset to multiple TFRecord files.
Each record should be a serialized Example protobuf with two
features: the serialized image (use tf.io.serialize_tensor()
to serialize each image), and the label. 11 Then use tf.data to create
an efficient dataset for each set. Finally, use a Keras model to
train these datasets, including a preprocessing layer to standardize
each input feature. Try to make the input pipeline as efficient as
possible, using TensorBoard to visualize profiling data.
            </p>
        </li>
    </ul>
</div>

<h2 style='font-size:30px'> Data Importing & Splitting</h2>

In [1]:
# Loading the fashion_mnist dataset.
from tensorflow.keras.datasets import fashion_mnist
from sklearn.model_selection import train_test_split
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Now, generating the validation set with `train_test_split`.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.1, random_state=42)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [2]:
# Storing each one of the sets in a tf.data.Dataset object.
# The classes store 1000 elements batches. The groups' data will be put into a .tfrecord file.
from tensorflow.data import Dataset
batch_size = 1000
train = Dataset.from_tensor_slices((X_train, y_train)).shuffle(54000).batch(batch_size)
val = Dataset.from_tensor_slices((X_val, y_val)).batch(batch_size)
test = Dataset.from_tensor_slices((X_test, y_test)).batch(batch_size)

In [3]:
# We'll produce 1000 instances .tfrecord files. One corresponding to a bacth created.
train_files = len(X_train) // batch_size
val_files = len(X_val) // batch_size
test_files = len(X_test) // batch_size

<h2 style='font-size:30px'> .tfrecord's Production</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            TensorFlow records demands data to be stored in protobuf format. We can do so by placing the serialized information in an `Example` class. 
        </li>
    </ul>
</div>

In [4]:
from tensorflow import Tensor
from tensorflow.train import BytesList, Example, Features, Feature, Int64List
from tensorflow.io import serialize_tensor, TFRecordWriter

def create_example(images:Tensor, targets:Tensor)->str:
    '''
        Generates a serialized `Example` object holding the pixel intensities and target values from a collection
        of MNIST images.
        
        Parameters
        ----------
        `images`: A 3-D `tf.Tensor` with the digits pixels. \n
        `targets`: An 1-D `tf.Tensor` with the digits labels.
        
        Returns
        -------
        An `tf.train.Example` object storing both pixels and target values.
    '''
    # Serializing the input vectors.
    serialized_images, serialized_targets = serialize_tensor(images), serialize_tensor(targets)
    example = Example(
        features=Features(
            feature={
            'pixels':Feature(bytes_list=BytesList(value=[serialized_images.numpy()])),
            'target':Feature(bytes_list=BytesList(value=[serialized_targets.numpy()]))
        }
        ))
    # Now, converting the `Example` object into a binary string.
    return example.SerializeToString()

In [5]:
# It is convenient to place all data files in a separate directory.
! mkdir mnist

In [6]:
def create_files(dataset:Dataset, filename:str, directory:str='.')->None:
    '''
        Creates the .tfrecord's files based on the batches from a provided `dataset`.
        
        Parameters
        ----------
        `dataset`: A `tf.data.Dataset` object. \n
        `filename`: A custom name for file identification. \n
        `directory`: A string that indicates the directory where the files are put.
    '''
    for index, (images, labels) in dataset.enumerate():
        file = TFRecordWriter(f'{directory}/{filename}_{index}.tfrecord')
        serialized_data = create_example(images, labels)
        file.write(serialized_data)

# Generating the files.
create_files(train, 'train', 'mnist')
create_files(val, 'val', 'mnist')
create_files(test, 'test', 'mnist')

<h2 style='font-size:30px'> Data Treatment</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            With the files generated, we can proceed and handle the data importing and its proper treatment.
        </li>
    </ul>
</div>

In [7]:
# Reading the data files separately.
train_files = Dataset.list_files('mnist/train*')
val_files = Dataset.list_files('mnist/val*')
test_files = Dataset.list_files('mnist/test*')

In [8]:
from tensorflow import string, uint8
from tensorflow.data import AUTOTUNE, TFRecordDataset
from tensorflow.io import FixedLenFeature, parse_example, parse_tensor
from typing import Iterable, Tuple

def preprocess(tfrecord:Tensor)->Tuple[Tensor, Tensor]:
    '''
    Reads an encoded protobuf and returns its Tensors in numerical format.
    
    Parameter
    ---------
    `tfrecord`: A `tf.Tensor` that stores encoded protobufs.
    
    Returns
    -------
    Two tensors in a tuple. One with the pixel intensities and another containing the target values.
    '''
    features = {
    'pixels':FixedLenFeature([], string, default_value=''), 
    'target':FixedLenFeature([], string, default_value='-1')
                }
    example = parse_example(tfrecord, features) # Returns a dictionary with the serialized images and target values.
    pixels, target = parse_tensor(example['pixels'], uint8), parse_tensor(example['target'], uint8)
    return pixels, target

def read_files(filenames:Iterable[str], shuffle_size:int=None, num_threads_reading:int=AUTOTUNE, 
               num_threads_preprocess:int=AUTOTUNE)->Dataset:
    '''
        Reads the .tfrecord files specified and retrieves a `tf.data.Dataset` object with the processed data.
        
        Parameters
        ----------
        `filenames`: The names of the files.
        `shuffle_size`: If specified, it shuffles the Dataset using a deck with the specified length.
        `num_threads_reading`: The number of threads to use when reading the files.
        `num_threads_preprocess`: The number of threads to use when preprocessing the dataset.
        
        Returns
        -------
        The treated dataset.
    '''
    dataset = TFRecordDataset(filenames, num_parallel_reads=num_threads_reading)
    if shuffle_size:
        dataset.shuffle(shuffle_size)
    return dataset.map(preprocess, num_parallel_calls=num_threads_preprocess).prefetch(1)

<h2 style='font-size:30px'> Standardization Layer</h2>
<div> 
    <ul style='font-size:20px'> 
        <li> 
            Here, we'll simply code a `tf.layers.Layer` object which fairly does a similar job of the Batch Normalization Layer. The main difference is that $\mu$ and $\sigma$ are computed in advance using the `adapt` function.
        </li>
    </ul>
</div>

In [9]:
from tensorflow.random import normal
from tensorflow.math import reduce_mean, reduce_std
reduce_mean(normal(shape=(5, 10, 10)), axis=0)

<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[-0.71860915, -0.37386498,  0.84272593,  0.15425727,  0.71466786,
        -0.5308983 ,  0.147206  ,  0.1305735 ,  0.28914818,  0.16880181],
       [-0.12105656, -0.12724052,  0.23465379, -0.2755946 , -0.20157509,
         0.64078265,  0.00167464, -0.24770947, -0.17672545,  0.00147332],
       [ 0.6664582 ,  0.1417041 , -0.36508322, -0.3298834 ,  0.19712499,
        -0.08973067, -0.00951571, -0.34670582,  0.15999393,  0.68951577],
       [-0.58635855, -0.34293714,  0.03675737, -0.19457375,  0.04571567,
         0.03293511, -0.708356  ,  0.18836927,  0.7266482 ,  0.20899737],
       [ 0.6625474 , -0.04238774,  0.9981421 ,  0.70100176,  0.23378868,
         0.69413424, -0.54928   ,  0.30869088,  0.35642567,  0.3064534 ],
       [ 0.23344226,  1.3789672 ,  0.5510902 , -0.21043363,  0.3211848 ,
         0.5299282 , -0.35761735, -0.37501377, -0.5181626 ,  0.20408134],
       [-0.29252636, -0.35589853, -0.08450793,  0.31069106, -0.89885

In [10]:
# The layer will inherit the properties of the experimental `PreprocessingLayer`.
from tensorflow.keras.layers.experimental.preprocessing import PreprocessingLayer
from tensorflow.math import reduce_mean, reduce_std
from tensorflow.keras.backend import epsilon
from typing import Union

class Standardize(PreprocessingLayer):
    '''
    A `PreprocessingLayer` object that carries out the standardization of a given array accordingly to an 
    informed axis.
    
    The necessary stats are computed before training with the `adapt` method. This aspect is what differentiates
    such class from the `BatchNormalization` layer, which computes means and standard deviations on the fly.
    '''
    def adapt(input_data:Union[Dataset, Tensor], axis=0)->None:
        '''
            Computes means and std's from a provided `tf.Tensor` or `tf.data.Dataset`.
            
            Paramater
            ---------
            `input_data`: The array from which the stats are computed.
            `axis`: The axis of choice to compute the stats.
        '''
        self.means = reduce_mean(input_data, axis=axis)
        self.stds = reduce_std(input_data, axis=axis)
        
    def transform(input_data:Union[Dataset, Tensor])->Union[Dataset, Tensor]:
        '''
        The method that standardizes the array.
        
        Paramter
        --------
        `input_data`: The array in which we perform the standardization
        
        '''
        return (input_data - self.means) / (self.stds + epsilon)
    
Standardize()

<__main__.Standardize at 0x7fe2982fff90>

<p style='color:red'> Terminar de documentar a Camada de Padronização

<h2 style='font-size:30px'> </h2>