## Import Packages, Environment Setting

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

Tensorflow Record is useful for storing serialized string efficiently. In this section, we introduce some basic use case of Tensorflow Record.

## Construct from Example
`tf.trian.Example` message provide an efficient way to construct serialized data structure for custom dataset. The pipeline is to create dictionary of every feature using `tf.train.Feature` for each instance, and then transform this dictionary into `tf.train.Features` object. Thereafter, create `tf.train.Example` using `tf.train.Features`, and finally create `TFRecord` using the serialize string of `tf.train.Example`.

For instance, if we have a dataset consist of `100 instance` with `3 features` and `1 target`:

### Create Feature object
`tf.train.Feature(float_list=tf.train.FloatList(value=data))`

We need to first construct `tf.train.Feature` object for each instance. Here, we focus on float features since they are the most commonly used data type. However, `tf.train.Feature` can be generalized to [other format](https://www.tensorflow.org/tutorials/load_data/tfrecord#data_types_for_tftrainexample).

In [2]:
np.random.seed(0)
X = np.random.normal(0, 1, 300).reshape(100, 3)
y = np.random.normal(0, 1, 100)
dict_of_features_for_sample_0 = {
    'feature A': tf.train.Feature(float_list=tf.train.FloatList(value=[X[0, 0]])),
    'feature B': tf.train.Feature(float_list=tf.train.FloatList(value=[X[0, 1]])),
    'feature C': tf.train.Feature(float_list=tf.train.FloatList(value=[X[0, 2]])),
    'target': tf.train.Feature(float_list=tf.train.FloatList(value=[y[0]])),
}

### Create Features object
`tf.train.Features(dict)`

Next, we construct `Features` object using the dictionary created from previous step.

In [3]:
features_for_sample_0 = tf.train.Features(feature=dict_of_features_for_sample_0)

### Create Example object
`tf.train.Example(features=features)`

Lastly, we coonstruct `tf.train.Example` using `tf.train.Features`.

In [4]:
example_for_sample_0 = tf.train.Example(features=features_for_sample_0)

### Construct serialized string
`example.SerializeToString()`

In [5]:
example_for_sample_0.SerializeToString()

b'\nY\n\x15\n\tfeature A\x12\x08\x12\x06\n\x04x\xcc\xe1?\n\x15\n\tfeature B\x12\x08\x12\x06\n\x04h\xe1\xcc>\n\x15\n\tfeature C\x12\x08\x12\x06\n\x04\x93\x8ez?\n\x12\n\x06target\x12\x08\x12\x06\n\x04F<\xa7\xbf'

### Implement serialization as function
Here, we implement a function to automate the entire process that can be effeciently applied on every instance. Note that we will have to __coerce__ the type to float to avoid issues when dealing with Tensorflow Tensors.

In [6]:
def serialize_example(featureA, featureB, featureC, target):
   
    dict_of_features = {
        'feature A': tf.train.Feature(float_list=tf.train.FloatList(value=[float(featureA)])),
        'feature B': tf.train.Feature(float_list=tf.train.FloatList(value=[float(featureB)])),
        'feature C': tf.train.Feature(float_list=tf.train.FloatList(value=[float(featureC)])),
        'target': tf.train.Feature(float_list=tf.train.FloatList(value=[float(target)])),
    }
    features = tf.train.Features(feature=dict_of_features)
    example = tf.train.Example(features=features)
    
    return example.SerializeToString()

### Write as TFRecord

In [7]:
n_instances = X.shape[0]
with tf.io.TFRecordWriter('example.tfrecords') as writer:
    for i in range(n_instances):
        example = serialize_example(X[i, 0], X[i, 1], X[i, 2], y[i])
        writer.write(example)

## Construct from Tensor
We can easily construct the serialized string using the serialization function and `map` function.

In [8]:
np.random.seed(0)
X = np.random.normal(0, 1, 300).reshape(100, 3)
y = np.random.normal(0, 1, 100).reshape(100)

dataset = tf.data.Dataset.from_tensor_slices({
    'feature A': X[:, 0],
    'feature B': X[:, 1],
    'feature C': X[:, 2],
    'target': y
})

Note that we will have to use `tf.py_function` since the `serialize_example` function is a native Python function that does not operate on the graph.

In [9]:
def serialize_dataset(x):
    serialize_string = tf.py_function(
        serialize_example,
        (x['feature A'], x['feature B'], x['feature C'], x['target']),
        tf.string
    )
    return tf.reshape(serialize_string, ())

In [10]:
with tf.io.TFRecordWriter('example.tfrecords') as writer:
    for example in dataset.map(serialize_dataset):
        writer.write(example.numpy())

## Read TFRecord
First we read as dataset with serialized string.

In [11]:
serialized_string_dataset = tf.data.TFRecordDataset('example.tfrecords')

Thereafter, we decode the dataset with feature description.

In [12]:
feature_description = {
    'feature A': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
    'feature B': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
    'feature C': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
    'target': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
}

def decode_serialized_string(serialized_string):
    return tf.io.parse_single_example(serialized_string, feature_description)

In [13]:
dataset = serialized_string_dataset.map(decode_serialized_string)