# **TFRecord and `tf.Example`**

**Learning objectives**

1. Understand the TFRecord format for storing data
2. Understand the `tf.Example` message type
3. Read and write a `TFRecord` file

## **Introduction**

In this notebook, you create, parse and use the `tf.Example` message, and then serialise, write, and read `tf.Example` messages to and from `.tfrecord` files. To read data efficiently it can be helpful to serialise your data and store it in a set of files (100-200MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. This can also be useful for caching any data-preprocessing.

### **The TFRecord format**

The TFRecord format is a simple format for storing a sequence of binary records. *Protocol buffers* are a cross-platform, cross-language library for efficient seralisation of structured data. Protocol messages are defined by `.proto` files, these are often the easiest way to understand a message type.

The `tf.Example` message (or protobuf) is a flexible message type that represents a `{"string": value}` mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX. 

Note: while useful, these structures are optional. There is no need to convert existing code to use TFRecords, unless you are using `tf.data` and reading data is still the bottleneck to training.

## **Load necessary libraries**

We will start by importing the necessary libraries for this lab.

In [3]:
import tensorflow as tf
import numpy as np
import IPython.display as display

print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.4.1


## **`tf.Example`**

### **Data types for `tf.Example`**

Fundamentally, a `tf.Example` is a `{"string": tf.train.Feature}` mapping.

The `tf.train.Feature` message type can accept one of the following three types. Most other generic types can be coerced into one of these:
1. `tf.train.BytesList` (the following types can be coerced)
- `string`
- `byte`
2. `tf.train.FloatList` (the following types can be coerced)
- `float` (`float32`)
- `double` (`float64`)
3. `tf.train.Int64List` (the following types can be coerced)
- `bool`
- `enum`
- `int32`
- `uint32`
- `int64`
- `uint64`

In order to convert a standard TensorFlow type to a `tf.Example`-compatible `tf.train.Feature`, you can use the shortcut functions below. Note that each function takes a scalar input value and returns a `tf.train.Feature` containing one of the three `list` types above.

In [4]:
# The following functions can be used to convert a value to a type compatible with `tf.Example`

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte"""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Returns a float_list from a float / double"""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint"""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Note: to stay simple, this example only uses scalar inputs. The simplest way to handle non-scalar features is to use `tf.serialize_tensor` to convert tensors to binary-strings. Strings are scalars in TensorFlow. Use `tf.parse_tensor` to convert the binary-string back to a tensor.

Below are some examples of how these functions work. Note the varying input types and the standardised output types. If the input type for a function does not match one of the coercible types stated above, the function will raise an exception (e.g. `_int64_feature(1.0)` will error out, since `1.0` is a float, so should be used with the `_float_feature` function instead)

In [5]:
print(_bytes_feature(b"test_string"))
print(_bytes_feature(u"test_bytes".encode("utf8")))

print(_float)

bytes_list {
  value: "test_string"
}

bytes_list {
  value: "test_bytes"
}

