<a href="https://colab.research.google.com/github/deep-diver/mlops-hf-tf-vision-models/blob/main/notebooks/parse_tfrecord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [None]:
import tensorflow as tf

2022-09-26 04:55:06.084487: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-26 04:55:06.084538: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## GCS Paths

The prepared TFRecords are stored in Google Cloud Storage(GCS). GCS is the equivalent service to AWS S3, but it is provided by Google. 
- `GCS_PATH_FULL_RESOUTION`: this indicates the GCS path where the full resolution(`500x500`) datasets are stored.
- `GCS_PATH_LOW_RESOLUTION`: this indicates the GCS path where the lowered resolution(`256x256`) datasets are stored.

We provide two different resolutions of datasets. High resolution images takes a longer time to be used in a number of steps in ML pipeline, so it might be useful to test dedicated services to handle a large amount of data such as [**Dataflow**](https://cloud.google.com/dataflow).

In [None]:
GCS_PATH_FULL_RESOUTION = "gs://beans-fullres/tfrecords"
GCS_PATH_LOW_RESOLUTION = "gs://beans-lowres/tfrecords"

## Parsing TFRecords

The purpose of this section is to verify that TFRecords are correctly saved and structured as intended. There are few functionalities to note:
- data is flattened and stored in a binary format, so we need to decode them into `Tensor` and `reshape` into appropriates shapes. 
- [`tf.io.parse_single_example`](https://www.tensorflow.org/api_docs/python/tf/io/parse_single_example): It parses a single Example proto(col buffer message) and returns a dict mapping feature keys to `Tensor` and `SparseTensor` values.

- [`tf.sparse.to_dense`](https://www.tensorflow.org/api_docs/python/tf/sparse/to_dense): if you are not familiar with `tf.sparse.SparseTensor`, please take a look at the [official document](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor). It basically efficiently and effectively stores values of a Tensor, so it could reduce the number of bytes when saved in external files. But, it should be brought back to a dense `Tensor` which is commonly used in AI/TensorFlow.

- after `tf.io.parse_single_example` and `tf.sparse.to_dense`, we still don't know the shape of the returned dense `Tensor`. it is basically just a single dimensional array(Tensor), so we need to reshape appropriately with `tf.reshape`.

What `tf.data.TFRecordDataset` does is to create `tf.data` from TFRecord files. 

In [None]:
BATCH_SIZE = 4
AUTO = tf.data.AUTOTUNE

In [None]:
def parse_tfr(proto):
    feature_description = {
        "image": tf.io.VarLenFeature(tf.float32),
        "image_shape": tf.io.VarLenFeature(tf.int64),
        "label": tf.io.VarLenFeature(tf.int64),
    }
    rec = tf.io.parse_single_example(proto, feature_description)
    image_shape = tf.sparse.to_dense(rec["image_shape"])
    image = tf.reshape(tf.sparse.to_dense(rec["image"]), image_shape)
    label = tf.sparse.to_dense(rec["label"])
    return {"pixel_values": image, "label": label}


def prepare_dataset(GCS_PATH=GCS_PATH_FULL_RESOUTION, 
                    split="train", batch_size=BATCH_SIZE):

    if split not in ["train", "val"]:
        raise ValueError(
            "Invalid split provided. Supports splits are: `train` and `val`."
        )

    dataset = tf.data.TFRecordDataset(
        [filename for filename in tf.io.gfile.glob(f"{GCS_PATH}/{split}-*")],
        num_parallel_reads=AUTO,
    ).map(parse_tfr, num_parallel_calls=AUTO)

    if split == "train":
        dataset = dataset.shuffle(batch_size * 2)

    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTO)
    return dataset

### Full Resolution Dataset

In [None]:
train_dataset = prepare_dataset()
val_dataset = prepare_dataset(split="val")

2022-09-26 04:58:13.399342: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-09-26 04:58:13.399401: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-09-26 04:58:13.399437: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (codespaces-78e142): /proc/driver/nvidia/version does not exist
2022-09-26 04:58:13.400225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
for batch in train_dataset.take(1):
    print(batch["pixel_values"].shape, batch["label"].shape)

(4, 500, 500, 3) (4, 1)


In [None]:
for batch in val_dataset.take(1):
    print(batch["pixel_values"].shape, batch["label"].shape)

(4, 500, 500, 3) (4, 1)


### Low Resolution Dataset

In [None]:
train_dataset = prepare_dataset(GCS_PATH_LOW_RESOLUTION)
val_dataset = prepare_dataset(GCS_PATH_LOW_RESOLUTION, split="val")

In [None]:
for batch in train_dataset.take(1):
    print(batch["pixel_values"].shape, batch["label"].shape)

(4, 256, 256, 3) (4, 1)


In [None]:
for batch in val_dataset.take(1):
    print(batch["pixel_values"].shape, batch["label"].shape)

(4, 256, 256, 3) (4, 1)
