# Chapter 13. Loading and Prepocessing Data with TensorFlow

In this chapter, the author emphasizes the challenges of training deep learning systems on large datasets that surpass available memory capacity. 
The focus is on the challenges of handling large datasets in deep learning systems, specifically those that cannot fit into memory. TensorFlow addresses this issue through its Data API, simplifying the process of ingesting and preprocessing vast datasets. Key points include:

- Data API in TensorFlow: Facilitates efficient handling of large datasets by creating a dataset object, specifying data sources, and transformation methods, with TensorFlow managing implementation details like multithreading, queuing, batching, and prefetching.
- Compatibility with tf.keras: The Data API seamlessly integrates with tf.keras, providing a smooth workflow for deep learning model development.
- Supported data formats: TensorFlow's Data API can read from various sources, including text files (CSV), binary files (fixed-size records and TFRecord format), and SQL databases, enhancing flexibility in data handling.
- Preprocessing challenges: Efficient preprocessing of diverse data, including normalization and encoding of text, categorical features, etc., is crucial, with options like custom preprocessing layers or using standard Keras preprocessing layers.
- Related TensorFlow projects: The discussion briefly introduces TF Transform, enabling batch-mode preprocessing for training sets, and TF Datasets, offering a convenient function for downloading and manipulating common datasets through the Data API. These tools contribute to a comprehensive ecosystem for effective deep learning workflows in TensorFlow.






### The Data API

## TensorFlow Data API Overview

The Data API in TensorFlow revolves around the concept of a dataset, representing a sequence of data items. Key functionalities include:

- **Creating a Dataset:** Use `tf.data.Dataset.from_tensor_slices()` to create a dataset in RAM, providing a tensor as input. The dataset consists of slices of the tensor along its first dimension.

- **Iterating Over a Dataset:** Easily iterate over dataset items using a simple loop. Each item is a tensor representing a slice of the original data.

- **Chaining Transformations:** Apply various transformations to a dataset using methods like `repeat()` and `batch()`. Chaining transformations allows for efficient preprocessing and handling of large datasets.

- **Data Shuffling:** Use the `shuffle()` method to shuffle instances in the training set, improving the independence and identically distributed nature required for Gradient Descent.

- **Interleaving Lines from Multiple Files:** Demonstrated using the `interleave()` method, allowing for the interleaving of lines from multiple files for efficient data shuffling.

- **Prefetching:** Enhance performance by prefetching the next batch of data while the current batch is being processed, ensuring better CPU and GPU utilization.

- **Preprocessing the Data:** Implement a preprocessing function using TensorFlow functions like `decode_csv()` and `stack()` to parse, scale, and preprocess data items.

- **Building a Reusable Input Pipeline:** Create a function `csv_reader_dataset()` that efficiently loads, preprocesses, shuffles, repeats, batches, and prefetches data from multiple CSV files.

- **Using the Dataset with tf.keras:** Integrate the created dataset with the Keras API for model training, evaluation, and prediction. Pass datasets directly to `fit()`, `evaluate()`, and `predict()` methods, simplifying the model training process.

- **Custom Training Loop:** Optionally, iterate over the dataset manually for a custom training loop. This offers flexibility in building advanced training procedures.




# TFRecord Format and Compression

The TFRecord format is TensorFlow's preferred way of efficiently storing and reading large amounts of data. It is a binary format consisting of records with length, CRC checksums, and actual data. Creating a TFRecord file is easy using `tf.io.TFRecordWriter`, and reading it can be done with `tf.data.TFRecordDataset`. Compression, especially for network loading, can be achieved using options like GZIP.

```python
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)


# Protocol Buffers (protobuf)

TFRecord files often contain serialized protocol buffers (protobufs). Protobufs are a binary format developed by Google, now widely used, and can be defined using a simple language. TensorFlow provides protobuf definitions, like `tf.train.Example`, commonly used in TFRecord files.

```python
from tensorflow.train import BytesList, FloatList, Int64List, Feature, Features, Example

person_example = Example(
    features=Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alice"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        }
    )
)


# Loading and Parsing Examples

Loading and parsing serialized Examples from a TFRecord file is done using `tf.data.TFRecordDataset` and `tf.io.parse_single_example()`. Features are described using a dictionary, and parsing involves specifying the types and shapes of the features.

```python
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string),
}

for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example, feature_description)
    # ... (processing the parsed example)


# SequenceExample for Lists of Lists
For more complex structures, like lists of lists, TensorFlow provides SequenceExample. It contains context features and feature lists, where each feature can be a list of byte strings, a list of 64-bit integers, or a list of floats.

```python
from tensorflow.train import FeatureList, FeatureLists, SequenceExample

sequence_example = SequenceExample(
    context=Features(feature={"author": Feature(bytes_list=BytesList(value=[b"John"]))}),
    feature_lists=FeatureLists(
        feature_list={
            "content": FeatureList(feature=[Feature(bytes_list=BytesList(value=[b"word1", b"word2"]))]),
            "comments": FeatureList(feature=[Feature(bytes_list=BytesList(value=[b"comment1", b"comment2"]))])
        }
    )
)
