# Chapter 13: Loading and Preprocessing Data with TensorFlow

This notebook contains the code reproductions and theoretical explanations for Chapter 13 of *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*.

## Chapter Summary

This chapter focuses on building efficient, scalable, and production-ready data pipelines using TensorFlow.

Key topics covered include:

* **The Data API (`tf.data`):** We learn how to build high-performance data pipelines by creating a `Dataset` object and chaining transformations. This includes reading from files, shuffling, batching, mapping (preprocessing), and prefetching.
* **The TFRecord Format:** This is TensorFlow's preferred binary format for storing large datasets. We learn how to create, write to, read from, and parse TFRecord files, including those containing `tf.train.Example` protocol buffers.
* **Preprocessing Layers:** We explore how to preprocess features *within* the model itself. This ensures that the same preprocessing logic is applied during training and inference, preventing training-serving skew. This includes:
    * Standardization and normalization.
    * Encoding categorical features (using one-hot encoding or embeddings).
    * Encoding text features (using tokenization and bag-of-words or embeddings).
* **TF Transform:** A library from TensorFlow Extended (TFX) that allows you to define a single preprocessing function which can be run efficiently in batch (e.g., on Apache Beam) before training, and also exported for on-the-fly preprocessing in a deployed model.
* **TensorFlow Datasets (TFDS):** A high-level library that provides a simple way to download and use many common public datasets, already in the `tf.data.Dataset` format.

## Setup

First, let's import the necessary libraries and set up the environment.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Common setup for plotting
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## The Data API

**Theoretical Explanation:**

The Data API revolves around the `tf.data.Dataset` object, which represents a sequence of data items. This API is designed to handle large datasets that may not fit in memory. It allows you to build a pipeline by chaining transformations.

* **Source:** You first create a dataset from a source (e.g., from tensors in memory, from files on disk).
* **Transformations:** You then apply a series of transformations to this dataset. Each transformation method (like `.batch()` or `.map()`) returns a *new* dataset object. This enables method chaining and lazy evaluation.
* **Iteration:** Finally, you iterate over the dataset (e.g., in a `for` loop or by passing it to a Keras model's `fit()` method).

In [2]:
# Create a dataset from a tensor in RAM
X = tf.range(10)  # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [3]:
# This dataset contains 10 items: tensors 0, 1, 2, ..., 9
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### Chaining Transformations

Each method returns a new dataset, so you can chain methods together to build a processing pipeline.

In [4]:
dataset = dataset.repeat(3).batch(7, drop_remainder=False)

for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


Common transformation methods:
* `map()`: Applies a custom function to each item. This is used for preprocessing.
* `apply()`: Applies a transformation to the dataset as a whole.
* `filter()`: Filters the dataset, keeping only items that pass a test.
* `take()`: Creates a new dataset with only the first *n* items.

In [5]:
# Reset dataset for new examples
dataset = tf.data.Dataset.from_tensor_slices(tf.range(10))

# map()
dataset_map = dataset.map(lambda x: x * 2)
print("Mapped:", list(dataset_map.as_numpy_iterator()))

# filter()
dataset_filter = dataset.filter(lambda x: x < 5)
print("Filtered:", list(dataset_filter.as_numpy_iterator()))

# take()
dataset_take = dataset.take(3)
print("Taken:", list(dataset_take.as_numpy_iterator()))

Mapped: [np.int32(0), np.int32(2), np.int32(4), np.int32(6), np.int32(8), np.int32(10), np.int32(12), np.int32(14), np.int32(16), np.int32(18)]
Filtered: [np.int32(0), np.int32(1), np.int32(2), np.int32(3), np.int32(4)]
Taken: [np.int32(0), np.int32(1), np.int32(2)]


### Shuffling the Data

**Theoretical Explanation:**

Gradient Descent works best when the training instances are independent and identically distributed (IID). A `shuffle()` method helps achieve this by creating a **shuffle buffer**.

It works as follows:
1.  It fills a buffer (of `buffer_size`) with the first items from the source dataset.
2.  When an item is requested, it pulls one out *randomly* from the buffer.
3.  It then replaces the pulled item with the *next* item from the source dataset.

For this to be effective, the `buffer_size` must be large enough. A common practice is to set it to the size of the training set, but if the dataset is too large for RAM, you can use a smaller buffer (e.g., 10,000). It's also crucial to shuffle the source files themselves first.

In [6]:
dataset = tf.data.Dataset.range(10).repeat(3) # 0 to 9, three times
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


### Interleaving lines from multiple files

For very large datasets, a common pattern is to split the data into multiple files. This allows you to shuffle them at the file level. The Data API can read from these files in parallel and interleave their lines. This further improves shuffling and performance.

1.  `Dataset.list_files()`: Creates a dataset of filenames (shuffles them by default).
2.  `dataset.interleave()`: Reads from multiple files at once and interleaves their records.

In [7]:
# We first need to create the CSV files for this example
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([str(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

# Prepare and save the data
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

print("Filepaths created.")

Filepaths created.


In [8]:
# 1. Create a dataset of filepaths
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

# 2. Interleave the lines from 5 files at a time
#    We also skip the header row (skip(1)) from each file
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

# Let's check the result
for line in dataset.take(5):
    print(line.numpy())

b'4.2083,44.0,5.323204419889502,0.9171270718232044,846.0,2.3370165745856353,37.47,-122.2,2.782'
b'4.1812,52.0,5.701388888888889,0.9965277777777778,692.0,2.4027777777777777,33.73,-118.31,3.215'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
b'3.3456,37.0,4.514084507042254,0.9084507042253521,458.0,3.2253521126760565,36.67,-121.7,2.526'
b'3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442'


### Preprocessing the Data

The data is loaded as byte strings. We need a function to parse and scale it. We can use `tf.io.decode_csv` to parse the lines and `tf.stack` to re-form tensors.

In [9]:
n_inputs = 8 # housing.data.shape[1]

def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / (X_std + keras.backend.epsilon()), y

# Test the preprocess function
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579157,  1.216324  , -0.05204564, -0.3921597 , -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072057 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)

### Putting Everything Together

**Theoretical Explanation:**

We can build a final, efficient pipeline by chaining all these steps. The key to performance is to make sure operations run in parallel.

* `interleave(..., num_parallel_calls=...)`: Reads multiple files in parallel.
* `map(..., num_parallel_calls=...)`: Preprocesses multiple items in parallel.
* `shuffle()`: Shuffles the items.
* `batch()`: Groups items into batches.
* `prefetch(1)`: This is a crucial performance optimization. It creates a dataset that will always prepare one batch ahead of time. While the model is training on batch N, the CPU is already preparing batch N+1. This prevents the GPU from "starving" for data.
* `cache()`: If your dataset is small enough to fit in RAM, you can add `.cache()` after preprocessing (but before shuffling/batching) to store the preprocessed data in memory, avoiding repeated work every epoch.

In [10]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                       n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size)
    return dataset.batch(batch_size).prefetch(1)


In [11]:
# Create the final dataset objects for training, validation, and testing
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

### Using the Dataset with `tf.keras`

You can now pass these `Dataset` objects directly to the `fit()`, `evaluate()`, and `predict()` methods of a Keras model.

In [14]:
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1)
])

model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1e-3))

batch_size = 32 # Define batch_size here

# We pass the datasets directly to fit()
# Since train_set repeats indefinitely, we must set steps_per_epoch.
model.fit(train_set, epochs=10,
          validation_data=valid_set,
          steps_per_epoch=len(X_train) // batch_size)

Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - loss: 3.6443 - val_loss: 0.9263
Epoch 2/10
[1m 51/362[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 3ms/step - loss: 0.9031



[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.8263 - val_loss: 0.7216
Epoch 3/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.7222 - val_loss: 0.6458
Epoch 4/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - loss: 0.6793 - val_loss: 0.6127
Epoch 5/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.6226 - val_loss: 0.5843
Epoch 6/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.5629 - val_loss: 0.6731
Epoch 7/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.5842 - val_loss: 0.5320
Epoch 8/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.5352 - val_loss: 0.5357
Epoch 9/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - loss: 0.4999 - val_loss: 0.5265
Epoch 10/10
[1m362/362[0m [32m━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x791054e3fb00>

In [15]:
# We can also pass the dataset to evaluate()
model.evaluate(test_set, steps=len(X_test) // batch_size)

[1m161/161[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 0.4852


0.487282931804657

In [16]:
# And to predict()
# Note: new_set should not contain labels
new_set = test_set.take(3).map(lambda X, y: X)
model.predict(new_set)

[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step




array([[1.9415727 ],
       [1.8754908 ],
       [1.2476733 ],
       [2.2222128 ],
       [1.0024118 ],
       [1.4595723 ],
       [1.7911727 ],
       [3.078833  ],
       [0.9406998 ],
       [1.7095742 ],
       [2.852618  ],
       [2.3548572 ],
       [4.203974  ],
       [1.4372646 ],
       [0.9491283 ],
       [1.9114571 ],
       [1.6402541 ],
       [1.0866268 ],
       [2.9489248 ],
       [2.2098937 ],
       [1.1167781 ],
       [1.6792766 ],
       [1.8251857 ],
       [1.4860404 ],
       [2.7172618 ],
       [2.0456617 ],
       [2.6747231 ],
       [2.329032  ],
       [2.6333773 ],
       [1.5017779 ],
       [1.5731975 ],
       [0.9456526 ],
       [1.1374192 ],
       [2.8793921 ],
       [4.7820644 ],
       [1.3767979 ],
       [1.8201216 ],
       [1.510982  ],
       [1.3603859 ],
       [1.7061348 ],
       [1.9048326 ],
       [1.4471312 ],
       [1.6706655 ],
       [1.2072568 ],
       [2.827791  ],
       [1.3204029 ],
       [2.2691965 ],
       [2.392

## The TFRecord Format

**Theoretical Explanation:**

The TFRecord format is TensorFlow's preferred format for storing large datasets. It's a simple binary format composed of a sequence of binary records. Each record contains its length, a CRC checksum, the actual data, and a final CRC checksum for the data.

This format is efficient to read and works very well with the Data API. It's particularly useful for data that isn't line-based, like images or audio.

The records themselves often contain serialized **protocol buffers** (protobufs). A protobuf is a portable, extensible, and efficient binary format. The standard protobuf used in TFRecords is the `tf.train.Example`.

In [17]:
# Write to a TFRecord file
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

# Read from a TFRecord file
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


In [18]:
# You can also compress TFRecord files with GZIP
options = tf.io.TFRecordOptions(compression_type="GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

# And read them by specifying the compression type
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],
                                compression_type="GZIP")

### TensorFlow Protobufs

The `tf.train.Example` protobuf is a flexible message type that represents one instance. It's a dictionary of named features, where each feature can be a list of byte strings, a list of floats, or a list of integers.

Here is its definition (simplified):
```
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1; }
message Int64List { repeated int64 value = 1; }

message Feature {
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
};

message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
```

In [19]:
# Create a tf.train.Example protobuf
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features=Features(
        feature={
            "name": Feature(bytes_list=BytesList(value=[b"Alice"])),
            "id": Feature(int64_list=Int64List(value=[123])),
            "emails": Feature(bytes_list=BytesList(value=[b"a@b.com", b"c@d.com"]))
        })
)

# Serialize it and write to a TFRecord file
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

### Loading and Parsing Examples

To read and parse the `Example` protobufs, you use `tf.io.parse_single_example()`. This is a TF operation, so it can be part of your `tf.data` pipeline. It requires a dictionary that describes the features you expect.

In [20]:
# Define the feature description
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "emails": tf.io.VarLenFeature(tf.string), # VarLenFeature for variable length
}

# Create the dataset and parse the examples
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"])
for serialized_example in dataset:
    parsed_example = tf.io.parse_single_example(serialized_example,
                                                  feature_description)
    print(parsed_example)

{'emails': SparseTensor(indices=tf.Tensor(
[[0]
 [1]], shape=(2, 1), dtype=int64), values=tf.Tensor([b'a@b.com' b'c@d.com'], shape=(2,), dtype=string), dense_shape=tf.Tensor([2], shape=(1,), dtype=int64)), 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'Alice'>}


In [21]:
# VarLenFeatures are parsed as SparseTensors
print(parsed_example["emails"])

SparseTensor(indices=tf.Tensor(
[[0]
 [1]], shape=(2, 1), dtype=int64), values=tf.Tensor([b'a@b.com' b'c@d.com'], shape=(2,), dtype=string), dense_shape=tf.Tensor([2], shape=(1,), dtype=int64))


In [22]:
# You can convert a sparse tensor to a dense tensor
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

**Note on `SequenceExample`:** For more complex data, like lists of lists (e.g., a document as a list of sentences, where each sentence is a list of words), you can use the `SequenceExample` protobuf. It's parsed using `tf.io.parse_single_sequence_example()`.

## Preprocessing the Input Features

**Theoretical Explanation:**

It's crucial to preprocess your features (e.g., scale numerical features, encode categorical features) before feeding them to your neural network. Instead of doing this *before* training, you can do it *inside* the model by creating preprocessing layers.

**Advantages:**
1.  **Simplicity:** The model expects raw data, simplifying your data pipeline.
2.  **Prevents Training-Serving Skew:** By bundling the preprocessing inside the model, you guarantee that the *exact same* preprocessing is applied to new data during inference as was applied to the training data. This is a common source of bugs.
3.  **Portability:** The saved model contains the preprocessing, making it easy to deploy to mobile, web, or servers.

You can create custom layers or use the new standard Keras preprocessing layers.

In [23]:
# Example: Creating a custom Standardization layer
class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        # Compute the mean and std dev from a data sample
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

In [25]:
std_layer = Standardization()
# We 'adapt' the layer by showing it a sample of the training data
std_layer.adapt(X_train)

# Scale the training and validation data explicitly for this example
X_train_scaled = std_layer(X_train)
X_valid_scaled = std_layer(X_valid)

# Now we can include this layer in a model
model = keras.Sequential()
model.add(std_layer)
model.add(keras.layers.Dense(30, activation="relu"))
model.add(keras.layers.Dense(1))

model.compile(loss="mse", optimizer="nadam")
model.fit(X_train_scaled, y_train, epochs=2,
          validation_data=(X_valid_scaled, y_valid))
# Note: We fit on scaled data just for this example.
# In a real project, you would fit on the *unscaled* X_train.

Epoch 1/2
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - loss: 1.5447 - val_loss: 1.2801
Epoch 2/2
[1m363/363[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 1.0632 - val_loss: 0.9866


<keras.src.callbacks.history.History at 0x791054e3f3e0>

### Encoding Categorical Features Using One-Hot Vectors

To encode categorical strings, we can use a lookup table to convert them to integer IDs, then use `tf.one_hot` to encode those IDs.

In [None]:
# 1. Define the vocabulary
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)

# 2. Create the lookup table
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2 # For out-of-vocabulary (oov) categories
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

# 3. Test the table
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
print("Indices:", cat_indices)

# 4. One-hot encode the indices
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
print("One-hot:", cat_one_hot)

### Encoding Categorical Features Using Embeddings

**Theoretical Explanation:**

One-hot encoding is inefficient for large vocabularies (e.g., 50,000+ words). A more efficient approach is to use **embeddings**.

An embedding is a **trainable**, dense vector that represents a category. For example, a word might be represented by a 128-dimensional vector instead of a 50,000-dimensional one-hot vector.

Initially, these vectors are random. During training, the network learns to place similar categories close to each other in the *embedding space*. This is a form of **representation learning**.

In [None]:
# The embedding layer does two things:
# 1. It stores the embedding matrix (initialized randomly).
# 2. When given category indices, it looks up the corresponding embedding vectors.

embedding_dim = 2
embedding = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets,
                                   output_dim=embedding_dim)

print(embedding(cat_indices))

In [None]:
# Example of a Keras model using both numerical and embedding inputs
# This requires the Functional API

regular_inputs = keras.layers.Input(shape=[8]) # Assumes 8 numerical features
categories = keras.layers.Input(shape=[], dtype=tf.string)

# Preprocessing part
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets,
                                   output_dim=embedding_dim)(cat_indices)

# Concatenate numerical features with categorical embeddings
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])

# Regular part of the model
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs=[regular_inputs, categories],
                             outputs=[outputs])

### Keras Preprocessing Layers

TensorFlow is standardizing a set of preprocessing layers in `keras.layers` that work like our custom `Standardization` layer: you `adapt()` them to a data sample, then include them in your model.

Examples will include:
* `keras.layers.Normalization`
* `keras.layers.TextVectorization` (for tokenizing, indexing, and bag-of-words)
* `keras.layers.Discretization` (for binning continuous data)

## TF Transform

**Theoretical Explanation:**

A major challenge in production is **training-serving skew**. This happens when the preprocessing you do in your training pipeline (e.g., in Apache Beam) is slightly different from the preprocessing you do in your live application (e.g., in Java or JavaScript).

`TF Transform` (part of TFX) solves this. You define your preprocessing function *once* using `tft` ops.
1.  This function is run in batch over your entire training set using a tool like Apache Beam. This computes all necessary statistics (like mean, std, vocabulary) and preprocesses the data for fast training.
2.  It also generates an equivalent **TensorFlow graph** (a `TF Function`) that includes the *computed statistics* (e.g., the mean and std are now constants in the graph).

You can then include this graph as the first layer of your model, guaranteeing that the exact same preprocessing logic is applied everywhere.

In [None]:
try:
    import tensorflow_transform as tft

    def preprocess(inputs):  # inputs = a batch of input features
        median_age = inputs["housing_median_age"]
        ocean_proximity = inputs["ocean_proximity"]
        standardized_age = tft.scale_to_z_score(median_age)
        ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
        return {
            "standardized_median_age": standardized_age,
            "ocean_proximity_id": ocean_proximity_id
        }
except ImportError:
    print("TF Transform is not installed. Skipping this code block.")

## The TensorFlow Datasets (TFDS) Project

**Theoretical Explanation:**

TFDS is a library that provides a simple way to download and access many common datasets (e.g., MNIST, ImageNet, `imdb_reviews`). It downloads the data, caches it, and returns `tf.data.Dataset` objects, ready to be used.

In [None]:
import tensorflow_datasets as tfds

# Load MNIST dataset
dataset = tfds.load(name="mnist")
mnist_train, mnist_test = dataset["train"], dataset["test"]

In [None]:
# The dataset items are dictionaries
for item in mnist_train.take(1):
    print(item["image"].shape)
    print(item["label"])

In [None]:
# You can load it directly as a (features, label) tuple for Keras
dataset = tfds.load(name="mnist", batch_size=32, as_supervised=True)
mnist_train = dataset["train"].prefetch(1)

# This dataset can be passed directly to fit()
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Lambda(lambda x: x / 255.),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
model.fit(mnist_train, epochs=2)

## Exercises

See Appendix A in the book.