# Chapter 12: Custom Models and Training with TensorFlow

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Up until now, we've used only TensorFlow's high-level API, `tf.keras`, but it already got us pretty far: we built various neural network architectures, including regression and classification nets, Wide & Deep nets, and self-normalizing nets, using all sorts of techniques, such as Batch Normalization, dropout, and learning rate schedules. In fact, 95% of the use cases you will encounter will not require anything other than `tf.keras` (and `tf.data`; see Chapter 13). But now it's time to dive deeper into TensorFlow and take a look at its lower-level Python API. This will be useful when you need extra control to write custom loss functions, custom metrics, layers, models, initializers, regularizers, weight constraints, and more. You may even need to fully control the training loop itself, for example to apply special transformations or constraints to the gradients (beyond just clipping them) or to use multiple optimizers for different parts of the network. We will cover all these cases in this chapter, and we will also look at how you can boost your custom models and training algorithms using TensorFlow's automatic graph generation feature.

## 2. A Quick Tour of TensorFlow

TensorFlow is a powerful library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its core structure is based on the following principles:
* Its core is very similar to NumPy, but with support for GPU acceleration.
* It supports distributed computing (across multiple devices and servers).
* It includes a Just-In-Time (JIT) compiler that allows it to optimize computations for speed and memory usage. It works by extracting the computation graph from a Python function, then optimizing it (e.g., by pruning unused nodes), and finally running it efficiently (e.g., by automatically running independent operations in parallel).
* Computation graphs can be exported to a portable format, so you can train a model in one environment (e.g., on Linux) and run it in another (e.g., on Android).
* It implements autodiff (automatic differentiation) and provides some excellent optimizers, such as RMSProp and Nadam, so you can easily minimize all sorts of loss functions.

At the lowest level, each TensorFlow operation (op for short) is implemented using highly efficient C++ code. Many operations have multiple implementations called kernels: each kernel is dedicated to a specific device type, such as CPUs, GPUs, or even TPUs (Tensor Processing Units).

## 3. Using TensorFlow like NumPy

TensorFlow’s API revolves around **tensors**, which flow from operation to operation—hence the name TensorFlow. A tensor is usually a multidimensional array (exactly like a NumPy `ndarray`), but it can also hold a scalar (a simple value, such as 42). These tensors will be important when we create custom cost functions, custom metrics, custom layers, and more, so let’s see how to create and manipulate them.

### Tensors and Operations

You can create a tensor with `tf.constant()`. For example, here is a tensor representing a matrix with two rows and three columns of floats:

In [None]:
import tensorflow as tf
import numpy as np

t = tf.constant([[1., 2., 3.], [4., 5., 6.]])
print("Tensor:\n", t)
print("Shape:", t.shape)
print("Dtype:", t.dtype)

Just like an `ndarray`, a `tf.Tensor` has a shape and a data type (`dtype`). Indexing works much like in NumPy:

In [None]:
print("Indexing t[:, 1:]:\n", t[:, 1:])
print("Indexing t[..., 1, tf.newaxis]:\n", t[..., 1, tf.newaxis])

Most importantly, all sorts of tensor operations are available:

In [None]:
print("t + 10:\n", t + 10)
print("tf.square(t):\n", tf.square(t))
print("t @ tf.transpose(t):\n", t @ tf.transpose(t))  # Matrix multiplication

### Tensors and NumPy

Tensors play nice with NumPy: you can create a tensor from a NumPy array, and vice versa. You can even apply TensorFlow operations to NumPy arrays and NumPy operations to tensors:

In [None]:
a = np.array([2., 4., 5.])
print("tf.constant(a):\n", tf.constant(a))
print("t.numpy():\n", t.numpy())
print("tf.square(a):\n", tf.square(a))
print("np.square(t):\n", np.square(t))

### Type Conversions

Type conversions can significantly hurt performance, and they can easily go unnoticed when they are done automatically. To avoid this, TensorFlow does not perform any type conversions automatically: it just raises an exception if you try to execute an operation on tensors with incompatible types. For example, you cannot add a float tensor and an integer tensor, and you cannot even add a 32-bit float and a 64-bit float:

In [None]:
try:
    tf.constant(2.0) + tf.constant(40)
except tf.errors.InvalidArgumentError as ex:
    print(ex)

You can use `tf.cast()` when you really need to convert types:

In [None]:
t2 = tf.constant(40)
print(tf.constant(2.0) + tf.cast(t2, tf.float32))

### Variables

The `tf.Tensor` values we’ve seen so far are immutable: you cannot modify them. This means that we cannot use regular tensors to implement weights in a neural network, since they need to be tweaked by backpropagation. Plus, other parameters may also need to change over time (e.g., a momentum in a momentum optimizer). What we need is a `tf.Variable`:

In [None]:
v = tf.Variable([[1., 2., 3.], [4., 5., 6.]])
print("Variable:\n", v)

A `tf.Variable` acts much like a `tf.Tensor`: you can perform the same operations with it, it plays nicely with NumPy as well, and it is just as picky with types. But it can also be modified in place using the `assign()` method (or `assign_add()` or `assign_sub()`, which increment or decrement the variable by the given value). You can also modify individual cells (or slices) of the variable, by using the cell’s (or slice’s) `assign()` method (direct assignment will not work) or by using the `scatter_update()` or `scatter_nd_update()` methods:

In [None]:
v.assign(2 * v)
print("After assign(2*v):\n", v.numpy())
v[0, 1].assign(42)
print("After slice assign:\n", v.numpy())
v[:, 2].assign([0., 1.])
print("After column assign:\n", v.numpy())

## 4. Custom Loss Functions

Suppose you want to train a regression model, but your training set is a bit noisy. Of course, you start by trying to clean up your dataset by removing or fixing the outliers, but that turns out to be insufficient; the dataset is still noisy. Which loss function should you use? The Mean Squared Error might penalize large errors too much and cause your model to be imprecise. The Mean Absolute Error would not penalize outliers as much, but training might take a while to converge, and the trained model might not be very precise. This is probably a good time to use the **Huber loss** instead.

**Equation 12-1: Huber Loss**
$$ L_{\delta}(y, f(x)) = \begin{cases} \frac{1}{2}(y - f(x))^2 & \text{for } |y - f(x)| \le \delta \\ \delta (|y - f(x)| - \frac{1}{2}\delta) & \text{otherwise} \end{cases} $$

The Huber loss is not currently part of the official Keras API, but it is available in `tf.keras` (just use an instance of the `keras.losses.Huber` class). But let’s pretend it’s not there. Just create a function that takes the labels and predictions as arguments, and use TensorFlow operations to compute every instance’s loss:

In [None]:
def huber_fn(y_true, y_pred):
    error = y_true - y_pred
    is_small_error = tf.abs(error) < 1
    squared_loss = tf.square(error) / 2
    linear_loss = tf.abs(error) - 0.5
    return tf.where(is_small_error, squared_loss, linear_loss)

For better performance, you should use a vectorized implementation, as in the example above. Moreover, if you want to benefit from TensorFlow’s graph features, you should use TensorFlow operations exclusively.

Now you can use this loss function when you compile the model, then train your model:

In [None]:
from tensorflow import keras
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Prepare data
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=[8]),
    keras.layers.Dense(1),
])

model.compile(loss=huber_fn, optimizer="nadam")
model.fit(X_train_scaled, y_train, epochs=2, validation_data=(X_valid_scaled, y_valid))

### Saving and Loading Models with Custom Objects

When you save a model containing a custom object, you'll need to map the name to the object when loading it. 

But what if you want to configure the threshold? One solution is to create a function that creates a configured loss function. However, when you save the model, the `threshold` will not be saved. This means that you will have to specify the `threshold` value when loading the model. A better solution is to create a subclass of the `keras.losses.Loss` class and implement its `get_config()` method:

In [None]:
class HuberLoss(keras.losses.Loss):
    def __init__(self, threshold=1.0, **kwargs):
        self.threshold = threshold
        super().__init__(**kwargs)
    def call(self, y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < self.threshold
        squared_loss = tf.square(error) / 2
        linear_loss = self.threshold * tf.abs(error) - self.threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}

# Usage
model.compile(loss=HuberLoss(2.), optimizer="nadam")

## 5. Custom Metrics

Losses and metrics are conceptually not the same thing: losses (e.g., cross entropy) are used by Gradient Descent to *train* a model, so they must be differentiable (at least where they are evaluated), and their gradients should not be 0 everywhere. Plus, it’s okay if they are not easily interpretable by humans. In contrast, metrics (e.g., accuracy) are used to *evaluate* a model: they must be more easily interpretable, and they can be non-differentiable or have 0 gradients everywhere.

That said, in most cases, defining a custom metric function is exactly the same as defining a custom loss function. In fact, we could even use the Huber loss function we created earlier as a metric.

### Streaming Metrics (Stateful Metrics)

Some metrics, like accuracy, can be computed by averaging the scores of each batch. However, some metrics cannot be computed this way, such as precision. If you compute the precision of the first batch (e.g., 80%) and the precision of the second batch (e.g., 40%), the overall precision is not necessarily the average (60%). It depends on the number of positive predictions in each batch.

To compute such metrics, we need an object that can keep track of the number of true positives and false positives as it sees new batches. This is called a *stateful metric*.

Here is how to implement a simple `HuberMetric` class that keeps track of the total Huber loss and the number of instances seen so far. When asked for the result, it returns the ratio.

In [None]:
class HuberMetric(keras.metrics.Metric):
    def __init__(self, threshold=1.0, **kwargs):
        super().__init__(**kwargs) # handles base args (e.g., dtype)
        self.threshold = threshold
        self.huber_fn = create_huber(threshold)
        # State variables
        self.total = self.add_weight("total", initializer="zeros")
        self.count = self.add_weight("count", initializer="zeros")
    def update_state(self, y_true, y_pred, sample_weight=None):
        metric = self.huber_fn(y_true, y_pred)
        self.total.assign_add(tf.reduce_sum(metric))
        self.count.assign_add(tf.cast(tf.size(y_true), tf.float32))
    def result(self):
        return self.total / self.count
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "threshold": self.threshold}

def create_huber(threshold=1.0):
    def huber_fn(y_true, y_pred):
        error = y_true - y_pred
        is_small_error = tf.abs(error) < threshold
        squared_loss = tf.square(error) / 2
        linear_loss = threshold * tf.abs(error) - threshold**2 / 2
        return tf.where(is_small_error, squared_loss, linear_loss)
    return huber_fn

## 6. Custom Layers

You may occasionally want to build an architecture that contains an exotic layer for which TensorFlow does not provide a default implementation. In this case, you will need to create a custom layer. Or you may simply want to build a very repetitive architecture, containing identical blocks of layers repeated many times, and it would be convenient to treat each block of layers as a single layer.

To create a custom layer, you need to subclass the `keras.layers.Layer` class and implement the following methods:
1.  `__init__()`: Save hyperparameters.
2.  `build()`: Create the layer's variables (weights and biases). This method is called the first time the layer is used, so it knows the input shape.
3.  `call()`: Perform the desired operations. This is where the forward pass logic goes.
4.  `compute_output_shape()`: Returns the shape of the outputs (optional, Keras can infer it).
5.  `get_config()`: For saving/loading.

Here is how to create a simplified version of the `Dense` layer:

In [None]:
class MyDense(keras.layers.Layer):
    def __init__(self, units, activation=None, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = keras.activations.get(activation)

    def build(self, batch_input_shape):
        # Create a trainable weight variable for the kernel (weights matrix)
        self.kernel = self.add_weight(
            name="kernel", shape=[batch_input_shape[-1], self.units],
            initializer="glorot_normal")
        # Create a trainable bias variable
        self.bias = self.add_weight(
            name="bias", shape=[self.units], initializer="zeros")
        super().build(batch_input_shape) # must be at the end

    def call(self, X):
        # The forward pass computation
        return self.activation(X @ self.kernel + self.bias)

    def compute_output_shape(self, batch_input_shape):
        return tf.TensorShape(batch_input_shape.as_list()[:-1] + [self.units])

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "units": self.units,
                "activation": keras.activations.serialize(self.activation)}

## 7. Custom Models

We have already seen how to create custom models using the Subclassing API (in Chapter 10). It is straightforward: subclass the `keras.Model` class, create layers and variables in the constructor, and implement the `call()` method to do whatever you want the model to do.

Suppose you want to build a model with a custom `ResidualBlock` layer (containing two dense layers and an addition operation) repeated multiple times. Here is how you could implement it:

In [None]:
class ResidualBlock(keras.layers.Layer):
    def __init__(self, n_layers, n_neurons, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(n_neurons, activation="elu",
                                          kernel_initializer="he_normal")
                       for _ in range(n_layers)]

    def call(self, inputs):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        return inputs + Z

class ResidualRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden1 = keras.layers.Dense(30, activation="elu",
                                          kernel_initializer="he_normal")
        self.block1 = ResidualBlock(2, 30)
        self.block2 = ResidualBlock(2, 30)
        self.out = keras.layers.Dense(output_dim)

    def call(self, inputs):
        Z = self.hidden1(inputs)
        for _ in range(1 + 3):
            Z = self.block1(Z)
        Z = self.block2(Z)
        return self.out(Z)

# Create model
model = ResidualRegressor(1)
model.compile(loss="mse", optimizer="nadam")
history = model.fit(X_train_scaled, y_train, epochs=2)
model.save("my_custom_model") # Note: Saving custom models requires using the SavedModel format (default in TF2)

## 8. Losses and Metrics Based on Model Internals

Earlier we defined custom losses and metrics based on the labels and the predictions (and optionally sample weights). There will be times when you want to define losses based on other parts of your model, such as the weights or the activations of its hidden layers. This may be useful for regularization purposes or to monitor some internal aspect of your model.

To define a custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the `add_loss()` method. For example, let’s build a custom regression MLP model composed of a stack of five hidden layers plus an output layer. This custom model will also have an auxiliary output on top of the upper hidden layer. The loss associated with this auxiliary output will be called the *reconstruction loss*: it is the mean squared difference between the reconstruction and the inputs. The idea is to encourage the model to preserve as much information as possible through the hidden layers, so that it can reconstruct the inputs.

In [None]:
class ReconstructingRegressor(keras.Model):
    def __init__(self, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.hidden = [keras.layers.Dense(30, activation="selu",
                                          kernel_initializer="lecun_normal")
                       for _ in range(5)]
        self.out = keras.layers.Dense(output_dim)
        # Auxiliary reconstruction layer
        self.reconstruction_mean = keras.metrics.Mean(name="reconstruction_error")

    def build(self, batch_input_shape):
        n_inputs = batch_input_shape[-1]
        self.reconstruct = keras.layers.Dense(n_inputs)
        super().build(batch_input_shape)

    def call(self, inputs, training=None):
        Z = inputs
        for layer in self.hidden:
            Z = layer(Z)
        reconstruction = self.reconstruct(Z)
        recon_loss = tf.reduce_mean(tf.square(reconstruction - inputs))
        self.add_loss(0.05 * recon_loss)
        if training:
            result = self.reconstruction_mean(recon_loss)
            self.add_metric(result)
        return self.out(Z)

## 9. Computing Gradients with Autodiff

To understand how to write custom training loops, you first need to know how to compute gradients automatically. We use the `tf.GradientTape` context manager.

In [None]:
def f(w1, w2):
    return 3 * w1 ** 2 + 2 * w1 * w2

w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])
print("Gradients:", gradients)

The tape automatically records every operation that involves a variable. Then you can ask this tape to compute the gradients of the result $z$ with regard to both variables $[w1, w2]$.

## 10. Custom Training Loops

In some rare cases, the `fit()` method may not be flexible enough for what you need to do. For example, the Wide & Deep paper uses two different optimizers: one for the wide path and a different one for the deep path. Since the `fit()` method only uses one optimizer (the one that we specify when compiling the model), implementing this paper requires writing your own custom training loop.

Building a custom training loop involves:
1.  Iterating over the dataset.
2.  Running the forward pass inside a `GradientTape`.
3.  Computing the loss.
4.  Computing gradients.
5.  Using an optimizer to update the weights.

In [None]:
l2_reg = keras.regularizers.l2(0.05)
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="elu", kernel_initializer="he_normal",
                       kernel_regularizer=l2_reg),
    keras.layers.Dense(1, kernel_regularizer=l2_reg)
])

def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

def print_status_bar(iteration, total, loss, metrics=None):
    metrics = " - ".join(["{}: {:.4f}".format(m.name, m.result()) for m in [loss] + (metrics or [])])
    end = "" if iteration < total else "\n"
    print("\r{}/{} - ".format(iteration, total) + metrics, end=end)

n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = keras.losses.MeanSquaredError()
mean_loss = keras.metrics.Mean()
metrics = [keras.metrics.MeanAbsoluteError()]

for epoch in range(n_epochs):
    print("Epoch {}/{}".format(epoch + 1, n_epochs))
    for step in range(1, n_steps + 1):
        X_batch, y_batch = random_batch(X_train_scaled, y_train)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch, training=True)
            main_loss = loss_fn(y_batch, y_pred)
            loss = main_loss + tf.add_n(model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        for variable in model.variables:
            if variable.constraint is not None:
                variable.assign(variable.constraint(variable))
        mean_loss(loss)
        for metric in metrics:
            metric(y_batch, y_pred)
        print_status_bar(step, n_steps, mean_loss, metrics)
    for metric in [mean_loss] + metrics:
        metric.reset_states()

## 11. TensorFlow Functions and Graphs

In TensorFlow 2, eager execution is enabled by default, which makes debugging easier but can be slower. To boost performance, you can convert your Python functions into TensorFlow Graphs using the `@tf.function` decorator.

When you call a function decorated with `@tf.function`, TensorFlow traces the function (executes it once) to generate a computation graph. Subsequent calls with input tensors of the same shape and type will run the optimized graph instead of the Python code.

In [None]:
@tf.function
def cube(x):
    print("Tracing...") # This side-effect only happens during tracing
    return x ** 3

print("Result 1:", cube(tf.constant(2.0)))
print("Result 2:", cube(tf.constant(3.0)))