# Chapter 19: Training and Deploying TensorFlow Models at Scale

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Introduction

Once you have a beautiful model that makes amazing predictions, what do you do with it? Well, you need to put it in production! This could be as simple as running the model on a batch of data and perhaps writing a script that runs this model every night. However, it is often much more involved. Various parts of your infrastructure may need to use this model on live data, in which case you probably want to wrap your model in a web service: this way, any part of your infrastructure can query your model at any time using a simple REST API (or some other protocol), as we discussed in Chapter 2. But as time passes, you need to regularly retrain your model on fresh data and push the updated version to production. You must handle model versioning, gracefully transition from one model to the next, possibly roll back to the previous model in case of problems, and perhaps run multiple different models in parallel to perform A/B experiments. If your product becomes successful, your service may start to get plenty of queries per second (QPS), and it must scale up to support the load. A great solution to scale up your service, as we will see in this chapter, is to use TF Serving, either on your own hardware infrastructure or via a cloud service such as Google Cloud AI Platform.

But deployment is not the only challenge. Training large models on huge datasets can take days or even weeks. To speed this up, you can use distributed training strategies to spread the workload across multiple GPUs or TPUs.

## 2. Serving a TensorFlow Model

### Saving and Exporting

Before we deploy, we need a trained model. Let's train a simple MNIST classifier.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Load MNIST
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train_full = X_train_full[..., np.newaxis].astype("float32") / 255.
X_test = X_test[..., np.newaxis].astype("float32") / 255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# Build Model
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-2),
              metrics=["accuracy"])
model.fit(X_train, y_train, epochs=3, validation_data=(X_valid, y_valid))

# Save Model in SavedModel format
model_version = "0001"
model_name = "my_mnist_model"
model_path = os.path.join(model_name, model_version)
tf.saved_model.save(model, model_path)

print(f"Model saved to {model_path}")

The SavedModel format contains:
* `saved_model.pb`: The computation graph definition.
* `variables/`: The model weights.
* `assets/`: Extra files (e.g., vocabulary files).

### TF Serving

TensorFlow Serving is a high-performance serving system for machine learning models, designed for production environments. It handles model versioning (loading new versions automatically) and efficiently handles requests via REST or gRPC.

To use it, you typically run it inside a Docker container. (The following is a command line instruction, not python code).

```bash
docker run -it --rm -p 8500:8500 -p 8501:8501 \
    -v "/path/to/my_mnist_model:/models/my_mnist_model" \
    -e MODEL_NAME=my_mnist_model \
    tensorflow/serving
```

### Querying via REST API

Once the server is running (listening on port 8501 for REST), we can send HTTP POST requests.

In [None]:
import json
import requests

# Prepare data payload (must be JSON serializable)
input_data_json = json.dumps({
    "signature_name": "serving_default",
    "instances": X_test[:3].tolist()  # Convert numpy array to list
})

# In a real scenario with Docker running, you would do this:
# SERVER_URL = "http://localhost:8501/v1/models/my_mnist_model:predict"
# response = requests.post(SERVER_URL, data=input_data_json)
# response.raise_for_status()
# response = response.json()
# y_proba = np.array(response["predictions"])

## 3. Deploying to Mobile or Embedded Devices (TFLite)

For edge devices (phones, IoT), standard TensorFlow models are too heavy. TFLite optimizes the model for size and latency.

### Converting to TFLite

We use the `TFLiteConverter`. We can apply **Quantization** (reducing float32 to float16 or int8) to dramatically shrink the model size.

In [None]:
# Convert the SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)

# Optimization: Float16 Quantization (reduces size by 2x, minimal accuracy loss)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

tflite_model = converter.convert()

# Save the .tflite file
with open("my_mnist_model.tflite", "wb") as f:
    f.write(tflite_model)

print("TFLite model saved.")

### Using the Interpreter

We can test the `.tflite` model in Python using the Interpreter.

In [None]:
interpreter = tf.lite.Interpreter(model_path="my_mnist_model.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Run inference on one image
input_data = X_test[0:1] # shape (1, 28, 28, 1)
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

print("TFLite Prediction Class:", np.argmax(output_data))

## 4. Running on the Browser (TensorFlow.js)

You can convert models to run directly in a web browser using `tensorflowjs`. This allows for privacy (data stays on the user's device) and low latency.

Command line utility:
```bash
tensorflowjs_converter --input_format=tf_saved_model \
    ./my_mnist_model/0001 ./my_tfjs_model
```
This produces a `model.json` and binary shard files.

## 5. Distributed Training

When a model is too large or data is too vast, we scale horizontally.

### Data Parallelism vs. Model Parallelism
* **Data Parallelism:** Replicate the model on every GPU. Split the batch into mini-batches, one for each GPU. Gradients are computed independently and then aggregated (summed) to update all models simultaneously.
* **Model Parallelism:** Split the model itself across GPUs (e.g., layers 1-5 on GPU0, 6-10 on GPU1). Harder to implement efficiently due to communication overhead.

### Mirrored Strategy

The standard approach for single-machine, multi-GPU training. It uses the **AllReduce** algorithm to synchronize gradients efficiently across GPUs.

In [None]:
# To use distributed training, wrap model creation and compilation in the strategy scope.
# If you have multiple GPUs, TF will utilize them automatically.
# If you only have a CPU, it will fallback gracefully.

strategy = tf.distribute.MirroredStrategy()

print(f"Number of devices: {strategy.num_replicas_in_sync}")

with strategy.scope():
    model_dist = keras.models.Sequential([
        keras.layers.Flatten(input_shape=[28, 28, 1]),
        keras.layers.Dense(100, activation="relu"),
        keras.layers.Dense(10, activation="softmax")
    ])
    model_dist.compile(loss="sparse_categorical_crossentropy",
                       optimizer=keras.optimizers.SGD(learning_rate=1e-2),
                       metrics=["accuracy"])

# Training looks exactly the same
batch_size = 100 # Should be larger to exploit parallelism
model_dist.fit(X_train, y_train, epochs=2, batch_size=batch_size)

### Other Strategies

* **MultiWorkerMirroredStrategy:** Similar to MirroredStrategy but across multiple servers (nodes).
* **TPUStrategy:** Specifically for Google Cloud TPUs. TPUs are specialized hardware accelerators for matrix multiplication, offering immense speedups for Deep Learning.