**Chapter 10 ‚Äì Introduction to Artificial Neural Networks with Keras**

_This notebook contains all the sample code and solutions to the exercises in chapter 10._

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://kaggle.com/kernels/welcome?src=https://github.com/ageron/handson-ml3/blob/main/10_neural_nets_with_keras.ipynb"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

# Setup

This project requires Python 3.7 or above:

In [None]:
import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ‚â• 1.0.1:

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

And TensorFlow ‚â• 2.8:

In [None]:
import tensorflow as tf

assert version.parse(tf.__version__) >= version.parse("2.8.0")

As we did in previous chapters, let's define the default font sizes to make the figures prettier:

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

And let's create the `images/ann` folder (if it doesn't already exist), and define the `save_fig()` function which is used through this notebook to save the figures in high-res for the book:

In [None]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "ann"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# From Biological to Artificial Neurons
## The Perceptron

In [None]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 0)  # Iris setosa

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)

X_new = [[2, 0.5], [3, 1]]
y_pred = per_clf.predict(X_new)  # predicts True and False for these 2 flowers

In [None]:
y_pred

<details>
<summary><b>AI Understanding Template </b></summary>

## 1. What is it?
A single-layer linear classifier...

## 2. How does it reason?
The perceptron learns...

</details>


<details>
<summary><b>AI Understanding ‚Äì Perceptron Classifier</b></summary>

## ‚úÖ**AI Understanding Template ‚Äì Perceptron Classifier**

  ## **1. What is it?**

  A **single-layer linear classifier** that finds a straight line to separate two classes using simple updates.

  In your code ‚Üí classifies **Iris Setosa vs not-Setosa** using two features.

  ---

## **2. How does it reason?**

The perceptron learns a **weight vector** and **bias**:

> **prediction = sign(w¬∑x + b)**

It adjusts weights whenever it misclassifies a sample.
It basically **pushes the decision boundary** until it can separate the classes.

---

## **3. Where does it fail?**

* If classes are **not linearly separable** (perceptron will never converge).
* Sensitive to **feature scaling**.
* No probability outputs (only True/False).
* Cannot model complex patterns ‚Äî only a **single straight line**.

---

## **4. When should I use it?**

Use it when:

* Data is **simple, linear, binary**.
* You want a **very fast, explainable** model.
* You want to understand the basics of neural networks.

Not ideal for real-world complex datasets.

---

## **5. What is the mental model?**

Think of it as:

> **A yes/no switch that draws one straight boundary to split two groups.**

It tries to push wrong predictions to the correct side by nudging weights.

---

## **6. How do I prompt it?**

(‚ÄúPrompt‚Äù = how to give input / how to use it.)

* Give numeric feature vectors to `fit()`:
  `per_clf.fit(X, y)`
* Pass new samples to `predict()`:
  `per_clf.predict([[2, 0.5]])`
* Ensure classes are **binary** (True/False).

No hyperparameters needed except `max_iter`, `eta0`, `random_state`.

---

## **7. What are alternatives?**

Better linear/binary classifiers:

* **Logistic Regression**
* **Linear SVM**
* **SGDClassifier(log loss)**

Better for non-linear patterns:

* **Kernel SVM**
* **Random Forest**
* **Neural Networks**

---

## **Code Explanation (Short & Clear)**

```python
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
```

Imports Iris dataset and Perceptron model.

---

### **Load dataset**

```python
iris = load_iris(as_frame=True)
```

Loads Iris into a pandas-like structure.

---

### **Select two features**

```python
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
```

Use only petal length & width ‚Üí 2D input.

---

### **Create binary labels**

```python
y = (iris.target == 0)  # Iris setosa
```

True if class = Setosa, False otherwise.

---

### **Train perceptron**

```python
per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)
```

Learns a linear boundary separating Setosa vs non-Setosa.

---

### **Predict on new samples**

```python
X_new = [[2, 0.5], [3, 1]]
y_pred = per_clf.predict(X_new)
```

Returns:

* **True** ‚Üí predicted Setosa
* **False** ‚Üí predicted non-Setosa

<details>

The `Perceptron` is equivalent to a `SGDClassifier` with `loss="perceptron"`, no regularization, and a constant learning rate equal to 1:

In [None]:
# extra code ‚Äì shows how to build and train a Perceptron

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(loss="perceptron", penalty=None,
                        learning_rate="constant", eta0=1, random_state=42)
sgd_clf.fit(X, y)
assert (sgd_clf.coef_ == per_clf.coef_).all()
assert (sgd_clf.intercept_ == per_clf.intercept_).all()

When the Perceptron finds a decision boundary that properly separates the classes, it stops learning. This means that the decision boundary is often quite close to one class:

In [None]:
# extra code ‚Äì plots the decision boundary of a Perceptron on the iris dataset

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

a = -per_clf.coef_[0, 0] / per_clf.coef_[0, 1]
b = -per_clf.intercept_ / per_clf.coef_[0, 1]
axes = [0, 5, 0, 2]
x0, x1 = np.meshgrid(
    np.linspace(axes[0], axes[1], 500).reshape(-1, 1),
    np.linspace(axes[2], axes[3], 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_predict = per_clf.predict(X_new)
zz = y_predict.reshape(x0.shape)
custom_cmap = ListedColormap(['#9898ff', '#fafab0'])

plt.figure(figsize=(7, 3))
plt.plot(X[y == 0, 0], X[y == 0, 1], "bs", label="Not Iris setosa")
plt.plot(X[y == 1, 0], X[y == 1, 1], "yo", label="Iris setosa")
plt.plot([axes[0], axes[1]], [a * axes[0] + b, a * axes[1] + b], "k-",
         linewidth=3)
plt.contourf(x0, x1, zz, cmap=custom_cmap)
plt.xlabel("Petal length")
plt.ylabel("Petal width")
plt.legend(loc="lower right")
plt.axis(axes)
plt.show()

<details>
<summary><b>AI Understanding - Perceptron-2 </b></summary>

## ‚úÖ**AI Understanding Template ‚Äî Perceptron (using SGDClassifier)**

## **1. What is it?**

A **linear binary classifier** that learns a separating line (hyperplane) by adjusting weights whenever it misclassifies a sample.

Equivalent to:
**SGDClassifier(loss="perceptron") ‚Üí classic Perceptron algorithm.**

---

## **2. How does it reason?**

Each training sample updates weights:

> **If prediction is wrong ‚Üí move the decision boundary toward correct class.**
> If correct ‚Üí no update.

Mathematically:
`w_new = w_old + learning_rate * (y * x)` for misclassified samples.

It learns by **incremental weight nudges** based on errors.

---

## **3. Where does it fail?**

* If data is **not linearly separable**, it keeps oscillating ‚Üí no convergence.
* Sensitive to **feature scaling**.
* Performs poorly with **noise** or **overlapping classes**.
* Only good for **binary classification** (multi-class via one-vs-all).

---

## **4. When should I use it?**

* When you need an **extremely fast** linear classifier.
* For **online learning** (updates as new data arrives).
* When dataset is **large** and you want **incremental training**.

---

## **5. What is the mental model?**

Think of it as:

> **A line that keeps pivoting every time it makes a mistake.**

Like a student adjusting their answer after every error.

---

## **6. How do I prompt it?**

(Meaning: how to use it effectively.)

* Use **scaled features** (`StandardScaler`).
* Ensure **binary labels** (0/1 or -1/+1).
* Tune these:

  * `learning_rate`,
  * `eta0`,
  * `max_iter`,
  * `penalty` (if regularization needed).
* Provide **shuffled data** (important for SGD).

---

## **7. What are alternatives?**

Better linear models:

* **Logistic Regression** ‚Äì probabilistic, stable
* **Linear SVM** ‚Äì maximizes margin ‚Üí better accuracy
* **SGDClassifier(loss="log")** ‚Äì SGD + logistic regression
* **Perceptron()** (Scikit-learn's dedicated class)

For non-linear:

* **Random Forest**, **XGBoost**, **Neural Networks**

---

## **Code Explanation (Short + Clear)**

### **1. Create Perceptron via SGD**

```python
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(
    loss="perceptron",   # tells SGD to mimic the Perceptron update rule
    penalty=None,        # no regularization
    learning_rate="constant",
    eta0=1,              # fixed step size
    random_state=42
)
sgd_clf.fit(X, y)
```

### **2. Verify equality with another perceptron**

```python
assert (sgd_clf.coef_ == per_clf.coef_).all()
assert (sgd_clf.intercept_ == per_clf.intercept_).all()
```

Meaning:
Both models learned **the exact same weights**.

---

## **Plotting the Decision Boundary**

### **3. Compute line equation from weights**

Perceptron boundary:
`w1*x1 + w2*x2 + b = 0`

Solve for `x2`:

```python
a = -per_clf.coef_[0, 0] / per_clf.coef_[0, 1]  # slope
b = -per_clf.intercept_ / per_clf.coef_[0, 1]   # intercept
```

---

### **4. Create a grid to visualize predictions**

```python
x0, x1 = np.meshgrid(
    np.linspace(0, 5, 500).reshape(-1, 1),   # petal length
    np.linspace(0, 2, 200).reshape(-1, 1)    # petal width
)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_predict = per_clf.predict(X_new)
zz = y_predict.reshape(x0.shape)
```

This creates 100,000 points and predicts class for each ‚Üí gives a colored region.

---

### **5. Plot data, decision boundary, and background**

```python
plt.plot(X[y == 0, 0], X[y == 0, 1], "bs", label="Not Iris setosa")
plt.plot(X[y == 1, 0], X[y == 1, 1], "yo", label="Iris setosa")
plt.plot([0,5], [a*0 + b, a*5 + b], "k-", linewidth=3)
plt.contourf(x0, x1, zz, cmap=custom_cmap)
```

* Blue squares = class 0
* Yellow circles = class 1
* Black line = decision boundary
* Background shades = predicted regions

</details>

**Activation functions**

In [None]:
# extra code ‚Äì this cell generates and saves Figure 10‚Äì8

from scipy.special import expit as sigmoid

def relu(z):
    return np.maximum(0, z)

def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)

max_z = 4.5
z = np.linspace(-max_z, max_z, 200)

plt.figure(figsize=(11, 3.1))

plt.subplot(121)
plt.plot([-max_z, 0], [0, 0], "r-", linewidth=2, label="Heaviside")
plt.plot(z, relu(z), "m-.", linewidth=2, label="ReLU")
plt.plot([0, 0], [0, 1], "r-", linewidth=0.5)
plt.plot([0, max_z], [1, 1], "r-", linewidth=2)
plt.plot(z, sigmoid(z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", linewidth=1, label="Tanh")
plt.grid(True)
plt.title("Activation functions")
plt.axis([-max_z, max_z, -1.65, 2.4])
plt.gca().set_yticks([-1, 0, 1, 2])
plt.legend(loc="lower right", fontsize=13)

plt.subplot(122)
plt.plot(z, derivative(np.sign, z), "r-", linewidth=2, label="Heaviside")
plt.plot(0, 0, "ro", markersize=5)
plt.plot(0, 0, "rx", markersize=10)
plt.plot(z, derivative(sigmoid, z), "g--", linewidth=2, label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", linewidth=1, label="Tanh")
plt.plot([-max_z, 0], [0, 0], "m-.", linewidth=2)
plt.plot([0, max_z], [1, 1], "m-.", linewidth=2)
plt.plot([0, 0], [0, 1], "m-.", linewidth=1.2)
plt.plot(0, 1, "mo", markersize=5)
plt.plot(0, 1, "mx", markersize=10)
plt.grid(True)
plt.title("Derivatives")
plt.axis([-max_z, max_z, -0.2, 1.2])

save_fig("activation_functions_plot")
plt.show()

<details>
<summary><b>AI Understanding - Perceptron-3 </b></summary>
Below is a **short, crisp AI-framework explanation + clear code breakdown** for your activation-functions plot.

---

## ‚úÖ**AI Understanding Template ‚Äî Activation Functions Code**

## **1. What is it?**

A visualization script that plots:

* Activation functions (ReLU, Sigmoid, Tanh, Heaviside)
* Their numerical derivatives

Used to understand how neural networks **activate** and **backpropagate**.

---

## **2. How does it reason?**

* Each activation transforms input **z** ‚Üí output
* Derivatives show how gradients flow
* Smooth functions (sigmoid/tanh) ‚Üí smooth gradients
* ReLU ‚Üí piecewise linear
* Heaviside ‚Üí step with no useful gradients

The derivative function uses **finite differences** to approximate the true derivative.

---

## **3. Where does it fail?**

* Numerical derivative is unstable near discontinuities (Heaviside, ReLU at 0)
* Sigmoid saturates ‚Üí vanishing gradients
* Tanh also saturates
* ReLU dies for negative z (grad = 0)

---

## **4. When should I use it?**

Use when learning or demonstrating:

* Activation behavior
* Vanishing gradient problem
* Why modern networks prefer ReLU family
* Backprop intuition

---

## **5. Mental model**

Think of activations as:

> **‚ÄúGatekeepers that decide how much signal passes to the next layer.‚Äù**

Derivatives = how much learning signal (gradient) flows backward.

---

## **6. How do I prompt it?**

(Not LLM prompting ‚Äî *code usage guidance*.)
You ‚Äúprompt‚Äù the functions by:

* Passing a vector `z`
* Plotting outputs
* Using `derivative(func, z)` to approximate slopes
* Tweaking activation choices to see differences

---

## **7. Alternatives**

For activations:

* **Leaky ReLU**
* **ELU / GELU**
* **Swish / Mish**

For derivatives:

* Analytical derivatives (instead of finite difference)
* Autograd / TensorFlow / PyTorch automatic differentiation

---

## **Code Explanation (Short + Clear)**

### **1. Import + activation definitions**

```python
from scipy.special import expit as sigmoid

def relu(z):
    return np.maximum(0, z)
```

* `sigmoid` imported from SciPy
* `relu` implemented manually

---

### **2. Numerical derivative function**

```python
def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)
```

* Central difference formula
* Approximates f‚Ä≤(z)
* Works for any function `f`

---

### **3. Prepare input range**

```python
max_z = 4.5
z = np.linspace(-max_z, max_z, 200)
```

* 200 points from ‚Äì4.5 to +4.5
* Used for smooth plots

---

### **4. Plot activation functions**

```python
plt.subplot(121)
plt.plot(z, relu(z), "m-.", label="ReLU")
plt.plot(z, sigmoid(z), "g--", label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", label="Tanh")
```

Also draws Heaviside step using simple line plots.

Purpose: visualize shape.

---

### **5. Plot derivatives**

```python
plt.subplot(122)
plt.plot(z, derivative(sigmoid, z), "g--", label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", label="Tanh")
plt.plot(z, derivative(np.sign, z), "r-", label="Heaviside")
```

* Shows slopes
* Critical for understanding gradient flow

ReLU derivative is drawn manually using straight lines, because derivative is undefined at 0.

---

### **6. Decoration**

Grid, axis limits, legends, labels, titles.

---

### **7. Save and show**

```python
save_fig("activation_functions_plot")
plt.show()
```

Saves high-quality image + displays it.

---

If you want, I can (short):
‚úî add GELU/Swish to the plot
‚úî rewrite in TensorFlow or PyTorch
‚úî summarize ‚Äúwhen to use which activation‚Äù in 5 lines

</details>

## Regression MLPs

**Warning**: In recent versions of Scikit-Learn, you must use `root_mean_squared_error()` to compute the RMSE, instead of `mean_squared_error(labels, predictions, squared=False)`. The following `try`/`except` block tries to import `root_mean_squared_error`, and if it fails it just defines it.

In [None]:
try:
    from sklearn.metrics import root_mean_squared_error
except ImportError:
    from sklearn.metrics import mean_squared_error

    def root_mean_squared_error(labels, predictions):
        return mean_squared_error(labels, predictions, squared=False)

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

mlp_reg = MLPRegressor(hidden_layer_sizes=[50, 50, 50], random_state=42)
pipeline = make_pipeline(StandardScaler(), mlp_reg)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_valid)
rmse = root_mean_squared_error(y_valid, y_pred)

In [None]:
rmse

<details>
<summary><b>AI Understanding - Regression MLPs </b></summary>

## ‚úÖ**AI Understanding‚Äî Activation Functions Code**

## **1. What is it?**

A visualization script that plots:

* Activation functions (ReLU, Sigmoid, Tanh, Heaviside)
* Their numerical derivatives

Used to understand how neural networks **activate** and **backpropagate**.

---

## **2. How does it reason?**

* Each activation transforms input **z** ‚Üí output
* Derivatives show how gradients flow
* Smooth functions (sigmoid/tanh) ‚Üí smooth gradients
* ReLU ‚Üí piecewise linear
* Heaviside ‚Üí step with no useful gradients

The derivative function uses **finite differences** to approximate the true derivative.

---

## **3. Where does it fail?**

* Numerical derivative is unstable near discontinuities (Heaviside, ReLU at 0)
* Sigmoid saturates ‚Üí vanishing gradients
* Tanh also saturates
* ReLU dies for negative z (grad = 0)

---

## **4. When should I use it?**

Use when learning or demonstrating:

* Activation behavior
* Vanishing gradient problem
* Why modern networks prefer ReLU family
* Backprop intuition

---

## **5. Mental model**

Think of activations as:

> **‚ÄúGatekeepers that decide how much signal passes to the next layer.‚Äù**

Derivatives = how much learning signal (gradient) flows backward.

---

## **6. How do I prompt it?**

(Not LLM prompting ‚Äî *code usage guidance*.)
You ‚Äúprompt‚Äù the functions by:

* Passing a vector `z`
* Plotting outputs
* Using `derivative(func, z)` to approximate slopes
* Tweaking activation choices to see differences

---

## **7. Alternatives**

For activations:

* **Leaky ReLU**
* **ELU / GELU**
* **Swish / Mish**

For derivatives:

* Analytical derivatives (instead of finite difference)
* Autograd / TensorFlow / PyTorch automatic differentiation

---

# **Code Explanation (Short + Clear)**

### **1. Import + activation definitions**

```python
from scipy.special import expit as sigmoid

def relu(z):
    return np.maximum(0, z)
```

* `sigmoid` imported from SciPy
* `relu` implemented manually

---

### **2. Numerical derivative function**

```python
def derivative(f, z, eps=0.000001):
    return (f(z + eps) - f(z - eps))/(2 * eps)
```

* Central difference formula
* Approximates f‚Ä≤(z)
* Works for any function `f`

---

### **3. Prepare input range**

```python
max_z = 4.5
z = np.linspace(-max_z, max_z, 200)
```

* 200 points from ‚Äì4.5 to +4.5
* Used for smooth plots

---

### **4. Plot activation functions**

```python
plt.subplot(121)
plt.plot(z, relu(z), "m-.", label="ReLU")
plt.plot(z, sigmoid(z), "g--", label="Sigmoid")
plt.plot(z, np.tanh(z), "b-", label="Tanh")
```

Also draws Heaviside step using simple line plots.

Purpose: visualize shape.

---

### **5. Plot derivatives**

```python
plt.subplot(122)
plt.plot(z, derivative(sigmoid, z), "g--", label="Sigmoid")
plt.plot(z, derivative(np.tanh, z), "b-", label="Tanh")
plt.plot(z, derivative(np.sign, z), "r-", label="Heaviside")
```

* Shows slopes
* Critical for understanding gradient flow

ReLU derivative is drawn manually using straight lines, because derivative is undefined at 0.

---

### **6. Decoration**

Grid, axis limits, legends, labels, titles.

---

### **7. Save and show**

```python
save_fig("activation_functions_plot")
plt.show()
```

Saves high-quality image + displays it.

---

If you want, I can (short):
‚úî add GELU/Swish to the plot
‚úî rewrite in TensorFlow or PyTorch
‚úî summarize ‚Äúwhen to use which activation‚Äù in 5 lines
</details>

## Classification MLPs

In [None]:
# extra code ‚Äì this was left as an exercise for the reader

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

iris = load_iris()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    iris.data, iris.target, test_size=0.1, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, test_size=0.1, random_state=42)

mlp_clf = MLPClassifier(hidden_layer_sizes=[5], max_iter=10_000,
                        random_state=42)
pipeline = make_pipeline(StandardScaler(), mlp_clf)
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_valid, y_valid)
accuracy

<details>
<summary><b>AI Understanding - Classification MLPs </b></summary>

## ‚úÖ**AI Understanding Template (for this MLPClassifier code)**

## **1. What is it?**

A small **feed-forward neural network classifier** (MLP) trained on the **Iris dataset** using scikit-learn.

---

## **2. How does it reason?**

* Standardizes inputs (via **StandardScaler**)
* Passes them through **one hidden layer of 5 neurons**
* Learns non-linear decision boundaries
* Uses backprop + gradient descent to minimize classification loss
* Outputs class probabilities for the 3 Iris species

---

## **3. Where does it fail?**

* Very small network ‚Üí may underfit complex datasets
* Sensitive to scaling (but we use StandardScaler ‚Üí good)
* Doesn‚Äôt capture long-range structure (no attention, no convolution)
* Not great for: images, text sequences, large tabular sets

---

## **4. When should I use it?**

Use MLPClassifier when:

* You have **small/medium tabular numeric data**
* Need a quick **baseline neural network**
* Problem is **multi-class classification**
* You want simple, fast, shallow neural nets (not deep architectures)

---

## **5. What is the mental model?**

Think of it as:

> **A stack of weighted linear layers + nonlinear activations that bend the space so classes become separable.**

It‚Äôs the simplest ‚Äúneural network brain‚Äù for classification tasks.

---

## **6. How do I prompt it?**

(How to use/train it effectively)

* Always **scale your features**
* Tune:

  * `hidden_layer_sizes`
  * `max_iter`
  * `learning_rate_init`
* Feed numeric tabular features only
* For small datasets ‚Üí increase `max_iter`
* For complex patterns ‚Üí add more layers

---

## **7. What are alternatives?**

| Method                     | When better                                  |
| -------------------------- | -------------------------------------------- |
| **LogisticRegression**     | Data mostly linear                           |
| **RandomForestClassifier** | Tabular data with non-linear relations       |
| **XGBoost/CatBoost**       | Best performance on structured data          |
| **SVM**                    | Small datasets with clear margins            |
| **Keras MLP**              | If you want deeper/more flexible neural nets |
| **Transformers/CNNs**      | For text/vision problems                     |

---

# **Code Explanation (Short & Clear)**

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
```

‚úî Load dataset
‚úî Import train/test splitting
‚úî Import a neural network classifier

---

```python
iris = load_iris()
```

Loads classic 150-sample Iris dataset with 4 numeric features.

---

```python
X_train_full, X_test, y_train_full, y_test = train_test_split(
    iris.data, iris.target, test_size=0.1, random_state=42)
```

* 10% test data
* Remaining 90% kept for training/validation

---

```python
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, test_size=0.1, random_state=42)
```

* Splits remaining 90% into:

  * 90% training
  * 10% validation

---

```python
mlp_clf = MLPClassifier(hidden_layer_sizes=[5], max_iter=10_000,
                        random_state=42)
```

Creates a neural network with:

* 1 hidden layer of **5 neurons**
* Allow up to **10,000 training iterations**
* Same randomness for reproducibility

---

```python
pipeline = make_pipeline(StandardScaler(), mlp_clf)
```

Builds a pipeline:

1. **Standardize features**
2. **Feed into neural network**

(Scaling is crucial for MLP to converge.)

---

```python
pipeline.fit(X_train, y_train)
```

Trains the neural network on training data.

---

```python
accuracy = pipeline.score(X_valid, y_valid)
accuracy
```

Evaluates accuracy on **validation set** and prints it.

Expected accuracy ~ **0.93‚Äì1.0** (Iris is simple).
</details>

# Implementing MLPs with Keras
## Building an Image Classifier Using the Sequential API
### Using Keras to load the dataset

Let's start by loading the fashion MNIST dataset. Keras has a number of functions to load popular datasets in `tf.keras.datasets`. The dataset is already split for you between a training set (60,000 images) and a test set (10,000 images), but it can be useful to split the training set further to have a validation set. We'll use 55,000 images for training, and 5,000 for validation.

In [None]:
import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

The training set contains 60,000 grayscale images, each 28x28 pixels:

In [None]:
X_train.shape

Each pixel intensity is represented as a byte (0 to 255):

In [None]:
X_train.dtype

Let's scale the pixel intensities down to the 0-1 range and convert them to floats, by dividing by 255:

In [None]:
X_train, X_valid, X_test = X_train / 255., X_valid / 255., X_test / 255.

You can plot an image using Matplotlib's `imshow()` function, with a `'binary'`
 color map:

In [None]:
# extra code

plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()

The labels are the class IDs (represented as uint8), from 0 to 9:

In [None]:
y_train

Here are the corresponding class names:

In [None]:
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

So the first image in the training set is an ankle boot:

In [None]:
class_names[y_train[0]]

Let's take a look at a sample of the images in the dataset:

In [None]:
# extra code ‚Äì this cell generates and saves Figure 10‚Äì10

n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(class_names[y_train[index]])
plt.subplots_adjust(wspace=0.2, hspace=0.5)

save_fig("fashion_mnist_plot")
plt.show()

<details>
<summary><b>AI Understanding - Implementing MLPs with Keras </b></summary>
	Below is a **short, crisp AI-style breakdown** followed by a **simple explanation of the code**.

---

## ‚úÖ **AI Understanding Template ‚Äî Fashion-MNIST Loading & Visualization**

### **1. What is it?**

A machine learning pipeline step that:

* Loads the **Fashion-MNIST** dataset
* Splits it into train/validation/test
* Normalizes pixel values
* Visualizes sample images

It prepares data for image-classification models (CNN, DNN, etc.).

---

### **2. How does it reason?**

Not actual reasoning ‚Äî but **data processing logic**:

* Load ‚Üí Split ‚Üí Normalize ‚Üí Visualize
* Normalization helps models converge faster
* Visualization lets you verify data sanity

---

### **3. Where does it fail?**

* If visualization is not checked ‚Üí wrong labels or corrupted data.
* If images are not normalized ‚Üí training becomes unstable.
* If shapes mismatch ‚Üí model input errors.
* If grayscale (1-channel) is used incorrectly with CNN expecting 3 channels.

---

### **4. When should I use it?**

* Before training **any image classifier**
* To sanity-check data distribution
* To confirm labels match images
* When benchmarking new models on a standard dataset

---

### **5. What is the mental model?**

Think of it as:

> **‚ÄúLoad clothes images ‚Üí clean them ‚Üí convert numbers ‚Üí preview them ‚Üí ready for model.‚Äù**

---

### **6. How do I prompt it?**

(Here ‚Äúprompt‚Äù = how to *use the code*.)

* Ensure dataset loads correctly
* Normalize values using `/ 255.`
* Use `plt.imshow()` to confirm image quality
* Prepare train/valid/test strictly separated

---

### **7. What are alternatives?**

* **CIFAR-10** (color images)
* **MNIST** (digits)
* **Custom image datasets** via `tf.keras.utils.image_dataset_from_directory`
* **Kaggle fashion datasets** for higher complexity

---

## ‚úÖ **Code Explanation (Short, Clear)**

### **1. Load Dataset**

```python
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
```

* Downloads Fashion-MNIST
* Gives 60,000 training images + 10,000 test images
* Each image is 28√ó28 grayscale

---

### **2. Create Validation Set**

```python
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]
```

* Last **5000 images** ‚Üí validation
* Remaining **55,000** ‚Üí training

---

### **3. Check shape & datatype**

```python
X_train.shape
X_train.dtype
```

* Shape: `(55000, 28, 28)`
* dtype: `uint8` (0‚Äì255 pixel values)

---

### **4. Normalize Pixel Values**

```python
X_train, X_valid, X_test = X_train / 255., X_valid / 255., X_test / 255.
```

* Converts pixels from **0‚Äì255 ‚Üí 0‚Äì1**
* Helps neural networks learn smoother

---

### **5. Display First Image**

```python
plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()
```

* Shows a grayscale image
* `binary` = black/white colormap
* Turn off axis for cleaner look

---

### **6. Get Label Names**

```python
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

class_names[y_train[0]]
```

* Converts numeric label (0‚Äì9) ‚Üí category name
* Example: 9 ‚Üí ‚ÄúAnkle boot‚Äù

---

### **7. Plot Grid of Images**

```python
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
```

* Creates grid: **4 rows √ó 10 columns**
* Shows 40 sample images

---

### **Loop to Draw Each Image**

```python
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(class_names[y_train[index]])
```

* Draws each image
* Adds class name as title
* Uses ‚Äúnearest‚Äù to keep pixels sharp

---

### **Spacing + Save Figure**

```python
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig("fashion_mnist_plot")
plt.show()
```

* Adjust spacing
* Saves as image file
* Displays plot

---

If you want, I can also provide:
‚úî **One-page cheat-sheet** for Fashion-MNIST
‚úî **Model code** (DNN or CNN)
‚úî **AI-style understanding for training the model**

</details>

### Creating the model using the Sequential API

In [None]:
tf.random.set_seed(42)
model = tf.keras.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=[28, 28]))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(300, activation="relu"))
model.add(tf.keras.layers.Dense(100, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))

In [None]:
# extra code ‚Äì clear the session to reset the name counters
tf.keras.backend.clear_session()
tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

In [None]:
model.summary()

In [None]:
# extra code ‚Äì another way to display the model's architecture
tf.keras.utils.plot_model(model, "my_fashion_mnist_model.png", show_shapes=True)

In [None]:
model.layers

In [None]:
hidden1 = model.layers[1]
hidden1.name

In [None]:
model.get_layer('dense') is hidden1

In [None]:
weights, biases = hidden1.get_weights()
weights

In [None]:
weights.shape

In [None]:
biases

In [None]:
biases.shape

### Compiling the model

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

This is equivalent to:

In [None]:
# extra code ‚Äì this cell is equivalent to the previous cell
model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
              optimizer=tf.keras.optimizers.SGD(),
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])

In [None]:
# extra code ‚Äì shows how to convert class ids to one-hot vectors
tf.keras.utils.to_categorical([0, 5, 1, 0], num_classes=10)

Note: it's important to set `num_classes` when the number of classes is greater than the maximum class id in the sample.

In [None]:
# extra code ‚Äì shows how to convert one-hot vectors to class ids
np.argmax(
    [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
     [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
     [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
     [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],
    axis=1
)

<details>
<summary><b>AI Understanding MLPs with Keras - Creating and compiling </b></summary>

Below is a **short, clean, AI-style explanation** *plus* a **code walkthrough** for the given Sequential MNIST model.

---

## ‚úÖ **AI Understanding Template ‚Äî MNIST Sequential Model**

### **1. What is it?**

A simple **feed-forward neural network (MLP)** for classifying 28√ó28 images (e.g., Fashion-MNIST / MNIST) into 10 classes.

---

### **2. How does it reason?**

* **Flatten** turns the 2D image into a 1D vector.
* **Dense layers (ReLU)** learn non-linear patterns such as shapes/edges.
* **Softmax output** converts scores ‚Üí probabilities.
* Training uses **cross-entropy** to push probabilities toward the correct class.

---

### **3. Where does it fail?**

* Works poorly on **complex images** (CNNs beat MLPs).
* Does not use spatial structure ‚Üí treats each pixel independently.
* Prone to overfitting if network is large or data is small.

---

### **4. When should I use it?**

Use this when:

* Data is **simple** (e.g., digit recognition).
* You want a **baseline classifier** fast.
* You want to teach beginners **how neural nets train**.

---

### **5. What is the mental model?**

> **‚ÄúConvert image ‚Üí vector ‚Üí pass through funnels of neurons ‚Üí classify.‚Äù**

It‚Äôs simply a stack of fully connected layers that gradually learn better representations.

---

### **6. How do I prompt it?**

(How to feed/use the model)

* Input must be shape **[batch, 28, 28]**
* Labels must be **integers 0‚Äì9** (sparse cross entropy).
* Compile with **SGD** or **Adam**.
* Train with `model.fit(X_train, y_train)`.

---

### **7. What are the alternatives?**

| Alternative             | Why use it?                              |
| ----------------------- | ---------------------------------------- |
| **CNN (Conv2D)**        | Best for images; uses spatial patterns.  |
| **SimpleRNN/LSTM**      | If treating each row/column as sequence. |
| **Vision Transformer**  | High-end accuracy on modern image tasks. |
| **Logistic Regression** | Tiny baseline for comparison.            |

---

## üß† **Code Explanation (Short & Clear)**

### **Set seed**

```python
tf.random.set_seed(42)
```

Ensures reproducible weight initialization.

---

## **Model Creation ‚Äì Method 1**

```python
model = tf.keras.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=[28, 28]))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(300, activation="relu"))
model.add(tf.keras.layers.Dense(100, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
```

**Explanation:**

* `InputLayer`: expects 28√ó28 images
* `Flatten`: reshape into 784-length vector
* Dense-300 ‚Üí Dense-100 ‚Üí Dense-10 (classification)

---

### **Clear session (reset counters)**

```python
tf.keras.backend.clear_session()
tf.random.set_seed(42)
```

---

## **Model Creation ‚Äì Method 2 (cleaner)**

```python
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])
```

---

### **Model summary**

```python
model.summary()
```

Shows layer shapes + parameters.

---

### **Plot the model**

```python
tf.keras.utils.plot_model(model, "my_fashion_mnist_model.png", show_shapes=True)
```

---

## **Accessing Layers**

```python
hidden1 = model.layers[1]
hidden1.name
model.get_layer('dense') is hidden1
```

* `layers[1]` ‚Üí first Dense(300) layer
* `get_layer()` fetches by name
* You can inspect how layers are stored.

---

### **Weights of a layer**

```python
weights, biases = hidden1.get_weights()
weights
```

* Shows the big `[784 x 300]` weight matrix
* And bias vector of length 300

---

## **Compile Model**

```python
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])
```

Why sparse?

* Labels are integers (0‚Äì9), not one-hot vectors.

Equivalent version:

```python
model.compile(loss=tf.keras.losses.sparse_categorical_crossentropy,
              optimizer=tf.keras.optimizers.SGD(),
              metrics=[tf.keras.metrics.sparse_categorical_accuracy])
```

---

## **One-hot utilities**

Convert ids ‚Üí one-hot:

```python
tf.keras.utils.to_categorical([0, 5, 1, 0], num_classes=10)
```

Convert one-hot ‚Üí ids:

```python
np.argmax([...], axis=1)
```

---

If you want, I can also create:
‚úî A 20-word ultra-short version
‚úî A diagram showing how data flows through each layer
‚úî A version comparing this model to a CNN in table form


</details>

### Training and evaluating the model

In [None]:
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))

In [None]:
history.params

In [None]:
print(history.epoch)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

pd.DataFrame(history.history).plot(
    figsize=(8, 5), xlim=[0, 29], ylim=[0, 1], grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])
plt.legend(loc="lower left")  # extra code
save_fig("keras_learning_curves_plot")  # extra code
plt.show()

In [None]:
# extra code ‚Äì shows how to shift the training curve by -1/2 epoch
plt.figure(figsize=(8, 5))
for key, style in zip(history.history, ["r--", "r--.", "b-", "b-*"]):
    epochs = np.array(history.epoch) + (0 if key.startswith("val_") else -0.5)
    plt.plot(epochs, history.history[key], style, label=key)
plt.xlabel("Epoch")
plt.axis([-0.5, 29, 0., 1])
plt.legend(loc="lower left")
plt.grid()
plt.show()

In [None]:
model.evaluate(X_test, y_test)

### Using the model to make predictions

In [None]:
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)

In [None]:
y_pred = y_proba.argmax(axis=-1)
y_pred

In [None]:
np.array(class_names)[y_pred]

In [None]:
y_new = y_test[:3]
y_new

In [None]:
# extra code ‚Äì this cell generates and saves Figure 10‚Äì12
plt.figure(figsize=(7.2, 2.4))
for index, image in enumerate(X_new):
    plt.subplot(1, 3, index + 1)
    plt.imshow(image, cmap="binary", interpolation="nearest")
    plt.axis('off')
    plt.title(class_names[y_test[index]])
plt.subplots_adjust(wspace=0.2, hspace=0.5)
save_fig('fashion_mnist_images_plot', tight_layout=False)
plt.show()

<details>
<summary><b>AI Understanding - Evaluation & Predictions </b></summary>

---

## ‚úÖ **AI Understanding Template ‚Äî Keras Training/Evaluation Code**

## **1. What is it?**

A standard **model training ‚Üí tracking ‚Üí evaluation ‚Üí prediction ‚Üí visualization** pipeline using Keras + Matplotlib.

---

## **2. How does it reason?**

* During `fit()`, the model updates weights using backprop (per epoch).
* Stores all metrics in `history.history`.
* Plots curves to help ‚Äúreason‚Äù about learning trends (overfit/underfit).
* Uses `model.predict()` to produce probabilities ‚Üí picks highest class.

---

## **3. Where does it fail?**

* If epochs too high ‚Üí **overfitting**.
* If data not normalized ‚Üí **unstable loss curves**.
* If classes imbalanced ‚Üí **misleading accuracy**.
* If model is too small/large ‚Üí **underfitting/overfitting**.

---

## **4. When should I use it?**

Use this pipeline when you want:

* Quick training + validation monitoring.
* To diagnose model behavior via loss/accuracy curves.
* Simple prediction flow (e.g., MNIST, tabular, binary classifiers).

---

## **5. What is the mental model?**

Think of it as:

> **‚ÄúA training diary that logs each epoch‚Äôs progress and lets you visualize learning behavior.‚Äù**

`fit()` = training engine
`history.history` = logbook
Plots = health check of your neural network
`evaluate()` = exam
`predict()` = final output

---

## **6. How do I prompt it?**

(How to use the API correctly)

* Provide training + validation data.
* Choose number of epochs (20‚Äì50 early).
* Read `history.history['loss']` etc. for diagnostics.
* Use `plot()` to see underfit/overfit.
* Use `predict()` ‚Üí `argmax` to convert probabilities ‚Üí classes.

---

## **7. What are alternatives?**

* **TensorBoard** ‚Üí real-time training dashboard
* **Scikit-learn fit/score APIs**
* **PyTorch Lightning Trainer**
* **Weights & Biases / MLflow** ‚Üí experiment tracking
* **FastAI** ‚Üí automated training cycles

---

##  **CODE EXPLANATION (Short + Clear)**

---

## **Training the model**

```python
history = model.fit(X_train, y_train, epochs=30,
                    validation_data=(X_valid, y_valid))
```

* Trains for **30 epochs**
* Logs **loss + metrics** for train/val
* Returns `history` object containing all logs.

---

## **Inspecting training metadata**

```python
history.params         # training parameters (epochs, samples)
print(history.epoch)   # list of epochs [0..29]
```

---

## **Plot the learning curves**

```python
pd.DataFrame(history.history).plot(
    figsize=(8, 5), xlim=[0, 29], ylim=[0, 1],
    grid=True, xlabel="Epoch",
    style=["r--", "r--.", "b-", "b-*"])
plt.legend(loc="lower left")
plt.show()
```

Explanation:

* Converts history to DataFrame.
* Plots **training loss**, **validation loss**, **training metric**, **validation metric**.
* Styles define line shapes (red dashed, blue solid, etc.).
* Helps detect **overfitting** (val curves diverging).

---

## **Shift training curve by -0.5 epochs (optional trick)**

Useful to visually align train vs validation curves.

```python
for key, style in zip(history.history, ["r--", "r--.", "b-", "b-*"]):
    epochs = np.array(history.epoch) + (0 if key.startswith("val_") else -0.5)
    plt.plot(epochs, history.history[key], style, label=key)
```

---

## **Evaluate on test set**

```python
model.evaluate(X_test, y_test)
```

Gives final loss + metrics on unseen data.

---

## **Make predictions**

```python
X_new = X_test[:3]
y_proba = model.predict(X_new)
y_proba.round(2)
```

* Returns probability distribution per class.
* Rounded to 2 decimals.

---

## **Convert probabilities to class labels**

```python
y_pred = y_proba.argmax(axis=-1)
np.array(class_names)[y_pred]
```

`argmax` picks the class with highest probability.

---

## **Ground truth comparison**

```python
y_test[:3]
```

---

## **Plot images + true labels**

```python
plt.imshow(image, cmap="binary", interpolation="nearest")
plt.title(class_names[y_test[index]])
```

Shows the test images with their *actual* class names.

---
</details>

## Building a Regression MLP Using the Sequential API

Let's load, split and scale the California housing dataset (the original one, not the modified one as in chapter 2):

In [None]:
# extra code ‚Äì load and split the California housing dataset, like earlier
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

In [None]:
tf.random.set_seed(42)
norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:])
model = tf.keras.Sequential([
    norm_layer,
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(50, activation="relu"),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
norm_layer.adapt(X_train)
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
mse_test, rmse_test = model.evaluate(X_test, y_test)
X_new = X_test[:3]
y_pred = model.predict(X_new)

In [None]:
rmse_test

In [None]:
y_pred

## Building Complex Models Using the Functional API

Not all neural network models are simply sequential. Some may have complex topologies. Some may have multiple inputs and/or multiple outputs. For example, a Wide & Deep neural network (see [paper](https://ai.google/research/pubs/pub45413)) connects all or part of the inputs directly to the output layer.

In [None]:
# extra code ‚Äì reset the name counters and make the code reproducible
tf.keras.backend.clear_session()
tf.random.set_seed(42)

In [None]:
normalization_layer = tf.keras.layers.Normalization()
hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
hidden_layer2 = tf.keras.layers.Dense(30, activation="relu")
concat_layer = tf.keras.layers.Concatenate()
output_layer = tf.keras.layers.Dense(1)

input_ = tf.keras.layers.Input(shape=X_train.shape[1:])
normalized = normalization_layer(input_)
hidden1 = hidden_layer1(normalized)
hidden2 = hidden_layer2(hidden1)
concat = concat_layer([normalized, hidden2])
output = output_layer(concat)

model = tf.keras.Model(inputs=[input_], outputs=[output])

In [None]:
model.summary()

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
normalization_layer.adapt(X_train)
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
mse_test = model.evaluate(X_test, y_test)
y_pred = model.predict(X_new)

What if you want to send different subsets of input features through the wide or deep paths? We will send 5 features (features 0 to 4), and 6 through the deep path (features 2 to 7). Note that 3 features will go through both (features 2, 3 and 4).

In [None]:
tf.random.set_seed(42)  # extra code

In [None]:
input_wide = tf.keras.layers.Input(shape=[5])  # features 0 to 4
input_deep = tf.keras.layers.Input(shape=[6])  # features 2 to 7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)
model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])

X_train_wide, X_train_deep = X_train[:, :5], X_train[:, 2:]
X_valid_wide, X_valid_deep = X_valid[:, :5], X_valid[:, 2:]
X_test_wide, X_test_deep = X_test[:, :5], X_test[:, 2:]
X_new_wide, X_new_deep = X_test_wide[:3], X_test_deep[:3]

norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit((X_train_wide, X_train_deep), y_train, epochs=20,
                    validation_data=((X_valid_wide, X_valid_deep), y_valid))
mse_test = model.evaluate((X_test_wide, X_test_deep), y_test)
y_pred = model.predict((X_new_wide, X_new_deep))

Adding an auxiliary output for regularization:

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(42)

In [None]:
input_wide = tf.keras.layers.Input(shape=[5])  # features 0 to 4
input_deep = tf.keras.layers.Input(shape=[6])  # features 2 to 7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)
aux_output = tf.keras.layers.Dense(1)(hidden2)
model = tf.keras.Model(inputs=[input_wide, input_deep],
                       outputs=[output, aux_output])

**Warning**: in recent versions, Keras requires one metric per output, so I replaced `metrics=["RootMeanSquaredError"]` with `metrics=["RootMeanSquaredError", "RootMeanSquaredError"]` in the code below.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=("mse", "mse"), loss_weights=(0.9, 0.1), optimizer=optimizer,
              metrics=["RootMeanSquaredError", "RootMeanSquaredError"])

In [None]:
norm_layer_wide.adapt(X_train_wide)
norm_layer_deep.adapt(X_train_deep)
history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=20,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid))
)

**Warning**: in recent TF version, `evaluate()` also returns the main metric and the aux metric. To ensure the code works in both old and new versions, we only look at the first 3 elements of `eval_results` (i.e., just the losses):

In [None]:
eval_results = model.evaluate((X_test_wide, X_test_deep), (y_test, y_test))
weighted_sum_of_losses, main_loss, aux_loss = eval_results[:3]

In [None]:
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))

In [None]:
y_pred_tuple = model.predict((X_new_wide, X_new_deep))
y_pred = dict(zip(model.output_names, y_pred_tuple))

<details>
<summary><b>AI Understanding - Regression & Complex Model </b></summary>

Below is a **short, crisp AI-style breakdown** followed by a **clean explanation of each code block**.

---

# ‚úÖ **AI Understanding Template ‚Äì Applied to This Code**

## **1. What is it?**

A set of TensorFlow models for **tabular regression** (California Housing), progressing from:

1. **Simple DNN**
2. **Functional API DNN**
3. **Wide & Deep model**
4. **Wide & Deep with auxiliary outputs (multi-task style)**

---

## **2. How does it reason?**

* **Normalization** helps all features to same scale.
* **Deep layers** learn non-linear patterns (interactions).
* **Wide branch** learns memorization patterns (direct linear effects).
* **Concatenation** merges shallow + deep knowledge.
* **Auxiliary outputs** stabilize training by forcing hidden layers to learn strong representations.

---

## **3. Where does it fail?**

* Small datasets ‚Üí overfitting.
* Highly categorical data ‚Üí better with embeddings / trees.
* Highly linear tasks ‚Üí simpler linear models are enough.
* Poor feature splitting between wide & deep ‚Üí suboptimal.

---

## **4. When should I use it?**

Use these models for:

* **Tabular data regression**
* **Mixed linear + non-linear patterns**
* **Real estate, pricing, risk scoring, health tabular datasets**
* **When feature engineering matters**

---

## **5. What is the mental model?**

Think of it like:

> **‚ÄúA neural network that listens to two sources: simple rules (wide) + complex patterns (deep) and blends both.‚Äù**

---

## **6. How do I prompt it?** *(How to use it)*

* Feed **X_train**, **y_train** cleanly.
* Always run `.adapt()` before training.
* For wide+deep ‚Üí split features logically.
* For predictions ‚Üí pass inputs as tuples:
  `model.predict((wide, deep))`

---

## **7. What are alternatives?**

* **XGBoost / CatBoost / LightGBM** ‚Üí best for tabular data.
* **Linear Regression / ElasticNet** ‚Üí for purely linear tasks.
* **Random Forest / Extra Trees** ‚Üí quick baselines.
* **TabTransformer** ‚Üí deep learning for categorical-heavy data.

---

## ----------------------------------------------------------

# ‚úÖ **Code Explanation (Short + Clear)**

## ----------------------------------------------------------

---

## üîπ **1) Basic DNN Regression (Sequential model)**

### **Load + split data**

```python
housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(...)
X_train, X_valid, y_train, y_valid = train_test_split(...)
```

‚Üí Classic 70/15/15 splitting.

---

### **Normalization + DNN**

```python
norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:])
model = tf.keras.Sequential([
    norm_layer,
    Dense(50, relu),
    Dense(50, relu),
    Dense(50, relu),
    Dense(1)
])
```

* First layer scales all features.
* Three hidden layers learn non-linear interactions.

---

### **Compile + Train**

```python
model.compile(loss="mse", optimizer=Adam(1e-3), metrics=["RMSE"])
norm_layer.adapt(X_train)
history = model.fit(...)
```

* Adam optimizer
* RMSE metric
* `adapt()` learns mean & variance

---

### **Evaluate + Predict**

```python
model.evaluate(X_test, y_test)
y_pred = model.predict(X_new)
```

---

## üîπ **2) Functional API Model**

Functional API gives more control than Sequential.

### **Define layers explicitly**

```python
input_ = Input(shape=X_train.shape[1:])
normalized = normalization_layer(input_)
hidden1 = Dense(30, relu)(normalized)
hidden2 = Dense(30, relu)(hidden1)
concat = Concatenate()([normalized, hidden2])
output = Dense(1)(concat)
```

* Input flows through two hidden layers.
* The normalized input is concatenated back ‚Üí forms a **wide + deep hybrid**.

---

### **Build, compile & train**

```python
model = Model(inputs=[input_], outputs=[output])
normalization_layer.adapt(X_train)
model.fit(...)
```

---

## üîπ **3) Wide + Deep Model (Manually Split Features)**

### **Split features manually**

```python
X_train_wide = X_train[:, :5]   # simple features
X_train_deep = X_train[:, 2:]   # richer feature set
```

Wide: first 5
Deep: last 6 (with overlap‚ÄîOK in W&D)

---

### **Inputs + normalization**

```python
input_wide = Input(shape=[5])
input_deep = Input(shape=[6])
norm_layer_wide = Normalization()
norm_layer_deep = Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
```

---

### **Deep branch**

```python
hidden1 = Dense(30, relu)(norm_deep)
hidden2 = Dense(30, relu)(hidden1)
```

---

### **Concatenate wide + deep**

```python
concat = concatenate([norm_wide, hidden2])
output = Dense(1)(concat)
model = Model(inputs=[input_wide, input_deep], outputs=[output])
```

---

### **Train**

```python
model.compile(...)
norm_layer_wide.adapt(...)
norm_layer_deep.adapt(...)
model.fit((X_train_wide, X_train_deep), y_train, ...)
```

---

### **Evaluation**

```python
mse_test = model.evaluate((X_test_wide, X_test_deep), y_test)
```

---

### **Prediction**

```python
y_pred = model.predict((X_new_wide, X_new_deep))
```

---

# üîπ **4) Wide + Deep With Auxiliary Outputs**

*(Your last lines refer to a model with two outputs.)*

### **Evaluate**

```python
eval_results = model.evaluate(...)
weighted_sum_of_losses, main_loss, aux_loss = eval_results[:3]
```

* Weighted losses reflect multi-task training.

### **Predict**

```python
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))
```

### **Convert tuple ‚Üí dictionary**

```python
y_pred = dict(zip(model.output_names, y_pred_tuple))
```

* Useful for named outputs.

---




</details>

## Using the Subclassing API to Build Dynamic Models

In [None]:
class WideAndDeepModel(tf.keras.Model):
    def __init__(self, units=30, activation="relu", **kwargs):
        super().__init__(**kwargs)  # needed to support naming the model
        self.norm_layer_wide = tf.keras.layers.Normalization()
        self.norm_layer_deep = tf.keras.layers.Normalization()
        self.hidden1 = tf.keras.layers.Dense(units, activation=activation)
        self.hidden2 = tf.keras.layers.Dense(units, activation=activation)
        self.main_output = tf.keras.layers.Dense(1)
        self.aux_output = tf.keras.layers.Dense(1)

    def call(self, inputs):
        input_wide, input_deep = inputs
        norm_wide = self.norm_layer_wide(input_wide)
        norm_deep = self.norm_layer_deep(input_deep)
        hidden1 = self.hidden1(norm_deep)
        hidden2 = self.hidden2(hidden1)
        concat = tf.keras.layers.concatenate([norm_wide, hidden2])
        output = self.main_output(concat)
        aux_output = self.aux_output(hidden2)
        return output, aux_output

tf.random.set_seed(42)  # extra code ‚Äì just for reproducibility
model = WideAndDeepModel(30, activation="relu", name="my_cool_model")

**Warning**: as explained above, Keras now requires one loss and one metric per output, so I replaced `loss="mse"` with `loss=["mse", "mse"]` and I also replaced `metrics=["RootMeanSquaredError"]` with `metrics=["RootMeanSquaredError", "RootMeanSquaredError"]` in the code below.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=["mse", "mse"], loss_weights=[0.9, 0.1], optimizer=optimizer,
              metrics=["RootMeanSquaredError", "RootMeanSquaredError"])
model.norm_layer_wide.adapt(X_train_wide)
model.norm_layer_deep.adapt(X_train_deep)
history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=10,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)))
eval_results = model.evaluate((X_test_wide, X_test_deep), (y_test, y_test))
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))

# ‚úÖ **AI ‚Äì Quick Answers**

### **1. What is it?**

AI is software that learns patterns from data and uses them to make predictions, decisions, or generate outputs.

### **2. How does it reason?**

It doesn‚Äôt ‚Äúthink‚Äù like humans.
It **matches patterns**, **optimizes probabilities**, and **predicts the most likely output** based on training data.

### **3. Where does it fail?**

* Unseen edge cases
* Ambiguous/incomplete prompts
* Wrong or biased training data
* Logical reasoning that requires real-world grounding
* Multi-step planning without guidance

### **4. When should I use it?**

Use AI when:

* Rules cannot be hard-coded
* Data is large
* Problem is pattern-driven (image, text, time-series, recommendations)

Not ideal for:

* Exact logic (bank ledger)
* Safety-critical tasks without supervision

### **5. What is the mental model?**

Think of AI as:
üëâ **Pattern autocomplete**
Whatever you give, it tries to autocomplete based on learned examples.

### **6. How do I prompt it?**

* Be explicit (‚ÄúDo X, explain Y, give Z format‚Äù)
* Provide context and constraints
* Break tasks into steps
* Give examples
* State output format

### **7. Alternatives?**

* Rule-based systems
* Statistical models (regression, ARIMA)
* Optimization algorithms
* Search algorithms (A*, DFS, heuristic search)

---

## ‚úÖ **Code Explanation ‚Äì `WideAndDeepModel`**

This is a **Wide & Deep neural network** combining:
‚úîÔ∏è **Wide part** ‚Üí memorization
‚úîÔ∏è **Deep part** ‚Üí generalization

Google uses it in recommender systems.

---

## **Class Definition**

### **`__init__()`**

```python
class WideAndDeepModel(tf.keras.Model):
    def __init__(self, units=30, activation="relu", **kwargs):
        super().__init__(**kwargs)
```

* Inherits from `tf.keras.Model`
* Allows custom naming

### **Layers Created**

#### **Normalization layers**

```python
self.norm_layer_wide = tf.keras.layers.Normalization()
self.norm_layer_deep = tf.keras.layers.Normalization()
```

* Each input branch (wide / deep) is normalized separately.

#### **Deep network layers**

```python
self.hidden1 = tf.keras.layers.Dense(units, activation=activation)
self.hidden2 = tf.keras.layers.Dense(units, activation=activation)
```

* Two dense layers for the *deep* pathway (non-linear learning)

#### **Outputs**

```python
self.main_output = tf.keras.layers.Dense(1)
self.aux_output = tf.keras.layers.Dense(1)
```

* Main output: used for final prediction
* Aux output: regularizes the model (helps early layers)

---

## **`call()` method**

This defines the forward pass.

```python
input_wide, input_deep = inputs
```

Two inputs:

* Wide features
* Deep features

### **1. Normalize**

```python
norm_wide = self.norm_layer_wide(input_wide)
norm_deep = self.norm_layer_deep(input_deep)
```

### **2. Deep network**

```python
hidden1 = self.hidden1(norm_deep)
hidden2 = self.hidden2(hidden1)
```

### **3. Concatenate deep output + wide input**

```python
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
```

This merges memorization + generalization.

### **4. Two outputs**

```python
output = self.main_output(concat)
aux_output = self.aux_output(hidden2)
```

* `main_output` ‚Üí final task
* `aux_output` ‚Üí helps deep branch learn better

Return both:

```python
return output, aux_output
```

---

## ‚úÖ **Training Explanation**

### **Optimizer**

```python
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
```

### **Compile with two losses**

```python
model.compile(
    loss=["mse", "mse"],
    loss_weights=[0.9, 0.1],
    optimizer=optimizer,
    metrics=["RootMeanSquaredError", "RootMeanSquaredError"]
)
```

Why two losses?

* `0.9` ‚Üí main output (important)
* `0.1` ‚Üí aux output (regularization)

---

### **Adapt normalization layers**

```python
model.norm_layer_wide.adapt(X_train_wide)
model.norm_layer_deep.adapt(X_train_deep)
```

Learns the mean & variance for scaling.

---

### **Fit the model**

```python
history = model.fit((X_train_wide, X_train_deep), (y_train, y_train), epochs=10)
```

Inputs:

* `(X_train_wide, X_train_deep)`

Outputs:

* `(y_train, y_train)` ‚Üí main + aux get the same target

---

### **Evaluate**

```python
eval_results = model.evaluate((X_test_wide, X_test_deep), (y_test, y_test))
```

---

### **Predict**

```python
y_pred_main, y_pred_aux = model.predict((X_new_wide, X_new_deep))
```

* `y_pred_main` ‚Üí final prediction
* `y_pred_aux` ‚Üí not used in real-world inference, but available


## Saving and Restoring a Model

**Warning**: Keras now recommends using the `.keras` format to save models, and the `h5` format for weights. Therefore I have updated the code in this section to first show what you need to change if you still want to use TensorFlow's `SavedModel` format, and then how you can use the recommended formats.

In [None]:
# extra code ‚Äì delete the directory, in case it already exists

import shutil

shutil.rmtree("my_keras_model", ignore_errors=True)

**Warning**: Keras's `model.save()` method no longer supports TensorFlow's `SavedModel` format. However, you can still export models to the `SavedModel` format using `model.export()` like this:

In [None]:
model.export("my_keras_model")

In [None]:
# extra code ‚Äì show the contents of the my_keras_model/ directory
for path in sorted(Path("my_keras_model").glob("**/*")):
    print(path)

**Warning**: In Keras 3, it is no longer possible to load a TensorFlow `SavedModel` as a Keras model. However, you can load a `SavedModel` as a `tf.keras.layers.TFSMLayer` layer, but be aware that this layer can only be used for inference: no training.

In [None]:
tfsm_layer = tf.keras.layers.TFSMLayer("my_keras_model")
y_pred_main, y_pred_aux = tfsm_layer((X_new_wide, X_new_deep))

**Warning**: Keras now requires the saved weights to have the `.weights.h5` extension. There are no longer saved using the `SavedModel` format.

In [None]:
model.save_weights("my_weights.weights.h5")

In [None]:
model.load_weights("my_weights.weights.h5")

To save a model using the `.keras` format, simply use `model.save()`:

In [None]:
model.save("my_model.keras")

To load a `.keras` model, use the `tf.keras.models.load_model()` function. If the model uses any custom object, you must pass them to the function via the `custom_objects` argument:

In [None]:
loaded_model = tf.keras.models.load_model(
    "my_model.keras",
    custom_objects={"WideAndDeepModel": WideAndDeepModel}
)

# ‚úÖ **AI ‚Äî Quick Answers**

### **1) What is it?**

A system that learns patterns from data and makes predictions or generates outputs without explicit rules.

### **2) How does it reason?**

By mapping inputs ‚Üí outputs using learned internal weights; not true ‚Äúthinking.‚Äù
Reasoning = pattern completion + probability.

### **3) Where does it fail?**

* Out-of-distribution data
* Ambiguous instructions
* Small or biased training data
* Tasks requiring common sense or real-world knowledge

### **4) When should I use it?**

* When rules are unclear
* When patterns are complex
* When data is large
* When automation requires prediction, summarization, classification, generation

### **5) What is the mental model?**

Think of AI as:
**‚ÄúA statistical function that predicts the most likely next step based on training.‚Äù**
Not a calculator. Not logic-first. Pattern-first.

### **6) How do I prompt it?**

Use **CLEAR** instructions:

* **Role:** ‚ÄúAct as an expert‚Ä¶‚Äù
* **Task:** ‚ÄúDo X clearly‚Ä¶‚Äù
* **Constraints:** ‚ÄúShort. Bullet points.‚Äù
* **Context:** ‚ÄúHere is the data‚Ä¶‚Äù
* **Output format:** ‚ÄúGive JSON, table, steps‚Ä¶‚Äù

### **7) What are alternatives?**

* Rule-based systems
* Classical ML (trees, SVM, regressions)
* Optimization algorithms
* Statistical modeling
* Simple scripts if rules are fixed

---

## ‚úÖ **Code Explanation (Short & Clear)**

### **1. Delete old model directory**

```python
import shutil
shutil.rmtree("my_keras_model", ignore_errors=True)
```

Deletes the folder **if it exists**, so you export fresh.

---

### **2. Export the model in TF SavedModel format**

```python
model.export("my_keras_model")
```

Creates a TF-SavedModel directory containing:

* model graph
* variables
* assets
* signatures

---

### **3. List exported files**

```python
for path in sorted(Path("my_keras_model").glob("**/*")):
    print(path)
```

Shows the structure.

---

### **4. Load exported SavedModel into a TFSMLayer**

```python
tfsm_layer = tf.keras.layers.TFSMLayer("my_keras_model")
y_pred_main, y_pred_aux = tfsm_layer((X_new_wide, X_new_deep))
```

* Wraps the SavedModel as a layer
* Allows predictions inside another model
* Useful for serving pipelines, modular models

---

### **5. Save weights only**

```python
model.save_weights("my_weights.weights.h5")
```

Stores **only the layer weights**, no architecture.

---

### **6. Load weights**

```python
model.load_weights("my_weights.weights.h5")
```

Requires the **same model architecture** to already exist in code.

---

### **7. Save full model (.keras format)**

```python
model.save("my_model.keras")
```

Saves:

* architecture
* weights
* compile settings

Better than using `.h5`.

---

### **8. Load full model with custom layer/class**

```python
loaded_model = tf.keras.models.load_model(
    "my_model.keras",
    custom_objects={"WideAndDeepModel": WideAndDeepModel}
)
```

If your model uses custom classes, you must provide them during load.

---



## Using Callbacks

In [None]:
shutil.rmtree("my_checkpoints", ignore_errors=True)  # extra code

**Warning**: as explained earlier, Keras now requires the checkpoint files to have a `.weights.h5` extension:

In [None]:
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_checkpoints.weights.h5",
                                                   save_weights_only=True)
history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=10,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)),
    callbacks=[checkpoint_cb])

In [None]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True)
history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=100,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)),
    callbacks=[checkpoint_cb, early_stopping_cb])

In [None]:
class PrintValTrainRatioCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        ratio = logs["val_loss"] / logs["loss"]
        print(f"Epoch={epoch}, val/train={ratio:.2f}")

In [None]:
val_train_ratio_cb = PrintValTrainRatioCallback()
history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=10,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)),
    callbacks=[val_train_ratio_cb], verbose=0)

Below is a **short, crisp answer** for each AI question + a **simple explanation** of your Keras callback code.

---

## **AI ‚Äî The Fast Template**

### **1. What is it?**

A system that learns patterns from data and produces outputs (predictions, text, images) without explicit rules.

### **2. How does it reason?**

By **detecting patterns** in huge datasets, using weights + math (neural networks).
It does **statistical reasoning**, not human logical reasoning.

### **3. Where does it fail?**

* When data is missing or biased
* When asked for exact truth beyond patterns
* In completely new situations
* With unclear or contradictory prompts

### **4. When should I use it?**

When you need:

* classification
* prediction
* summarization
* pattern detection
* automation of repetitive cognitive tasks

Not useful for strict rule-based logic (e.g., tax rules).

### **5. What is the mental model?**

Think of AI as a **probabilistic pattern matcher**:
‚ÄúGiven X, what is the statistically likely Y?‚Äù

### **6. How do I prompt it?**

* Be **clear**, **specific**, and **bounded**
* Give **context**, **examples**, **format**
* Mention **constraints** (short, table, JSON, steps etc.)

### **7. What are alternatives?**

* **Rule-based systems**
* **Databases + SQL**
* **Classical ML** (SVM, RF, XGBoost)
* **Optimization algorithms**
* **Symbolic AI**

---

# **Code Explanation (Short & Clear)**

## **1. shutil.rmtree("my_checkpoints", ignore_errors=True)**

Deletes the folder **my_checkpoints** if it exists.
`ignore_errors=True` = **don‚Äôt crash** if the folder is missing.

---

## **2. ModelCheckpoint**

```python
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(
    "my_checkpoints.weights.h5",
    save_weights_only=True
)
```

This callback:

* Saves **model weights** after every epoch
* File: `my_checkpoints.weights.h5`
* Useful for restoring good weights later

---

## **3. First model.fit()**

```python
history = model.fit(..., epochs=10, callbacks=[checkpoint_cb])
```

Runs 10 epochs and **saves weights** after each epoch.

---

## **4. EarlyStopping Callback**

```python
early_stopping_cb = tf.keras.callbacks.EarlyStopping(
    patience=10,
    restore_best_weights=True
)
```

Stops training **early** if validation loss doesn‚Äôt improve for **10 epochs**.

`restore_best_weights=True` = after stopping, return to the **best model** seen.

---

## **5. Second model.fit() (with early stopping)**

```python
history = model.fit(
    ... ,
    epochs=100,
    callbacks=[checkpoint_cb, early_stopping_cb]
)
```

* Trains up to 100 epochs
* Stops early
* Still saves checkpoints
* Ends with best weights restored

---

## **6. Custom Callback**

```python
class PrintValTrainRatioCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs):
        ratio = logs["val_loss"] / logs["loss"]
        print(f"Epoch={epoch}, val/train={ratio:.2f}")
```

What it does:

* After each epoch
* Reads loss and val_loss
* Prints **val_loss / loss** ratio

  * Ratio > 1.3 indicates overfitting
  * Ratio ‚âà 1 means balanced training

---

## **7. Third model.fit() using the custom callback**

```python
history = model.fit(..., callbacks=[val_train_ratio_cb], verbose=0)
```

Runs silently (verbose=0) and prints only the ratio.

---



## Using TensorBoard for Visualization

TensorBoard is preinstalled on Colab, but not the `tensorboard-plugin-profile`, so let's install it:

In [None]:
if "google.colab" in sys.modules:  # extra code
    %pip install -q -U tensorboard-plugin-profile

In [None]:
shutil.rmtree("my_logs", ignore_errors=True)

In [None]:
from pathlib import Path
from time import strftime

def get_run_logdir(root_logdir="my_logs"):
    return Path(root_logdir) / strftime("run_%Y_%m_%d_%H_%M_%S")

run_logdir = get_run_logdir()

In [None]:
# extra code ‚Äì builds the first regression model we used earlier
tf.keras.backend.clear_session()
tf.random.set_seed(42)
norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:])
model = tf.keras.Sequential([
    norm_layer,
    tf.keras.layers.Dense(30, activation="relu"),
    tf.keras.layers.Dense(30, activation="relu"),
    tf.keras.layers.Dense(1)
])
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"])
norm_layer.adapt(X_train)

In [None]:
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir,
                                                profile_batch=(100, 200))
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid),
                    callbacks=[tensorboard_cb])

In [None]:
print("my_logs")
for path in sorted(Path("my_logs").glob("**/*")):
    print("  " * (len(path.parts) - 1) + path.parts[-1])

Let's load the `tensorboard` Jupyter extension and start the TensorBoard server:

In [None]:
%load_ext tensorboard
%tensorboard --logdir=./my_logs

**Note**: if you prefer to access TensorBoard in a separate tab, click the "localhost:6006" link below:

In [None]:
# extra code

if "google.colab" in sys.modules:
    from google.colab import output

    output.serve_kernel_port_as_window(6006)
else:
    from IPython.display import display, HTML

    display(HTML('<a href="http://localhost:6006/">http://localhost:6006/</a>'))

You can use also visualize histograms, images, text, and even listen to audio using TensorBoard:

In [None]:
test_logdir = get_run_logdir()
writer = tf.summary.create_file_writer(str(test_logdir))
with writer.as_default():
    for step in range(1, 1000 + 1):
        tf.summary.scalar("my_scalar", np.sin(step / 10), step=step)

        data = (np.random.randn(100) + 2) * step / 100  # gets larger
        tf.summary.histogram("my_hist", data, buckets=50, step=step)

        images = np.random.rand(2, 32, 32, 3) * step / 1000  # gets brighter
        tf.summary.image("my_images", images, step=step)

        texts = ["The step is " + str(step), "Its square is " + str(step ** 2)]
        tf.summary.text("my_text", texts, step=step)

        sine_wave = tf.math.sin(tf.range(12000) / 48000 * 2 * np.pi * step)
        audio = tf.reshape(tf.cast(sine_wave, tf.float32), [1, -1, 1])
        tf.summary.audio("my_audio", audio, sample_rate=48000, step=step)

**Note**: it used to be possible to easily share your TensorBoard logs with the world by uploading them to https://tensorboard.dev/. Sadly, this service will shut down in December 2023, so I have removed the corresponding code examples from this notebook.

When you stop this Jupyter kernel (a.k.a. Runtime), it will automatically stop the TensorBoard server as well. Another way to stop the TensorBoard server is to kill it, if you are running on Linux or MacOSX. First, you need to find its process ID:

In [None]:
# extra code ‚Äì lists all running TensorBoard server instances

from tensorboard import notebook

notebook.list()

Next you can use the following command on Linux or MacOSX, replacing `<pid>` with the pid listed above:

    !kill <pid>

On Windows:

    !taskkill /F /PID <pid>

# ‚úÖ **AI ‚Äì Quick Answers**

### **‚Ä¢ What is it?**

A system that learns patterns from data and produces predictions, decisions, or generated content.

### **‚Ä¢ How does it reason?**

By matching patterns ‚Üí computing probabilities ‚Üí choosing the most likely outcome.
(LLMs: token-by-token prediction).

### **‚Ä¢ Where does it fail?**

When data is missing, ambiguous, biased, or when tasks require true understanding, memory, or reasoning beyond patterns.

### **‚Ä¢ When should I use it?**

For pattern-heavy tasks: classification, summarization, prediction, recommendation, automation, chatbot, vision, etc.

### **‚Ä¢ What is the mental model?**

Treat AI like a very smart autocomplete:
*‚ÄúIt continues patterns from huge training data, not by understanding like humans.‚Äù*

### **‚Ä¢ How do I prompt it?**

Clear intent ‚Üí role ‚Üí constraints ‚Üí examples ‚Üí output style.
Template:
**You are X ‚Üí Do Y ‚Üí Under Z constraints ‚Üí In this format ‚Üí Using these examples.**

### **‚Ä¢ What are alternatives?**

Rules, algorithms, search, statistics, databases, heuristics, automation scripts.

---

## ‚úÖ **Code Explanation (Simple)**

Below is a **short, section-wise explanation**.

---

## **1. Colab Check + Plugin Install**

```python
if "google.colab" in sys.modules:
    %pip install -q -U tensorboard-plugin-profile
```

If running in Google Colab ‚Üí install extra TensorBoard profiling plugin.

---

## **2. Clear old logs**

```python
shutil.rmtree("my_logs", ignore_errors=True)
```

Deletes previous TensorBoard logs (fresh run).

---

## **3. Helper to create timestamped log folders**

```python
def get_run_logdir(root_logdir="my_logs"):
    return Path(root_logdir) / strftime("run_%Y_%m_%d_%H_%M_%S")
```

Every training session goes into a new folder like:

```
my_logs/run_2025_11_18_08_45_31
```

---

## **4. Prepare model**

```python
tf.keras.backend.clear_session()
tf.random.set_seed(42)
norm_layer = tf.keras.layers.Normalization(...)
model = tf.keras.Sequential([...])
```

* Clears previous graphs
* Fixes randomness for reproducibility
* Adds a normalization layer
* Builds a dense neural network

---

## **5. Compile + Adapt Normalization**

```python
model.compile(loss="mse", optimizer=optimizer, metrics=["RMSE"])
norm_layer.adapt(X_train)
```

* Prepares model for regression
* Learns normalization statistics from training data

---

## **6. TensorBoard callback**

```python
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir,
                                                profile_batch=(100, 200))
```

Creates logs + profiling between batch 100‚Äì200.

---

## **7. Train model**

```python
history = model.fit(..., callbacks=[tensorboard_cb])
```

Stores training curves (loss, RMSE) in TensorBoard logs.

---

## **8. Print log folder tree**

Useful for checking file structure.

---

## **9. Start TensorBoard**

```python
%load_ext tensorboard
%tensorboard --logdir=./my_logs
```

Launches TensorBoard UI.

---

## **10. Colab vs Local Browser**

Opens TensorBoard in:

* A new Colab window (Colab)
* A local link (Jupyter)

---

## **11. Write custom summaries**

```python
with writer.as_default():
    tf.summary.scalar(...)
    tf.summary.histogram(...)
    tf.summary.image(...)
    tf.summary.text(...)
    tf.summary.audio(...)
```

This section manually generates data for TensorBoard demos:

| Summary     | Meaning                             |
| ----------- | ----------------------------------- |
| `scalar`    | Line chart (sin wave)               |
| `histogram` | Distribution plot (values increase) |
| `image`     | Random images getting brighter      |
| `text`      | Text logs                           |
| `audio`     | Synthetic tone (sine wave)          |

You‚Äôll see all these inside TensorBoard.

---

## **12. Show TensorBoard sessions**

```python
notebook.list()
```

Lists running TensorBoard instances.

---



# Fine-Tuning Neural Network Hyperparameters

In this section we'll use the Fashion MNIST dataset again:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(42)

In [None]:
if "google.colab" in sys.modules:
    %pip install -q -U keras_tuner~=1.4.6

In [None]:
import keras_tuner as kt

def build_model(hp):
    n_hidden = hp.Int("n_hidden", min_value=0, max_value=8, default=2)
    n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)
    learning_rate = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2,
                             sampling="log")
    optimizer = hp.Choice("optimizer", values=["sgd", "adam"])
    if optimizer == "sgd":
        optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
    else:
        optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten())
    for _ in range(n_hidden):
        model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
    model.add(tf.keras.layers.Dense(10, activation="softmax"))
    model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
                  metrics=["accuracy"])
    return model

In [None]:
random_search_tuner = kt.RandomSearch(
    build_model, objective="val_accuracy", max_trials=5, overwrite=True,
    directory="my_fashion_mnist", project_name="my_rnd_search", seed=42)
random_search_tuner.search(X_train, y_train, epochs=10,
                           validation_data=(X_valid, y_valid))

In [None]:
top3_models = random_search_tuner.get_best_models(num_models=3)
best_model = top3_models[0]

In [None]:
top3_params = random_search_tuner.get_best_hyperparameters(num_trials=3)
top3_params[0].values  # best hyperparameter values

In [None]:
best_trial = random_search_tuner.oracle.get_best_trials(num_trials=1)[0]
best_trial.summary()

In [None]:
best_trial.metrics.get_last_value("val_accuracy")

In [None]:
best_model.fit(X_train_full, y_train_full, epochs=10)
test_loss, test_accuracy = best_model.evaluate(X_test, y_test)

In [None]:
class MyClassificationHyperModel(kt.HyperModel):
    def build(self, hp):
        return build_model(hp)

    def fit(self, hp, model, X, y, **kwargs):
        if hp.Boolean("normalize"):
            norm_layer = tf.keras.layers.Normalization()
            norm_layer.adapt(X)
            X = norm_layer(X)
        return model.fit(X, y, **kwargs)

In [None]:
hyperband_tuner = kt.Hyperband(
    MyClassificationHyperModel(), objective="val_accuracy", seed=42,
    max_epochs=10, factor=3, hyperband_iterations=2,
    overwrite=True, directory="my_fashion_mnist", project_name="hyperband")

In [None]:
root_logdir = Path(hyperband_tuner.project_dir) / "tensorboard"
tensorboard_cb = tf.keras.callbacks.TensorBoard(root_logdir)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=2)
hyperband_tuner.search(X_train, y_train, epochs=10,
                       validation_data=(X_valid, y_valid),
                       callbacks=[early_stopping_cb, tensorboard_cb])

In [None]:
bayesian_opt_tuner = kt.BayesianOptimization(
    MyClassificationHyperModel(), objective="val_accuracy", seed=42,
    max_trials=10, alpha=1e-4, beta=2.6,
    overwrite=True, directory="my_fashion_mnist", project_name="bayesian_opt")
bayesian_opt_tuner.search(X_train, y_train, epochs=10,
                          validation_data=(X_valid, y_valid),
                          callbacks=[early_stopping_cb])

In [None]:
%tensorboard --logdir {root_logdir}

# Exercise solutions

## 1. to 9.

1. Visit the [TensorFlow Playground](https://playground.tensorflow.org/) and play around with it, as described in this exercise.
2. Here is a neural network based on the original artificial neurons that computes _A_ ‚äï _B_ (where ‚äï represents the exclusive OR), using the fact that _A_ ‚äï _B_ = (_A_ ‚àß ¬¨ _B_) ‚à® (¬¨ _A_ ‚àß _B_). There are other solutions‚Äîfor example, using the fact that _A_ ‚äï _B_ = (_A_ ‚à® _B_) ‚àß ¬¨(_A_ ‚àß _B_), or the fact that _A_ ‚äï _B_ = (_A_ ‚à® _B_) ‚àß (¬¨ _A_ ‚à® ¬¨ _B_), and so on.<br /><img width="70%" src="https://github.com/macsrc/mac-handson-ml3/blob/handson-ml-241025/images/ann/exercise2.png?raw=1" />
3. A classical Perceptron will converge only if the dataset is linearly separable, and it won't be able to estimate class probabilities. In contrast, a Logistic Regression classifier will generally converge to a reasonably good solution even if the dataset is not linearly separable, and it will output class probabilities. If you change the Perceptron's activation function to the sigmoid activation function (or the softmax activation function if there are multiple neurons), and if you train it using Gradient Descent (or some other optimization algorithm minimizing the cost function, typically cross entropy), then it becomes equivalent to a Logistic Regression classifier.
4. The sigmoid activation function was a key ingredient in training the first MLPs because its derivative is always nonzero, so Gradient Descent can always roll down the slope. When the activation function is a step function, Gradient Descent cannot move, as there is no slope at all.
5. Popular activation functions include the step function, the sigmoid function, the hyperbolic tangent (tanh) function, and the Rectified Linear Unit (ReLU) function (see Figure 10-8). See Chapter 11 for other examples, such as ELU and variants of the ReLU function.
6. Considering the MLP described in the question, composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons, where all artificial neurons use the ReLU activation function:
    * The shape of the input matrix **X** is _m_ √ó 10, where _m_ represents the training batch size.
    * The shape of the hidden layer's weight matrix **W**<sub>_h_</sub> is 10 √ó 50, and the length of its bias vector **b**<sub>_h_</sub> is 50.
    * The shape of the output layer's weight matrix **W**<sub>_o_</sub> is 50 √ó 3, and the length of its bias vector **b**<sub>_o_</sub> is 3.
    * The shape of the network's output matrix **Y** is _m_ √ó 3.
    * **Y** = ReLU(ReLU(**X** **W**<sub>_h_</sub> + **b**<sub>_h_</sub>) **W**<sub>_o_</sub> + **b**<sub>_o_</sub>). Recall that the ReLU function just sets every negative number in the matrix to zero. Also note that when you are adding a bias vector to a matrix, it is added to every single row in the matrix, which is called _broadcasting_.
7. To classify email into spam or ham, you just need one neuron in the output layer of a neural network‚Äîfor example, indicating the probability that the email is spam. You would typically use the sigmoid activation function in the output layer when estimating a probability. If instead you want to tackle MNIST, you need 10 neurons in the output layer, and you must replace the sigmoid function with the softmax activation function, which can handle multiple classes, outputting one probability per class. If you want your neural network to predict housing prices like in Chapter 2, then you need one output neuron, using no activation function at all in the output layer. Note: when the values to predict can vary by many orders of magnitude, you may want to predict the logarithm of the target value rather than the target value directly. Simply computing the exponential of the neural network's output will give you the estimated value (since exp(log _v_) = _v_).
8. Backpropagation is a technique used to train artificial neural networks. It first computes the gradients of the cost function with regard to every model parameter (all the weights and biases), then it performs a Gradient Descent step using these gradients. This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that (hopefully) minimize the cost function. To compute the gradients, backpropagation uses reverse-mode autodiff (although it wasn't called that when backpropagation was invented, and it has been reinvented several times). Reverse-mode autodiff performs a forward pass through a computation graph, computing every node's value for the current training batch, and then it performs a reverse pass, computing all the gradients at once (see Appendix B for more details). So what's the difference? Well, backpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps, each of which computes gradients and uses them to perform a Gradient Descent step. In contrast, reverse-mode autodiff is just a technique to compute gradients efficiently, and it happens to be used by backpropagation.
9. Here is a list of all the hyperparameters you can tweak in a basic MLP: the number of hidden layers, the number of neurons in each hidden layer, and the activation function used in each hidden layer and in the output layer. In general, the ReLU activation function (or one of its variants; see Chapter 11) is a good default for the hidden layers. For the output layer, in general you will want the sigmoid activation function for binary classification, the softmax activation function for multiclass classification, or no activation function for regression. If the MLP overfits the training data, you can try reducing the number of hidden layers and reducing the number of neurons per hidden layer.

## 10.

*Exercise: Train a deep MLP on the MNIST dataset (you can load it using `tf.keras.datasets.mnist.load_data()`. See if you can get over 98% accuracy by manually tuning the hyperparameters. Try searching for the optimal learning rate by using the approach presented in this chapter (i.e., by growing the learning rate exponentially, plotting the loss, and finding the point where the loss shoots up). Next, try tuning the hyperparameters using Keras Tuner with all the bells and whistles‚Äîsave checkpoints, use early stopping, and plot learning curves using TensorBoard.*

**TODO**: update this solution to use Keras Tuner.

Let's load the dataset:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

Just like for the Fashion MNIST dataset, the MNIST training set contains 60,000 grayscale images, each 28x28 pixels:

In [None]:
X_train_full.shape

Each pixel intensity is also represented as a byte (0 to 255):

In [None]:
X_train_full.dtype

Let's split the full training set into a validation set and a (smaller) training set. We also scale the pixel intensities down to the 0-1 range and convert them to floats, by dividing by 255, just like we did for Fashion MNIST:

In [None]:
X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.

Let's plot an image using Matplotlib's `imshow()` function, with a `'binary'`
 color map:

In [None]:
plt.imshow(X_train[0], cmap="binary")
plt.axis('off')
plt.show()

The labels are the class IDs (represented as uint8), from 0 to 9. Conveniently, the class IDs correspond to the digits represented in the images, so we don't need a `class_names` array:

In [None]:
y_train

The validation set contains 5,000 images, and the test set contains 10,000 images:

In [None]:
X_valid.shape

In [None]:
X_test.shape

Let's take a look at a sample of the images in the dataset:

In [None]:
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(y_train[index])
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

Let's build a simple dense network and find the optimal learning rate. We will need a callback to grow the learning rate at each iteration. It will also record the learning rate and the loss at each iteration:

In [None]:
K = tf.keras.backend

class ExponentialLearningRate(tf.keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []

    def on_batch_end(self, batch, logs=None):
        lr = self.model.optimizer.learning_rate.numpy() * self.factor
        self.model.optimizer.learning_rate = lr
        self.rates.append(lr)
        self.losses.append(logs["loss"])

In [None]:
tf.keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

We will start with a small learning rate of 1e-3, and grow it by 0.5% at each iteration:

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
expon_lr = ExponentialLearningRate(factor=1.005)

Now let's train the model for just 1 epoch:

In [None]:
history = model.fit(X_train, y_train, epochs=1,
                    validation_data=(X_valid, y_valid),
                    callbacks=[expon_lr])

We can now plot the loss as a functionof the learning rate:

In [None]:
plt.plot(expon_lr.rates, expon_lr.losses)
plt.gca().set_xscale('log')
plt.hlines(min(expon_lr.losses), min(expon_lr.rates), max(expon_lr.rates))
plt.axis([min(expon_lr.rates), max(expon_lr.rates), 0, expon_lr.losses[0]])
plt.grid()
plt.xlabel("Learning rate")
plt.ylabel("Loss")

The loss starts shooting back up violently when the learning rate goes over 6e-1, so let's try using half of that, at 3e-1:

In [None]:
tf.keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

In [None]:
optimizer = tf.keras.optimizers.SGD(learning_rate=3e-1)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])

In [None]:
run_index = 1 # increment this at every run
run_logdir = Path() / "my_mnist_logs" / "run_{:03d}".format(run_index)
run_logdir

In [None]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=20)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_mnist_model.keras", save_best_only=True)
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)

history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint_cb, early_stopping_cb, tensorboard_cb])

In [None]:
model = tf.keras.models.load_model("my_mnist_model.keras") # rollback to best model
model.evaluate(X_test, y_test)

We got over 98% accuracy. Finally, let's look at the learning curves using TensorBoard:

In [None]:
%tensorboard --logdir=./my_mnist_logs