# Chapter 10: Introduction to Artificial Neural Networks with Keras

**Reference:** Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (Aurélien Géron)

---

## 1. Chapter Summary

This chapter introduces Artificial Neural Networks (ANNs), the core of Deep Learning. Inspired by biological neurons, ANNs are versatile, powerful, and scalable models ideal for complex tasks like image classification, speech recognition, and natural language processing.

**Key Topics:**
1.  **From Biological to Artificial Neurons:** The evolution of ANNs from the simple Perceptron to the Multilayer Perceptron (MLP).
2.  **Training with Backpropagation:** How modern neural networks learn using Gradient Descent and the reverse-mode autodiff algorithm.
3.  **Keras API:** Using the high-level Keras API (integrated into TensorFlow) to build, train, and evaluate deep learning models easily.
4.  **Building Models:**
    * **Sequential API:** For simple stacks of layers.
    * **Functional API:** For complex topologies (multiple inputs/outputs).
    * **Subclassing API:** For fully dynamic models.
5.  **Hyperparameter Tuning:** Strategies to optimize learning rate, number of layers, neurons per layer, and activation functions.

## 2. Theoretical Explanations

### A. The Perceptron

**Mechanism:**
The Perceptron is one of the simplest ANN architectures (invented in 1957). It is based on a **Threshold Logic Unit (TLU)**.
* **Input:** The TLU takes numerical inputs ($x_1, x_2, ..., x_n$), each associated with a weight ($w_1, w_2, ..., w_n$).
* **Computation:** It computes a weighted sum of its inputs ($z$).

$$ z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b $$

* **Activation:** It applies a step function (e.g., Heaviside step function) to this sum and outputs the result ($h_\mathbf{w}(\mathbf{x})$).

$$ h_\mathbf{w}(\mathbf{x}) = \text{step}(z) $$

**Training (Hebb's Rule):**
"Cells that fire together, wire together." The Perceptron training rule reinforces the weights of connections that help reduce the error. If a neuron outputs the wrong class, the weights connecting inputs that contributed to this error are adjusted. The weight update rule is:

$$ w_{i,j}^{(\text{next step})} = w_{i,j} + \eta (y_j - \hat{y}_j) x_i $$

Where:
* $w_{i,j}$ is the connection weight between input $i$ and neuron $j$.
* $x_i$ is the $i^{th}$ input value.
* $\hat{y}_j$ is the output of neuron $j$.
* $y_j$ is the target output of neuron $j$.
* $\eta$ is the learning rate.

**Limitation:**
Perceptrons can only solve linearly separable problems (like Linear SVMs). They famously cannot solve the XOR problem (Exclusive OR).

### B. The Multilayer Perceptron (MLP) and Backpropagation

**Mechanism:**
An MLP stacks multiple layers of neurons:
1.  **Input Layer:** Receives the features.
2.  **Hidden Layers:** One or more layers of TLUs where the "magic" happens.
3.  **Output Layer:** The final layer that produces the prediction.

Adding hidden layers allows the network to model complex non-linear relationships, solving problems like XOR.

**Backpropagation Algorithm:**
How do we train a deep stack of layers? The breakthrough came with Backpropagation (Rumelhart et al., 1986). It is essentially **Gradient Descent** applied efficiently to a multi-layer network using the Chain Rule.

1.  **Forward Pass:** The data flows through the network. For each layer $l$, the output is computed as:
    $$ \mathbf{a}^{(l)} = \phi(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}) $$
    Where $\phi$ is the activation function, $\mathbf{W}$ are the weights, and $\mathbf{b}$ is the bias.

2.  **Compute Loss:** Calculate the error between prediction and target (e.g., using MSE or Cross-Entropy).

3.  **Reverse Pass (Error Propagation):** The algorithm propagates the error gradient backward. It calculates the gradient of the loss function with respect to each weight using the chain rule:
    $$ \frac{\partial J}{\partial w_{i,j}} = \frac{\partial J}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial \text{hidden}} \cdot \dots \cdot \frac{\partial \text{hidden}}{\partial w_{i,j}} $$

4.  **Update Step:** It tweaks the connection weights to reduce the error using Gradient Descent:
    $$ \mathbf{W} \leftarrow \mathbf{W} - \eta \nabla_{\mathbf{W}} J(\mathbf{W}) $$

**Crucial Component: Activation Functions:**
For backpropagation to work, we need non-linear activation functions (unlike the step function). Common choices:
* **Sigmoid (Logistic):** Outputs 0 to 1. Good for probability estimation but suffers from vanishing gradients.
    $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
* **Tanh (Hyperbolic Tangent):** Outputs -1 to 1. Similar to sigmoid but centered around 0, leading to faster convergence.
    $$ \tanh(z) = 2\sigma(2z) - 1 $$
* **ReLU (Rectified Linear Unit):** The default choice for hidden layers. It is fast and reduces vanishing gradient issues.
    $$ \text{ReLU}(z) = \max(0, z) $$
* **Softmax:** Used in the output layer for multiclass classification. It converts scores into probabilities that sum to 1.
    $$ \hat{p}_k = \frac{\exp(s_k(\mathbf{x}))}{\sum_{j=1}^{K} \exp(s_j(\mathbf{x}))} $$

### C. Fine-Tuning Neural Network Hyperparameters

Neural Networks have many knobs to turn (topology, learning rate, batch size, activation functions). 
* **Number of Hidden Layers:** Deep networks (many layers) are generally more parameter-efficient than wide networks (one layer with many neurons) for complex tasks because they can learn hierarchical features (e.g., lines -> shapes -> objects).
* **Neurons per Layer:** A common strategy is to construct a pyramid (fewer neurons at each subsequent layer), but using the same number of neurons in all hidden layers often works just as well and is simpler to tune.
* **Learning Rate ($\eta$):** The most important hyperparameter. If too low, training is slow. If too high, it diverges.
* **Batch Size:** Large batches offer stable gradients and hardware efficiency (GPU) but may generalize worse. Small batches offer noisy gradients (escaping local optima) but are slower per epoch.

## 3. Step-by-Step Implementation with Keras

### A. Image Classification with the Sequential API
We will build a Multilayer Perceptron to classify Fashion MNIST images (grayscale images of clothing items).

In [None]:
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Load the Dataset
# Fashion MNIST is a drop-in replacement for MNIST (70,000 images, 10 classes, 28x28 pixels).
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# 2. Preprocessing
# Split the full training set into a validation set and a (smaller) training set.
# Normalize pixel intensities from 0-255 to 0-1 range (essential for Neural Network convergence).
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0

# Class names for reference
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

print("Training Data Shape:", X_train.shape)
print("Validation Data Shape:", X_valid.shape)

In [None]:
# 3. Building the Model using Sequential API
# Sequential model is the simplest Keras model for a linear stack of layers.
model = keras.models.Sequential([
    # Input Layer: Flattens the 28x28 2D array into a 1D array of 784 pixels.
    keras.layers.Flatten(input_shape=[28, 28]),
    
    # Hidden Layer 1: Dense layer with 300 neurons.
    # Activation='relu' (Rectified Linear Unit) handles non-linearity well.
    keras.layers.Dense(300, activation="relu"),
    
    # Hidden Layer 2: Dense layer with 100 neurons.
    keras.layers.Dense(100, activation="relu"),
    
    # Output Layer: 10 neurons (one per class).
    # Activation='softmax' ensures outputs sum to 1 (probabilities for each class).
    keras.layers.Dense(10, activation="softmax")
])

# 4. Summary
# Displays the network architecture and parameter count.
model.summary()

In [None]:
# 5. Compiling the Model
# We must specify the loss function, optimizer, and metrics before training.
# Loss='sparse_categorical_crossentropy': Used because we have sparse labels (integers 0-9), not one-hot vectors.
# Optimizer='sgd': Stochastic Gradient Descent. We can create the object to tune the learning rate.
# Metrics=['accuracy']: To track classification accuracy during training.
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

# 6. Training the Model
# epochs=10: The model sees the entire dataset 30 times.
# validation_data: Keras evaluates loss/accuracy on this set at the end of each epoch.
# Ideally, you should see 'loss' decrease and 'val_accuracy' increase.
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

In [None]:
# 7. Visualizing Learning Curves
# The fit() method returns a History object containing training parameters and metrics.
# We can plot this to detect overfitting (if training loss keeps dropping but validation loss goes up).

pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.gca().set_ylim(0, 1) # set the vertical range to [0-1]
plt.title("Learning Curves")
plt.show()

# 8. Evaluation
# Evaluate on the unseen Test Set to get the final performance metric.
model.evaluate(X_test, y_test)

### B. Regression MLP
We can also use MLPs for regression tasks (e.g., predicting housing prices). The main differences are:
* **Output Layer:** Only 1 neuron (for a single predicted value) and **no activation function** (to allow outputting any value range).
* **Loss Function:** Mean Squared Error (`mse`).

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Prepare Data
housing = fetch_california_housing()

# Split Train/Valid/Test
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data, housing.target, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

# Scaling is CRITICAL for Neural Networks to converge efficiently.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

# 2. Build Regression Model
model_reg = keras.models.Sequential([
    # Input layer isn't strictly necessary if input_shape is defined in the first Dense layer,
    # but it's good practice for clarity. 8 features in California housing data.
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    
    # Output layer: 1 neuron, no activation (linear output).
    keras.layers.Dense(1)
])

# 3. Compile
# Using 'mean_squared_error' for regression loss.
model_reg.compile(loss="mean_squared_error", optimizer="sgd")

# 4. Train
history_reg = model_reg.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid))

# 5. Predict
X_new = X_test[:3]
y_pred = model_reg.predict(X_new)
print("Predictions:", y_pred)
print("Actual:", y_test[:3])

### C. Functional API for Complex Models
The Sequential API is easy but rigid (single input, single output, linear stack). The **Functional API** allows for non-linear topologies, shared layers, and multiple inputs/outputs.

Here, we build a **Wide & Deep** neural network. It connects all or part of the inputs directly to the output layer (Wide path) while also sending inputs through a stack of hidden layers (Deep path). This allows the model to learn both simple rules (linear) and complex patterns (deep).

In [None]:
# Wide & Deep Architecture Implementation

# 1. Define Inputs
input_ = keras.layers.Input(shape=X_train.shape[1:])

# 2. Deep Path
hidden1 = keras.layers.Dense(30, activation="relu")(input_)
hidden2 = keras.layers.Dense(30, activation="relu")(hidden1)

# 3. Concatenate (Wide Path + Deep Path)
# We merge the raw input_ directly with the output of hidden2.
concat = keras.layers.Concatenate()([input_, hidden2])

# 4. Output Layer
output = keras.layers.Dense(1)(concat)

# 5. Create Model
model_wide_deep = keras.models.Model(inputs=[input_], outputs=[output])

# Compile and Train (same as before)
model_wide_deep.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1e-3))
model_wide_deep.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

### D. Saving and Restoring Models
Keras makes it easy to save the entire model (architecture, weights, and optimizer state) to a single HDF5 file.

In [None]:
# Save model
model.save("my_keras_model.h5")

# Load model
loaded_model = keras.models.load_model("my_keras_model.h5")

# Verify prediction
print("Loaded model layers:", len(loaded_model.layers))

### E. Callbacks: Early Stopping
Instead of guessing the number of epochs, we can use the `EarlyStopping` callback. It interrupts training when the validation loss stops improving, preventing overfitting.

In [None]:
# Define Callbacks
# 1. ModelCheckpoint: Saves the best model observed during training (lowest validation loss).
checkpoint_cb = keras.callbacks.ModelCheckpoint("my_best_model.h5", save_best_only=True)

# 2. EarlyStopping: Stops if 'val_loss' doesn't improve for 10 epochs (patience).
# restore_best_weights=True: Reverts the model weights to the best epoch, not the last one.
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

# Train with callbacks
history = model_reg.fit(X_train, y_train, epochs=100,
                        validation_data=(X_valid, y_valid),
                        callbacks=[checkpoint_cb, early_stopping_cb])