# Training Neural Networks

## 1. The Training Process: An Overview

Training a neural network follows the same fundamental three-step process as training a logistic regression model.

### The 3-Step Process
1.  **Specify the Model**: Define the network's architecture (layers, neurons) and how it computes the output $\hat{y}$ (or $f(\vec{x})$) from an input $\vec{x}$. This is the **forward propagation** step.
2.  **Specify the Loss and Cost Function**:
    * The **Loss Function** $L(f(\vec{x}), y)$ measures how well the network performs on a *single* training example.
    * The **Cost Function** $J(W, B)$ is the average of the loss over the *entire* training set. It measures the performance of the current set of parameters $(W, B)$ on all the data.
    $$ J(W,B) = \frac{1}{m} \sum_{i=1}^{m} L(f(\vec{x}^{(i)}), y^{(i)}) $$
3.  **Minimize the Cost Function**: Use an optimization algorithm, like **gradient descent**, to find the values of the parameters $W$ and $B$ that minimize the cost $J(W,B)$.

### Mapping the Steps to TensorFlow
This three-step process maps directly to the primary functions in TensorFlow/Keras:
1.  **Specify Model**: `model = tf.keras.Sequential([...])`
2.  **Specify Cost**: `model.compile(loss=...)`
3.  **Minimize Cost**: `model.fit(X, y, epochs=...)`

---

## 2. Loss and Cost Functions

The choice of loss function depends on the type of problem you are solving (classification vs. regression).

- **Binary Classification (y=0 or 1)**: The standard loss is **Binary Cross-Entropy**.
  $$ L(f(\vec{x}), y) = -y \log(f(\vec{x})) - (1-y) \log(1 - f(\vec{x})) $$
  In TensorFlow, this is `tf.keras.losses.BinaryCrossentropy`.

- **Regression (y is a continuous number)**: The standard loss is **Mean Squared Error**. The loss for a single example is the squared error, and the cost is the average of this over the dataset.
  $$ L(f(\vec{x}), y) = (y - f(\vec{x}))^2 $$
  In TensorFlow, this is `tf.keras.losses.MeanSquaredError`.

---

## 3. Minimizing the Cost Function

To find the best parameters, we need to minimize $J(W, B)$.
- The core algorithm is **gradient descent**, which iteratively updates the parameters to move "downhill" on the cost function landscape.
- The update rule for any parameter $w_{jk}^{[l]}$ is:
  $$ w_{jk}^{[l]} = w_{jk}^{[l]} - \alpha \frac{\partial J(W,B)}{\partial w_{jk}^{[l]}} $$
  (A similar update is performed for the bias parameters $b_j^{[l]}$).
- To compute the partial derivatives ($\frac{\partial J}{\partial w}$), neural networks use an efficient algorithm called **backpropagation**. TensorFlow handles this automatically within the `model.fit()` function.

---

## 4. Activation Functions

So far we've primarily used the sigmoid function, but there are better choices, especially for hidden layers.

### Common Activation Functions
1.  **Sigmoid**: $g(z) = \frac{1}{1 + e^{-z}}$. Output is between 0 and 1. Used for binary classification output layers.
2.  **ReLU (Rectified Linear Unit)**: $g(z) = \max(0, z)$. The most common choice for hidden layers. It's computationally faster and helps mitigate issues with slow learning (vanishing gradients).
3.  **Linear**: $g(z) = z$. Essentially no activation function. Used for regression output layers where the output can be positive or negative.

### How to Choose Activation Functions: Rules of Thumb
- **Output Layer**: The choice depends on the expected output value `y`.
    - **Binary Classification** (y is 0 or 1): Use **`sigmoid`**.
    - **Regression** (y can be +/-): Use **`linear`**.
    - **Regression** (y is non-negative, >= 0): Use **`relu`**.

- **Hidden Layers**:
    - The default and most common choice is **`relu`**. It generally leads to faster training compared to sigmoid.

### Why are Non-Linear Activations Necessary?
- A neural network with only **linear activation functions** is mathematically equivalent to a single linear model (like linear or logistic regression).
- Stacking linear layers does not increase the model's complexity. The non-linear "bends" introduced by functions like ReLU and sigmoid are what allow neural networks to learn highly complex, non-linear relationships in data.
- **Rule**: Always use a non-linear activation function (like ReLU) in your hidden layers.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy, MeanSquaredError, SparseCategoricalCrossentropy
from tensorflow.keras.optimizers import Adam

# === Model for Binary Classification (e.g., cat vs. dog) ===
# This model uses the recommended best practices.
model_binary = Sequential([
    Dense(units=25, activation='relu'),      # Hidden layer 1: Use ReLU
    Dense(units=15, activation='relu'),      # Hidden layer 2: Use ReLU
    Dense(units=1, activation='linear')      # Output layer: Use linear (for from_logits)
])

model_binary.compile(
    loss=BinaryCrossentropy(from_logits=True), # Use from_logits=True for numerical stability
    optimizer=Adam(learning_rate=0.001)        # Use the Adam optimizer
)


# === Model for Multiclass Classification (e.g., digits 0-9) ===
# This model uses the recommended best practices.
model_multiclass = Sequential([
    Dense(units=25, activation='relu'),
    Dense(units=15, activation='relu'),
    Dense(units=10, activation='linear') # Output layer has 10 units (for 10 classes) and is linear
])

model_multiclass.compile(
    loss=SparseCategoricalCrossentropy(from_logits=True), # Use SparseCategoricalCrossentropy with from_logits
    optimizer=Adam(learning_rate=0.001)
)


# === Model for Regression (e.g., house price) ===
model_regression = Sequential([
    Dense(units=100, activation='relu'),
    Dense(units=50, activation='relu'),
    Dense(units=1, activation='linear') # Output is a single number, so linear is a good choice
])

model_regression.compile(
    loss=MeanSquaredError(),
    optimizer=Adam(learning_rate=0.01)
)

# --- Example of making a prediction ---
# When using from_logits=True, the model's direct output is 'logits' (raw z values), not probabilities.
# To get probabilities, you must manually apply the appropriate activation function.

# For binary classification:
# logits = model_binary.predict(X_new)
# probabilities = tf.nn.sigmoid(logits)

# For multiclass classification:
# logits = model_multiclass.predict(X_new)
# probabilities = tf.nn.softmax(logits)

print("Models for binary, multiclass, and regression have been defined and compiled.")



Models for binary, multiclass, and regression have been defined and compiled.


---

## 5. Multiclass Classification

This applies when you have more than two classes (e.g., classifying digits 0-9).

### The Softmax Function
Softmax is a generalization of sigmoid to multiple classes. It takes a vector of $N$ real numbers ($z_1, ..., z_N$) and converts it into a probability distribution of $N$ probabilities that sum to 1.
- For an output layer with $N$ units (one for each class), the activation for the $j^{th}$ unit is:
  $$ a_j = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}} $$
- $a_j$ can be interpreted as the probability that the input belongs to class $j$, i.e., $P(y=j | \vec{x})$.

### Loss Function for Softmax
The loss function for multiclass classification is **Categorical Cross-Entropy**. If the true class for a training example is $j$, the loss is simply the negative logarithm of the predicted probability for that class:
$$ L = -\log(a_j) $$
This encourages the model to assign the highest possible probability to the correct class. In TensorFlow, this is `tf.keras.losses.SparseCategoricalCrossentropy`.

---

## 6. Advanced Implementation Details

### Numerical Stability: `from_logits=True`
- When calculating the softmax function and then the cross-entropy loss, intermediate calculations involving exponents ($e^z$) can lead to very large or very small numbers, causing **numerical round-off errors**.
- **Solution**: Combine the calculation of the final activation (softmax) and the loss function into a single, more stable step.
- **Implementation**:
    1. Set the activation function of the final layer to **`linear`** (so it outputs the raw `z` values, called **logits**).
    2. In the `compile` step, use the `BinaryCrossentropy` or `SparseCategoricalCrossentropy` loss function and set the argument **`from_logits=True`**.
- This is the **recommended practice** for both binary and multiclass classification in TensorFlow for better accuracy.

### The Adam Optimizer
- **Adam (Adaptive Moment Estimation)** is an optimization algorithm that is generally more effective and faster than standard gradient descent.
- It **adapts the learning rate automatically** for each parameter during training.
    - It increases the learning rate for parameters that are consistently moving in the same direction.
    - It decreases the learning rate for parameters whose updates are oscillating.
- Adam is the **de-facto standard optimizer** for training most neural networks today.

---

## 7. Other Layer Types (For Context)
- The layers we have used are called **Dense** layers because every neuron is connected to all activations from the previous layer.
- Other specialized layers exist, such as **Convolutional Layers**, where neurons only look at a small, localized region of the input. These are the foundation of modern computer vision models. This is for your general knowledge and not required for this course's assignments.
