# Neural Network Training

## TensorFlow Implementation

#### Overview
This week’s focus is on training neural networks using your own data, building on last week's topic of inference. The example task is handwritten digit recognition (e.g., distinguishing between '0' and '1'). The neural network architecture has an input layer (the image), two hidden layers (25 and 15 units respectively), and one output unit.

#### Key Steps for Training a Neural Network:
1. **Specify the Model**:  
   - Use TensorFlow's `Sequential` to create a neural network with three layers:
     - First hidden layer: 25 units, sigmoid activation.
     - Second hidden layer: 15 units, some activation function.
     - Output layer: 1 unit (binary output).

2. **Compile the Model**:  
   - Define the loss function for training. For binary classification (e.g., recognizing a digit), the **binary crossentropy loss function** is used.
   - Compile the model with the optimizer, the loss function, and performance metrics if needed.

3. **Fit the Model**:  
   - Use the `fit` function to train the model. It takes the dataset \(X\) (inputs) and \(Y\) (labels) and trains the model using gradient descent.
   - **Epochs**: The number of steps to run the learning algorithm (e.g., gradient descent). Decide how many iterations (epochs) to run to optimize the model.

#### Conceptual Understanding:
- **Understanding Code**: It's crucial to understand each line of the TensorFlow code to effectively debug and improve models. If the learning algorithm doesn't perform well, this conceptual foundation will help troubleshoot.
- **Mental Framework**: Understanding the roles of layers, loss functions, and optimization (like gradient descent) helps in diagnosing issues when training doesn’t yield expected results.

---

### Short Notes for Exam:
1. **Neural Network Architecture**:
   - Input → Hidden Layer 1 (25 units, sigmoid) → Hidden Layer 2 (15 units) → Output (1 unit).

2. **TensorFlow Steps**:
   - **Model Specification**: `Sequential` to string together layers.
   - **Compile Model**: Use binary crossentropy for binary classification.
   - **Fit Model**: Gradient descent optimization, specify number of **epochs** (iterations).

3. **Loss Function**:
   - **Binary Crossentropy**: For binary outcomes (e.g., digit recognition).

4. **Epochs**: Control how long the model runs the training algorithm (more epochs = more training steps).

5. **Debugging**: Understanding the internals of the model helps in debugging when the neural network doesn't work as expected.


In [None]:
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy

# Step 1
model = Sequential([
    Dense(units = 25, activation="sigmoid"),
    Dense(units = 15, activation="sigmoid"),
    Dense(units = 1, activation="sigmoid")
])

# Step 2
model.compile(loss = BinaryCrossentropy())

# Step 3
model.fit(X,Y, epochs = 100) # epochs: number of steps in gradient descent

## Training Details

### Step-by-Step Breakdown:

#### 1. **Specify the Model (Output Computation)**:
   - In **logistic regression**, you predicted the output $ f(x) $ using the sigmoid function applied to the linear combination of parameters $ W $ and $ X $ plus the bias $ B $.
     - $ f(x) = \frac{1}{1 + e^{-z}} $, where $ z = W \cdot X + B $
   - In **neural networks**, this step corresponds to specifying the architecture of the network (layers, units per layer, activation functions). The layers and activations now define how input $ X $ is transformed through multiple layers of computation:
     - **Input → Hidden Layer 1 (25 units, sigmoid) → Hidden Layer 2 (15 units) → Output (1 unit, sigmoid)**

#### 2. **Define Loss and Cost Function**:
   - In logistic regression, the **loss function** measured how well the model performed on a single training example. You used the **log loss (binary cross-entropy)**:
     - $ \text{Loss} = -y \log(f(x)) - (1 - y) \log(1 - f(x)) $
   - The **cost function** was the average loss over all training examples, calculated to minimize over the entire dataset.
     - $ J(W, B) = \frac{1}{m} \sum_{i=1}^{m} \text{Loss}(f(x^{(i)}), y^{(i)}) $
   - For **neural networks**, the loss function is still **binary cross-entropy** for binary classification problems. TensorFlow provides this built-in, and once you specify the loss function, the cost function is automatically handled by taking the average over all training examples.

#### 3. **Minimize the Cost Function**:
   - For logistic regression, **gradient descent** updated the weights $ W $ and biases $ B $ by calculating the gradient of the cost function with respect to these parameters:
     - $ W = W - \alpha \frac{\partial J(W, B)}{\partial W} $
     - $ B = B - \alpha \frac{\partial J(W, B)}{\partial B} $
   - Similarly, in neural networks, **gradient descent** or its variants (like Adam or RMSprop) is used to minimize the cost function by updating all the parameters (weights and biases) across all layers.
   - The process of computing the gradients for each parameter is handled by TensorFlow’s **backpropagation** algorithm.

#### How TensorFlow Implements These Steps:
- **Step 1: Model Specification**  
   You define the architecture of the neural network:
   ```python
   model = Sequential([
       Dense(25, activation='sigmoid'),  # First hidden layer with 25 units
       Dense(15, activation='sigmoid'),  # Second hidden layer with 15 units
       Dense(1, activation='sigmoid')    # Output layer for binary classification
   ])
   ```
- **Step 2: Compile the Model (Loss and Optimizer)**  
   You specify the loss function (binary cross-entropy for classification) and the optimization method (e.g., Adam, which is often more efficient than gradient descent):
   ```python
   model.compile(optimizer='adam', loss='binary_crossentropy')
   ```
   - TensorFlow also handles other loss functions, such as **mean squared error** for regression tasks.
  
- **Step 3: Train the Model (Minimization of Cost Function)**  
   Once the model is defined and compiled, you can train it using the `fit()` function. TensorFlow implements backpropagation to compute the gradients and updates the weights and biases accordingly.
   ```python
   model.fit(X_train, y_train, epochs=100)
   ```

### Key Concepts:
- **Backpropagation**: TensorFlow automatically computes gradients through all layers using backpropagation, which allows for efficient weight updates in multi-layer networks.
- **Optimization Algorithms**: While logistic regression uses plain gradient descent, TensorFlow supports more advanced optimizers like Adam, which are better suited for complex networks.
- **Epochs**: An epoch is a single pass over the entire training dataset. You specify how many epochs (iterations) the training process should run.

### Comparison with Logistic Regression:
| Step                         | Logistic Regression                                                                 | Neural Network (Multilayer Perceptron)                             |
|------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| **Step 1: Compute Output**    | $ f(x) = \text{sigmoid}(W \cdot X + B) $                                          | Forward pass through multiple layers using activations (e.g., sigmoid) |
| **Step 2: Loss Function**     | Binary cross-entropy loss: $ -y \log(f(x)) - (1 - y) \log(1 - f(x)) $              | Same binary cross-entropy, or other losses depending on the task   |
| **Step 3: Gradient Descent**  | Update $ W $ and $ B $ using gradients of the cost function                     | Backpropagation updates weights/biases for all layers              |

### Final Thoughts:
With TensorFlow, the manual work you did for logistic regression—computing gradients, updating weights, calculating losses—is now abstracted. Libraries like TensorFlow allow you to define, compile, and train complex neural networks with just a few lines of code, making them more accessible for rapid development. Understanding the underlying steps (as you did with logistic regression) still provides a valuable foundation for debugging and model optimization.


# Activaction Functions

## Alternatives to the Sigmoid Activation

In neural networks, the choice of activation function plays a crucial role in determining the network's performance. So far, you’ve been using the **sigmoid activation function**, which is a smooth, S-shaped function. The sigmoid function outputs values between 0 and 1, making it suitable for binary classification tasks. However, when you need a more complex or scalable model, you can explore other activation functions, such as **ReLU** (Rectified Linear Unit) and **linear** activation, which can make your network more powerful.

### **ReLU Activation Function (Rectified Linear Unit)**
ReLU is defined as:
$ g(z) = \max(0, z) $

- **Behavior**: For any input \( z \), if \( z \) is less than 0, ReLU outputs 0. Otherwise, it outputs the value of \( z \). This creates a linear relationship for positive values and a flat response for negative ones.
- **Advantages**: ReLU helps overcome the vanishing gradient problem often encountered with sigmoid functions, especially in deeper networks. It also allows the activation to be any non-negative value, which can be useful when the quantity you're modeling can vary significantly.
- **Usage**: ReLU is now the most commonly used activation function in hidden layers because of its simplicity and efficiency in training large neural networks.

### **Linear Activation Function**
Linear activation is defined as:
$ g(z) = z $

- **Behavior**: This function outputs the value of \( z \) directly without any transformation.
- **Usage**: Linear activations are typically used in the output layer for regression tasks, where you need a continuous range of outputs. Since it's equivalent to not using any activation function, it preserves the linearity of the input.

### **Choosing Between Activation Functions**
- **Sigmoid**: Best for binary classification in the output layer (where you want probabilities between 0 and 1).
- **ReLU**: Preferred for hidden layers in deep networks because of its ability to help models converge faster and handle non-negative values.
- **Linear**: Typically used in the output layer when dealing with regression problems where you want unbounded continuous outputs.

## Choosing activation functions

In neural networks, choosing the right activation function for the **output layer** and the **hidden layers** is crucial for effective learning and making correct predictions. Here’s a structured approach to selecting the appropriate activation functions for different parts of your network:

### **Output Layer:**

1. **Binary Classification** (e.g., predicting whether something is true or false, like if a product is a top seller):
   - **Sigmoid Activation**: 
     $$g(z) = \frac{1}{1 + e^{-z}}$$
     - This function outputs a probability between 0 and 1, which is perfect for binary classification tasks. It’s the natural choice when the target label \( y \) is 0 or 1.
  
2. **Regression Tasks** (e.g., predicting continuous values like tomorrow’s stock price, which can be positive or negative):
   - **Linear Activation**: 
     $$g(z) = z$$
     - This function allows the output to take on any real number value, making it suitable for regression problems where the output can be both positive and negative.

3. **Non-negative Continuous Values** (e.g., predicting the price of a house, which can’t be negative):
   - **ReLU Activation**: 
     $$g(z) = \max(0, z)$$
     - This ensures that the output is always non-negative, which is useful when your target values are naturally constrained to be positive or zero.

### **Hidden Layers:**

- **ReLU (Rectified Linear Unit)**:
  $$g(z) = \max(0, z)$$
  - **Why ReLU?** ReLU is by far the most commonly used activation function in hidden layers due to several reasons:
    - **Efficiency**: It is computationally simpler and faster than the sigmoid function.
    - **Non-linearity**: ReLU introduces non-linearity, allowing the network to model complex patterns. The linear behavior for positive values helps the network learn more effectively, while the flat region for negative values keeps the gradient descent process stable.
    - **Avoids the vanishing gradient problem**: Sigmoid functions can become flat at extreme values, which slows down learning due to tiny gradients. ReLU, by contrast, is only flat on one side, so it retains a large gradient for positive values.

- **Leaky ReLU and other variants**:
  - **Leaky ReLU** allows for a small slope on the negative side, preventing the "dead ReLU" problem where units can get stuck and never activate.
  - **Swish** and other advanced activation functions might offer marginal improvements, but for most applications, ReLU is sufficient.

### **Why Not Just Use Linear Activation Everywhere?**
- **Without activation functions**, the network becomes just a series of linear combinations, and a network without non-linearity (like a ReLU) can only model linear relationships. To model complex, non-linear patterns, activation functions like ReLU are essential.
  
### **Summary of Activation Function Recommendations:**
- **Output Layer**:
  - Binary classification: **Sigmoid**
  - Regression (positive and negative values): **Linear**
  - Non-negative continuous values: **ReLU**
  
- **Hidden Layers**: **ReLU** as the default for most applications.

This strategy allows you to tailor the network's behavior to the nature of your problem, ensuring that you can model both complex non-linear relationships and appropriate output ranges effectively.

## Why do we need activation functions?

Neural networks rely on **non-linear activation functions** to enable them to model complex, non-linear patterns. If a neural network used **only linear activation functions**, it would not gain any additional modeling power over traditional **linear regression**. Here’s why:

### **What Happens Without Non-Linear Activation Functions?**
Let’s break it down using a simplified neural network example:

1. **Single hidden layer example**:
   - Consider an input $x$, a hidden layer with parameters $w_1$ and $b_1$, and an output layer with parameters $w_2$ and $b_2$.
   - If the activation function is **linear** (i.e., $g(z) = z$) at both layers:
     - The hidden layer output $a_1$ is just:
       $$a_1 = w_1 \cdot x + b_1$$
     - The output layer output $a_2$ becomes:
       $$a_2 = w_2 \cdot a_1 + b_2 = w_2 \cdot (w_1 \cdot x + b_1) + b_2$$
     - Simplifying this expression:
       $$a_2 = (w_2 \cdot w_1) \cdot x + (w_2 \cdot b_1 + b_2)$$
     - Notice this is just a **linear function** of $x$, $a_2 = w \cdot x + b$, where $w = w_2 \cdot w_1$ and $b = w_2 \cdot b_1 + b_2$.
   
   **Conclusion**: This output is no different from what you would get with **linear regression**—the entire point of using a neural network is defeated because the model can only learn a **linear relationship** between input and output.

2. **Multiple layers with linear activation**:
   - Even if you stack more layers, using only linear activation functions in all layers results in the same problem. A **linear function of a linear function** is still a **linear function**. Hence, multiple linear layers do not make the model more expressive or capable of learning non-linear patterns.
   
   **General Rule**: A neural network with linear activation functions in all layers is equivalent to a single-layer linear model.

### **Why Use Non-Linear Activation Functions?**
Non-linear activation functions like **ReLU** (Rectified Linear Unit) or **sigmoid** allow the network to capture complex, non-linear relationships between inputs and outputs, which is essential for most real-world tasks.

- **ReLU** introduces non-linearity by outputting the input directly if it is positive, and 0 otherwise. This simple non-linearity enables the network to model complex patterns and makes gradient descent optimization more effective.
  
- Other non-linear functions like **sigmoid** (used in classification problems) or **tanh** (used sometimes in hidden layers) also introduce non-linearity in different ways, enabling the network to learn more expressive features.

### **Key Insights:**
- **Without non-linearity**, neural networks cannot learn complex patterns and reduce to linear models, which are insufficient for tasks like image recognition, speech processing, or any task requiring more than basic linear modeling.
- **ReLU**, **sigmoid**, and **tanh** are essential because they enable **backpropagation** and gradient descent to work effectively, allowing the network to learn complex mappings from inputs to outputs.

### **Summary**:
- **Don’t use linear activation functions** in hidden layers. Always use non-linear activation functions like **ReLU**.
- Neural networks need non-linear activation functions to differentiate them from linear regression and to make them powerful enough to learn complex relationships.


# Multiclass Classification

## Multiclass

Multiclass classification involves problems where there are more than two possible output categories. Unlike binary classification, where the output $y$ takes one of two values (0 or 1), multiclass classification handles cases where $y$ can take several values.

### Examples of Multiclass Classification:
1. **Handwritten Digit Recognition**: Instead of just recognizing digits like 0 and 1, you may need to classify any digit from 0 to 9. This results in 10 possible output classes.
2. **Disease Classification**: A medical application where a model must classify patients into one of several possible diseases, for instance, determining if a patient has one of five distinct conditions.
3. **Defect Detection in Manufacturing**: In quality control, you might classify a product based on multiple defect types (e.g., scratches, discoloration, or chips in a pill).

### Dataset Representation:
In binary classification, you may have a dataset where features $x_1$ and $x_2$ are used to estimate the probability of $y$ being 0 or 1, often using logistic regression. However, in multiclass classification, the model must estimate the probabilities of $y$ taking on more than two values—such as classes 1, 2, 3, or 4.

For example, imagine a dataset where:
- Circles (O) represent one class.
- X’s represent another class.
- Triangles represent a third class.
- Squares represent a fourth class.

In this scenario, you’re not just predicting whether $y = 1$, but instead estimating the probability that $y$ equals each possible class (1, 2, 3, or 4).

### Softmax Regression:
To handle multiclass classification, logistic regression is generalized to **softmax regression**, which extends the idea of estimating a single probability into estimating probabilities for all possible classes. Instead of fitting a model to estimate the probability of $y = 1$, softmax regression helps in estimating the probabilities for all classes. 

In the next step, this **softmax function** is incorporated into a neural network, allowing the model to predict multiple categories from the input features. The softmax layer outputs a probability distribution over the possible classes, ensuring the predicted probabilities sum to 1, making it ideal for multiclass classification tasks. 

This approach allows neural networks to classify inputs into multiple categories effectively.

## Softmax

### Understanding Softmax Regression

**Softmax Regression** extends logistic regression to handle multiclass classification problems where the output variable $y$ can take on more than two discrete values.

#### How Softmax Regression Works

1. **Computing Scores**:
   For each possible class $j$, compute a score $z_j$:
   $$z_j = w_j \cdot x + b_j$$
   where $w_j$ and $b_j$ are the parameters for class $j$, $x$ is the input feature vector, and $\cdot$ denotes the dot product.

2. **Softmax Function**:
   Convert the scores into probabilities using the softmax function:
   $$a_j = \frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}$$
   Here, $a_j$ represents the estimated probability that $y$ equals class $j$, and the denominator normalizes these probabilities so that they sum to 1.

   For example, for four possible classes, you would compute:
   $$a_1 = \frac{e^{z_1}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}}$$
   $$a_2 = \frac{e^{z_2}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}}$$
   $$a_3 = \frac{e^{z_3}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}}$$
   $$a_4 = \frac{e^{z_4}}{e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4}}$$

   If the computed probabilities are $a_1 = 0.30$, $a_2 = 0.20$, $a_3 = 0.15$, the probability for the fourth class $a_4$ is:
   $$a_4 = 1 - (a_1 + a_2 + a_3) = 1 - (0.30 + 0.20 + 0.15) = 0.35$$

3. **General Case**:
   For $n$ possible classes, the probability for class $j$ is:
   $$a_j = \frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}$$

   The probabilities $a_1, a_2, \ldots, a_n$ will always sum up to 1.

#### Softmax Regression vs Logistic Regression

When $n = 2$ (binary classification), softmax regression simplifies to logistic regression. The softmax function becomes equivalent to the sigmoid function used in logistic regression, thus allowing softmax regression to handle binary classification scenarios as well.

#### Cost Function for Softmax Regression

The cost function for softmax regression is the negative log-likelihood function. For a single training example with ground truth label $y = j$, the loss is:
$$\text{Loss} = -\log(a_j)$$
where $a_j$ is the predicted probability for the true class $j$.

For multiple training examples, the cost function is the average of these individual losses over all training examples:
$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m \log(a_{y^{(i)}}^{(i)})$$
where $a_{y^{(i)}}^{(i)}$ is the probability assigned to the true class for the $i$-th training example, and $m$ is the number of training examples.

#### Summary

- **Softmax Regression** generalizes logistic regression to multiple classes by computing probabilities for each class and normalizing them.
- **Softmax Function** ensures the probabilities sum to 1, providing a valid probability distribution.
- **Cost Function** for softmax regression incentivizes the model to assign high probabilities to the true class, effectively learning to predict the correct class with high confidence.

In the next steps, this softmax regression model can be incorporated into a neural network to handle more complex multiclass classification tasks.

## Neural Network with Softmax Output

### Building a Neural Network with Softmax Output Layer

To extend neural networks for multiclass classification problems (e.g., digit recognition with 10 classes), you use a Softmax output layer. Here's a step-by-step explanation of how to incorporate the Softmax layer into a neural network and how to implement it using TensorFlow.

#### Softmax Output Layer in a Neural Network

1. **Architecture for Multiclass Classification**:
   - **Hidden Layers**: Compute activations just as in any standard neural network.
   - **Output Layer**: Replace the output layer with a Softmax layer that has as many units as the number of classes (e.g., 10 units for digit classification from 0 to 9).

2. **Forward Propagation**:
   - Compute the scores $z_j$ for each class $j$ from the output of the previous layer:
     $$z_j = W_j \cdot a_{(L-1)} + b_j$$
     where $W_j$ and $b_j$ are the weights and biases for class $j$, and $a_{(L-1)}$ is the activation vector from the previous layer.

   - Apply the Softmax function to convert scores into probabilities:
     $$a_j = \frac{e^{z_j}}{\sum_{k=1}^n e^{z_k}}$$
     Here, $a_j$ is the probability of the input belonging to class $j$.

   - The Softmax activation function ensures that the sum of probabilities $a_1, a_2, \ldots, a_n$ is 1.

3. **Unique Property of Softmax**:
   - Unlike other activation functions (sigmoid, ReLU), the Softmax function considers all class scores simultaneously. Each output probability $a_j$ depends on all the scores $z_1, z_2, \ldots, z_n$.

#### Implementing in TensorFlow

1. **Define the Model**:
   ```python
   import tensorflow as tf

   # Define the model architecture
   model = tf.keras.Sequential([
       tf.keras.layers.Dense(25, activation='relu', input_shape=(input_dim,)),  # Hidden layer 1
       tf.keras.layers.Dense(15, activation='relu'),                           # Hidden layer 2
       tf.keras.layers.Dense(10, activation='softmax')                         # Output layer with Softmax
   ])
   ```

2. **Compile the Model**:
   - Use `SparseCategoricalCrossentropy` as the loss function for multiclass classification:
     ```python
     model.compile(optimizer='adam',
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])
     ```

3. **Train the Model**:
   - Fit the model to your data:
     ```python
     model.fit(x_train, y_train, epochs=10, batch_size=32)
     ```

   Here, `x_train` is your input data and `y_train` is your corresponding labels (with values ranging from 0 to 9 for digit classification).

#### Notes

- **SparseCategoricalCrossentropy**: Used because the target values are integers representing the class labels.
- **Softmax Activation**: Ensures that the model outputs a probability distribution over the classes.

#### Summary

- **Softmax Layer**: Converts raw scores into probabilities for multiclass classification.
- **TensorFlow Implementation**: Define a neural network with a Softmax output layer, compile with `SparseCategoricalCrossentropy`, and train as usual.


## Improved implementation of softmax

This lecture explains how to improve the numerical stability of the softmax regression model in TensorFlow by avoiding potential round-off errors that occur when computing values like the softmax activations or loss functions.

### Key Concepts:

1. **Numerical Stability and Round-off Errors**: 
    - When computing values in a computer, round-off errors can occur because the computer has limited memory to store numbers (floating-point numbers).
    - Example: Computing $x = \frac{2}{10,000}$ directly versus computing it through intermediate steps like $1 + \frac{1}{10,000} - (1 - \frac{1}{10,000})$ may lead to small discrepancies due to round-off errors, especially when handling very small or large numbers.

2. **Softmax Layer and Loss Function**:
    - In softmax regression, activations are computed as $a_j = \frac{e^{z_j}}{\sum_{i=1}^{10} e^{z_i}}$, where $z_j$ is the output of the last linear layer.
    - Instead of computing softmax activations and loss in separate steps (which can lead to numerical issues), TensorFlow allows you to directly compute the loss function using more numerically stable methods.

3. **Improved Implementation Using `from_logits=True`**:
    - TensorFlow has an option to directly compute the loss from the raw logits (the $z$ values) without first computing the softmax activations. This is done using the `from_logits=True` parameter.
    - By avoiding the explicit calculation of the softmax activations and instead folding it into the loss calculation, TensorFlow can rearrange computations in a more numerically stable way, reducing round-off errors.

4. **Logits**: 
    - In machine learning, logits refer to the raw output of the final linear layer (before applying softmax or any other transformation). When using `from_logits=True`, TensorFlow assumes you're working with logits and internally applies the necessary transformations to calculate the loss.

5. **Application in Multi-Class Classification**:
    - The updated TensorFlow code no longer requires explicitly specifying the softmax activation in the final layer. Instead, it can directly calculate the logits and handle everything through the loss function.
    - The cost function (e.g., `SparseCategoricalCrossentropy`) automatically applies the softmax internally when `from_logits=True` is used.

### Benefits of the Improved Approach:
- **Reduced Numerical Errors**: Avoids extremely large or small intermediate values (e.g., $e^{z}$) that could introduce inaccuracies.
- **Cleaner Code**: Handles both activation and loss calculation in one step through the loss function specification.

### Code Example:
```python
model = Sequential([
    Dense(25, activation='relu'),
    Dense(15, activation='relu'),
    Dense(10)  # No softmax activation here
])

model.compile(optimizer='adam', 
              loss=SparseCategoricalCrossentropy(from_logits=True))
```

This approach improves numerical stability, although the code might seem less intuitive. It is the preferred method when dealing with multi-class classification problems. 


# Additional Neural Network Concepts

## Advanced Optimization

The **Adam optimization algorithm** improves upon standard gradient descent by adapting the learning rate for each parameter during training. This leads to faster convergence and more efficient learning, especially for neural networks. Here's a breakdown of the key points:

### Gradient Descent Recap:
- Gradient descent updates parameters (e.g., $w_j $) using:
  $$
  w_j = w_j - \alpha \cdot \frac{\partial J}{\partial w_j}
  $$
  where $\alpha $ is the learning rate, and $\frac{\partial J}{\partial w_j} $ is the gradient of the cost function $J $ with respect to $w_j $.

### Issues with Fixed Learning Rate:
- **Small learning rates** can result in slow progress, as each step barely moves toward the minimum.
- **Large learning rates** can cause oscillations, making it hard to converge smoothly to the minimum.

### The Adam Algorithm (Adaptive Moment Estimation):
- **Adaptively changes the learning rate** for each parameter individually.
- **Adjusts learning rates** based on how the gradient behaves:
  - If a parameter continues moving in the same direction, Adam **increases the learning rate** for that parameter, allowing it to converge faster.
  - If a parameter oscillates (bounces back and forth), Adam **reduces the learning rate** for smoother convergence.
  
### Key Benefits:
- **Momentum and adaptive learning rates:** Adam keeps track of past gradients (momentum) and adjusts the learning rates based on that information.
- **Numerical stability:** Adam reduces the need to tune the learning rate manually and is robust to different initial settings.
  
### Implementation in Code:
In TensorFlow/Keras, implementing Adam is straightforward:
```python
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])
```
Here, the learning rate is set to $10^{-3} $, but this can be adjusted based on the problem.

### Why Use Adam?
- **Faster convergence:** Adam dynamically adjusts learning rates, often resulting in quicker and more efficient training than standard gradient descent.
- **De facto standard:** It's widely used in practice due to its ability to handle large datasets and complex models effectively.

In summary, the Adam optimizer is a more advanced and adaptive version of gradient descent, making it ideal for training neural networks efficiently.

## Additional Layer Types

This lecture introduces **convolutional layers**, a fundamental building block of **convolutional neural networks (CNNs)**, which differ from the **dense layers** previously discussed in neural networks.

### Recap of Dense Layers:
- In a dense layer, each neuron in the hidden layer is connected to every neuron in the previous layer. This works well but can be computationally expensive, especially for large input data like images or signals.

### Convolutional Layers:
- **Convolutional layers** operate differently. Instead of each neuron looking at all inputs from the previous layer, neurons in a convolutional layer focus on **small, localized regions** (called **receptive fields**) of the input.
  
#### Example with Handwritten Digit:
- In an image (e.g., a handwritten digit), each neuron in the convolutional layer looks only at a small portion of the image (a small rectangle of pixels), not the entire image.
- By using these localized connections, convolutional layers:
  - **Speed up computation**, as fewer connections are made compared to dense layers.
  - **Reduce overfitting**, since they require fewer parameters and learn to capture local patterns.

### Convolutional Neural Networks (CNNs):
- Multiple convolutional layers can be stacked to form a **convolutional neural network**. CNNs are especially effective for tasks like **image recognition** and **signal classification**.
- In each convolutional layer, neurons focus on different portions of the input, enabling the network to capture **local patterns** like edges or specific features in an image.

#### Example with EKG Signals:
- For a one-dimensional signal like an **EKG** (electrocardiogram), neurons in a convolutional layer only focus on small windows of the signal, analyzing a subset of the data points at a time.
- Subsequent layers can also be convolutional, continuing to look at smaller sections of the activations from the previous layer.

### Practical Benefits:
- **Efficiency**: By focusing on smaller regions of the input, CNNs can process data faster and with fewer parameters than dense layers.
- **Better Generalization**: By limiting the number of parameters, CNNs are less prone to overfitting, making them more effective when training with limited data.
  
### Architecture Choices in CNNs:
- Researchers have flexibility in choosing the size of the **window** (receptive field) each neuron looks at, as well as the number of neurons in each layer, allowing for various CNN designs suited to specific tasks.

### Neural Network Layer Types:
- The video highlights how new layer types (like **convolutional layers**) add flexibility and power to neural networks.
- Other types of layers, such as **transformers, LSTMs,** or **attention models**, represent ongoing research efforts to develop more effective neural networks by combining different kinds of layers.

### Conclusion:
- While dense layers can be powerful, **convolutional layers** offer advantages for tasks involving structured data like images and signals. Understanding different types of layers broadens the capacity of neural networks to solve a wide range of problems.