# Chapter 18: Introduction to the  Neural Network


Artificial intelligence has revolutionized the way machines learn from data, enabling them to perform tasks that once seemed impossible. From powering voice assistants to enabling autonomous vehicles, AI systems are now integrated into many aspects of daily life. At the core of this transformation are neural networks—computational models inspired by **the structure and function of the human brain**. These networks consist of interconnected nodes, or "neurons," that process information in layers, allowing them to recognize patterns, make decisions, and improve over time. From early single-layer perceptrons to today’s deep architectures, neural networks have become a cornerstone of modern AI, powering breakthroughs in fields like healthcare, finance, and autonomous systems.

> Neural networks are a **type of supervised learning algorithm** capable of identifying intricate patterns and relationships within data, making them suitable for tackling problems that traditional models often struggle with.

![NN](imgs/NN.png)


Although neural network algorithms have existed for many years, recent advancements in their architectures have led to significant improvements in performance on large-scale machine learning tasks. These developments form the foundation of what is now referred to as the "deep learning" methodology.

Deep learning, a branch of machine learning and artificial intelligence, leverages neural networks with multiple hidden layers to address highly complex tasks. These tasks range from natural language processing—such as speech recognition and text interpretation—to computer vision applications like object detection and image classification. The rise of deep learning over the past twenty years can be attributed to its remarkable effectiveness, the surge in computational power, and the growing accessibility of vast datasets.

## 18.1 Fundamentals of Neural Networks

Neural networks are a powerful class of supervised learning algorithms capable of modeling complex, nonlinear relationships in data. Unlike traditional machine learning models, which rely on handcrafted features, neural networks automatically learn hierarchical representations from raw input. Key components include:

* **Input Layer:** Receives raw data (e.g., pixels in an image, words in a text).

* **Hidden Layers:** Intermediate layers that transform inputs through weighted connections and activation functions.

* **Output Layer:** Produces the final prediction (e.g., class label, regression value).

* **Weights & Biases:** Adjustable parameters learned during training.

* **Activation Functions:** Introduce nonlinearity (e.g., ReLU, Sigmoid, Tanh).

COMMENT: WE CAN ADD SOME GIF VISUALIZATION of Neural Network HERE: input later to hidden output

Training a neural network involves **forward propagation** (passing data through the network) and **backpropagation **(adjusting weights based on error gradients using optimization techniques like gradient descent).

## **18.1 Neurons: The Building Blocks**

A neuron (or node) is the fundamental unit of a neural network, mimicking biological neurons. It receives inputs, processes them using weights and an activation function, and produces an output. Mathematically, a neuron's operation can be represented as:


$$ y = f\left(\sum_{i=1}^{n} (w_i x_i) + b\right) $$

Where:
- $x_i$ = input features
- $w_i$ = weights
- $b$ = bias term  
- $f$ = activation function


## **4.2.2 Layers: Input, Hidden, and Output**
Neural networks are organized into layers:

1. **Input Layer**:
    - The first layer that receives raw data (e.g., pixel values in an image, word embeddings in text).

2. **Hidden Layers**:
    - Intermediate layers between input and output where feature extraction and transformation occur.
    - Deep networks have multiple hidden layers

 3. **Output Layer**:
    - Produces final predictions
    - Classification: class probabilities
    - Regression: continuous values


## **18.3 Weights and Biases**
 | Component | Role |
 |-----------|------|
 | **Weights** (\( w_i \)) | Determine connection strength between neurons. They are adjusted during training to minimize prediction errors. |
  | **Bias** (\( b \)) | Allows shifting the activation function to improve model flexibility. |



## **18.4 Activation Functions**

Activation functions are essential components of neural networks, introducing non-linearities that enable the model to learn complex patterns. Without them, a neural network would simply be a linear regression model, incapable of handling intricate data relationships. This section explores four widely used activation functions: Sigmoid, ReLU, Tanh, and Softmax, discussing their properties, advantages, and limitations. Each has unique properties and use cases depending on the architecture and goal of the model.


| Function  | Formula                          | Use Case                      |
|-----------|----------------------------------|-------------------------------|
| **Sigmoid** | $\sigma(x) = \frac{1}{1 + e^{-x}}$ | Binary classification  (outputs between 0 and 1)         |
| **ReLU**   | $f(x) = \max(0, x)$              | Default choice for hidden layers (fast computation)    |
| **Tanh**   | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | Similar to sigmoid but outputs between -1 and 1 |
| **Softmax**| $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$ | Multi-class classification (outputs probabilities) |


**The sigmoid function:**  maps any real-valued number to a range between 0 and 1, making it suitable for binary classification tasks where outputs represent probabilities.
This function $\sigma(x) = \frac{1}{1 + e^{-x}}$  maps any real-valued input into a probability-like output, making it useful in binary classification and output layers where probabilistic interpretation is needed.

Advantages:
* Useful when outputs need to be interpreted as probabilities.
*Differentiable, allowing gradient-based optimization.


Disadvantages:
* Vanishing Gradients: For very large or small inputs, gradients become nearly zero, slowing down learning.
* Not Zero-Centered: Outputs are always positive, leading to inefficient weight updates
* Computationally Expensive: Involves exponentiation operations.


**ReLU (Rectified Linear Unit) Function:** is one of the most popular activation functions due to its simplicity and effectiveness. It outputs the input directly if positive; otherwise, it outputs zero ($f(x) = \max(0, x)$). It is non-linear but simple; computationally efficient.

Advantages:

* Avoids Vanishing Gradient (for positive inputs): Unlike sigmoid, gradients remain strong for active neurons.
* Fast Computation: No complex exponentials.
* Sparsity: Can deactivate neurons (output zero), making the network more efficient.

Disadvantages:

* Dying ReLU Problem: If many neurons output zero (due to negative inputs), they stop learning entirely.
*
Not Zero-Centered: Like sigmoid, can lead to slower convergence.

What is the **Dying ReLU** Problem?
If a neuron consistently receives negative inputs, its output becomes zero, and its weights stop updating (since the gradient is also zero). Over time, this can cause some neurons to "die" and never activate again, reducing the model’s capacity to learn (neurons stop contributing to learning).

Solutions to Dying ReLU:

* Leaky ReLU: Allows a small negative slope (e.g., 0.01) for negative inputs.
$$
\text{Leaky ReLU}(x) =
\begin{cases}
x, & \text{if } x \geq 0 \\
\alpha x, & \text{if } x < 0
\end{cases}
$$

where $\alpha$ is a small positive constant (e.g., 0.01).

* Parametric ReLU (PReLU): Learns the negative slope during training.

* Exponential Linear Unit (ELU): Smoothly handles negative inputs.

**Tanh (Hyperbolic Tangent) Function:**
The tanh function is similar to sigmoid but maps inputs to a range between -1 and 1, making it zero-centered, $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$.
* In neural networks, an activation function is zero-centered if its output values are symmetrically distributed around zero (i.e., they have a mean of zero). This property helps in maintaining stable and efficient training by preventing systematic weight updates in a single direction.

* When activation outputs are not zero-centered (e.g., sigmoid outputs between 0 and 1), gradients during backpropagation tend to be either all positive or all negative, leading to inefficient weight updates: tend to update in the same direction (either always increasing or always decreasing), slowing down convergence.

Advantages:

* Zero-centered output allows for better convergence during gradient descent.

* Stronger gradients than sigmoid for inputs near 0.

Disadvantages:

* Still suffers from the vanishing gradient problem for very large or very small inputs:: Like sigmoid, gradients become very small for extreme values.

* Slightly More Computationally Expensive: Due to exponential operations.

**The Softmax function** is typically used in the output layer of a multi-class classification model. It converts a vector of raw scores (logits) into a probability distribution over predicted output classes: **Softmax**| $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$.


Advantages:

* Ensures that the sum of the outputs is 1, making them interpretable as probabilities.

* Highlights the highest-valued input while suppressing the rest, which helps in clear class predictions.

Disadvantages:

* Exponentially sensitive to input scale—can cause numerical instability if logits are too large.

* When classes are not mutually exclusive, Softmax is not ideal (use sigmoid instead for multi-label classification).

<h4><a href="/CMSC320TextBook/chapter18/interactive_activation_functions.html">Click Here for Interactive Activation Function Visualization</a></h4>

## **4.2.5 Loss Functions**
Loss functions (or cost functions) measure how well a neural network’s predictions match the true target values. During training, the goal is to minimize the loss by adjusting the model’s parameters. The choice of loss function depends on the type of task:



 - **Mean Squared Error (MSE)**: is widely used in regression problems, such as predicting house prices or temperature, where the  output is continuous and goal is to minimize the average squared difference between predicted and actual values.$$L_{MSE} = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2$$

 As it calculates the average squared difference between predicted values and actual values, which means larger errors are penalized more heavily. While MSE is straightforward and differentiable, making it compatible with gradient descent, it has notable drawbacks: it is highly sensitive to outliers due to the squaring operation, and it performs poorly in classification tasks, often leading to slow convergence.

 - **Cross-Entropy Loss**: On the other hand, Cross-Entropy Loss is widely used for classification tasks, both binary and multi-class. It measures the difference between the predicted probability distribution and the true label distribution.
- For binary classification:

$$L_{BCE} = -\frac{1}{N}\sum_{i=1}^N \left[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$
-   - For multi-class classification:

$$L_{CCE} = -\frac{1}{N}\sum_{i=1}^N \sum_{c=1}^C y_{i,c}\log(\hat{y}_{i,c})$$

 For binary classification, it penalizes the model when it confidently predicts the wrong class, encouraging outputs closer to the true labels. In multi-class settings, cross-entropy works with softmax outputs to handle multiple classes simultaneously. One disadvantage of cross-entropy loss is that it can become very large when the model assigns near-zero probabilities to the true class, which may cause instability during training if not handled properly.


## **4.2.6 Optimizers**

Optimizers are algorithms that adjust the weights of a neural network to minimize the loss function during training. Two most basic optimizers are
 - **SGD (Stochastic Gradient Descent)**: Basic optimization
 - **Adam**: Adaptive learning rates (most popular)

Stochastic Gradient Descent (SGD) updates the model’s parameters by computing the gradient of the loss on a small batch of data and moving in the direction that reduces the loss. While simple and effective, SGD uses a fixed learning rate and can be slow to converge, especially for complex models.

To address these limitations, Adam (Adaptive Moment Estimation) is widely used due to its ability to adapt the learning rate for each parameter individually by combining the benefits of momentum and RMSProp optimizers. Adam often results in faster convergence and better performance without much tuning, making it the most popular choice for training deep neural networks.

<sub>Note: Momentum helps speed up learning by smoothing updates using past gradients, while RMSProp adapts learning rates based on recent gradient magnitudes. Adam combines both techniques for efficient and stable training.</sub>

## 4.2.7 Training Techniques & Regularization

Training a neural network consists of two main steps performed repeatedly over many iterations: the **forward pass** and the **backward pass**.

### Forward Pass

During the forward pass, input data flows through the network, and the final output layer generates predictions $(\hat{y}^{(i)})$ for each input example \(i\). The network then computes the loss function \(L\), which quantifies the difference between predicted outputs and true targets  $(y^{(i)}\)$. For a batch of \(m\) samples, the average loss is:

$$
L = \frac{1}{m} \sum_{i=1}^m L\left(\hat{y}^{(i)}, y^{(i)}\right)
$$

This loss guides how well the model is performing.


Here are the details of step by step:


## Forward Pass: Step-by-Step

The forward pass is the phase in which input data flows through the neural network to produce predictions.


<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*J-v2B6T9RKxdvwThtQ1NVg.png" width="600"/>
</p>

<p align="center"><b>Figure: Forward Pass in Neural Networks</b></p>



Here's how it works:

**Input Layer**  
Each input example is denoted as  $x_i^{(0)}$ which is fed into the network.

**Hidden Layers**  
For each hidden layer \( l = 1, 2, \ldots, L \), the network performs the following steps:

**Linear transformation:**  
$$
z_i^{[l]} = W^{[l]} x_i^{(l-1)} + b^{[l]}
$$

**Non-linear activation:**  
$$
x_i^{(l)} = \sigma\left(z_i^{[l]}\right)
$$  
Here, \( x_i^{(l)} \) is the activated output of layer \( l \), used as input for the next layer.

**Output Layer**  
After the last hidden layer \( L \), the output layer produces the final prediction:  
$$
\hat{y}_i = f\left(x_i^{(L)}\right)
$$  
where \( f \) is an appropriate output function (e.g., identity for regression, sigmoid for binary classification, softmax for multi-class classification).

**Loss Computation**  

The predicted output $\hat{y}_i$ is compared to the true label $y_i$ using a loss function $L(\hat{y}_i, y_i)$.

Example (Mean Squared Error):  
$$
L(\hat{y}_i, y_i) = (\hat{y}_i - y_i)^2
$$

**Average Loss Over the Batch**  
For a batch of \( m \) samples, the average loss is:  
$$
L = \frac{1}{m} \sum_{i=1}^m L(\hat{y}_i, y_i)
$$  
This average loss quantifies how well the model is performing and is used in the backward pass to update model parameters.

<h4><a href="/CMSC320TextBook/chapter18/interactive_forward_propagation.html">Click Here for Interactive Forward Propagation Visualization</a></h4>

## Backward Pass (Backpropagation)

In the backward pass, gradients of the loss with respect to each parameter (weights $W^{[l]}$ at layer $l$) are computed using the chain rule.

The chain rule lets us break down the gradient of the loss into simpler parts by following the flow of computations backward through the network. It helps calculate how changes in weights affect the final loss by multiplying the derivatives of each intermediate step.

Mathematically:

$$
\frac{\partial L}{\partial W^{[l]}} =
\frac{\partial L}{\partial a^{[l]}} \cdot
\frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot
\frac{\partial z^{[l]}}{\partial W^{[l]}}
$$

where:

- $a^{[l]}$ is the activation output of layer $l$,  
- $z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$ is the input to the activation function.

**Weight Update**  
The weights are then updated using gradient descent:

$$
W^{[l]} := W^{[l]} - \alpha \cdot \frac{\partial L}{\partial W^{[l]}}
$$

where:

- $W^{[l]}$ = current weights at layer $l$,  
- $\alpha$ = learning rate (controls the step size),  
- $\frac{\partial L}{\partial W^{[l]}}$ = gradient of the loss with respect to the weights.

**Repeat**  
This process of forward pass, backward pass, and weight update is repeated over many epochs (full passes over the training data) until the error is minimized.

<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/0*LnQvzEdc8wkkUte9.png" width="600"/>
</p>

<p align="center"><b>Figure: Backpropagation in Neural Networks</b></p>

<h4><a href="/CMSC320TextBook/chapter18/interactive_backward_propagation.html">Click Here for Interactive Backpropagation Visualization</a></h4>

<!--
### Backward Pass (Backpropagation)

In the backward pass, gradients of the loss with respect to each parameter (weights \(W^{[l]}\) at layer \(l\)) are computed using the **chain rule**:

$$
\frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}
$$
Where:  
- $a^{[l]}$ is the activation output of layer $l$,  
- $z^{[l]}$ is the linear combination $W^{[l]} a^{[l-1]} + b^{[l]}$.


Parameters are updated using gradient descent with learning rate \(\alpha\):

$$
W^{[l]} := W^{[l]} - \alpha \frac{\partial L}{\partial W^{[l]}}
$$

# Backward Pass (Error Propagation)

After computing the loss, the error is sent backward through the network to find out how much each weight contributed to the error.

Using the chain rule, the network calculates the gradient of the loss with respect to each weight — this tells us the direction and amount to change the weights to reduce the error.

**Weight Update**  
The weights are then updated using gradient descent:

$$
W := W - \eta \cdot \frac{\partial L}{\partial W}
$$

where:

- $W$ = current weights  
- $\eta$ = learning rate (controls how big each update is)  
- $\frac{\partial L}{\partial W}$ = gradient of loss w.r.t. weights  

**Repeat**  
This process of forward pass, backward pass, and weight update repeats over many epochs (full passes over the training data) until the error becomes small enough. -->




<!-- ### Regularization Techniques

Regularization adds penalty terms to the loss function to prevent overfitting by constraining model complexity:

- **L2 Regularization (Weight Decay):**

$$
L_{\text{reg}} = L + \lambda \sum \|W\|_2^2
$$

- **L1 Regularization:**

$$
L_{\text{reg}} = L + \lambda \sum |W|
$$

Where $\lambda$ controls the regularization strength.


Adding these terms encourages smaller or sparser weights, improving generalization on unseen data. -->

## Regularization Techniques

Regularization adds penalty terms to the loss function to prevent overfitting, which happens when the model learns the training data too well including noise, resulting in poor performance on new, unseen data. By constraining model complexity, regularization encourages the model to generalize better.

---

### L2 Regularization (Weight Decay):

$$
L_{reg} = L + \lambda \sum \| W \|_2^2 = L + \lambda \sum W^2
$$

L2 regularization adds the sum of the squared weights to the loss function. The hyperparameter $\lambda$ controls the strength of this penalty. This encourages the model to keep weights small but not necessarily zero, which tends to distribute the "importance" across many features and reduces overfitting.

---

### L1 Regularization:

$$
L_{reg} = L + \lambda \sum |W|
$$

L1 regularization adds the sum of the absolute values of the weights to the loss. This tends to push some weights exactly to zero, effectively performing feature selection by encouraging sparsity in the model. This can be useful when you want a simpler model with fewer active features.

---

### Why Use Regularization?

- **Improves Generalization:** By limiting the size or number of weights, the model avoids fitting noise and irrelevant patterns in the training data.  
- **Controls Model Complexity:** Prevents weights from growing too large, which can cause unstable predictions.  
- **Feature Selection (L1):** Helps identify and ignore irrelevant features by forcing their weights to zero.

---

### Choosing $\lambda$

The regularization strength $\lambda$ is a hyperparameter that must be tuned carefully:

- Too small: little effect on overfitting  
- Too large: model may underfit by being too constrained

---

### Example: When to Use L1 vs. L2 Regularization

- **L2 Regularization (Ridge):**  
Use when most features are relevant and you want to keep them all but reduce overfitting by shrinking weights smoothly.  
*Example:* Predicting house prices using many meaningful features.

- **L1 Regularization (Lasso):**  
Use when you expect only a few important features and want the model to ignore irrelevant ones by setting some weights exactly to zero.  
*Example:* Selecting important genes from thousands of candidates in a biological study.

---

### Other Common Regularization Methods (Brief)

- **Dropout:** Randomly sets some activations to zero during training to prevent co-adaptation of neurons, helping the model generalize better.  
- **Early Stopping:** Stops training when performance on a validation set stops improving, preventing overfitting.









## Putting It All Together: How Neural Network Training Happens

After understanding how regularization helps prevent overfitting and improve generalization, it's important to see how the entire training process operates in practice. Training a neural network involves repeatedly feeding data through the model in manageable portions, updating parameters, and gradually improving performance.

The next section explains the key concepts of **batch size**, **iteration**, and **epoch**, which organize how training data is processed and how the model learns over time.

---

## Neural Network Training: Batch Size, Iteration, and Epoch

When training a neural network, the dataset is usually too large to process all at once, so it is split into smaller parts called **batches**.

---

### Batch Size  
The number of training examples used in one forward and backward pass.  
For example, a batch size of 32 means the model processes 32 samples before updating weights.

---

### Iteration  
One update of the model’s parameters (weights and biases).  
Each iteration uses one batch of data.

---

### Epoch  
One full pass over the entire training dataset.  
If the dataset has 1000 samples and batch size is 100, then:  
$$
1 \text{ epoch} = \frac{1000}{100} = 10 \text{ iterations}
$$

<p align="center">
  <img src="https://media.geeksforgeeks.org/wp-content/uploads/20241024155237307614/epoch-in-machine-learning_.webp" width="600"/>
</p>

<p align="center"><b>Figure: Epoch in Machine Learning</b></p>

---

### Training Process Overview

1. **Start Training:** Initialize model parameters.

2. For each epoch (repeat multiple times):

   - Divide data into batches according to batch size.  
   
   - For each batch:  
     - Perform forward pass to calculate predictions.  
     - Calculate loss based on predictions and true labels.  
     - Perform backward pass (backpropagation) to compute gradients.  
     - Update weights using gradients (e.g., gradient descent).


3. Repeat until model performance stabilizes or desired accuracy is reached.


<p>
  <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*TFKm_XC2dIk36i80drQ0KA.png" width="600"/>
</p>

<p>
  <b>Figure:</b> In the Input Training Dataset step, as different kinds of strategies, we can assign a Batch Size to determine how many training datasets are going to be used for one weights updating process. For example, there are 5000 images in total as the training datasets. If we set the Batch Size = 1000, we get 5 batches of training datasets. As a result, the weights updating process will be executed 5 times (Iterations).
</p>

<p><em>Ref: <a href="https://medium.com/@crazyhatcap/epoch-batch-size-iteration-in-neural-network-training-process-bee58415eb8e" target="_blank">medium.com/@crazyhatcap</a></em></p>


## Chapter Summary
In this chapter, we explored the key components of training neural networks, including the forward and backward passes, how loss guides learning, and how weights are updated iteratively. We also discussed the role of batch size, iterations, and epochs in organizing the training process for efficient and effective learning.

Understanding these concepts lays the foundation for building, training, and optimizing deep learning models. In the next chapter, we will delve into advanced optimization techniques and strategies to further improve model performance.

## Knowledge Check

<iframe src="https://docs.google.com/forms/d/e/1FAIpQLSdWDTzfBm1pC7w07wLKetrKKxVlzdEz-D-feCfFYemssJeVAA/viewform?embedded=true" width="640" height="1078" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>