# Chapter 18: Introduction to the  Neural Network


Artificial intelligence has revolutionized the way machines learn from data, enabling them to perform tasks that once seemed impossible. From powering voice assistants to enabling autonomous vehicles, AI systems are now integrated into many aspects of daily life. At the core of this transformation are neural networks—computational models inspired by **the structure and function of the human brain**. These networks consist of interconnected nodes, or "neurons," that process information in layers, allowing them to recognize patterns, make decisions, and improve over time. From early single-layer perceptrons to today’s deep architectures, neural networks have become a cornerstone of modern AI, powering breakthroughs in fields like healthcare, finance, and autonomous systems.

> Neural networks are a **type of supervised learning algorithm** capable of identifying intricate patterns and relationships within data, making them suitable for tackling problems that traditional models often struggle with.

![NN](NN.png)


Although neural network algorithms have existed for many years, recent advancements in their architectures have led to significant improvements in performance on large-scale machine learning tasks. These developments form the foundation of what is now referred to as the "deep learning" methodology.

Deep learning, a branch of machine learning and artificial intelligence, leverages neural networks with multiple hidden layers to address highly complex tasks. These tasks range from natural language processing—such as speech recognition and text interpretation—to computer vision applications like object detection and image classification. The rise of deep learning over the past twenty years can be attributed to its remarkable effectiveness, the surge in computational power, and the growing accessibility of vast datasets.

## 18.1 Fundamentals of Neural Networks

Neural networks are a powerful class of supervised learning algorithms capable of modeling complex, nonlinear relationships in data. Unlike traditional machine learning models, which rely on handcrafted features, neural networks automatically learn hierarchical representations from raw input. Key components include:

* **Input Layer:** Receives raw data (e.g., pixels in an image, words in a text).

* **Hidden Layers:** Intermediate layers that transform inputs through weighted connections and activation functions.

* **Output Layer:** Produces the final prediction (e.g., class label, regression value).

* **Weights & Biases:** Adjustable parameters learned during training.

* **Activation Functions:** Introduce nonlinearity (e.g., ReLU, Sigmoid, Tanh).

COMMENT: WE CAN ADD SOME GIF VISUALIZATION of Neural Network HERE: input later to hidden output

Training a neural network involves **forward propagation** (passing data through the network) and **backpropagation **(adjusting weights based on error gradients using optimization techniques like gradient descent).

## **18.1 Neurons: The Building Blocks**

A neuron (or node) is the fundamental unit of a neural network, mimicking biological neurons. It receives inputs, processes them using weights and an activation function, and produces an output. Mathematically, a neuron's operation can be represented as:


$$ y = f\left(\sum_{i=1}^{n} (w_i x_i) + b\right) $$

Where:
- $x_i$ = input features
- $w_i$ = weights
- $b$ = bias term  
- $f$ = activation function


## **4.2.2 Layers: Input, Hidden, and Output**
Neural networks are organized into layers:

1. **Input Layer**:
    - The first layer that receives raw data (e.g., pixel values in an image, word embeddings in text).

2. **Hidden Layers**:
    - Intermediate layers between input and output where feature extraction and transformation occur.
    - Deep networks have multiple hidden layers

 3. **Output Layer**:
    - Produces final predictions
    - Classification: class probabilities
    - Regression: continuous values


## **18.3 Weights and Biases**
 | Component | Role |
 |-----------|------|
 | **Weights** (\( w_i \)) | Determine connection strength between neurons. They are adjusted during training to minimize prediction errors. |
  | **Bias** (\( b \)) | Allows shifting the activation function to improve model flexibility. |



## **18.4 Activation Functions**

Activation functions are essential components of neural networks, introducing non-linearities that enable the model to learn complex patterns. Without them, a neural network would simply be a linear regression model, incapable of handling intricate data relationships. This section explores four widely used activation functions: Sigmoid, ReLU, Tanh, and Softmax, discussing their properties, advantages, and limitations. Each has unique properties and use cases depending on the architecture and goal of the model.


| Function  | Formula                          | Use Case                      |
|-----------|----------------------------------|-------------------------------|
| **Sigmoid** | $\sigma(x) = \frac{1}{1 + e^{-x}}$ | Binary classification  (outputs between 0 and 1)         |
| **ReLU**   | $f(x) = \max(0, x)$              | Default choice for hidden layers (fast computation)    |
| **Tanh**   | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | Similar to sigmoid but outputs between -1 and 1 |
| **Softmax**| $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$ | Multi-class classification (outputs probabilities) |


**The sigmoid function:**  maps any real-valued number to a range between 0 and 1, making it suitable for binary classification tasks where outputs represent probabilities.
This function $\sigma(x) = \frac{1}{1 + e^{-x}}$  maps any real-valued input into a probability-like output, making it useful in binary classification and output layers where probabilistic interpretation is needed.

Advantages:
* Useful when outputs need to be interpreted as probabilities.
*Differentiable, allowing gradient-based optimization.


Disadvantages:
* Vanishing Gradients: For very large or small inputs, gradients become nearly zero, slowing down learning.
* Not Zero-Centered: Outputs are always positive, leading to inefficient weight updates
* Computationally Expensive: Involves exponentiation operations.


**ReLU (Rectified Linear Unit) Function:** is one of the most popular activation functions due to its simplicity and effectiveness. It outputs the input directly if positive; otherwise, it outputs zero ($f(x) = \max(0, x)$). It is non-linear but simple; computationally efficient.

Advantages:

* Avoids Vanishing Gradient (for positive inputs): Unlike sigmoid, gradients remain strong for active neurons.
* Fast Computation: No complex exponentials.
* Sparsity: Can deactivate neurons (output zero), making the network more efficient.

Disadvantages:

* Dying ReLU Problem: If many neurons output zero (due to negative inputs), they stop learning entirely.
*
Not Zero-Centered: Like sigmoid, can lead to slower convergence.

What is the **Dying ReLU** Problem?
If a neuron consistently receives negative inputs, its output becomes zero, and its weights stop updating (since the gradient is also zero). Over time, this can cause some neurons to "die" and never activate again, reducing the model’s capacity to learn (neurons stop contributing to learning).

Solutions to Dying ReLU:

* Leaky ReLU: Allows a small negative slope (e.g., 0.01) for negative inputs.
$$
\text{Leaky ReLU}(x) =
\begin{cases}
x, & \text{if } x \geq 0 \\
\alpha x, & \text{if } x < 0
\end{cases}
$$

where $\alpha$ is a small positive constant (e.g., 0.01).

* Parametric ReLU (PReLU): Learns the negative slope during training.

* Exponential Linear Unit (ELU): Smoothly handles negative inputs.

**Tanh (Hyperbolic Tangent) Function:**
The tanh function is similar to sigmoid but maps inputs to a range between -1 and 1, making it zero-centered, $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$.
* In neural networks, an activation function is zero-centered if its output values are symmetrically distributed around zero (i.e., they have a mean of zero). This property helps in maintaining stable and efficient training by preventing systematic weight updates in a single direction.

* When activation outputs are not zero-centered (e.g., sigmoid outputs between 0 and 1), gradients during backpropagation tend to be either all positive or all negative, leading to inefficient weight updates: tend to update in the same direction (either always increasing or always decreasing), slowing down convergence.

Advantages:

* Zero-centered output allows for better convergence during gradient descent.

* Stronger gradients than sigmoid for inputs near 0.

Disadvantages:

* Still suffers from the vanishing gradient problem for very large or very small inputs:: Like sigmoid, gradients become very small for extreme values.

* Slightly More Computationally Expensive: Due to exponential operations.

**The Softmax function** is typically used in the output layer of a multi-class classification model. It converts a vector of raw scores (logits) into a probability distribution over predicted output classes: **Softmax**| $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$.


Advantages:

* Ensures that the sum of the outputs is 1, making them interpretable as probabilities.

* Highlights the highest-valued input while suppressing the rest, which helps in clear class predictions.

Disadvantages:

* Exponentially sensitive to input scale—can cause numerical instability if logits are too large.

* When classes are not mutually exclusive, Softmax is not ideal (use sigmoid instead for multi-label classification).



## **4.2.5 Loss Functions**
Loss functions (or cost functions) measure how well a neural network’s predictions match the true target values. During training, the goal is to minimize the loss by adjusting the model’s parameters. The choice of loss function depends on the type of task:



 - **Mean Squared Error (MSE)**: is widely used in regression problems, such as predicting house prices or temperature, where the  output is continuous and goal is to minimize the average squared difference between predicted and actual values.$$L_{MSE} = \frac{1}{N}\sum_{i=1}^N (y_i - \hat{y}_i)^2$$

 As it calculates the average squared difference between predicted values and actual values, which means larger errors are penalized more heavily. While MSE is straightforward and differentiable, making it compatible with gradient descent, it has notable drawbacks: it is highly sensitive to outliers due to the squaring operation, and it performs poorly in classification tasks, often leading to slow convergence.

 - **Cross-Entropy Loss**: On the other hand, Cross-Entropy Loss is widely used for classification tasks, both binary and multi-class. It measures the difference between the predicted probability distribution and the true label distribution.
- For binary classification:

$$L_{BCE} = -\frac{1}{N}\sum_{i=1}^N \left[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)\right]$$
-   - For multi-class classification:

$$L_{CCE} = -\frac{1}{N}\sum_{i=1}^N \sum_{c=1}^C y_{i,c}\log(\hat{y}_{i,c})$$

 For binary classification, it penalizes the model when it confidently predicts the wrong class, encouraging outputs closer to the true labels. In multi-class settings, cross-entropy works with softmax outputs to handle multiple classes simultaneously. One disadvantage of cross-entropy loss is that it can become very large when the model assigns near-zero probabilities to the true class, which may cause instability during training if not handled properly.


## **4.2.6 Optimizers**

Optimizers are algorithms that adjust the weights of a neural network to minimize the loss function during training. Two most basic optimizers are
 - **SGD (Stochastic Gradient Descent)**: Basic optimization
 - **Adam**: Adaptive learning rates (most popular)

Stochastic Gradient Descent (SGD) updates the model’s parameters by computing the gradient of the loss on a small batch of data and moving in the direction that reduces the loss. While simple and effective, SGD uses a fixed learning rate and can be slow to converge, especially for complex models.

To address these limitations, Adam (Adaptive Moment Estimation) is widely used due to its ability to adapt the learning rate for each parameter individually by combining the benefits of momentum and RMSProp optimizers. Adam often results in faster convergence and better performance without much tuning, making it the most popular choice for training deep neural networks.

<sub>Note: Momentum helps speed up learning by smoothing updates using past gradients, while RMSProp adapts learning rates based on recent gradient magnitudes. Adam combines both techniques for efficient and stable training.</sub>

## 4.2.7 Training Techniques & Regularization

Training a neural network consists of two main steps performed repeatedly over many iterations: the **forward pass** and the **backward pass**.

### Forward Pass

During the forward pass, input data flows through the network, and the final output layer generates predictions $(\hat{y}^{(i)})$ for each input example \(i\). The network then computes the loss function \(L\), which quantifies the difference between predicted outputs and true targets  $(y^{(i)}\)$. For a batch of \(m\) samples, the average loss is:

$$
L = \frac{1}{m} \sum_{i=1}^m L\left(\hat{y}^{(i)}, y^{(i)}\right)
$$

This loss guides how well the model is performing.

### Backward Pass (Backpropagation)

In the backward pass, gradients of the loss with respect to each parameter (weights \(W^{[l]}\) at layer \(l\)) are computed using the **chain rule**:

$$
\frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}
$$

Where:
- \(a^{[l]}\) is the activation output of layer \(l\),
- \(z^{[l]}\) is the linear combination \(W^{[l]} a^{[l-1]} + b^{[l]}\).

Parameters are updated using gradient descent with learning rate \(\alpha\):

$$
W^{[l]} := W^{[l]} - \alpha \frac{\partial L}{\partial W^{[l]}}
$$

### Regularization Techniques

Regularization adds penalty terms to the loss function to prevent overfitting by constraining model complexity:

- **L2 Regularization (Weight Decay):**

$$
L_{\text{reg}} = L + \lambda \sum \|W\|_2^2
$$

- **L1 Regularization:**

$$
L_{\text{reg}} = L + \lambda \sum |W|
$$

Where \(\lambda\) controls the regularization strength.

Adding these terms encourages smaller or sparser weights, improving generalization on unseen data.





## Reference:

https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.89879&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false