---
# ✅ **What is Deep Learning?**

- Deep Learning (DL) is a **subset of Machine Learning (ML)**.
- It uses **Artificial Neural Networks (ANNs)** with multiple (deep) layers.
- Learns patterns from data by **automatically extracting features**.
- Works best with **large datasets** and high computational power (GPUs).
- Used in **computer vision**, **speech recognition**, **NLP**, etc.

---

# ✅ **Machine Learning vs Deep Learning**

| Aspect | Machine Learning (ML) | Deep Learning (DL) |
|--------|------------------------|---------------------|
| **1. Data Dependency** | Works well with **small/medium** data | Needs **large** datasets |
| **2. Hardware Dependency** | Runs on **normal CPU** | Needs **GPUs/TPUs** |
| **3. Training Time** | **Faster** training | **Slower**, especially for deep nets |
| **4. Feature Selection** | Requires **manual** feature engineering | Learns features **automatically** |
| **5. Interpretability** | More **interpretable** | Often a **black box** |

---

# ✅ **Why Deep Learning is Successful Now? (Factors Behind Its Growth)**

- **Datasets**: Availability of large datasets (e.g., ImageNet, text corpora).
- **Hardware**: GPUs/TPUs speed up matrix computations.
- **Architecture**: Modern models like CNNs, RNNs, Transformers.
- **Frameworks**: Easy-to-use tools like **TensorFlow**, **PyTorch**.
- **Community**: Open-source contributions, tutorials, large research community.

---


# ✅ **What is a Neural Network?**

- A Neural Network is a **computational model** inspired by the **human brain**.
- It is made up of layers:
  - **Input Layer**: Takes the raw features.
  - **Hidden Layers**: Process inputs using **weights**, **biases**, and **activation functions**.
  - **Output Layer**: Produces the prediction or output.

--- 

# ✅ **Types of Neural Networks**

| Type | Description | Common Use |
|------|-------------|-------------|
| **1. Feedforward Neural Network (FNN)** | Basic structure, unidirectional | Classification, Regression |
| **2. Convolutional Neural Network (CNN)** | Works on images, uses filters | Image classification, Object detection |
| **3. Recurrent Neural Network (RNN)** | Works on sequences, has memory | Time-series, NLP |
| **4. LSTM (Long Short-Term Memory)** | A type of RNN with better memory handling | Language modeling, Text generation |
| **5. GAN (Generative Adversarial Network)** | Generator + Discriminator model | Image generation, Deepfakes |
| **6. Autoencoder** | Learns compressed representation | Denoising, Dimensionality Reduction |
| **7. Transformer** | Based on self-attention | Language models like BERT, GPT |

# Note:
- FNN is umbrella term for both Perceptron and MLP.

---

# ✅ **Applications of Deep Learning**
- Computer Vision
- Natural Language Processing (NLP)
- Healthcare
- Self-driving Cars
- Finance
- Generative AI
---

# ✅ **Perceptron Algorithm**

- The Perceptron Algorithm is a **supervised learning** algorithm used for **binary classification**.
- It learns a linear decision boundary to separate two classes.
- It updates the weights to reduce classification errors.

In [1]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/P1.png" style="width: 500px;"/>
</div>
"""))

---
# Perceptron Algorithm Steps

In [2]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/P2.png" style="width: 500px;"/>
</div>
"""))

---
# Perceptron Trick

In [3]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/P3.png" style="width: 500px;"/>
</div>
"""))

---
# Limitation of Perceptron

- Works only if the data is linearly separable
- Cannot handle XOR-like problems

---
# Summary

- A simple and fast binary classifier
- Learns by updating weights

---
# ✅ **Perceptron Loss Function**

- Used to measure errors in classification.
- Loss is 0 if prediction is correct.
- Loss is positive if prediction is wrong.
- Only penalizes misclassified points.
- Helps move decision boundary in the right direction.

---

## Mathematical Formulation

The **perceptron loss** for a single training example is:

$$
L(\mathbf{w}, b; \mathbf{x}, y) = \max(0, -y(\mathbf{w}^T \mathbf{x} + b))
$$

---

# ✅ **How to Calculate Trainable Parameters in a Neural Network**

---

## What are Trainable Parameters?

- Trainable parameters = **Weights + Biases**
- These are updated during training using **backpropagation**
- They define what the model learns

---

## Formula to Calculate Parameters

For each fully connected (dense) layer:

**Parameters = (Number of inputs)*(Number of neurons) +(Number of neurons)**

- First part = **weights**
- Second part = **biases**

---

## Example

Let’s say you have the following network:

- Input layer: 4 features  
- Hidden layer 1: 5 neurons  
- Hidden layer 2: 3 neurons  
- Output layer: 1 neuron

---

### Step-by-step Calculation

1. Input → Hidden1

(4*5) + 5 = 20 + 5 = 25

2. Hidden1 → Hidden2

(5*3) + 3 = 15 + 3 = 18

3. Hidden2 → Output

(3*1) + 1 = 3 + 1 = 4

---

### Total Trainable Parameters:

25 + 18 + 4 = 47

---

**Tip:-**

Only **number of features and neurons per layer** matter not the number of training samples.

---


# ✅ **Loss Function**

Loss function is a method of evaluating how well your algorithm is modelling your dataset.

---
# Types of Loss Function

---

**1. Mean Squeared Error(MSE)/Squared Loss/L2 Loss**

**2. Mean Absolute Error(MAE)/L1 Loss**

**3. Huber Loss**

**4. Binary Core Entropy/Log Loss**

**5. Categorical Cross Entropy**

**6. Sparse Categorical Cross Entropy**

---

In [4]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF1.png" style="width: 500px;"/>
</div>
"""))

In [5]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF2.png" style="width: 500px;"/>
</div>
"""))

In [6]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF3.png" style="width: 500px;"/>
</div>
"""))

In [7]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF4.png" style="width: 500px;"/>
</div>
"""))

In [8]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF5.png" style="width: 500px;"/>
</div>
"""))

In [9]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/LF6.png" style="width: 500px;"/>
</div>
"""))

---
## Loss Function Summary Table

| **Loss Function**                 | **Use Case**                | **Label Format**        | **Activation Function**    |
|----------------------------------|-----------------------------|--------------------------|----------------------------|
| **Mean Squared Error (MSE)**     | Regression                  | Continuous values        | `None` or `Linear`         |
| **Mean Absolute Error (MAE)**    | Regression                  | Continuous values        | `None` or `Linear`         |
| **Huber Loss**                   | Regression with outliers    | Continuous values        | `None` or `Linear`         |
| **Binary Cross Entropy**         | Binary Classification       | 0 or 1                   | `Sigmoid`                  |
| **Categorical Cross Entropy**    | Multi-class Classification  | One-hot encoded vector   | `Softmax`                  |
| **Sparse Categorical Cross Entropy** | Multi-class Classification  | Integer class index       | `Softmax`                  |

---

## Notes:

- Use **MSE/MAE** for regression problems.
- Use **Cross Entropy** for classification problems.
- Choose activation based on output type:
  - `Sigmoid` → for **binary classification**
  - `Softmax` → for **multi-class classification**
  - `Linear` or none → for **regression**
---

# ✅ **Gradient Descent**

**Gradient Descent** is an optimization algorithm used to minimize a **loss function** by iteratively moving in the direction of the **negative gradient**.

---
The general update rule is:

**θ = θ - α * ∇L(θ)**

Where:

- θ is the parameter (or weight)
- α is the learning rate (step size)
- ∇L(θ) is the gradient of the loss function with respect to θ

---

# Types of Gradient Descent

---

**1. Batch Gradeint Descent**

**2. Stochastic Gradeint Descent**

**3. Mini-Batch Gradeint Descent**

---

# Summary Table

| Type                        | Data Used per Step | Speed     | Stability | Memory Usage  |Formula|
|-----------------------------|--------------------|-----------|-----------|---------------|---------------|
| Batch Gradient Descent      | All data           | Slow      | High      | High          |θ = θ - α * (1/m) * Σ ∇Lᵢ(θ)
| Stochastic Gradient Descent | 1 sample           | Fast      | Low       | Low           |θ = θ - α * ∇Lᵢ(θ)
| Mini-Batch Gradient Descent | Small batch        | Moderate  | Moderate  | Moderate      |θ = θ - α * (1/b) * Σ ∇Lᵢ(θ)


Where:

- **m** is the number of training examples

- **b** is the batch size

---
# ✅ **Multi Layer Perceptron**

---

## MLP Notation

In [11]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/MLP1.png" style="width: 500px;"/>
</div>
"""))

---

## What is MLP?

An **MLP (Multi-Layer Perceptron)** is a type of **feedforward neural network** that consists of:

- **Input Layer**
- **One or more Hidden Layers**
- **Output Layer**

Each layer is **fully connected** to the next layer.

---

## Mathematics Behind MLP



In [12]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/MLP2.png" style="width: 500px;"/>
</div>
"""))

In [13]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/MLP3.png" style="width: 500px;"/>
</div>
"""))

---
# ✅ **Forward Propagation**

**Forward Propagation** is the process of sending input data through the layers of the neural network to compute the output.

It is used:
- During **training** (before loss computation)
- During **prediction** (after training)

---


In [14]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/FF1.png" style="width: 500px;"/>
</div>
"""))

---
## Example for 3-layer MLP

In [16]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/FF2.png" style="width: 500px;"/>
</div>
"""))

---
# ✅ **Backpropagation**

**Backpropagation** is the process of calculating **gradients** (partial derivatives) of the **loss function** with respect to **weights and biases**.

These gradients are used to **update model parameters** using **Gradient Descent**.

---

In [17]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/BB1.png" style="width: 500px;"/>
</div>
"""))

In [18]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/BB2.png" style="width: 500px;"/>
</div>
"""))

In [19]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/BB3.png" style="width: 500px;"/>
</div>
"""))

In [20]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/BB4.png" style="width: 500px;"/>
</div>
"""))

In [22]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="Screenshots/BB5.png" style="width: 500px;"/>
</div>
"""))

---
# ✅ **MLP Memoization**

**MLP Memoization** refers to an optimization technique used in training **Multilayer Perceptrons (MLPs)**. It involves **caching intermediate computations** during the **forward pass**, so they can be **reused during the backward pass (backpropagation)**.

This avoids redundant calculations and improves computational efficiency.

---

## Why is Memoization Needed?

During **backpropagation**, we compute gradients using the **chain rule**, which requires:

- The **activations**
- The **linear outputs**

These values are already computed during the **forward pass**, so instead of recomputing them, we **store (memoize)** them for reuse.

---

## Benefits of Memoization
- Faster training
- Reduced redundant computation
- Lower memory usage (with smart caching strategies)
- Improved scalability for deep networks

---

## Summary

MLP Memoization is a **smart caching strategy** that stores intermediate results during the forward pass to **optimize** the backward pass. It's a key technique in efficient neural network training.

---


# ✅ **Improving Neural Network Performance**

---
# **1. Fine-tuning Hyperparameters**
---

## Number of Hidden Layers
- More layers can learn more complex patterns.
- Too many can lead to overfitting or vanishing gradients.
---
## Number of Neurons per Layer
- More neurons = more learning capacity.
- But too many can slow training or cause overfitting.
---
## Learning Rate
- Controls how big the steps are during training.
- Too high = unstable learning; too low = slow learning.
---
## Batch Size
- Number of samples processed before updating weights.
- Small batch = faster updates, more noise.
- Large batch = stable updates, slower learning.
---
## Epochs
- One epoch = one full pass through the training data.
- More epochs = more learning, but risk of overfitting.
---
## Optimizer
- Algorithm that updates weights (e.g., SGD, Adam).
- Adam is popular for being fast and adaptive.
---
## Activation Functions
- Add non-linearity to the model.
- Common ones: ReLU (fast), Sigmoid (can cause vanishing gradients), Tanh.
---


# **2. Solving Common Neural Network Problems**
---
## Vanishing Gradients
   - Activation Functions  
   - Weight Initialization  
---
## Overfitting
   - Reduce Complexity / Increase Data  
   - Dropout Layers  
   - Regularization (L1 & L2)  
   - Early Stopping  
---
## Normalization
   - Normalizing Inputs  
   - Batch Normalization  
   - Normalizing Activations  
---
## Optimizers
    - Momentum  
    - Adagrad  
    - RMSprop  
    - Adam  
    
---

## Learning Rate Scheduling

---

## Gradient Checking and Clipping

---

# ✅ **Vanishing and Exploding Gradient Problems**

---

## Vanishing Gradient Problem

**What is it?**  
When training deep neural networks, the gradients (used to update weights) become **very small** as they are backpropagated through layers. This causes **early layers to learn very slowly or not at all**.

**Why does it happen?**  
In backpropagation, gradients are multiplied layer by layer. If the values are **less than 1**, they shrink:

$$
\text{Gradient} = \frac{\partial L}{\partial w} = \prod_{i=1}^{n} \frac{\partial a_i}{\partial a_{i-1}}
$$

If each derivative is < 1, the product becomes **very small** as \( n \) increases.

**Example:**  
If each layer multiplies by 0.5 and you have 10 layers:

$$
0.5^{10} = 0.00098 \quad \text{(almost zero)}
$$

So, the gradient **vanishes**.

---

## Exploding Gradient Problem

**What is it?**  
The opposite: gradients become **very large**, causing **unstable training** and huge weight updates.

**Why does it happen?**  
If the derivatives are **greater than 1**, they grow exponentially:

$$
\text{Gradient} = \prod_{i=1}^{n} \frac{\partial a_i}{\partial a_{i-1}}
$$

If each derivative is > 1, the product becomes **very large**.

**Example:**  
If each layer multiplies by 2 and you have 10 layers:

$$
2^{10} = 1024 \quad \text{(very large)}
$$

So, the gradient **explodes**.

---

## Solutions

- **Vanishing**: Use ReLU activation, Batch Normalization, or architectures like LSTM/GRU or Residual Networks (ResNets).
- **Exploding**: Use Gradient Clipping or better weight initialization.

---

# ✅ **Early Stopping**

**Early Stopping** is a regularization technique used to **prevent overfitting** during training.

It stops training **automatically** when the model's performance on a **validation set** starts to get worse, even if training loss is still decreasing.

---

## Why is it needed?

- During training:
  - **Training loss ↓** (model fits training data better)
  - **Validation loss ↓ then ↑** (model starts overfitting)

We stop at the point where **validation loss is lowest**.

---

## How does it work?

1. Split data into **training** and **validation** sets.
2. Monitor **validation loss** after each epoch.
3. If validation loss **doesn’t improve** for a set number of epochs (called **patience**), stop training.

---

## Example

Let’s say validation loss over epochs looks like this:

| Epoch | Validation Loss |
|-------|------------------|
| 1     | 0.50             |
| 2     | 0.40             |
| 3     | 0.35             |
| 4     | 0.36             |
| 5     | 0.38             |

If **patience = 2**, training stops after epoch 5 (no improvement for 2 epochs).

---

## Benefits

- Prevents overfitting
- Saves training time
- Simple to implement

---

## Common Parameters

- `monitor`: What to track (e.g., `val_loss`)
- `patience`: How many epochs to wait before stopping
- `restore_best_weights`: Option to revert to best model



# ✅ **Feature Scaling**

---

**Feature Scaling** is a technique to **normalize or standardize** input features so they are on a **similar scale**.

This helps models **learn faster and better**, especially those that use gradient descent.

---

## Why is it important?

- Features with different scales (e.g., age in years vs. income in dollars) can **confuse the model**.
- Models like neural networks, SVMs, and KNN are **sensitive to feature magnitudes**.
- Scaling ensures **fair contribution** of each feature.

---

## Common Methods

### 1. Min-Max Scaling (Normalization)

Scales values to a fixed range [0, 1]:

$$
x' = \frac{x - x_{min}}{x_{max} - x_{min}}
$$

### 2. Standardization (Z-score Scaling)

Centers data around 0 with unit variance:

$$
x' = \frac{x - \mu}{\sigma}
$$

Where:
- mu = mean of the feature
- sigma = standard deviation

---

## When to Use What?

| Method         | Use When...                          |
|----------------|--------------------------------------|
| Min-Max        | Data is bounded and not Gaussian     |
| Standardization| Data is Gaussian or unbounded        |

---

## Notes

- Always **fit scaler on training data only**, then apply to test/validation.



---
# ✅**Dropout Layer**

---

**Dropout** is a regularization technique used to **prevent overfitting** in neural networks.

During training, it **randomly "drops" (sets to zero)** a fraction of neurons in a layer on each forward pass.

---

## Why use Dropout?

- Forces the network to **not rely too much on specific neurons**
- Encourages the model to **learn more robust features**
- Acts like training many smaller networks and averaging them

---

## How does it work?

If dropout rate = 0.5, then **50% of neurons** are randomly turned off during each training step.

Mathematically:

$$
\text{output}_i = 
\begin{cases}
0 & \text{with probability } p \\\\
\frac{a_i}{1 - p} & \text{with probability } 1 - p
\end{cases}
$$

Where:
- ai is the activation of neuron i
- p is the dropout rate

---

## During Inference

- **No neurons are dropped**
- Outputs are **scaled down** automatically

---

## Why Dropout is Similar to Random Forests

- In **Random Forests**, each tree is trained on a **random subset of features and data**.
- In **Dropout**, each forward pass uses a **random subset of neurons**.
- Both techniques:
  - Reduce overfitting
  - Encourage **diversity** in learning
  - Combine multiple "weaker" models to form a **stronger, more general model**

So, Dropout is like training **many smaller neural networks** and averaging their predictions — just like Random Forests average many decision trees.

---


## Disadvantages of Dropout

---

### 1. Slower Training

- Dropout introduces **randomness**, which can make training **slower to converge**.
- The model needs more epochs to reach optimal performance.

---

### 2. Not Always Effective

- Dropout works best in **fully connected layers**.
- It may not help much (or even hurt) in **convolutional layers** or **recurrent networks** unless carefully tuned.

---

### 3. Increased Training Time

- Because neurons are dropped randomly, the model has to **learn multiple redundant paths**.
- This increases the **computational cost** during training.

---

### 4. Harder to Tune

- Choosing the right **dropout rate** (e.g., 0.2, 0.5) is not straightforward.
- Too high → underfitting  
  Too low → overfitting

---

### 5. Not Used During Inference(Testing/Pridiction)

- Dropout is **disabled during testing**, which means the model behaves differently during training and inference.
- This can cause **inconsistencies** if not handled properly.

---

### 6. May Not Work Well with Batch Normalization

- Dropout and BatchNorm can **interfere** with each other.
- Often, one is preferred over the other depending on the architecture.


---

## Example in Code (Keras)

```python
from tensorflow.keras.layers import Dropout

model.add(Dropout(0.5))  # 50% dropout



---
# ✅ **Regularization: L1 and L2**

---

Regularization is a technique to **reduce overfitting** by **penalizing large weights** in the model.

It adds a **penalty term** to the loss function to discourage complexity.

---

## Modified Loss Function

Let:
- Lo = original loss (e.g., cross-entropy or MSE)
- w  = model weights
- lambda= regularization strength

Then:

- **L1 Regularization (Lasso):**

  $$
  L = L_0 + \lambda \sum |w|
  $$

- **L2 Regularization (Ridge):**

  $$
  L = L_0 + \lambda \sum w^2
  $$

---

## L1 vs. L2

| Feature         | L1 (Lasso)                  | L2 (Ridge)                  |
|-----------------|-----------------------------|-----------------------------|
| Penalty         | sum|w|                      |sum w^2                      |
| Effect          | Shrinks some weights to 0   | Shrinks all weights evenly  |
| Use Case        | Feature selection           | General weight decay        |
| Sparsity        | Produces sparse models      | Keeps all features          |

---


## Notes

- lambda controls the strength of regularization.
- Too high → underfitting  
  Too low → overfitting

---

## Example in Code (Keras)

```python
from tensorflow.keras import regularizers

# L1
Dense(64, activation='relu', kernel_regularizer=regularizers.l1(0.01))

# L2
Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01))


---
# ✅ **Activation Functions**

---

- An **activation function** decides whether a neuron should be **activated or not** by applying a mathematical transformation to its input.

- It introduces **non-linearity** into the network, allowing it to learn **complex patterns**.

- Without activation functions, a neural network would behave like a **linear model**, no matter how many layers it has

---
## How a Neuron Processes Inputs

Each neuron receives inputs, multiplies them by weights, adds a bias, and then passes the result through an **activation function**:

$$
a = g(w_1x_1 + w_2x_2 + \ldots + w_nx_n + b)
$$

Where:
- \( x_1, x_2, ..., x_n \) are the input features  
- \( w_1, w_2, ..., w_n \) are the corresponding weights  
- \( b \) is the bias  
- \( g \) is the activation function  
- \( a \) is the output of the neuron

---
**Note**
- If the output \( a \) is **high**, the neuron is **activated** — it contributes strongly to the next layer.  
- If the output \( a \) is **low or zero**, the neuron is **not activated** — it contributes little or nothing.


---

## 1. Sigmoid Activation Function

**Formula:**

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

**Range:** (0, 1)

**Use Case:** Binary classification (output layer)

**Advantages:**
- Smooth and differentiable
- Outputs can be interpreted as probabilities

**Disadvantages:**
- **Vanishing gradient** for large |x|
- Outputs not zero-centered
- Slow convergence

---

## 2. tanh (Hyperbolic Tangent)

**Formula:**

$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

**Range:** (−1, 1)

**Use Case:** Hidden layers (better than sigmoid)

**Advantages:**
- Zero-centered output
- Stronger gradients than sigmoid

**Disadvantages:**
- Still suffers from **vanishing gradient**
- Can saturate for large |x|

---

## 3. ReLU (Rectified Linear Unit)

**Formula:**

$$
f(x) = \max(0, x)
$$

**Range:** [0, ∞)

**Use Case:** Most common in hidden layers

**Advantages:**
- Simple and fast
- Solves vanishing gradient (mostly)
- Sparse activation (efficient)

**Disadvantages:**
- **Dying ReLU**: neurons can get stuck at 0
- Not zero-centered

---

## Summary Table

| Function | Range     | Zero-Centered  | Vanishing Gradient      | Speed  | Common Use            |
|----------|-----------|----------------|-------------------------|--------|-----------------------|
| Sigmoid  | (0, 1)    | No             | Yes                     | Slow   | Output layer (binary) |
| tanh     | (−1, 1)   | Yes            | Yes (less than sigmoid) | Medium | Hidden layers         |
| ReLU     | [0, ∞)    | No             | No                      | Fast   | Hidden layers         |

---


# ✅ **Variants of ReLU and  Dying ReLU Problem**

---

## What is ReLU?

**ReLU (Rectified Linear Unit):**

$$
f(x) = \max(0, x)
$$

- Outputs 0 if input is negative  
- Outputs input if positive

---

## Dying ReLU Problem

- If too many neurons output **0**, they **stop learning** (gradient = 0)  
- This happens when weights push inputs into the negative zone **permanently**
- **Common Causes:**
    - Large negative bias or weights
    - High learning rate → pushes weights too far into negative zone
    - Poor initialization → many neurons start with negative outputs

---

## ReLU Variants

---

### 1. **Leaky ReLU**

Allows a small slope for negative inputs:

$$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$

- Typically, **alpha = 0.01**  
- Helps avoid dying ReLU

---

### 2. **Parametric ReLU (PReLU)**

Like Leaky ReLU, but alpha is **learned** during training:

$$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
a x & \text{if } x \leq 0
\end{cases}
$$

- More flexible than Leaky ReLU

---

### 3. **ELU (Exponential Linear Unit)**

Smooth curve for negative inputs:

$$
f(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha (e^x - 1) & \text{if } x \leq 0
\end{cases}
$$

- Helps with vanishing gradients  
- \( \alpha \) is a hyperparameter (e.g., 1.0)

---

### 4. **SELU (Scaled ELU)**

Self-normalizing version of ELU:

$$
f(x) =
\lambda \begin{cases}
x & \text{if } x > 0 \\
\alpha (e^x - 1) & \text{if } x \leq 0
\end{cases}
$$

- Works best with **specific initialization** and **architecture**  
- Keeps activations normalized

---

## Summary

| Function     | Handles Dying ReLU? | Learnable? | Smooth? | Common Use             |
|--------------|---------------------|------------|---------|------------------------|
| ReLU         | ❌                  | ❌         | ❌      | Default                |
| Leaky ReLU   | ✅ (fixed slope)    | ❌         | ❌      | Simple fix             |
| PReLU        | ✅ (learned slope)  | ✅         | ❌      | More flexible          |
| ELU          | ✅                  | ❌         | ✅      | Deep networks          |
| SELU         | ✅ (self-normalizing)| ❌        | ✅      | Self-normalizing nets  |


---
# ✅ **Weight Initialization**

Proper weight initialization helps neural networks **converge faster** and **avoid issues** like vanishing or exploding gradients.

---

### Why is Weight Initialization Important?

- Prevents gradients from becoming too small (vanishing) or too large (exploding)  
- Helps maintain stable activations and gradients across layers  
- Speeds up training and improves performance

---

## 1. Xavier / Glorot Initialization

**Used for:** Sigmoid, tanh activations  
**Goal:** Keep the variance of activations and gradients the same across layers.

**Formula:**

Let:
- \( n_{\text{in}} \): number of input units  
- \( n_{\text{out}} \): number of output units

**Uniform:**

$$
W \sim \mathcal{U} \left( -\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}} \right)
$$

**Normal:**

$$
W \sim \mathcal{N} \left( 0, \frac{2}{n_{in} + n_{out}} \right)
$$

---

### 2. He Initialization

**Used for:** ReLU and its variants  
**Goal:** Preserve variance of activations in forward pass

**Formula:**

$$
W \sim \mathcal{N} \left( 0, \frac{2}{n_{in}} \right)
$$

- Helps avoid dying ReLU problem  
- Allows deeper networks to train effectively

---

# ✅ **Batch Normalization**

---

**Batch Normalization (BatchNorm)** is a technique to:
- Normalize the inputs of each layer
- Speed up training
- Stabilize the learning process

---

# Why Use It?

- Reduces **internal covariate shift**
- Allows **higher learning rates**
- Helps prevent **vanishing/exploding gradients**
- Acts like a **regularizer**

---

# Step-by-Step Process

---

## Step 1: Compute Mean

Given a batch of activations:
$$x_1, x_2, ..., x_m$$

The mean of the batch is computed as:
$$\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i$$

## Step 2: Compute Variance
The variance of the batch is computed as:
$$\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2$$

## Step 3: Normalize

Each activation is normalized using the mean and variance:
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

Where $\epsilon$ is a small constant added to avoid division by zero.

## Step 4: Scale and Shift
The normalized activation is then scaled and shifted using learnable parameters $\gamma$ and $\beta$:
$$y_i = \gamma \hat{x}_i + \beta$$

$\gamma$: learnable scale parameter  
$\beta$: learnable shift parameter

These parameters allow the network to recover the original activations if needed.

---
# Typical Usage

- Use **after linear/convolution layer**
- Use **before activation function**

---


# ✅ **Optimizers**


Optimizers are algorithms that **adjust weights** of a neural network to **minimize the loss function**.

They use gradients (from backpropagation) to decide **how much and in which direction** to update weights.

---

# Types of Optimizers

## 1. **SGD (Stochastic Gradient Descent)**

Basic update rule:

$$
w = w - \eta \cdot \nabla L(w)
$$

- $\eta$: learning rate  
- $\nabla L(w)$: gradient of loss w.r.t. weights

---

## 2. **SGD with Momentum**

Adds a velocity term to smooth updates:

$$
v_t = \gamma v_{t-1} + \eta \nabla L(w)
$$

$$
w = w - v_t
$$

- $\gamma$: momentum factor (e.g. 0.9)
- Helps escape local minima and dampen oscillations.

---

## 3. **NAG (Nesterov Accelerated Gradient)**

Looks ahead before applying gradient:

$$
v_t = \gamma v_{t-1} + \eta \nabla L(w - \gamma v_{t-1})
$$

$$
w = w - v_t
$$

- More accurate updates than vanilla momentum.

---

## 4. **AdaGrad**

Adapts learning rate for each parameter:

$$
G_t = G_{t-1} + \nabla L(w)^2
$$

$$
w = w - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla L(w)
$$

- Works well for sparse data.  
- But learning rate shrinks too much over time.

---

## 5. **RMSProp**

Fixes AdaGrad’s issue by using moving average of squared gradients:

$$
E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) \cdot (\nabla L(w))^2
$$

$$
w = w - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot \nabla L(w)
$$

- Good for non-stationary data and deep networks.

---

## 6. **Adam (Adaptive Moment Estimation)**

Combines **Momentum + RMSProp**:

1. First moment (mean of gradients):

$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(w)
$$

2. Second moment (variance):

$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla L(w))^2
$$

3. Bias correction:

$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$

4. Update weights:

$$
w = w - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t
$$

- Best all-rounder. Works well in most cases.

---


# Evolution of Optimizers

| Optimizer        | Problem / Limitation                                                                 | How Next Optimizer Improved It                                                  |
|------------------|---------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|
| **SGD**          | - Very slow convergence  <br> - Stuck in local minima  <br> - Oscillates in ravines  | ➤ Momentum added to smooth and accelerate updates                               |
| **SGD + Momentum** | - Still overshoots or fluctuates  <br> - Doesn't look ahead while updating         | ➤ NAG adds lookahead (computes gradient before step) for better guidance        |
| **NAG**          | - Still uses same learning rate for all parameters                                   | ➤ AdaGrad introduces adaptive learning rates for each parameter                 |
| **AdaGrad**      | - Learning rate shrinks too much over time → stops learning                          | ➤ RMSProp uses moving average to prevent learning rate from vanishing           |
| **RMSProp**      | - No momentum  <br> - Can still oscillate                                             | ➤ Adam combines momentum + adaptive learning for stable & fast updates          |
| **Adam**         | - Sometimes converges to sub-optimal solutions  <br> - May not generalize well        | ➤ Improvements like AdamW, Nadam, etc. (not discussed here) try to fix          |

---

## Summary

| Optimizer      | Learns Rate? | Uses Momentum?    | Notes                            |
| -------------- | ------------ | ----------------- | -------------------------------- |
| SGD            | ❌ No         | ❌ No              | Basic gradient descent           |
| SGD + Momentum | ❌ No         | ✅ Yes             | Faster convergence               |
| NAG            | ❌ No         | ✅ Yes (lookahead) | Better than Momentum             |
| AdaGrad        | ✅ Yes        | ❌ No              | Slows down over time             |
| RMSProp        | ✅ Yes        | ❌ No              | Good for non-stationary problems |
| Adam           | ✅ Yes        | ✅ Yes             | Best default choice              |

---

# ✅**Exponentially Weighted Moving Average (EWMA)**

EWMA is a technique used to compute a moving average that gives more weight to recent values and less weight to older values.

It’s widely used in optimizers (like RMSProp and Adam) to smooth gradients or squared gradients over time.

$$
v_t = \beta v_{t-1} + (1 - \beta)x_t
$$

- $x_t$: current value (like gradient or loss)  
- $\beta$: decay rate (e.g. 0.9 or 0.99)  

- Keeps more weight on recent values.  
- Smooths out noise in training.

---