# What is Deep Learning
It is a subset of machine learning that takes data, perform a function, and progressively gets better at it. The algorithms are inspired by the structure and function of the brain called **Artificial Nural Network (ANN)**.

## Agenda - 7/25
- Neural Network and Deep Learning Foundation
- Tunning and Optimizing the Deep Learning Networks
- Convolution Neural Networks (CNN) - I --> CNN deals with images
- Convolution Neural Networks - II
- Recurrent Neutal Networks (RNN) --> deals with time-series data
- Long Short-Term Memory (LSTM) Networks --> deals with time-series data

**Use-Cases**
- Face-detecting features in the Facebook or iPhone.
- Object-detection by the autonomous cars.
- Speech-recognition with Siri
- Recommending movies to users - Netflix

https://poloclub.github.io/cnn-explainer/



# Machine Learning vs Deep Learning

![DLvsML](https://cdn.prod.website-files.com/634e9d5d7cb8f7fc56b28199/66a2f1efb5b625f74538bd2a_668f001243b877277fd25aa1_652ebc26fbd9a45bcec81819_Deep_Learning_vs_Machine_Learning_3033723be2.jpeg)


  - Machine learning is trained on small data compared with Deep Learning.
  - ML works on algorithms. DL works on ANN.
  - Training time is short for ML. It is long for ANN.
  - Accuracy of the DL increases with more training data. But it drops for ML.

## Perceptrons

The Perceptron is one of the simplest types of artificial neural networks and forms the foundation of more complex deep learning models. It is a **binary classifier** that **maps input features to a single output using a set of weights and a bias**, followed by **an activation function**.

![DLvsML](https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Ib3_FfuOy04kOmfO)

The following things take place in each Perceptron in the ANN:
- It receives vector inputs and its weights
- It performs weighted sum = x1w1 + x2w2 + ....+XnWn
- the weighted sum is sent to activation function f(x) to generate output.

**Activation Function** introduces non-linearity into the network, which allows it to learn complex patterns (like images, language, etc.). Without activation functions, a neural network would just be a big linear function, no matter how many layers it has.

### Problem with Linearity
Doctors want to predict if a tumor is cancerous based on things like shape, size, and texture.
score= a⋅size + b⋅texture + c⋅shape+d

If score is high → cancer and If score is low → not cancer

This **assumes everything affects the result in a straight line.**

Example: Every 1 unit increase in texture always increases cancer risk the same amount.
**This is too simple for the real world.**

Example:

- A large cell might only mean cancer if it’s also irregular.

- A rough texture might be OK unless the shape is also weird.

Hence, some non-lineraity ingestion by the **activation function** is needed.

**ANN** is a collection of layers of **Perceptron** connected. It consists of:
- Input Layer
- Hidden Layers
- Output Layer

E.g an image is visualize as an array of x * y dimension of pixels. The array is sent to each neuron/perceptron in the input layer. Out put of the layer is sent to each neuron/perceptron in the next hidden layer, and so on.

https://playground.tensorflow.org/#

https://developers.google.com/machine-learning/crash-course/neural-networks/activation-functions

# Input to Activation Function or Transfer Function in Artificial Neural Networks (ANN)

In an **Artificial Neural Network (ANN)**, the **input to the activation function** is also known as:

- **Net input**
- **Weighted sum**
- **Pre-activation value**

As mentioned above, the Activation function adds non-linearity to the input to learn complex underlying patterns in the data. **without activation function**, the neuron resembles a linear regression.  

---

### 🔢 Formula

For a neuron/perceptron $j $, the input to the activation function is:

$$
z_j = \sum_{i=1}^{n} w_{ji} x_i + b_j
$$

Where:
- $ x_i $ = input from neuron $ i $ (or input feature)
- $ w_{ji} $ = weight connecting input $ i $ to neuron $ j $
- $ b_j $ = bias of neuron $ j $
- $ z_j $ = net input to the activation function

The output of the neuron is then:

$$
a_j = f(z_j)
$$

Where $ f $ is the **activation function** (like ReLU, Sigmoid, or Tanh), and $ a_j $ is the output.

**🧠 Example**

Given:
- Inputs: $ x_1 = 0.5 $, $ x_2 = 0.8 $
- Weights: $ w_1 = 0.3 $, $ w_2 = 0.7 $
- Bias: $ b = 0.1 $

Calculate the net input:

$$
z = (0.5 \times 0.3) + (0.8 \times 0.7) + 0.1 = 0.15 + 0.56 + 0.1 = 0.81
$$

---

Here are some of the **activation functions**:

### 1. Step Function (Heaviside)
It is not used anymore.

$$
f(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}
$$

**Graph**
A sudden jump from 0 to 1 at 𝑥 = 0


### 2. ReLU Function

If we use ReLU as the activation function:

$$
a = \max(0, 0.81) = 0.81
$$

**Graph**
- 0 for all negative values
- Straight line for positive values
- Looks simple, but creates a kink at 0
- This kink breaks the linearity

**Usage**
- **Almost every modern neural network (CNNs, MLPs, Transformers)**
- It is used only in the Hdden layer.

### 3. Sigmoid Function

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:
- $ \sigma(z) $ is the output of the sigmoid function
- $ z $ is the input (also called the net input or weighted sum)
- $ e $ is Euler's number, approximately equal to 2.718

**Graph**
- S-shaped curve
- Ranges from 0 to 1
- Smoothly squashes input into range (0 to 1)
- Allows small and large values to flatten out
- Non-linear because the rate of change is not constant

**Usage**
- **Binary classification** outputs
- Hidden layers of RNN
- Used mostly in output layer.

### 4. TanH Function

$$
\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}
$$

Where:  
- \( \tanh(z) \) is the output of the hyperbolic tangent function  
- \( z \) is the input (also called the net input or weighted sum)  
- \( e \) is Euler's number, approximately equal to 2.718  

**Graph**  
- S-shaped curve, similar to sigmoid but centered at zero  
- Ranges from \(-1\) to \(1\)  
- Smoothly squashes input into range \(-1\) to \(1\)  
- Allows both positive and negative inputs to be transformed non-linearly  
- Non-linear because the rate of change varies depending on input value  

**Usage**  
- Often used in hidden layers of neural networks  
- Especially popular in **recurrent neural networks** (RNNs)  
- Helps center data around zero which can improve learning speed and stability  
- Hidden layers of RNN

### 5. Softmax Function
$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

Where:  
- $ z_i $ is the input to the $ i^{\text{th}} $ neuron in the output layer  
- $ K $ is the total number of output classes  
- $ e $ is Euler’s number, approximately equal to 2.718  
- The denominator sums over all $ K $ class scores, turning raw scores into probabilities  

**Graph**  
- Converts raw input values (called logits) into a **probability distribution**  
- All outputs are between \( 0 \) and \( 1 \)  
- The sum of all outputs is exactly \( 1 \)  
- Emphasizes the **largest values** while suppressing smaller ones  
- Non-linear because the exponential and normalization change the scale and shape of input values  

**Usage**  
- Used in the **output layer** of neural networks for **multi-class classification**  
- Helps the network output probabilities for each class  
- Most common in tasks like image classification, text categorization, etc.  
- Not used in hidden layers  


### ✅ Notes

- **Input Layer** is meant for receiving inputs only. It does not use any activation function.






# 🔁 Training Nural Network - Forward Propagation vs. Backward Propagation

https://xnought.github.io/backprop-explainer/

---

### 🚀 Forward Propagation

**Definition**:  
Forward propagation is the process of **passing input data through the network** to generate an output (prediction).

#### 🔢 Steps:
1. **Input** data is fed to the input layer.
2. Each neuron computes a **weighted sum** of its inputs:
   $$
   z = \sum w_i x_i + b
   $$
3. The result is passed through an **activation function** (e.g., ReLU, Sigmoid):
   $$
   a = f(z)
   $$
4. This continues layer by layer until the **final output** is produced.

#### 🎯 Purpose:
To calculate the **output (prediction)** of the neural network based on current weights.

$$
   \text{Loss} = \text{LossFunction}(y_{\text{true}}, y_{\text{pred}})
   $$

Objective of learning is to minimize the losses. **In forward-propagation**, *there is no way to send feedback back to the model training.*

---

### 🔁 Backward Propagation (Backprop)

**Definition**:  
Backward propagation is the process of **updating the weights** by calculating the gradient of the loss function and applying optimization.

#### 🔄 Steps:
1. Calculate the **loss (error)** between predicted output and true label:
   $$
   \text{Loss} = \text{LossFunction}(y_{\text{true}}, y_{\text{pred}})
   $$
2. Compute the **gradient of the loss** with respect to each weight using the **chain rule** of calculus.
3. Propagate the error **backward from output to input** layers.
4. **Update the weights** using Gradient Descent:
   $$
   w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial \text{Loss}}{\partial w_{\text{old}}}
   $$
   Where $ \eta $ is the learning rate.

#### 🎯 Purpose:
To **minimize the loss** by adjusting weights and biases in the direction that reduces error.

---

### 🧪 Simple Analogy

- **Forward Propagation**: Like a student answering a test based on what they know.
- **Backward Propagation**: Like the teacher grading the test, showing mistakes, and the student learning from it to do better next time.

---

### 🔄 Summary Table

| Process              | Direction           | Purpose                       | Main Operation            |
|----------------------|---------------------|-------------------------------|----------------------------|
| Forward Propagation  | Input → Output       | Make predictions               | Weighted sum + activation |
| Backward Propagation | Output → Input       | Update weights (learn)         | Gradient computation + update |



**EPOCH**: A forward and backward propagation completes 1 EPOCH.

EPOCH count tells you how many forward-backward propagation took place in the learning.




# 🧠 Predicting an Image Using a Neural Network (Backward Propagation)

Let's walk through how a neural network predicts whether an image is a **cat or dog**, using key concepts like **initial weights**, **activation functions**, **forward pass**, and **backpropagation**.

---

## 🔢 Step 1: Input Layer

- A 28×28 grayscale image is flattened into a vector of pixel values.
- For a 28×28 image → 784 inputs:

$$
x = [x_1, x_2, ..., x_{784}]
$$

👉 **No activation function is applied** at this stage. It's just raw input.

---

## 🔧 Step 2: Initialize Weights

- Every connection from the input to the first hidden layer gets an initial weight.
- Example using **He Initialization** (good for ReLU):

$$
w \sim \mathcal{N}(0, \frac{2}{n_{\text{in}}})
$$

This helps avoid exploding or vanishing gradients.

---

## 🧮 Step 3: Forward Pass

### Hidden Layer

Let’s say we have:

- 1 Hidden Layer with 128 neurons (ReLU activation)
- 1 Output Neuron (Sigmoid activation for binary classification)

The weighted sum for one hidden neuron is:

$$
z = w_1x_1 + w_2x_2 + \dots + w_{784}x_{784} + b
$$

Apply **ReLU** activation:

$$
a = \max(0, z)
$$

Repeat for all 128 neurons.

### Output Layer

Then:

$$
z_{\text{output}} = \sum w_i a_i + b_{\text{out}}
$$

$$
\hat{y} = \sigma(z_{\text{output}}) = \frac{1}{1 + e^{-z_{\text{output}}}}
$$

- If $ \hat{y} \approx 1 $: predicted label is **dog**
- If $ \hat{y} \approx 0 $: predicted label is **cat**

---

## 💥 Step 4: Loss Calculation

Compare the prediction to the actual label using **binary cross-entropy**:

$$
\text{Loss} = -\left[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})\right]
$$

Where:
- $ y $ is the true label (1 for dog, 0 for cat)
- $ \hat{y} $ is the predicted probability

---

## 🔄 Step 5: Backpropagation

- Compute the **error** at the output layer.
- Use the **chain rule** to propagate the error backward.
- Compute gradients:

$$
\frac{\partial \text{Loss}}{\partial w}
$$

- Update weights using **gradient descent**:

$$
w = w - \eta \cdot \frac{\partial \text{Loss}}{\partial w}
$$

Where $ \eta $ is the **learning rate**.

---

## 🔁 Repeat for Many Images

This process repeats for **every image**, over many **epochs**, improving accuracy each time.

The process repeats until stopping conditions is met. Here are some stopping condition set by you.

1. Number of Epoch

2. Stop training if loss < 0.001 or if it stops improving for 5 epochs

3. If validation loss starts increasing, it means the model is overfitting.

4. Maximum training time

---

## 🧩 Summary Flowchart

Image (28x28) → Flatten → Input Layer

         ↓

     Hidden Layer 1 (ReLU)

         ↓

     Output Layer (Sigmoid)

         ↓

   Prediction: "Dog" or "Cat"

         ↓

    Compute Loss & Backprop

         ↓

     Update Weights




## Loss Functions
- **Mean Squared Error (MSE)**: for regression tasks  
  $$
  MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
  $$

- **Binary Cross-Entropy**: for binary classification  
  $$
  BCE = - \frac{1}{n} \sum_{i=1}^{n} \left( y_i \cdot \log \hat{y}_i + (1 - y_i) \cdot \log (1 - \hat{y}_i) \right)
  $$

- **Categorical Cross-Entropy**: for multi-class classification  
  $$
  Loss = - \sum_{i=1}^{output \ size} y_i \cdot \log \hat{y}_i
  $$

Where:  
- **n**: Number of training samples;  
- **yᵢ**: Actual Output;  
- **ŷᵢ**: Predicted class probabilities.


**Batch Size**
No of training samples processed together in one training step.

e.g. let's say I have 1000 images dataset. It is processed in batches. say I want to processes 100 images per batch.

In that case, No of batches = 10, Size of Each Batch or Batch Size = 10

**EPOCH**
One full pass through the entire dataset is called EPOCH, composed of multiple itertations. E.g processing all batches composing 1000 images.

**Learning Rate** A hyper-parameter that detemines the model weight are adjusted during each update step.
It tells how frequently the weights are adjusted.