---

# ✅ **What are CNNs**

---

CNN (Convolutional Neural Network) is a type of deep learning model designed to process grid-like data, especially images.

- Images = 2D grids of pixels (e.g., 28×28 grayscale or 224×224×3 RGB).
- CNNs use convolutional layers to extract spatial features like edges, textures, shape

---


# ✅ **Why Not Use ANN for Images?**

## 1. Too Many Parameters

- Example: 224×224 RGB image = 150,528 inputs  
- ANN with 1 hidden layer of 1,000 neurons:  
  → 150,528 × 1,000 = **150 million weights**  
  → High memory usage, slow training, overfitting

## 2. No Spatial Awareness

- ANN flattens the image → loses **spatial structure**
- Nearby pixels (edges, shapes) are treated as **independent**

---

##  Why CNNs Work Better
### 1. Convolution Layer

- Applies small filters (e.g., 3×3) across the image
- Captures **local patterns** like edges and textures
- **Shared weights** → fewer parameters


### 2. Pooling Layer

- Reduces spatial size (e.g., 2×2 max pooling)
- Keeps important features, removes noise
- Reduces computation

---

### 3. Parameter Sharing

- Same filter slides across the image
- **Drastically fewer weights** than ANN

---

## Comparison Table

| Feature              | ANN (Fully Connected)  | CNN (Convolutional) |
|----------------------|------------------------|----------------------|
| Input shape          | Flattened              | 2D/3D (image)        |
| Parameters           | Very high              | Much lower           |
| Spatial info         | Lost                   | Preserved            |

---


# ✅ **Filters (Kernels)and Convolution Operation**

---

# What are Filters / Kernels?

- A **filter** (also called a **kernel**) is a small matrix used to extract features from an image.
- Common sizes: 3×3, 5×5, etc.
- Each filter detects a specific pattern (e.g., edge, corner, texture).

## Example: 3×3 Edge Detection Filter

[[-1, -1, -1], [ 0, 0, 0], [ 1, 1, 1]]


- This filter highlights **horizontal edges** in an image.

---

# What is Convolution Operation?

- Convolution = sliding the filter over the image and computing **dot products**.
- At each position, multiply filter values with the image patch and **sum** the result.

## Formula:

$$
\text{Output}(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m, n)
$$

Where:
- \( I \) = input image
- \( K \) = kernel
- \( k \) = kernel size
- \( (i, j) \) = position in output feature map

---

# What is a Convolutional Layer?

- A **convolutional layer** applies multiple filters to the input image.
- Each filter produces a **feature map**.
- The output is a **stack of feature maps** (depth increases).

---

# Summary

- **Filter**: Small matrix to detect patterns
- **Convolution**: Sliding filter over image and computing dot products
- **Convolutional Layer**: Applies multiple filters to extract features

---

# ✅ **Padding and Strides**

---

# What is Stride?

- **Stride** = number of pixels the filter moves at each step.
- Default stride = 1 (moves one pixel at a time).
- Larger stride = smaller output feature map.

## Example:

- Input: 5×5
- Filter: 3×3
- Stride: 1 → Output: 3×3  
- Stride: 2 → Output: 2×2

---

# What is Padding?

- **Padding** = adding extra pixels (usually zeros) around the input image.
- Purpose: control output size and preserve edge information.

## Types of Padding:

| Type         | Description |
|--------------|-------------|
| **Valid**    | No padding → output shrinks |
| **Same**     | Adds padding → output size ≈ input size |
| **Custom**   | Manually set padding size |

---




# ✅ **Padding and Stride – Formulas**

---

# 1. Output Size Formula (1D)

For a 1D input of size \( n \), kernel size \( k \), padding \( p \), and stride \( s \):

$$
\text{Output size} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1
$$

---

# 2. Output Size Formula (2D)

For a 2D input of size \(n*n \), kernel size \(k*k \), padding \( p \), and stride \( s \):

$$
\text{Output height} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1
$$

$$
\text{Output width} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1
$$

---

# 3. Padding to Keep Output Same as Input
To keep the output size the same as the input size (when stride = 1):

$$
p = \left\lfloor \frac{k - 1}{2} \right\rfloor
$$

This is used in **"same" padding**.

---

## Example

- Input size: 32×32  
- Kernel size: 3×3  
- Stride: 1  
- Padding: 1

Then:

$$
\text{Output size} = \left\lfloor \frac{32 + 2(1) - 3}{1} \right\rfloor + 1 = 32
$$

So the output size is **32×32**, same as input.

---


# ✅ **Calculating Trainable Parameters in CNN**

---

# 1. Convolutional Layer

**Formula:**

$$
\text{Parameters} = (F_H \times F_W \times C_{\text{in}} + 1) \times N_{\text{filters}}
$$

Where:
- **\( F_H, F_W \)**: Filter height and width
- **C_in**: Number of input channels
- **N_filters**: Number of filters (output channels)
- `+1` accounts for the **bias** per filter

## Example:

- Input: 128 x 128 x 3
- Filter size: 3 x 3
- Number of filters: 50

**Calculation:**

- Weights per filter: 3 x 3 x 3 = 27
- Total weights: 27 x 50 = 1350
- Biases: 1 x 50 =50 

**Total trainable parameters = 1350 + 50 = 1400**

---

# 2. Fully Connected (Dense) Layer

**Formula:**

$$
\text{Parameters} = (N_{\text{in}} + 1) \times N_{\text{out}}
$$

Where:
- **N_in**: Number of input units
- **N_out**: Number of output units
- `+1` is for the bias per output unit

---

# 3. Pooling Layers

- **No trainable parameters**
- They only perform downsampling (e.g., max or average pooling)

---

# Tips

| Term               |                 Meaning                 | 
|--------------------|-----------------------------------------|
| Input Channels	 | Depth of input (e.g., 3 for RGB image)   
| Output Channels	 | Number of filters used in the layer   
| Each Filter Size     | Height×Width×Input Channels


---


# ✅ **Pooling**

---

## What is Pooling?

- **Pooling** is a downsampling operation used in CNNs.
- It reduces the **spatial dimensions** (height and width) of feature maps.
- Benefits:
  - Reduces computation
  - Controls overfitting
  - Makes features more robust to translation (position changes)

---

## How Pooling Works

- A small window (e.g., 2×2) slides over the input.
- It replaces the window with a **single value** based on a rule (max, average, etc.).
- Pooling is applied **independently** to each feature map.

---

## Types of Pooling

### 1. Max Pooling

- Takes the **maximum** value in each window.
- Keeps the most important (strongest) feature.

**Example:**
Input: [[1, 3], [2, 4]]
Max Pooling → 4

---

### 2. Average Pooling

- Takes the **average** of all values in the window.
- Smooths the feature map.

**Example:**

Input: [[1, 3], [2, 4]]
Average Pooling → (1+2+3+4)/4 = 2.5


### 3. Min Pooling

- Takes the **minimum** value in each window.
- Highlights the **least activated** features.

**Example:**

Input: [[1, 3], [2, 4]]
Min Pooling → 1

---

### 4. Global Pooling

- Applies pooling over the **entire feature map**.
- Output is a **single value per feature map**.
- Used before fully connected layers to reduce dimensions.

#### Types:
- **Global Max Pooling**: Takes the maximum value of the entire feature map.
- **Global Average Pooling**: Takes the average of all values in the feature map.
---

## Pooling Parameters

| Parameter     | Description |
|---------------|-------------|
| **Window size** | Size of the pooling filter (e.g., 2×2) |
| **Stride**      | How far the window moves (usually 2) |
| **Padding**     | Usually not used in pooling |

---

## Output Size Formula (Same as Convolution)

For input size \( n \), filter size \( k \), stride \( s \), and padding \( p \):

$$
\text{Output size} = \left\lfloor \frac{n + 2p - k}{s} \right\rfloor + 1
$$

- Usually, \( p = 0 \) and \( s = k \) in pooling.

---

# ✅ **CNN Architecture**

---

# What is a CNN Architecture?

- A **CNN architecture** defines how layers are arranged in a convolutional neural network.
- It includes:
  - Number and type of layers (Conv, Pooling, FC)
  - Filter sizes and counts
  - Activation functions
  - Connections between layers

---

# Typical Layer Order in a CNN

A standard CNN follows this sequence of layers:

1. **Input Layer**
   - Accepts the image (e.g., 28×28 grayscale or 224×224×3 RGB)

2. **Convolutional Layer**
   - Applies filters to extract features like edges, textures, etc.

3. **Activation Function (ReLU)**
   - Adds non-linearity to help the network learn complex patterns
4. **Batch Normalization (optional)**
   - Normalizes activations to stabilize and speed up training

5. **Pooling Layer**
   - Downsamples the feature maps to reduce size and computation

6. **Dropout Layer (optional)**
   - Randomly disables neurons to prevent overfitting

7. **Repeat Steps 2–6**
   - Multiple convolutional blocks are stacked

8. **Flatten Layer**
   - Converts 2D feature maps into a 1D vector

9. **Fully Connected (Dense) Layer**
   - Performs classification based on extracted features

10. **Output Layer**
    - Final predictions (e.g., softmax for multi-class classification)

---

# Example Architecture Flow
Input → Conv → ReLU → BatchNorm → Pool → Dropout → Conv → ReLU → Pool → Flatten → FC → Output

---

# Layer Summary

| Layer Type         | Purpose                          |
|--------------------|----------------------------------|
| Convolutional Layer| Feature extraction               |
| Activation (ReLU)  | Non-linearity                    |
| Batch Norm         | Stabilize training               |
| Pooling Layer      | Downsampling                     |
| Dropout            | Regularization                   |
| Fully Connected    | Classification                   |

---

# Notes

- **BatchNorm** is often placed **after Conv and before ReLU**.
- **Dropout** is usually applied **after pooling or before FC layers**.
- The exact order may vary slightly in advanced architectures (e.g., ResNet, DenseNet).

---


# Famous CNN Architectures

## 1. LeNet-5 (1998)
- Designed for digit recognition (MNIST)
- Simple: 2 conv layers + 2 FC layers
- Input: 32×32 grayscale
## 2. AlexNet (2012)
- Won ImageNet 2012
- 5 conv layers + 3 FC layers
- Used ReLU, dropout, and GPU training

## 3. VGGNet (2014)
- Very deep: 16 or 19 layers
- Uses only 3×3 filters
- Easy to understand and implement
## 4. GoogLeNet (Inception) (2014)
- Introduced **Inception modules**
- Mixed filters (1×1, 3×3, 5×5) in parallel
- Very efficient and deep

## 5. ResNet (2015)
- Introduced **skip connections** (residual blocks)
- Solves vanishing gradient problem
- Very deep: up to 152 layers

## 6. DenseNet (2017)
- Each layer connects to **all previous layers**
- Improves feature reuse and gradient flow

---

# Comparison Table

| Model      | Year | Depth | Key Feature             |
|------------|------|-------|-------------------------|
| LeNet-5    | 1998 | 7     | First CNN for digits    |
| AlexNet    | 2012 | 8     | ReLU, dropout, GPU      |
| VGGNet     | 2014 | 16/19 | Simple, deep, 3×3 filters|
| GoogLeNet  | 2014 | 22    | Inception modules       |
| ResNet     | 2015 | 34–152| Residual connections    |
| DenseNet   | 2017 | 121+  | Dense connections       |

---

# Summary

- CNN architectures evolve to improve **accuracy**, **efficiency**, and **training stability**.
- Famous models like **ResNet** and **DenseNet** are widely used in modern applications.

---


# ✅ **Backpropagation in CNN**

---

# What is Backpropagation?

- Backpropagation is the process of **updating weights** in a neural network using **gradient descent**.
- It works by computing the **gradient of the loss function** with respect to each weight using the **chain rule** of calculus.

---

# Backpropagation Steps in CNN

## 1. Forward Pass
- Compute outputs layer by layer:
  - Convolution → Activation → Pooling → Fully Connected → Output
- Calculate **loss** using a loss function (e.g., cross-entropy)

---

## 2. Backward Pass (Backpropagation)

We compute gradients **from output to input**:

### a. Output Layer (Fully Connected)

For a loss function \( L \) and output \( \hat{y} \):

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}
$$

Where:
- \( z = W \cdot x + b \)
- \( \hat{y} \): prediction
- \( W \): weights
- \( x \): input to the layer

---

### b. Activation Function (e.g., ReLU)

For ReLU: f(x) = max(0, x)

Its derivative:

$$
f'(x) = \begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

Multiply this with the gradient from the next layer.

---

### c. Convolutional Layer

Let:
- \( I \): input image
- \( K \): kernel
- \( O \): output feature map

Then:

- Gradient w.r.t. kernel:

$$
\frac{\partial L}{\partial K} = I * \frac{\partial L}{\partial O}
$$

- Gradient w.r.t. input:

$$
\frac{\partial L}{\partial I} = \text{full convolution of } \frac{\partial L}{\partial O} \text{ with flipped } K
$$

---

### d. Pooling Layer (e.g., Max Pooling)

- **Max Pooling**: Gradient is passed only to the **max value** in the window.
- **Average Pooling**: Gradient is **evenly distributed** to all values in the window.

---

## 3. Weight Update (Gradient Descent)

Update rule:

$$
W := W - \eta \cdot \frac{\partial L}{\partial W}
$$

Where:
- \( \eta \): learning rate
- \( \frac{\partial L}{\partial W} \): gradient of loss w.r.t. weights

---

# Summary

| Step             | Operation                          |
|------------------|------------------------------------|
| Forward Pass     | Compute outputs and loss           |
| Backward Pass    | Compute gradients using chain rule |
| Update Weights   | Apply gradient descent             |




# Note:
- Backpropagation in CNNs is similar to ANNs, but includes **convolution-specific gradient rules**.

---


# ✅ **Transfer Learning**

---

# What is Transfer Learning?

- Transfer Learning is a technique where a model trained on one task is reused for another related task.
- In CNNs, it means using a **pre-trained model** (like VGG16, ResNet50) trained on a large dataset (e.g., ImageNet) and adapting it to a new, smaller dataset.

---
There are two main approaches:

## 1. Feature Extraction
- Freeze all pre-trained layers.
- Train only new classifier layers.

## 2. Fine-Tuning
- Unfreeze **some deeper layers** of the pre-trained model.
- Train both the classifier and a few convolutional layers.

---

# Why Use Transfer Learning?

- Saves **time and computation**.
- Useful when you have **limited data**.
- Leverages powerful **features learned from large datasets**.

---

## How Transfer Learning Works

### Step 1: Load a Pre-trained CNN
- Example models: `VGG16`, `ResNet50`, `MobileNet`.
- These are trained on large datasets like **ImageNet** (1.2 million images, 1000 classes).

### Step 2: Freeze Early Layers
- Early layers learn general visual patterns.
- We **freeze** them to keep their weights unchanged.

### Step 3: Replace Final Layers
- Remove the original classifier (top layers).
- Add **custom layers** for your new task.
- Example: change output from 1000 classes → 10 classes.

### Step 4: Train on New Dataset
- Train only the new layers.
- Optionally, **fine-tune** some deeper layers later.

---

# Example (Using Keras)

```python
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten

# Step 1: Load pre-trained model (without top/classifier)
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Step 2: Freeze base model layers
for layer in base_model.layers:
    layer.trainable = False

# Step 3: Add new classifier layers
x = Flatten()(base_model.output)
x = Dense(128, activation='relu')(x)
output = Dense(10, activation='softmax')(x)

# Step 4: Create new model
model = Model(inputs=base_model.input, outputs=output)
