# CNN : Convolutional Neural Network

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="padding: 20px; background-color: #511f1f; border: 1px solid #e9ecef; border-radius: 5px; margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.05); max-width: 600px; margin: 30px auto;">
    <h3 style="text-align: center; color: #ffffff; margin-top: 0; font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;">
        Project Contributors
    </h3>
    <hr style="border-top: 1px solid #ffffff; width: 50%; margin: 10px auto;">
    <div style="text-align: center; font-size: 1.1em; color: #ffffff; margin-bottom: 15px;">
        <strong>Walid Canesse</strong><br>
        <strong>Salaheddin Ben Emran</strong><br>
        <strong>Mohamed-Ayoub Bouzid</strong>
    </div>
    <div style="text-align: center; font-size: 0.9em; color: #ffffff; font-style: italic;">
        All animations and visualizations were created by the authors using <a href="https://www.manim.community/" target="_blank" style="color: #aad7ff; text-decoration: underline;">Manim</a>.
    </div>
</div>
"""))

# Motivations

The purpose of this project is to provide a clear and accessible introduction to convolutional neural networks.  
A basic understanding of artificial neural networks is recommended.

As this work has not been reviewed by professionals in the field and was produced out of personal interest alongside our studies, please feel free to contact us at **walidcanesse@gmail.com** if you notice any inaccuracies or have suggestions for improvement.


# Table of Contents

* [Introduction](#Introduction)
* [Chapter 1 : The Intuition Behind CNNs](#Chapter-1-:-The-intuition-behind-CNNs)
    * [I. What is an Image?](#I.-What-is-an-Image?)
        * [A. Grayscale Images](#A.-Grayscale-Images-(Black-&-White))
        * [B. Color Images (RGB)](#B.-Color-Images-(RGB))
        * [C. What is a Tensor?](#C.-What-is-a-Tensor?)
    * [II. Why Not Classic MLPs? (Example: MNIST)](#II.-Why-Not-Classic-MLPs?-(Example:-MNIST-Dataset))
    * [III. How Humans Classify Images: Feature Detection](#III.-How-Humans-Classify-Images:-Feature-Detection)
* [Chapter 2 : The Convolution Operator](#Chapter-2-:-The-convolution-operator)
    * [I. Defining Image Convolution and Kernels](#I.-Defining-Image-Convolution-and-Kernels)
    * [II. The Dimensions of a Convolution Output](#II.-The-dimensions-of-a-convolution-output)
    * [III. Hyperparameters of Convolution](#III.-Hyperparameters-of-Convolution)
        * [1. Padding](#1.-Padding)
        * [2. Stride](#2.-Stride)
        * [3. Kernel Size & Number of Filters](#3.-Kernel-Size-&-Number-of-Filters)
    * [IV. The Final Output Dimension Formula](#IV.-The-FINAL-Output-Dimension-Formula-including-padding-and-stride-!!)
    * [V. Feature Detection Examples](#V.-Feature-detection-examples)
* [Chapter 3 : Building the Convolutional Neural Network](#Chapter-3-:-Building-the-Convolutional-Neural-Network)
    * [I. From Manual Filters to Learned Features](#I.-From-Manual-Filters-to-Learned-Features)
    * [II. How Can We See Convolution as a Neural Network Layer](#II.-How-can-we-see-Convolution-as-a-Neural-Network-Layer)
        * [1. Definitions and Matrices](#1.-Definitions-and-Matrices)
        * [2. Step-by-Step Calculation](#2.-Step-by-Step-Calculation-(The-Linear-Part))
        * [3. The Activation Function](#3.-The-Activation-Function-(Non-Linearity))
    * [III. The Convolutional Neural Network Pipeline](#III.-The-Convolutional-Neural-Network-Pipeline-!!)
        * [1. Pooling](#1.-Pooling)
            * [Max Pooling vs. Average Pooling](#B.-Max-Pooling-vs.-Average-Pooling)
        * [2. CNN's Pipeline Architecture](#2.CNN's-Pipeline-!!!)
        * [3. Designing the Pipeline: Hyperparameters](#3.-Designing-the-CNN's-Pipeline:-The-Hyperparameters)
            * [3.4 Stabilization & Regularization (Batch Norm & Dropout)](#3.4-Crucial-Extras:-Stabilization-&-Regularization)
        * [4. Example of How a CNN Learns (Modern Pipeline)](#4.-Example-of-how-a-CNN-learns-(The-Full-Modern-Pipeline))
            * [4.1 Intuition vs. Reality](#4.1-Intuition-vs.-Reality)
            * [4.2 The Forward Pass](#4.2-The-Forward-Pass-(Step-by-Step-with-Layers))
            * [4.3 Backpropagation & Optimization](#4.3-Backpropagation-&-Optimization)
        * [5. Network Depth vs. Information Precision](#5.-Network-Depth-vs.-Information-Precision)
        * [6. Bonus: Mathematical Zoom into Backpropagation in CNNs](#6-bonus-mathematical-zoom-into-backpropagation-in-cnns)


* [References](#References)

# Introduction

In Machine Learning, when we need to perform classification, we have many standard models that work well, such as:
* **Logistic Regression**
* **Random Forest**
* **SVM (Support Vector Machines)**
* **K-Nearest Neighbors**

However, in this notebook, we will focus on a specific case: **when the input is an image**.

We will demonstrate that **Neural Networks** are a much more powerful tool for this task. In particular, we will see how their structure—with a specific adaptation called **Convolution**—can be perfectly tailored to understand and classify visual data.

# Chapter 1 : The intuition behind CNNs

## I. What is an Image?


To a computer, an image is just a grid of numbers. However, in Deep Learning, we formalize this using the concept of **Tensors**.

Dimension of an image : $Height \times Width \times Channels$

### A. Grayscale Images (Black & White)
We define a grayscale image as a volume of size **$N_1 \times N_2 \times 1$** .
* It is a matrix where each pixel is a number between **0 and 255**.
    * **0:** Pure Black.
    * **255:** Pure White.
    * **Between:** Shades of gray.
* We can view this as a **Tensor with 1 channel** (a single "slice" or matrix).

### B. Color Images (RGB)
A color image is a volume of size **$N_1 \times N_2 \times 3$** 
* Instead of one slice, we have **3 matrices stacked together**:
    1.  **Red** Matrix
    2.  **Green** Matrix
    3.  **Blue** Matrix
* This is a **3D Tensor** with 3 channels. 

In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/RealGrayscaleToMatrixV2_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/RGBAbstractTensor_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,



### C. What is a Tensor?
A **Tensor** is simply a generalization of matrices to higher dimensions.

* **0D Tensor (Scalar):** A simple number.
    * *Structure:* Just a singular point.
    * *Example:* `5`

* **1D Tensor (Vector):** A simple list of numbers.
    * *Structure:* Just a line.
    * *Example:* `[0, 255, 12, 45]`

* **2D Tensor (Matrix):** A single grid of numbers.
    * *Structure:* A "sheet" with rows and columns.
    * *Example:* A single grayscale image map ($H \times W \times 1$).

* **3D Tensor:** **Matrices stacked together**.
    * *Structure:* A stack of sheets.
    * *Example:* A Color Image. It is **3 matrices** (Red, Green, Blue) stacked on top of each other ($H \times W \times 3$).

### Important Note: 
**A 3D Tensor is not limited to 3 channels.**

* You can stack **2,5, 7, or even 100 matrices** together.
* It remains a **3D Tensor** because it is still defined by 3 axes: $Height \times Width \times Channels$.
* **Think of a book:** Whether a book has 3 pages or 500 pages, it is still a "3D object". The number of pages (channels) changes, but the structure is the same.


Don't overcomplicate it. Just see a Tensor as a container of matrices stacked on top of each other.

In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    <td style="width: 100%; border: none; padding: 5px; text-align: center;">
      <img src="media/videos/CNN/720p30/TensorExplanationFastSimple_ManimCE_v0.19.0.gif" style="width: 80%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    </td>
  </tr>
</table>
"""))

## II. Why Not Classic MLPs? (Example: MNIST Dataset)

Let's start with a classic example: the **MNIST dataset**.
This is a database of handwritten digits (0 to 9). The images are **Grayscale** and have a size of **$28 \times 28 \times 1$**.

As we discussed, grayscale images are simply matrices of pixels where values range from **0 to 255**.



### The Naive Approach: Multi-Layer Perceptron (MLP)
If we want to build a model to classify these digits, the first method that comes to mind is a standard **Fully Connected Neural Network** (or MLP).

However, an MLP expects a **flat vector** as input, not a matrix.
To use it, we are forced to **flatten** our $28 \times 28$ image into a single vector of **784 pixels**. We then feed this long vector into the MLP to get a prediction.


In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    <td style="width: 100%; border: none; padding: 5px; text-align: center;">
      <img src="media/videos/CNN/720p30/MLP_Style_NoText_Fixed_ManimCE_v0.19.0.gif" style="width: 80%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    </td>
  </tr>
</table>
"""))

### Why is this method not optimal?
While this might work for very simple tasks, it has **three major problems**:

#### 1. Loss of Spatial Information
An image is not just a random list of numbers; it is a **2D structured object**.
* Neighboring pixels are correlated (they work together to form shapes).
* The order and position matter.
**Problem:** By flattening the image into a vector, we **break this structure**. The network reads the image "pixel by pixel" without understanding the geometry, leading to a poor use of visual information.

In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    <td style="width: 100%; border: none; padding: 5px; text-align: center;">
      <img src="media/videos/CNN/720p30/FlatteningSimple_ManimCE_v0.19.0.gif" style="width: 80%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    </td>
  </tr>
</table>
"""))

#### 2. Giant Number of Parameters
Let's do the math for a simple MLP with **2 hidden layers of 100 neurons each** and **10 output neurons**.

* **Input Size:** 784 (pixels)
* **Hidden Layer 1:** 100 neurons
* **Hidden Layer 2:** 100 neurons
* **Output Layer:** 10 neurons

**Detailed Calculation of Parameters (Weights + Biases):**

* **Layer 1 (Input $\to$ Hidden 1):**
    * Weights: $784 \times 100 = 78,400$
    * Biases: $100$ (1 per neuron)
    
* **Layer 2 (Hidden 1 $\to$ Hidden 2):**
    * Weights: $100 \times 100 = 10,000$
    * Biases: $100$

* **Layer 3 (Hidden 2 $\to$ Output):**
    * Weights: $100 \times 10 = 1,000$
    * Biases: $10$

**Total Sum:**
$$78,400 + 100 + 10,000 + 100 + 1,000 + 10 = \mathbf{89,610}$$

**Problem:** We have nearly **90,000 parameters** just for a tiny $28 \times 28$ black and white image. On realistic color images (e.g., $1000 \times 1000$), this number would explode into the billions, making the model impossible to train efficiently.

#### 3. No Translation Invariance
To a human, if a digit "5" moves a little bit to the left or right, it is clearly still a "5". We recognize the **shape**, no matter where it is.
**Problem:** An MLP looks at each specific pixel position separately.
* If the "5" shifts, different input pixels light up.
* To the MLP, this looks like a completely different input.
* It has to "re-learn" what a "5" looks like for every possible position in the image. It lacks **Translation Invariance**.

In [None]:

from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    <td style="width: 100%; border: none; padding: 5px; text-align: center;">
      <img src="media/videos/CNN/720p30/DigitShiftBig_ManimCE_v0.19.0.gif" style="width: 80%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    </td>
  </tr>
</table>
"""))

## III. How Humans Classify Images: Feature Detection

To understand the intuition behind CNNs, let's look at how humans see.

When we look at an image, **we do not scan every single pixel one by one.**
Instead, we process the image globally. We unconsciously look for **Features**:
* **Edges** (contours)
* **Textures**
* **Patterns** (e.g., the shape of an ear, the curve of a digit).

**The Decision Process:**
Identifying these features increases the probability of a specific class. If we see "whiskers" + "pointed ears" + "fur", our brain concludes: *"Yes, this looks like a cat."*

**The Goal of CNNs:**
Convolutional Neural Networks are designed to **mimic this exact behavior**.
The key is the convolution operator who acts like an **Automatic Feature Detector**.

In [41]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="assets/STRATEGIE1.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="assets/STRATEGIE2.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,


# Chapter 2 : The convolution operator

## I. Defining Image Convolution and Kernels

As we said, the convolution acts as a **Feature Detector**. But how exactly do we compute it?

**The Process:**
1.  **Input:** We take an image (pixel matrix).
2.  **Kernel:** We define a filter of a specific size that we choose.
3.  **Operation:** We slide the kernel over the image, performing a "dot product" (multiplication + sum) at every position.

Let's do examples !



### 1. 1 CHANNEL INPUT WITH 1 FILTER



In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/Architecture_1_Grey_1Filter_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/ConvExampleExplainedClean_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,


### 2. 1 CHANNEL INPUT WITH 2 FILTER


In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/Architecture_2_Grey_2Filters_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/Conv2FiltersExample_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,


### 3. 3 CHANNEL INPUT WITH 1 FILTER


In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/Architecture_3_RGB_1Filter_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/RGBConvStepByStepCentered_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,


### 4. 3 CHANNEL INPUT WITH 5 FILTER

In [None]:

from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    <td style="width: 100%; border: none; padding: 5px; text-align: center;">
      <img src="media/videos/CNN/720p30/FullCNNSequence_ManimCE_v0.19.0.gif" style="width: 80%; border-radius: 8px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
    </td>
  </tr>
</table>
"""))

## II. The dimensions of a convolution output



$$
(N \times N \times C) \otimes (F \times F \times C \times M) \rightarrow (N - F + 1) \times (N - F + 1) \times M
$$

Where:
* **N**: Size of the input
* **F**: Size of the Kernel
* **C**: Channels
* **M**: Number of filters

> **⚠️ Warning:**
> The input and the kernels **must have the same number of channels** ($C$).
> You cannot convolve a $3 \times 3 \times 1$ input with a $2 \times 2 \times 3$ kernel.

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/ConvDimensionsSimpleCentered_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

## III. Hyperparameters of Convolution

When designing a CNN, we don't just "apply convolution." We have to tune specific knobs—called **Hyperparameters**—to control how the network processes the image.




### 1. Padding
**The Problem:** Without padding, the pixels on the borders are "seen" less often by the filters because we cannot center the kernel on them. (Try a simple example yourself: you will notice that edge pixels are involved in far fewer calculations than the central pixels).

**The Solution:** Padding consists of adding a border of pixels (usually **Zeros**) around the input image.It allows us to process the edges almost as effectively as the center.


In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/PaddingOneByOneEng_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

### 2. Stride
**The Concept:** The Stride is the "step size" of the convolution.
* **Stride = 1:** We shift the filter **1 pixel** at a time. This is the standard detailed scan.
* **Stride = 2:** We shift the filter **2 pixels** at a time (we skip one pixel).
**Impact:** A larger stride **reduces the output size**  and speeds up computation because we perform fewer operations.



In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/Stride4x4Comparison_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

### 3. Kernel Size & Number of Filters
Finally, we must choose the properties of the filters themselves:

* **Kernel Size (e.g., $3 \times 3$ or $5 \times 5$):**
    * This controls the **Receptive Field** (the local area) the network looks at.
    * Small kernels look for fine details. Large kernels look for broader patterns.

* **Number of Filters (Depth):**
    * This controls **how many features** we want to learn in this layer.
    * **More filters** = The network can learn more diverse patterns (edges, textures, colors).
    * **Trade-off:** More filters mean more parameters to learn and slower training.

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/KernelSizeAndDepthClean_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

## IV. The FINAL Output Dimension Formula including padding and stride !!

* **$N$**: Input dimension (Height = Width).
* **$F$**: Filter dimension (Height = Width).
* **$P$**: Padding.
* **$S$**: Stride.

The formula to calculate the output size of the feature map is:

$$
\text{Output Size} = \left\lfloor \frac{N - F + 2P}{S} \right\rfloor + 1
$$

> **Example:**
> * Input Image ($N$): $28$ (for a $28 \times 28$ image)
> * Filter ($F$): $3$ (for a $3 \times 3$ kernel)
> * Padding ($P$): $1$
> * Stride ($S$): $1$
>
> $$\text{Size} = \frac{28 - 3 + (2 \times 1)}{1} + 1 = \frac{27}{1} + 1 = \mathbf{28}$$
> *Result:* We kept the same size ($28 \times 28$) thanks to the padding!



# V. Feature detection examples

In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/FeatureDetectionGray_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/FeatureDetectionRGB_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,


# Chapter 3 : Building the Convolutional Neural Network

# Introduction

Now that we understand the convolution operator, we can move on to how it helps us build **image classification models**.

## The guiding principle for CNNs is as follows: 

1. **Convolution part** : We take our input image and perform a convolution with a set of filters. Then, we perform convolutions again on these outputs with new filters, and repeat the process... (the number of times we repeat this is a hyperparameter that we choose ourselves).

2. **MLP part** : Once we have performed as many compositions of convolutions as desired, we take the final outputs, apply a flattening operation, and feed the result into an MLP to perform the classification. 

This is the core concept to keep in mind.


## Wait, didn't we say flattening was bad?

You might be looking at this architecture and thinking:
> *"In the beginning, we said flattening the image destroys spatial structure, requires too many parameters, and lacks translation invariance. Yet, here we are flattening the output at the end. Is this a contradiction?"*

The short answer is **no**. The key difference is **WHEN** we flatten.
In the naive approach, we flattened **raw pixels**. Here, we flatten **extracted features**.

Let's revisit the three major problems to see how this architecture solves them:

#### 1. Loss of Spatial Information
* **The Old Problem:** Flattening the raw image immediately destroyed the 2D grid. The model lost the concept of "neighbors" (pixel $0,0$ and $0,1$) before it could even process shapes.
* **The Solution:** Here, we do **not** flatten immediately. We perform the convolution steps **first**. These steps respect and utilize the 2D structure to detect edges, curves, and patterns. By the time we flatten at the very end, we are no longer flattening "pixels"; we are flattening a list of **high-level concepts** (e.g., "contains a circle," "has vertical lines"). The spatial information has already been processed and encoded.

#### 2. Giant Number of Parameters
* **The Old Problem:** Connecting every single pixel of a large image (e.g., $1000 \times 1000$) to a hidden layer created millions of wasteful connections.
* **The Solution:** As we apply successive convolutions, the spatial dimensions of the image ($Height \times Width$) typically **shrink**. instead of flattening a massive $1000 \times 1000$ image, we might end up flattening a small $7 \times 7$ feature map.
    * Even with multiple filters, the total input size entering the MLP is significantly smaller.
    * This drastic reduction in size means far fewer parameters are needed in the final layers.

#### 3. No Translation Invariance
* **The Old Problem:** In a standard MLP, if a shape moved from the left to the right, it was seen as a completely new input.
* **The Solution:** The convolution operation handles the location. Because we slide the **same filter** across the whole image, a specific feature (like a curve) will be detected regardless of where it is.
    * The convolutional layers do the hard work of *locating* the features.
    * The final MLP doesn't need to search for the object; it just receives a signal from the feature maps saying *"I found this pattern"* (regardless of where it was originally located).

# I. From Manual Filters to Learned Features

### The Hypothesis
We might ask:
> *"Why don't we just manually choose the best kernels (like standard edge detectors), apply them to the image to extract features, and then feed the result into a classic MLP for classification?"*

### The Problem
This approach is flawed for three main reasons:
1.  **Missing Hidden Features:** We don't always know intuitively which patterns are the most "discriminant" (useful) to distinguish between classes.
2.  **Unknown Combinations:** We don't know the optimal mix of kernels to use.
3.  **Human Limitation:** By choosing manually, we limit the network to features *we* have imagined.

### The Solution: End-to-End Learning
The core idea of CNNs is to **let the network learn the values inside the kernels itself.**

* **Learnable Parameters:** The coefficients inside the filters are not fixed constants. They are **parameters** (weights) of the network.
* **Backpropagation:** Just like in an MLP, these weights are updated during training. The network figures out *on its own* which kernels are best to minimize the error.

### The Result: A Giant Pipeline / Neural Network
We obtain a single, unified neural network where:
1.  **Convolution Layers (The "Eyes"):** Learn to be the best possible **Feature Extractors**. This part of the neural network **will not be fully connected** ( we will see that after ).


2.  **Fully Connected Layers (The "Brain"):** Take these features and handle the **Classification Decision**.

# II. How can we see Convolution as a Neural Network Layer

To fully understand why a Convolutional Layer can be considered a specific type of Neural Network layer, we must examine the mathematical operations in detail. We will deconstruct a minimal example to draw the parallel with standard dense layers.

Let us consider a simplified case:
* **Input ($X$):** A $3 \times 3$ image (or feature map).
* **Filter ($W$):** A $2 \times 2$ kernel.
* **Bias ($b$):** A scalar value associated with this specific filter.


### 1. Definitions and Matrices

First, let us explicitly define our matrices.

**The Input Matrix ($x$):**
This represents the pixel values of our image.
$$
X = 
\begin{bmatrix} 
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23} \\
x_{31} & x_{32} & x_{33} 
\end{bmatrix}
$$

**The Kernel Matrix ($w$):**
These are the **weights** of the network that will be learned via backpropagation.
$$
W = 
\begin{bmatrix} 
w_1 & w_2 \\
w_3 & w_4 
\end{bmatrix}
$$

**The Bias ($b$):**
⚠️ **Crucial Point:** Just like a standard neuron ($y = wx + b$), a convolution filter will always includes a **Bias**.
* Even though the output is a $2 \times 2$ matrix ($4$ weights), there is only **ONE single bias value** ($b$) shared across the entire feature map.

**The Output Matrix ($z$):**
Based on the dimensions, a $2 \times 2$ kernel sliding over a $3 \times 3$ input produces a $2 \times 2$ output.
$$
Z = 
\begin{bmatrix} 
z_{1} & z_{2} \\
z_{3} & z_{4} 
\end{bmatrix}
$$



### 2. Step-by-Step Calculation (The Linear Part)

Let us compute the values of the output matrix $Z$. The operation consists of a dot product, to which we add the shared bias.

#### Step A: Calculating $z_1$ (Top-Left)
We position the kernel over the top-left region of the input (covering $x_{11}, x_{12}, x_{21}, x_{22}$).

The calculation is:
$$z_1 = (x_{11} \cdot w_1) + (x_{12} \cdot w_2) + (x_{21} \cdot w_3) + (x_{22} \cdot w_4) + \mathbf{b}$$

**Neural Network Interpretation:**
Observe this equation. It is the standard formula for a neuron: $\sum (x_i \cdot w_i) + b$.
* The neuron $z_1$ is connected to **only 4 specific inputs** ($x_{11}, x_{12}, x_{21}, x_{22}$).
* In a standard Fully Connected layer, $z_1$ would be connected to all 9 inputs.
* This property is called **Sparse Connectivity** (look at the animation at the very end of II. to convince yourself).

#### Step B: Calculating $z_2$ (Top-Right)
We slide the kernel one pixel to the right (Stride = 1). It now covers $x_{12}, x_{13}, x_{22}, x_{23}$.

The calculation is:
$$z_2 = (x_{12} \cdot \mathbf{w_1}) + (x_{13} \cdot \mathbf{w_2}) + (x_{22} \cdot \mathbf{w_3}) + (x_{23} \cdot \mathbf{w_4}) + \mathbf{b}$$

**Neural Network Interpretation:**
Notice that we use **exactly the same weights** ($w_1, w_2, w_3, w_4$) and the **same bias** ($b$) as we did for $z_1$.
* The neuron $z_2$ looks at a different local region.
* However, it shares the same parameters as $z_1$.
* This property is called **Parameter Sharing**.

#### Step C: Completing the Matrix
We continue sliding the window to compute the remaining outputs:

* **Bottom-Left ($z_3$):**
    $$z_3 = (x_{21} \cdot w_1) + (x_{22} \cdot w_2) + (x_{31} \cdot w_3) + (x_{32} \cdot w_4) + \mathbf{b}$$

* **Bottom-Right ($z_4$):**
    $$z_4 = (x_{22} \cdot w_1) + (x_{23} \cdot w_2) + (x_{32} \cdot w_3) + (x_{33} \cdot w_4) + \mathbf{b}$$

---

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/ConvView_Final_Title_SharedBias_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))


### 3. The Activation Function (Non-Linearity)

The calculations above yield a linear output map $Z$. To create a neural network capable of learning complex patterns, we must introduce non-linearity.

Just like in a standard MLP, we apply an **Activation Function** (typically ReLU) element-wise to the matrix $Z$.

Let $A$ be the final Activation Map:
$$
A = \text{ReLU}(Z) = 
\begin{bmatrix} 
\max(0, z_1) & \max(0, z_2) \\
\max(0, z_3) & \max(0, z_4) 
\end{bmatrix}
$$

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/ActivationFunctionStep_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

### 4. Summary: Convolution vs. Fully Connected

We have demonstrated that a Convolution is simply a Neural Network layer with two specific structural constraints:

1.  **Local/Sparse Connectivity:** Each output neuron is connected only to a small, local subset of the input pixels. It does not "see" the whole image at once.
2.  **Weight & Bias/Parameters Sharing:** All neurons in a specific feature map share the exact same weights (filters) and the same bias. This forces the network to search for the same feature everywhere in the image.

In [1]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="assets/output.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

### Conclusion

We have demonstrated that a **Convolution** combined with a **Bias** and an **Activation** ($\sigma$) is functionally identical to a standard neural network layer: $$y = \sigma(W * x + b)$$

---

> <span style="color:red">**⚠️ CRITICAL WARNING: Bias is per Filter**</span>
>
> Bias is a single scalar value **per filter**. It is NOT per pixel and NOT per input channel. This single value is **broadcasted** (added) to the entire 2D output map produced by that filter.

---

### Dimension Examples

We use the format: $F \times F \times Channels_{In} \times N_{Filters}$ (where F is filter size).

#### Example 1: 1-Channel Input (Grayscale)
*Input:* $28 \times 28 \times \mathbf{1}$ image.
*Layer:* 32 filters of size $3 \times 3$.

* **Kernel Shape:** $3 \times 3 \times \mathbf{1} \times 32$
* **Bias:** 32 parameters (1 per output map).

[Image showing a 1-channel convolution process: a single 2D input matrix convolved with a 2D kernel, producing one 2D output map to which one bias value is added.]

#### Example 2: 3-Channel Input (RGB)
*Input:* $32 \times 32 \times \mathbf{3}$ image.
*Layer:* 64 filters of size $5 \times 5$.

* **Kernel Shape:** $5 \times 5 \times \mathbf{3} \times 64$
* **Bias:** 64 parameters.
* *Note:* The kernel depth (3) MUST match the input depth. These 3 channels are summed into a single 2D map *before* the single bias is added.

In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/DimensionExamplesSideBySide_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

# III. The Convolutional Neural Network Pipeline !!

## 1. Pooling

The pipeline will include a step called **Pooling** that we often add after a convolution.

We present this step before going straight to the full CNN pipeline.

Its goal is to reduce the spatial dimensions of the image (Height and Width), which drastically reduces the number of parameters and computation cost, while preserving most important informations.

### A. How it Works
We define a window of size **$K \times K$** (just like a kernel) and a **Stride**.

* **General Rule:** The window slides over the image. For each window, we compress the pixels into a single value (we calculate either the **max** or the **average** of those values).
* **Standard Practice:** In 99% of cases, we use a window of **$2 \times 2$** with a **Stride of 2**. This effectively divides the height and width of the image by 2 at each step.


### B. Max Pooling vs. Average Pooling
There are two main ways to compress the data:
1.  **Average Pooling:** Calculates the average value of the pixels in the window.
2.  **Max Pooling:** Selects only the **maximum value** in the window.


In [None]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="media/videos/CNN/720p30/TinyMaxPoolingFixed_ManimCE_v0.19.0.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))


### C. Why is Max Pooling Better for Classification?
Let's take a concrete example with a $2 \times 2$ window to understand why Max Pooling is the standard.
Imagine a region of the image after a convolution (a feature map):

$$
\text{Window} = \begin{bmatrix} \mathbf{164} & 0 \\ 1 & 0 \end{bmatrix}
$$

* **Context:**
    * **164:** A very strong activation (the filter detected a sharp feature, like an edge or a texture).
    * **0 and 1:** Background noise. Note that in image processing, dark pixels (values near 0) usually do not contain key information; they represent empty space or background.

**The Comparison:**

* **Average Pooling:**
    $$\frac{164 + 0 + 1 + 0}{4} \approx \mathbf{41}$$
    * *Result:* The strong signal (164) is **diluted** by the zeros. The information becomes blurry and less distinct.

* **Max Pooling:**
    $$\max(164, 0, 1, 0) = \mathbf{164}$$
    * *Result:* We keep the **164**. The pooling ignores the noise (the zeros) and preserves the most important feature.

**Conclusion:** For image classification, Max Pooling acts as a "Feature Selector," ensuring the strongest patterns survive, while Average Pooling tends to wash them out.

### Pooling as a Layer of a Neural Network

We can consider the Pooling step as a genuine layer within the Neural Network just like convolution did !

* **Activation Function:** It applies a specific non-linear function over its inputs:
  $$\sigma(x_1, ..., x_n) = \max(x_1, ..., x_n)$$

* **No Weights, No Bias:** Unlike Convolutional or Dense layers, a Pooling layer has **no weights ($W$) and no bias ($b$)**.
  * *Number of Trainable Parameters = 0.*

* **Role:** Consequently, this layer does not "learn" features itself. It strictly applies a fixed mathematical operation to reduce dimensionality.

## 2.CNN's Pipeline !!!

In [None]:
from IPython.display import display, HTML

display(HTML("""
<table style="width: 100%; border: none; background: transparent;">
  <tr style="background: transparent;">
    
    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/TinyCNN_ImageInput_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

    <td style="width: 50%; border: none; padding: 5px;">
      <img src="media/videos/CNN/720p30/NeuralNetworkAnimation_ManimCE_v0.19.0.gif" style="width: 100%;">
    </td>

  </tr>
</table>
"""))

0,1
,



We can view a CNN in two ways: functionally as a **pipeline of blocks**, or structurally as **one giant Neural Network** (look upward).

### A. The Pipeline (Standard Architecture)
Most classical CNNs (like VGG or LeNet) follow this specific recipe:

$$
\text{Input} \xrightarrow{} \underbrace{[ \text{Conv} \xrightarrow{\text{ReLU}} \text{Pool} ] \times N}_{\text{Feature Extraction}} \xrightarrow{\text{Flatten}} \underbrace{\text{Dense} \xrightarrow{\text{ReLU}} \dots \xrightarrow{\text{Softmax}}}_{\text{Classification (MLP)}}
$$

> **⚠️ Critical Insight: It is ONE Network**
> Do not be fooled by the blocks. The Convolutional part **IS** a neural network, but with **sparse connections** (local receptive fields) and **shared weights** (kernels). The whole system is trained end-to-end via Backpropagation.

---

### B. Key Design Choices

#### 1. Activation Function: Why ReLU?
We use **ReLU** ($f(x) = \max(0, x)$) almost everywhere instead of Sigmoid/Tanh.
* **No Vanishing Gradient:** Gradients remain strong (1) for positive inputs, speeding up training.
* **Efficiency:** Computationally essentially free.

#### 2. Choosing Dimensions (Heuristics)
* **Spatial Size ($\downarrow$):** We **reduce** height/width at each step (via Pooling) to compress information.
* **Depth / Filters ($\uparrow$):** We **increase** the number of filters deeper in the network (e.g., $32 \to 64 \to 128$).
    * *Reason:* Early layers detect simple edges (few filters needed). Deep layers detect complex combinations (many filters needed).
* **Kernel Size:** **$3 \times 3$** is the industry standard. It is the smallest symmetrical filter that captures the notion of "center" and "neighbors" efficiently.

## 3. Designing the CNN's Pipeline: The Hyperparameters

We can divide them into three main categories:

### 3.1 Convolution Layers

**Step 1: Define Architecture Depth**
* **Number of Blocks:** How many (Conv + Pooling) blocks to stack?

**Step 2: For EACH Block, choose:**
* **Number of Filters:** (e.g., 32, 64, 128...)
* **Filter Size (Kernel):** (e.g., $3 \times 3$)
* **Padding:** 
* **Stride:** 
* **Activation Function:** (e.g., ReLU,sigmoid)
* **Pooling Type:** (Max Pooling vs Average Pooling)

### 3.2 MLP's Design (Classifier)
* **Number of Layers**
* **Neurons per Layer**
* **Activation Function:** 

### 3.3 Training (Backward)
* **Optimizer:** (e.g., Adam, SGD)
* **Learning Rate**
* **Batch Size**
* **Number of Epochs**




### 3.4 Crucial Extras: Stabilization & Regularization
To make the model robust, we add two specific layers.

#### 3.4.1 Batch Normalization (Batch Norm)
* **Where to put it?**
    * **In the Conv Block:** Place it **between** the Convolution and the Activation.
        * *Order:* `Conv2d` $\to$ `BatchNorm2d` $\to$ `ReLU`.
    * **In the MLP Block:** Place it **between** the Linear layer and the Activation for all **hidden layers**.
        * *Order:* `Linear` $\to$ `BatchNorm1d` $\to$ `ReLU`.

* **⚠️ Crucial Exception:**
    * **The Output Layer:** Do **NOT** batch normalize the final layer before the Softmax.
    * *Reason:* We need the raw scores (logits, $xW+b$) to preserve the relative scale between classes so Softmax can compute correct probabilities.

* **The Math:** It normalizes the **raw output** $z = xW + b$ (coming out of the Convolution or Linear layer) using the **mean ($\mu$) and variance ($\sigma^2$) of the current mini-batch**.
    $$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta$$
    * *$\mu$ and $\sigma$:* Calculated strictly on the **current batch** (not the whole dataset).
    * *$\gamma$ and $\beta$:* Learnable parameters that allow the network to adjust the scale and shift.
    * *$\epsilon$:* A tiny number added to ensure we **never divide by 0**.
* **Why?** It stabilizes the learning process (keeps values centered), allowing the use of higher learning rates and faster convergence.

If you want to know more about Batch Normalization , have a look here : 
https://www.youtube.com/watch?v=nT9nKBCjS_Y

#### 3.4.2 Dropout

**Dropout** is also a regularization technique.
We will want to randomly "turn off" some neurons in our Neural Network in order to make better predictions (yes it does seems a little counterintuitive at first).

**1. Biological Intuition (The Origin Story)**

The concept of Dropout is directly inspired by how the human brain functions:
* **Energy Efficiency:** The brain consumes a significant amount of energy. Biologically, it is not sustainable for all neurons to be turned on simultaneously for every stimulus.
* **Sparsity (Specialization):** The brain prefers **sparse representations**. To recognize a "cat," only a small fraction of highly specialized neurons need to activate.
* **Robustness & The Parsimony Principle:**
    * *Biological Fact:* Brain neurons naturally die or misfire, yet our cognitive abilities remain stable. We don't lose the ability to recognize a cat just because one neuron failed.
    * *Occam's Razor:* This forces the brain to adhere to the **Principle of Parsimony (Occam's Razor)**: "The simplest explanation is usually the best."
    * *The Bridge to Overfitting:* By applying Dropout, we mimic this biological constraint. We force the network to make accurate predictions using only a random fraction of its neurons at any given time.

**2. The Context: Fighting Overfitting**

A fundamental number in Deep Learning is the ratio between the number of parameters and the amount of data.
* **The Critical Case:** $N_{parameters} \gg N_{data}$.
* **The Risk:**  Overfitting.

To fight overfitting, we often use **Data Augmentation** (flipping images, zooming, etc.) to artificially increase the dataset size.
Dropout is essentially a form of Data Augmentation, by randomly turning off some neurons you corrupt the data going through the Neural Network.This forces the model to process a slightly different (noisy) representation of the input at each iteration.


**3. Mathematics and Algorithm (Step-by-Step)**

Let's formalize the process for a hidden layer $l$.
* Let $z^{(l)}$ be the activation vector (before Dropout) of dimension $N$.
* Let $p$ be the **Dropout Rate** (the probability of zeroing out a neuron, e.g., $p=0.5$).

**The Operation: Inverted Dropout**
Modern frameworks (PyTorch/TensorFlow) use **Inverted Dropout** to simplify the testing phase.

1.  **Generate Bernoulli Mask:**
    We generate a vector $m^{(l)}$ of the same size as $z^{(l)}$, composed of 0s and 1s.
    $$m_i^{(l)} \sim \text{Bernoulli}(1-p)$$
    *(Probability of being 1 is $1-p$, probability of being 0 is $p$).*

2.  **Apply Mask:**
    $$\tilde{z}^{(l)} = z^{(l)} \odot m^{(l)}$$
    *($\odot$ denotes the element-wise / Hadamard product).*

3.  **Scaling:**
    To maintain the same expected "energy" (mathematical expectation) as if all neurons were active, we divide by the keep probability $(1-p)$.
    $$\text{Output}_{train} = \frac{\tilde{z}^{(l)}}{1-p}$$

**Detailed Numerical Example**
Imagine a layer with **5 neurons**.
* Raw Output ($z$): `[10, 20, 30, 40, 50]`
* Dropout Rate ($p$): **0.4** (We kill 40% of neurons).
* Scaling Factor: $\frac{1}{1-0.4} = \frac{1}{0.6} \approx 1.67$

* **Iteration 1 (Batch A):** Randomness kills neurons **2** and **4**.
    * Mask: `[1, 0, 1, 0, 1]`
    * Apply: `[10, 0, 30, 0, 50]`
    * Scale ($\times 1.67$): `[16.7, 0, 50.1, 0, 83.5]`

* **Iteration 2 (Batch B):** Randomness kills neurons **1** and **5**.
    * Mask: `[0, 1, 1, 1, 0]`
    * Apply: `[0, 20, 30, 40, 0]`
    * Scale ($\times 1.67$): `[0, 33.4, 50.1, 66.8, 0]`

> **Key Observation:** Neuron 3 (value 30) was active both times, but it was paired with different neighbors. It cannot "trust" its neighbors to make a decision.


**4. Train vs. Test (The Crucial Mode Switch)**


* **Training Phase (`model.train()`):**
    * **Dropout is ON.**
    * The network is "intoxicated" (noisy).
    * The random mask is applied, and values are scaled by $\frac{1}{1-p}$.
    * *Objective:* Learn robustness.

* **Testing / Inference Phase (`model.eval()`):**
    * **Dropout is OFF.**
    * We want to use the full "brain" capacity for the best prediction.
    * All neurons are used: $m = [1, 1, 1, 1, 1]$.
    * **No extra scaling** is performed (because we already divided by $(1-p)$ during training, the expected values match).
    * *Objective:* Deterministic, maximum performance.


**5. When do you apply dropout ?**

It is standard practice to apply Dropout **only during the MLP phase**.
* **Reason:** Convolutional layers have significantly fewer parameters than Fully Connected (MLP) layers due to weight sharing. Consequently, they are naturally less prone to overfitting, so there is usually no need to aggressively reduce the model's capacity during the feature extraction phase.

In [40]:
from IPython.display import display, HTML

display(HTML("""
<div style="text-align: center;">
    <img src="assets/dropout.gif" 
         style="width: 80%; border-radius: 6px; box-shadow: 0 4px 10px rgba(0,0,0,0.15);">
</div>
"""))

*Note: This animation was not created by us; it was sourced from the [ManimML GitHub repository](https://github.com/helblazer811/ManimML/blob/main/assets/readme/dropout.gif).*

## 4. Example of how a CNN learns (The Full Modern Pipeline)

This section details the actual learning mechanism of a modern Convolutional Neural Network (CNN), integrating crucial stabilization and regularization techniques (**Data Augmentation, Batch Normalization, Dropout**).

**The Setup:**
Assume a dataset of **10,000 images**:
- **8,000** for training ($N_{train}$).
- **2,000** for testing.

---

### 4.1 Intuition vs. Reality

**The Naive Intuition:**
One might think the network loops through images one by one:
1. Load image 1 $\to$ Forward $\to$ Backward $\to$ Update weights.
2. Load image 2 $\to$ Forward $\to$ Backward $\to$ Update weights.
*This would be incredibly slow and unstable (noisy gradients).*

**The Modern Reality (Vectorization):**
We process images in **Mini-Batches**.
Let's choose a **Batch Size ($B$) = 32**.

The input is not a list of images, but a single **4-Dimensional Tensor**:
$$
X \in \mathbb{R}^{(B,\; C,\; H,\; W)}
$$
*Example:* 32 RGB images of size $64 \times 64$ becomes a tensor of shape `(32, 3, 64, 64)`.

---

### 4.2 The Forward Pass (Step-by-Step with Layers)

For a single batch of 32 images, the data flows through the pipeline as follows. Note that **Training Mode** behaviors are active here.

#### **Step 0: Data Augmentation (CPU/GPU)**
Before entering the network (or at the very first layer), we artificially corrupt the data to reduce overfitting.
* *Action:* Random flips, rotations, or zooms applied to the batch.
* *Result:* The network never sees the exact same image twice.
    $$X_{aug} = \text{Augment}(X)$$

#### **Step 1: The Convolutional Block (Feature Extraction)**
The tensor $X_{aug}$ enters the layers.
1.  **Conv2D:** Extracts features using filters ($W$).
    $$Z = X_{aug} * W$$
2.  **Batch Normalization (2D):** Stabilizes learning by recentering the data. It uses the batch mean $\mu_B$ and variance $\sigma^2_B$.
    $$\hat{Z} = \frac{Z - \mu_B}{\sqrt{\sigma^2_B + \epsilon}} \cdot \gamma + \beta$$
3.  **Activation (ReLU):** Adds non-linearity.
    $$A = \max(0, \hat{Z})$$
4.  **Pooling:** Downsamples dimensions (e.g., MaxPool).

#### **Step 2: Flattening**
The 4D tensor is flattened into a 2D matrix of shape $(B, \text{Features})$ to enter the MLP.

#### **Step 3: The MLP Block (Classification)**
1.  **Dense (Linear):**
    $$Z_{dense} = A_{flat} \cdot W_{dense} + b$$
2.  **Batch Normalization (1D):** Normalizes the vector features.
3.  **Activation (ReLU):** $\max(0, Z_{norm})$.
4.  **Dropout:** Randomly sets a percentage (e.g., 50%) of neurons to zero to force robustness.
    $$A_{final} = A_{relu} \odot \text{Mask}_{Bernoulli}$$

#### **Step 4: Output & Loss**
* **Prediction:** The final layer produces logits, converted to probabilities via **Softmax** (or Sigmoid).
* **Loss Computation:** The Loss $\mathcal{L}$ is calculated as the **average** of the errors of the 32 images.
    $$\mathcal{L}_{batch} = \frac{1}{B} \sum_{i=1}^{B} \text{Loss}(\text{pred}_i, \text{target}_i)$$

---

### 4.3 Backpropagation & Optimization

Once $\mathcal{L}_{batch}$ is computed:
1.  **Backward Pass:** We compute the gradient of the loss w.r.t. **all** parameters $\theta$ (kernels $\gamma, \beta$ of BN, weights, biases).
    $$\nabla_{\theta} \mathcal{L}_{batch}$$
2.  **Optimizer Step:** We update the weights using Gradient Descent (e.g., Adam/SGD).
    $$\theta_{t+1} = \theta_t - \eta \cdot \nabla_{\theta} \mathcal{L}_{batch}$$

---

### 4.4 Detailed Iteration Math (Epochs vs. Steps)

It is crucial to distinguish between an **Iteration** and an **Epoch**.

* **1 Iteration (or Step):** One forward/backward pass of **1 Batch** (32 images).
* **1 Epoch:** The network has seen the **entire** dataset (8,000 images) once.

**Calculations:**
* Dataset Size: $N = 8000$
* Batch Size: $B = 32$
* **Iterations per Epoch:** $\frac{8000}{32} = \mathbf{250}$ steps.

If we train for **20 Epochs**:
$$
\text{Total Parameter Updates} = 250 \times 20 = \mathbf{5,000 \text{ updates}}
$$

---

### 4.5 Crucial Note: Train vs. Eval Mode

The behavior of these layers changes radically after training:

| Layer | **Training Mode** (`model.train()`) | **Evaluation/Test Mode** (`model.eval()`) |
| :--- | :--- | :--- |
| **Data Augmentation** | **ON** (Random variations) | **OFF** (Use original images) |
| **Batch Normalization** | Uses **current batch** statistics ($\mu_B, \sigma_B$) | Uses **saved running average** statistics |
| **Dropout** | **ON** (Randomly kills neurons) | **OFF** (Uses all neurons, no scaling) |

### 4.6 Summary

Modern CNN training is not a loop over images. It is a sequence of **tensor operations** on batches.
1.  **Augment** the batch.
2.  Pass through **Conv $\to$ BN $\to$ ReLU $\to$ Pool**.
3.  Pass through **Dense $\to$ BN $\to$ ReLU $\to$ Dropout**.
4.  Compute **Batch Loss**.
5.  Update parameters (1 Step).
6.  Repeat 250 times to complete 1 Epoch.


### 4.7 Warning !

We repeat ourselves here but we do want to insist on the fact the intuitive expectation is that **CNNs process images one by one inside each batch and perform a separate forward for each image is FALSE**

In fact:

- a batch is represented as a single 4-D tensor,  
- one vectorized forward pass processes all images of one batch simultaneously

## 5. Network Depth vs. Information Precision

The more convolutional layers we add, the more complex the detected features become.

### 5.1 The Mathematical Intuition
It is actually quite natural.
Mathematically, a neural network is a **composition of non-linear functions**.
* One layer performs a simple transformation: $y = f(x)$.
* Ten layers perform a chain of transformations: $y = f_9(f_8(...f_1(x)...))$.

By stacking layers, we are building a function of **increasing complexity**.
Just like in math where composing simple functions allows you to describe complex curves, stacking convolutions allows the network to model extremely precise and intricate relationships in the data.

### 5.2 The Hierarchy of Features
Because of this composition, the network learns in a hierarchical way. As we move deeper into the network (towards the output), the features become more abstract and "human-level":

1.  **First Layers:** They detect **simple geometry** (horizontal lines, vertical edges, color gradients).
2.  **Middle Layers:** They combine lines to detect **shapes and parts** (corners, curves, circles, textures).
3.  **Last Layers:** They combine shapes to recognize **complex objects** (eyes, mouths, ears, car wheels, faces).



**Conclusion:** The deeper we go, the more the network understands "concepts" (Is there an eye?) rather than just pixel values.

## 6. Bonus: Mathematical Zoom into Backpropagation in CNNs



We aimed to keep this document concise while providing a clear overview of the CNN pipeline. As established, a CNN is essentially a large neural network composed of two distinct parts: the **Convolutional block** and the **MLP block**.

All parameters, including the values within the kernels, are learned via **Backpropagation**.

To understand how a kernel learns, we rely on the **Chain Rule**.

$$
\underbrace{\frac{\partial \mathcal{L}}{\partial W}}_{\text{What we want}} ={\frac{\partial \mathcal{L}}{\partial Y}} \times {\frac{\partial Y}{\partial W}}
$$

However, if you try to compute these partial derivatives for a Convolutional Kernel, you will find an **interesting result** due to the architecture's specific nature.

---

### The Weight Sharing Effect

To understand why CNN training is computationally intensive but parameter-efficient, let's derive the gradient for a **single weight** in a concrete example.

**The Setup:**
* **Input ($X$):** A $3 \times 3$ matrix (e.g., a small image patch).
* **Kernel ($W$):** A $2 \times 2$ filter.
* **Output ($Y$):** A $2 \times 2$ feature map (valid convolution).

$$
X = \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \\ x_{31} & x_{32} & x_{33} \end{bmatrix}, \quad
W = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix}
$$

**1. The Forward Pass (Weight Sharing in Action)**
Let's calculate the output map $Y$. Notice how the **same weight** $w_{11}$ (top-left of kernel) is reused 4 times, interacting with different input pixels.

$$
Y = \begin{bmatrix} y_{11} & y_{12} \\ y_{21} & y_{22} \end{bmatrix}
$$

The equations for the pixels are:
* $y_{11} = {\color{red}w_{11}}x_{11} + w_{12}x_{12} + w_{21}x_{21} + w_{22}x_{22}$
* $y_{12} = {\color{red}w_{11}}x_{12} + w_{12}x_{13} + w_{21}x_{22} + w_{22}x_{23}$
* $y_{21} = {\color{red}w_{11}}x_{21} + w_{12}x_{22} + w_{21}x_{31} + w_{22}x_{32}$
* $y_{22} = {\color{red}w_{11}}x_{22} + w_{12}x_{23} + w_{21}x_{32} + w_{22}x_{33}$

**2. The Backward Pass (Computing the Gradient)**
Assume we receive the gradient of the Loss $\mathcal{L}$ with respect to the output $Y$ (let's call it $\delta$, the "Message from the future"):
$$
\frac{\partial \mathcal{L}}{\partial Y} = \delta = \begin{bmatrix} \delta_{11} & \delta_{12} \\ \delta_{21} & \delta_{22} \end{bmatrix}
$$

We want to find the gradient for the weight ${\color{red}w_{11}}$: **$\frac{\partial \mathcal{L}}{\partial w_{11}}$**.

**Applying the Multivariate Chain Rule:**
Since $w_{11}$ contributed to **all 4 output pixels**, a change in $w_{11}$ affects the error through 4 different paths. We must sum these contributions.

$$
\frac{\partial \mathcal{L}}{\partial w_{11}} =
\frac{\partial \mathcal{L}}{\partial y_{11}}\frac{\partial y_{11}}{\partial w_{11}} +
\frac{\partial \mathcal{L}}{\partial y_{12}}\frac{\partial y_{12}}{\partial w_{11}} +
\frac{\partial \mathcal{L}}{\partial y_{21}}\frac{\partial y_{21}}{\partial w_{11}} +
\frac{\partial \mathcal{L}}{\partial y_{22}}\frac{\partial y_{22}}{\partial w_{11}}
$$

**3. Computing the Partial Derivatives:**
Looking at the equations in Step 1, we can see the derivative of $y$ with respect to $w_{11}$ is simply the input pixel $x$ it was multiplied with:
* $\frac{\partial y_{11}}{\partial w_{11}} = x_{11}$
* $\frac{\partial y_{12}}{\partial w_{11}} = x_{12}$
* $\frac{\partial y_{21}}{\partial w_{11}} = x_{21}$
* $\frac{\partial y_{22}}{\partial w_{11}} = x_{22}$

**4. The Final Result:**
$$
\frac{\partial \mathcal{L}}{\partial w_{11}} = \delta_{11}x_{11} + \delta_{12}x_{12} + \delta_{21}x_{21} + \delta_{22}x_{22}
$$

**Conclusion:**
The gradient of a single kernel weight is not just a local calculation. It is the **sum over the entire spatial domain** of the input weighted by the backpropagated error.

In simpler terms: **$\nabla W$ corresponds to the convolution of the Input $X$ and the Error $\delta$.**

# References 

	https://www.youtube.com/watch?v=JfBf5eYptSs&t=1468s
	https://www.youtube.com/watch?v=zG_5OtgxfAg&t=636s
	https://www.youtube.com/watch?v=581X9wsnWJs&t=700s
	https://www.youtube.com/watch?v=zPwQiZFrwUY
	https://www.kaggle.com/code/blurredmachine/alexnet-architecture-a-complete-guide
	https://medium.com/biased-algorithms/batch-normalization-in-cnn-81c0bd832c63
	https://www.youtube.com/watch?v=h8oL4GXkBV4
	https://github.com/helblazer811/ManimML/blob/main/assets/readme/dropout.gif