# CNN : Convolutional Neural Network

A project made by : Walid Canesse & Salaheddin Ben Emran & Mohamed-Ayoub Bouzid

All animations have been made by ourselves using Manim

METTRE UN SOMMAIRE

# Introduction

In Machine Learning, when we need to perform classification, we have many standard models that work well, such as:
* **Logistic Regression**
* **Random Forest**
* **SVM (Support Vector Machines)**
* **K-Nearest Neighbors**

However, in this notebook, we will focus on a specific case: **when the input is an image**.

We will demonstrate that **Neural Networks** are a much more powerful tool for this task. In particular, we will see how their structure—with a specific adaptation called **Convolution**—can be perfectly tailored to understand and classify visual data.

# CHAPTER 1 : The intuition behind CNNs

## 0. What is an Image?


To a computer, an image is just a grid of numbers. However, in Deep Learning, we formalize this using the concept of **Tensors**.

Dimension of an image : $Height \times Width \times Channels$

### A. Grayscale Images (Black & White)
We define a grayscale image as a volume of size **$N_1 \times N_2 \times 1$** .
* It is a matrix where each pixel is a number between **0 and 255**.
    * **0:** Pure Black.
    * **255:** Pure White.
    * **Between:** Shades of gray.
* We can view this as a **Tensor with 1 channel** (a single "slice" or matrix).

### B. Color Images (RGB)
A color image is a volume of size **$N_1 \times N_2 \times 3$** 
* Instead of one slice, we have **3 matrices stacked together**:
    1.  **Red** Matrix
    2.  **Green** Matrix
    3.  **Blue** Matrix
* This is a **3D Tensor** with 3 channels. 


### C. What is a Tensor?
A **Tensor** is simply a generalization of matrices to higher dimensions.

* **0D Tensor (Scalar):** A simple number.
    * *Structure:* Just a singular point.
    * *Example:* `5`

* **1D Tensor (Vector):** A simple list of numbers.
    * *Structure:* Just a line.
    * *Example:* `[0, 255, 12, 45]`

* **2D Tensor (Matrix):** A single grid of numbers.
    * *Structure:* A "sheet" with rows and columns.
    * *Example:* A single grayscale image map ($H \times W \times 1$).

* **3D Tensor:** **Matrices stacked together**.
    * *Structure:* A stack of sheets.
    * *Example:* A Color Image. It is **3 matrices** (Red, Green, Blue) stacked on top of each other ($H \times W \times 3$).

### Important Note: 
**A 3D Tensor is not limited to 3 channels.**

* You can stack **2,5, 7, or even 100 matrices** together.
* It remains a **3D Tensor** because it is still defined by 3 axes: $Height \times Width \times Channels$.
* **Think of a book:** Whether a book has 3 pages or 500 pages, it is still a "3D object". The number of pages (channels) changes, but the structure is the same.


**Intuition:** Don't overcomplicate it. Just see a Tensor as a container of matrices stacked on top of each other.

METTRE UNE ANIMATION POUR LES TENSEURS , GENRE FAIRE TENSEUR 1D CEST UN POINT ,2D

## 1. Why Not Classic MLPs? (Example: MNIST Dataset)

Let's start with a classic example: the **MNIST dataset**.
This is a database of handwritten digits (0 to 9). The images are **Grayscale** and have a size of **$28 \times 28 \times 1$**.

As we discussed, grayscale images are simply matrices of pixels where values range from **0 to 255**.

### The Naive Approach: Multi-Layer Perceptron (MLP)
If we want to build a model to classify these digits, the first method that comes to mind is a standard **Fully Connected Neural Network** (or MLP).

However, an MLP expects a **flat vector** as input, not a matrix.
To use it, we are forced to **flatten** our $28 \times 28$ image into a single vector of **784 pixels**. We then feed this long vector into the MLP to get a prediction.



[Image of flattening image to vector]


### Why is this method not optimal?
While this might work for very simple tasks, it has **three major problems**:

#### 1. Loss of Spatial Information
An image is not just a random list of numbers; it is a **2D structured object**.
* Neighboring pixels are correlated (they work together to form shapes).
* The order and position matter.
**Problem:** By flattening the image into a vector, we **break this structure**. The network reads the image "pixel by pixel" without understanding the geometry, leading to a poor use of visual information.

#### 2. Giant Number of Parameters
Let's do the math for a simple MLP with **2 hidden layers of 100 neurons each** and **10 output neurons**.

* **Input Size:** 784 (pixels)
* **Hidden Layer 1:** 100 neurons
* **Hidden Layer 2:** 100 neurons
* **Output Layer:** 10 neurons

**Detailed Calculation of Parameters (Weights + Biases):**

* **Layer 1 (Input $\to$ Hidden 1):**
    * Weights: $784 \times 100 = 78,400$
    * Biases: $100$ (1 per neuron)
    
* **Layer 2 (Hidden 1 $\to$ Hidden 2):**
    * Weights: $100 \times 100 = 10,000$
    * Biases: $100$

* **Layer 3 (Hidden 2 $\to$ Output):**
    * Weights: $100 \times 10 = 1,000$
    * Biases: $10$

**Total Sum:**
$$78,400 + 100 + 10,000 + 100 + 1,000 + 10 = \mathbf{89,610}$$

**Problem:** We have nearly **90,000 parameters** just for a tiny $28 \times 28$ black and white image. On realistic color images (e.g., $1000 \times 1000$), this number would explode into the billions, making the model impossible to train efficiently.

#### 3. No Translation Invariance
To a human, if a digit "5" moves a little bit to the left or right, it is clearly still a "5". We recognize the **shape**, no matter where it is.
**Problem:** An MLP looks at each specific pixel position separately.
* If the "5" shifts, different input pixels light up.
* To the MLP, this looks like a completely different input.
* It has to "re-learn" what a "5" looks like for every possible position in the image. It lacks **Translation Invariance**.

## 2. How Humans Classify Images: Feature Detection

To understand the intuition behind CNNs, let's look at how humans see.

When we look at an image, **we do not scan every single pixel one by one.**
Instead, we process the image globally. We unconsciously look for **Features**:
* **Edges** (contours)
* **Textures**
* **Patterns** (e.g., the shape of an ear, the curve of a digit).

**The Decision Process:**
Identifying these features increases the probability of a specific class. If we see "whiskers" + "pointed ears" + "fur", our brain concludes: *"Yes, this looks like a cat."*

**The Goal of CNNs:**
Convolutional Neural Networks are designed to **mimic this exact behavior**.
The key is the convolution operator who acts like an **Automatic Feature Detector**.

# CHAPTER 2 : The convolution operator

## 4. Defining Image Convolution and Kernels

As we said, the convolution acts as a **Feature Detector**. But how exactly do we compute it?

**The Process:**
1.  **Input:** We take an image (pixel matrix).
2.  **Kernel:** We define a small filter of a specific size (e.g., $3 \times 3$) that we choose.
3.  **Operation:** We slide the kernel over the image, performing a "dot product" (multiplication + sum) at every position.

### Let's Calculate!
FAIRE 4 EXEMPLES VOIR REMARKABLE 
ESSAYER DE LES ANIMER !!

## 5. Output Dimensions, Tensors, and Calculation Cases

ANIMATIONS !!

## 6. Hyperparameters of Convolution

When designing a CNN, we don't just "apply convolution." We have to tune specific knobs—called **Hyperparameters**—to control how the network processes the image.

### 1. Padding
**The Problem:** Without padding, the pixels on the borders are "seen" less often by the filters because we cannot center the kernel on them. (Try a simple example yourself: you will notice that edge pixels are involved in far fewer calculations than the central pixels).

**The Solution:** Padding consists of adding a border of pixels (usually **Zeros**) around the input image.
* **Benefit:** It allows us to process the edges as effectively as the center.


EXAMPLE ANIMATION

### 2. Stride
**The Concept:** The Stride is the "step size" of the convolution.
* **Stride = 1:** We shift the filter **1 pixel** at a time. This is the standard detailed scan.
* **Stride = 2:** We shift the filter **2 pixels** at a time (we skip one pixel).
**Impact:** A larger stride **reduces the output size**  and speeds up computation because we perform fewer operations.

EXAMPLE ANIMATION

### 3. The Output Dimension Formula 

To make the math easier, let's assume we are working with square inputs and square filters (which is almost always the case).
* **$N$**: Input dimension (Height = Width).
* **$F$**: Filter dimension (Height = Width).
* **$P$**: Padding.
* **$S$**: Stride.

The formula to calculate the output size of the feature map is:

$$
\text{Output Size} = \left\lfloor \frac{N - F + 2P}{S} \right\rfloor + 1
$$

> **Example:**
> * Input Image ($N$): $28$ (for a $28 \times 28$ image)
> * Filter ($F$): $3$ (for a $3 \times 3$ kernel)
> * Padding ($P$): $1$
> * Stride ($S$): $1$
>
> $$\text{Size} = \frac{28 - 3 + (2 \times 1)}{1} + 1 = \frac{27}{1} + 1 = \mathbf{28}$$
> *Result:* We kept the same size ($28 \times 28$) thanks to the padding!

### 4. Kernel Size & Number of Filters
Finally, we must choose the properties of the filters themselves:

* **Kernel Size (e.g., $3 \times 3$ or $5 \times 5$):**
    * This controls the **Receptive Field** (the local area) the network looks at.
    * Small kernels look for fine details. Large kernels look for broader patterns.

* **Number of Filters (Depth):**
    * This controls **how many features** we want to learn in this layer.
    * **More filters** = The network can learn more diverse patterns (edges, textures, colors).
    * **Trade-off:** More filters mean more parameters to learn and slower training.

# TRANSITION : ajouter l'idée suivante : on va vouloir faire des convolution de convolution ( plusieurs compositions de convolutions) pour pouvoir ensuite applatir le dernier output et faire un MLP dessus. C'est l'idée principal à retenir

# CHAPTER 3 : Building the Convolutional Neural Network

## 7. The CNN Breakthrough: From Manual Filters to Learned Features

### The Hypothesis
We might ask:
> *"Why don't we just manually choose the best kernels (like standard edge detectors), apply them to the image to extract features, and then feed the result into a classic MLP for classification?"*

### The Problem
This approach is flawed for three main reasons:
1.  **Missing Hidden Features:** We don't always know intuitively which patterns are the most "discriminant" (useful) to distinguish between classes.
2.  **Unknown Combinations:** We don't know the optimal mix of kernels to use.
3.  **Human Limitation:** By choosing manually, we limit the network to features *we* have imagined.

### The Solution: End-to-End Learning
The core idea of CNNs is to **let the network learn the values inside the kernels itself.**

* **Learnable Parameters:** The coefficients inside the filters are not fixed constants. They are **parameters** (weights) of the network.
* **Backpropagation:** Just like in an MLP, these weights are updated during training. The network figures out *on its own* which kernels are best to minimize the error.

### The Result: A Giant Pipeline
We obtain a single, unified network where:
1.  **Convolution Layers (The "Eyes"):** Learn to be the best possible **Feature Extractors**.
2.  **Fully Connected Layers (The "Brain"):** Take these features and handle the **Classification Decision**.



[Image of cnn architecture diagram]
CNN PIPELINE EN ANIMATION 
AUSSI EXPLIQUER POURQUOI LA PARTIE CONVOLUTION CA PEUT ETRE VU COMME UNE PARTIE RESEAUX DE NEURONNE !!!!!!!!!! VIA UN EXEMPLE ET DIRE QUE CEST BIEN UNE PARTIE PAS FULLY CONNECTED !!!

## 10. Convolution as a Layer of a Neural Network & Backpropagation

## 8. Pooling

After convolution, we often add a **Pooling Layer**.
Its goal is to reduce the spatial dimensions of the image (Height and Width), which drastically reduces the number of parameters and computation cost.

### A. How it Works
We define a window of size **$K \times K$** (just like a kernel) and a **Stride**.

* **General Rule:** The window slides over the image. For each window, we compress the pixels into a single value (we calculate either the **max** or the **average** of those values).
* **Standard Practice:** In 99% of cases, we use a window of **$2 \times 2$** with a **Stride of 2**. This effectively divides the height and width of the image by 2 at each step.



[Image of max pooling vs average pooling]


### B. Max Pooling vs. Average Pooling
There are two main ways to compress the data:
1.  **Average Pooling:** Calculates the average value of the pixels in the window.
2.  **Max Pooling:** Selects only the **maximum value** in the window.

### C. Why is Max Pooling Better for Classification?
Let's take a concrete example with a $2 \times 2$ window to understand why Max Pooling is the standard.
Imagine a region of the image after a convolution (a feature map):

$$
\text{Window} = \begin{bmatrix} \mathbf{164} & 0 \\ 1 & 0 \end{bmatrix}
$$

* **Context:**
    * **164:** A very strong activation (the filter detected a sharp feature, like an edge or a texture).
    * **0 and 1:** Background noise. Note that in image processing, dark pixels (values near 0) usually do not contain key information; they represent empty space or background.

**The Comparison:**

* **Average Pooling:**
    $$\frac{164 + 0 + 1 + 0}{4} \approx \mathbf{41}$$
    * *Result:* The strong signal (164) is **diluted** by the zeros. The information becomes blurry and less distinct.

* **Max Pooling:**
    $$\max(164, 0, 1, 0) = \mathbf{164}$$
    * *Result:* We keep the **164**. The pooling ignores the noise (the zeros) and preserves the most important feature.

**Conclusion:** For image classification, Max Pooling acts as a "Feature Selector," ensuring the strongest patterns survive, while Average Pooling tends to wash them out.

## 11. Designing the Pipeline: The Hyperparameters

We can divide them into three main categories:

### A. Convolution Layers

**Step 1: Define Architecture Depth**
* **Number of Blocks:** How many (Conv + Pooling) blocks to stack?

**Step 2: For EACH Block, choose:**
* **Number of Filters:** (e.g., 32, 64, 128...)
* **Filter Size (Kernel):** (e.g., $3 \times 3$)
* **Padding:** 
* **Stride:** 
* **Activation Function:** (e.g., ReLU,sigmoid)
* **Pooling Type:** (Max Pooling vs Average Pooling)

### B. MLP Layers (Classifier)
* **Number of Layers**
* **Neurons per Layer**
* **Activation Function:** (e.g., ReLU for hidden, Sigmoid/Softmax for output)

### C. Training (Backward)
* **Optimizer:** (e.g., Adam, SGD)
* **Learning Rate**
* **Batch Size**
* **Number of Epochs**


### D. Crucial Extras: Stabilization & Regularization
To make the model robust, we add two specific layers.

#### 1. Batch Normalization (Batch Norm)
* **Where to put it?** Inside the Conv Block, **between** the Convolution and the Activation function.
    * *Order:* `Conv2d` $\to$ `BatchNorm2d` $\to$ `ReLU`.
* **The Math:** It normalizes the output of the convolution by subtracting the batch mean ($\mu$) and dividing by the standard deviation ($\sigma$).
    $$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta$$
    *(Where $\gamma$ and $\beta$ are learnable parameters).*
* **Why?** It stabilizes the learning process (keeps values centered), allowing the use of higher learning rates and faster convergence.

https://www.youtube.com/watch?v=nT9nKBCjS_Y

#### 2. Dropout
* **Where to put it?** Inside the MLP, **immediately after** the Activation function of hidden layers.
    * *Order:* `Linear` $\to$ `ReLU` $\to$ `Dropout`.
* **The Math:** During training, for each neuron, we flip a coin with probability $p$.
    * If heads: The neuron's output becomes **0** (it is "turned off").
    * If tails: The neuron works normally.
* **Why?** It prevents **Overfitting**. By randomly silencing neurons, the network cannot rely on a single specific path. It forces the "team" of neurons to be robust, ensuring that the model generalizes well to new images instead of memorizing the training set.

## 13. Understanding the Training Flow: Batches, Tensors, and Iterations

To understand how training really works, we need to look under the hood at the dimensions and the timeline.

### A. The Setup: Data and Batches
Let's use our concrete example:
* **Dataset:** 8,000 Images.
* **Image Size:** $28 \times 28 \times 1$ (Grayscale).
* **Batch Size:** We choose **32**.

**Why Batches?**
We cannot feed 8,000 images at once into the GPU (memory explosion). We also don't want to feed them 1 by 1 (too slow). We group them into packets of 32.

**Key Definitions:**
1.  **Iteration (Step):** The act of processing **one batch** (32 images) -> calculating error -> updating weights.
2.  **Epoch:** The act of processing the **entire dataset** once.

**The Math:**
$$\text{Iterations per Epoch} = \frac{8000 \text{ images}}{32 \text{ batch size}} = \mathbf{250 \text{ Iterations}}$$

---

### B. Step-by-Step: Inside a Single Iteration
Let's zoom in on **Iteration #1** (the very first step of training).

#### 1. The Input Tensor (4D)
The computer does not load images one by one. It creates a massive 4-Dimensional block containing the first 32 images stacked together.

* **Format:** $(Batch, Channels, Height, Width)$
* **Dimensions:** $(32, 1, 28, 28)$

#### 2. The Convolution Layer (Parallel Processing)
We defined a Convolution layer with:
* **32 Filters** (Kernel size $3 \times 3$).
* **Stride 1**, **No Padding**.

**The Operation:**
The GPU applies the 32 filters to the 32 images **simultaneously**. It does not use a "for loop". It performs a massive matrix calculation on the whole block.

**Calculating Output Size:**
* Spatial size: $28 - 3 + 1 = 26$ (Output is $26 \times 26$).
* Depth: We have 32 Filters, so output depth is 32.
* Batch: We still have 32 images.

**The Output Tensor Dimensions:**
$$(32, 32, 26, 26)$$
*(Batch Size $\times$ Number of Filters $\times$ Height $\times$ Width)*

#### 3. The Batch Normalization (The Deep Dive)
Now, this huge tensor $(32, 32, 26, 26)$ enters the Batch Norm layer.
The goal is to calculate the **Mean ($\mu$)** and **Standard Deviation ($\sigma$)** to clean the data.

**How is the mean calculated?**
Batch Norm works **Filter by Filter** (Channel by Channel). It does NOT mix the filters.

**Example: Processing Filter #1**
Batch Norm looks at the specific "slice" for Filter #1 across the ENTIRE batch.
* It takes the $26 \times 26$ result of Filter 1 for **Image 1**.
* It takes the $26 \times 26$ result of Filter 1 for **Image 2**.
* ...
* It takes the $26 \times 26$ result of Filter 1 for **Image 32**.

It collects all these pixels together:
$$\text{Total Pixels} = 32 \times 26 \times 26 = \mathbf{21,632 \text{ pixels}}$$

It calculates **ONE single average** and **ONE standard deviation** from these 21,632 values.
Then, it subtracts this average from every pixel in Filter #1 (across all 32 images).

*It repeats this process for Filter #2, Filter #3... up to Filter #32.*

---

### C. The Timeline: The Training Loop
Now that we understand what happens inside one step, here is the full timeline of training.

**Epoch 1 Starts:**

1.  **Time T=0 (Iteration 1):**
    * Load **Batch 1** (Images 0 to 31).
    * **Forward Pass:** Calculate Conv -> BatchNorm -> ReLU -> MLP -> Final Prediction.
    * **Loss Calculation:** Compare predictions of these 32 images to the real labels.
    * **Backward Pass:** Update the weights of the network immediately.

2.  **Time T=1 (Iteration 2):**
    * The network is now slightly "smarter" (weights have changed).
    * Load **Batch 2** (Images 32 to 63).
    * Forward -> Loss -> Backward -> Update.

3.  **... (Repeat 248 times) ...**

4.  **Time T=250 (Iteration 250):**
    * Load the final Batch.
    * Update weights.

**End of Epoch 1:** The network has seen every image exactly once.
**Start Epoch 2:** We shuffle the images and start again from Batch 1!

## 12. Network Depth vs. Information Precision

The more convolutional layers we add, the more complex the detected features become.

### A. The Mathematical Intuition
It is actually quite natural.
Mathematically, a neural network is a **composition of non-linear functions**.
* One layer performs a simple transformation: $y = f(x)$.
* Ten layers perform a chain of transformations: $y = f_9(f_8(...f_1(x)...))$.

By stacking layers, we are building a function of **increasing complexity**.
Just like in math where composing simple functions allows you to describe complex curves, stacking convolutions allows the network to model extremely precise and intricate relationships in the data.

### B. The Hierarchy of Features
Because of this composition, the network learns in a hierarchical way. As we move deeper into the network (towards the output), the features become more abstract and "human-level":

1.  **First Layers:** They detect **simple geometry** (horizontal lines, vertical edges, color gradients).
2.  **Middle Layers:** They combine lines to detect **shapes and parts** (corners, curves, circles, textures).
3.  **Last Layers:** They combine shapes to recognize **complex objects** (eyes, mouths, ears, car wheels, faces).



**Conclusion:** The deeper we go, the more the network understands "concepts" (Is there an eye?) rather than just pixel values.

# CHAPTER 4: Practical Application

## 13. Case Study: Breast Cancer Detection 