## Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) processes an input image through several **convolutional layers**, each applying multiple learnable filters (kernels). For example, the first convolution layer might use 10 filters of size $6 \times 6 \times 3$ (height × width × depth), producing **10 feature maps** that capture different local patterns such as edges or textures.

These feature maps then undergo **subsampling** (often via max pooling) to reduce spatial dimensions while keeping the most prominent features, improving translation invariance and reducing computation.

The process of **convolution → subsampling** repeats multiple times, with feature maps getting deeper (more channels) but spatially smaller as we move through the network.

Eventually, the resulting high-level feature maps are **flattened** into a vector and passed to one or more **fully connected layers**, which perform the final classification or regression task.

In short:

* **Width & height** of feature maps → decrease as we go deeper.
* **Depth (number of channels)** → increases as we go deeper.

---




![](images/Typical_cnn.png)

![](images/vgg-16.png)

## 1. The Structural Change: Resolution ↓, Depth ↑

As you move **deeper into a CNN**, two things typically happen:

1. **Spatial resolution decreases** —
   The width and height of the feature maps become smaller.
   This happens due to **strided convolutions** or **pooling** (e.g. max pooling with stride 2).

2. **Depth (number of channels) increases** —
   The number of filters in each layer increases (e.g. 64 → 128 → 256 → 512).
   Each filter captures a **different type of pattern** or **feature dimension**.

---

## 2. What this means semantically

### Early layers — High resolution, low semantics

* Each neuron sees a **small receptive field** (a few pixels).
* The features represent **low-level visual cues**, such as:

  * Edges
  * Corners
  * Color blobs
  * Simple textures
* Because the resolution is high, spatial precision is retained — you know *where* the feature is.

Mathematically, if input is $ X \in \mathbb{R}^{H \times W \times 3} $, after first convolution:
$$
X_1 = f(W_1 * X + b_1)
$$
with small receptive field (e.g., $3 \times 3$) → local features.

---

### Middle layers — Medium resolution, medium semantics

* Receptive fields expand: neurons start combining lower-level edges and patterns.
* They detect **parts of objects** (e.g. corners of a mouth, eyes, wheels).
* Still some spatial detail, but less than before.
* Representations become **more invariant** to small translations, rotations, lighting.

---

### Deep layers — Low resolution, high semantics

* Receptive fields cover **almost the entire input image**.
* Each neuron responds to **high-level, abstract concepts** like:

  * “dog’s face”
  * “wheel”
  * “human torso”
  * “text region”
* You lose precise location information — instead, you gain **semantic meaning**.
* This is why deep features are useful for classification, but not directly for tasks that need fine localization (e.g. segmentation or detection).

Formally, the **effective receptive field** grows approximately as:
$$
r_l = r_{l-1} + (k_l - 1) \prod_{i=1}^{l-1} s_i
$$
where:

* $ r_l $: receptive field size at layer $ l $
* $ k_l $: kernel size
* $ s_i $: stride at layer $ i $

---

## 3. Why this tradeoff is useful

* **Decreasing spatial resolution** reduces computation and memory cost.
* **Increasing channel depth** allows the model to learn richer, more abstract representations.
* The combination enables hierarchical feature extraction — the hallmark of CNNs.

---

## 4. Relation to downstream tasks

| Task               | Needed Info                     | Example Adaptation                     |
| ------------------ | ------------------------------- | -------------------------------------- |
| **Classification** | Semantic meaning                | Deep layers only (e.g., GAP + Linear)  |
| **Detection**      | Both semantics & spatial        | Multi-scale feature maps (FPN, SSD)    |
| **Segmentation**   | High semantics + spatial detail | Encoder–Decoder (e.g., U-Net, DeepLab) |

In segmentation, for example, **skip connections** are used to recover fine-grained spatial info from earlier layers that was lost in downsampling.

---
