## Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) processes an input image through several **convolutional layers**, each applying multiple learnable filters (kernels). For example, the first convolution layer might use 10 filters of size $6 \times 6 \times 3$ (height × width × depth), producing **10 feature maps** that capture different local patterns such as edges or textures.

These feature maps then undergo **subsampling** (often via max pooling) to reduce spatial dimensions while keeping the most prominent features, improving translation invariance and reducing computation.

The process of **convolution → subsampling** repeats multiple times, with feature maps getting deeper (more channels) but spatially smaller as we move through the network.

Eventually, the resulting high-level feature maps are **flattened** into a vector and passed to one or more **fully connected layers**, which perform the final classification or regression task.

In short:

* **Width & height** of feature maps → decrease as we go deeper.
* **Depth (number of channels)** → increases as we go deeper.

---




![](images/Typical_cnn.png)

![](images/vgg-16.png)

![](images/VGG-16-architecture.jpg)

## How to read and interpret network architecture


When you see a diagram of a network architecture like **U-Net** or **ResNet**, the goal is to translate that picture (or description) into:

1. **What’s happening to the data at each stage** (shape, type, and meaning).
2. **Why those stages are there** (function and design reasoning).
3. **How the whole thing solves the intended problem** (semantic flow).

Here’s a structured way to read and interpret them:

---

## **1. Identify the big picture**

Before diving into layers:

* **Task type** – Is this classification, segmentation, detection, reconstruction, etc.?

  * U-Net → segmentation.
  * ResNet → classification (originally), but adaptable.
* **Data type** – Images, video, 3D volumes, etc.? This determines convolution types, input shapes, etc.
* **Input & output shapes** – e.g. U-Net may take a `256×256×3` image and output `256×256×C` where `C` = number of segmentation classes.

---

## **2. Break the architecture into high-level blocks**

Architectures are rarely “just a list of layers.” They usually have **modules**:

* **Encoder / backbone** – progressively reduces spatial size, increases channels (feature depth).
* **Bottleneck** – most compressed representation.
* **Decoder / head** – progressively upsamples, merges, and produces output.

For example:

* **U-Net** – Encoder → Bottleneck → Decoder with skip connections between matching resolutions.
* **ResNet** – Stack of “Residual Blocks” arranged in stages, gradually shrinking resolution but increasing channels.

---

## **3. Understand the transformations per block**

For each module:

* **What’s the input shape?**
* **What layers are applied?** (Conv, BatchNorm, Activation, Pooling, etc.)
* **How does shape change?**

  * Convolution with stride > 1 → downsampling.
  * Pooling → downsampling.
  * Transposed convolution / upsampling → upsampling.
* **What is the role?**

  * Convs → extract local patterns.
  * Pooling / stride → abstract and compress information.
  * Skip connections → preserve details, help gradient flow.

---

## **4. Watch the “information highway”**

Some architectures have *special wiring*:

* **Skip connections (ResNet)** → Shortcuts that add input to output of a block (helps train deep nets).
* **Concatenation skips (U-Net)** → Copy encoder features to decoder to restore detail.
* **Multi-scale paths (FPN, U-Net++)** → Combine features from different scales.

Interpretation tip:

* Ask: *“Where does the data flow in parallel?”*
* Ask: *“What is merged and how (addition, concatenation)?”*

---

## **5. Pay attention to design patterns**

Many architectures reuse “building blocks”:

* **Residual Block** – Conv → BN → ReLU → Conv → BN → Add input → ReLU.
* **Inverted Residual (MobileNetV2)** – Expand → Depthwise Conv → Project.
* **Double Convolution (U-Net)** – Two `3×3` convs with BN/ReLU.

Once you recognize a block, you can “mentally compress” the diagram.
Example: A ResNet-50 has 4 main stages with specific block counts, not just “50 layers.”

---

## **6. Use the “shape trace” method**

This is my go-to trick for making sense of *any* architecture:

| Step | Layer/Block       | Input Shape | Operation | Output Shape | Notes                 |
| ---- | ----------------- | ----------- | --------- | ------------ | --------------------- |
| 1    | Conv 7×7 s=2      | 224×224×3   | ↓ spatial | 112×112×64   | Early feature extract |
| 2    | MaxPool 3×3 s=2   | 112×112×64  | ↓ spatial | 56×56×64     | ...                   |
| 3    | Residual Block ×3 | 56×56×64    | same res  | 56×56×256    | ...                   |

If you do this from start to finish, the architecture stops being “mystical” and becomes a sequence of shape changes.

---

## **7. Ask “why” after “what”**

Once you decode the structure:

* **Why this downsampling rate?** (task might need detail vs abstraction)
* **Why skip connections?** (gradient stability, detail preservation)
* **Why so many channels in bottleneck?** (higher-level feature richness)

---

## **Example: Quick mental model**

* **U-Net** – Think: *compress (encoder), store detail aside (skip), expand (decoder), stitch back detail.*
* **ResNet** – Think: *standard CNN, but every block learns the *change* needed, not the full mapping.*

---


