# VGG-19

VGG-19 has the following pattern in the first block:

**Input (RGB image)** → Conv(3 → 64) → Conv(64 → 64) → MaxPool

---

### 1. Why do we have **two times 64 feature maps**?

* The **first conv layer** takes the raw RGB channels (3) and learns 64 different filters, so the output has **64 feature maps**.
* The **second conv layer** then applies 64 new filters, each seeing *all 64 previous feature maps* as input. This lets the network build richer, more abstract features without yet reducing spatial resolution.
* So it’s not “repeating the same 64 maps,” it’s:

  * First layer: detect low-level patterns (edges, colors, textures).
  * Second layer: combine them into more complex local structures, still keeping 64 channels so the representational power is larger before downsampling.

In other words, **keeping the same number of channels but stacking multiple convs increases depth of processing at the same spatial scale**. This improves expressive power.

---

### 2. What happens to **H and W**?

* Both conv layers in VGG use **3×3 kernels, stride = 1, padding = 1**.
* Formula for conv output size:

$$
H_{out} = \frac{H_{in} + 2p - k}{s} + 1
$$

With $k=3, s=1, p=1$, we get:

$$
H_{out} = \frac{H_{in} + 2 - 3}{1} + 1 = H_{in}
$$

Same for $W$.
 So after each 3×3 conv, the spatial size **stays the same**. Only the **channel depth changes** (3 → 64 → 64).

* After the two convs, a **max-pool (2×2, stride=2)** halves the H and W.

  * Example: $224×224×3$ input →
    Conv → $224×224×64$ →
    Conv → $224×224×64$ →
    MaxPool → $112×112×64$.

---


## 1. What does **kernel size** mean?

* A **kernel (filter)** is the small sliding window used in a convolution.
* Its size is written as $k \times k$. Examples:

  * **3×3 kernel** → looks at a 3×3 patch of the image (or feature map).
  * **5×5 kernel** → looks at a 5×5 patch.
  * **7×7 kernel** → looks at a 7×7 patch.

The choice of kernel size affects the **receptive field**:

* 3×3 sees very local details (edges, textures).
* 5×5 sees slightly larger structures.
* 7×7 sees broader patterns (but costs more parameters).

 VGG chose **3×3 kernels stacked multiple times** instead of using 5×5 or 7×7, because:

* Two 3×3 layers = receptive field of 5×5, but with fewer parameters and more non-linearities (ReLU in between).
* Three 3×3 layers = receptive field of 7×7, with even deeper representations.

---

## 2. How many kernels do we need in the first conv layer of VGG?

Input: **RGB image** → $H \times W \times 3$.
First conv layer: 64 output channels.

* Each output channel is produced by **one kernel**.
* Since the input has 3 channels, each kernel must also have **3 channels**.
* So the shape of one kernel is:

  $$
  3 \times 3 \times 3
  $$
* To get 64 output channels, we need **64 different kernels** of shape $3 \times 3 \times 3$.

So total parameters in the first conv =

$$
(3 \times 3 \times 3) \times 64 + 64 \quad \text{(bias terms)}
$$

\= 1,792 parameters.

---

## 3. What about the **second conv layer (64→64)?**

Now the input has 64 channels.
We want 64 outputs again.

* Each kernel is now: $3 \times 3 \times 64$.
* Each output channel has one such kernel.
* We need 64 kernels.

So parameters =

$$
(3 \times 3 \times 64) \times 64 + 64 = 36,928
$$

---

 **Summary**:

* Kernel size = spatial window (3×3, 5×5, 7×7).
* In conv layers:

  * 3 input channels → each kernel has 3 channels.
  * 64 output channels → we need 64 kernels.
* First VGG conv: 64 kernels of shape $3 \times 3 \times 3$.
* Second VGG conv: 64 kernels of shape $3 \times 3 \times 64$.

---



<img src='images/06_03.png'/>
<img src='images/06_09.png'>
<img src='images/3_channel_conv.gif'>