# Concepts: Convolution and CNNs

Understanding how convolutional neural networks see images.

This notebook covers the **theory** behind CNNs. You'll apply these concepts in [09_lab_convolutional_networks](09_lab_convolutional_networks.ipynb).

---

## Why CNNs for Images?

A regular neural network treats each pixel as an independent input.

**Problems:**
- A 224×224 RGB image = 150,528 inputs
- Ignores spatial structure (nearby pixels are related)
- Not translation invariant (cat in corner ≠ cat in center)

**CNNs solve this** by:
1. Using small **kernels** that slide across the image
2. Sharing weights across all positions
3. Building hierarchical features (edges → shapes → objects)

---

## What is Convolution?

Convolution is a mathematical operation that slides a small **kernel** (filter) across an image.

```
     INPUT IMAGE              KERNEL (3×3)           OUTPUT
     
    ┌─────────────┐          ┌───────┐
    │ 1  2  3  0  │          │ 1 0 1 │
    │ 0  1  2  3  │    *     │ 0 1 0 │    =    Feature Map
    │ 3  0  1  2  │          │ 1 0 1 │
    │ 2  3  0  1  │          └───────┘
    └─────────────┘
```

At each position: **element-wise multiply, then sum**.

**Key insight:** In CNNs, the network **learns** the best kernels for the task. We don't hand-design them.

---

## Common Kernels and Their Effects

Different kernels detect different features:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage

# Create a simple test image
image = np.zeros((100, 100))
image[30:70, 30:70] = 1  # White square
image[40:60, 40:60] = 0.5  # Gray inner square

# Define kernels
kernels = {
    'Original': None,
    'Edge (Horizontal)': np.array([[-1, -1, -1], [0, 0, 0], [1, 1, 1]]),
    'Edge (Vertical)': np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]),
    'Blur': np.ones((3, 3)) / 9,
    'Sharpen': np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]]),
}

fig, axes = plt.subplots(1, 5, figsize=(15, 3))

for ax, (name, kernel) in zip(axes, kernels.items()):
    if kernel is None:
        ax.imshow(image, cmap='gray')
    else:
        filtered = ndimage.convolve(image, kernel)
        ax.imshow(filtered, cmap='gray')
    ax.set_title(name)
    ax.axis('off')

plt.tight_layout()
plt.show()

---

## Padding and Stride

### Padding

Without padding, the output is smaller than the input.

```
Input: 5×5          Kernel: 3×3          Output: 3×3 (smaller!)
```

**"Same" padding** adds zeros around the border to keep the same size.

### Stride

**Stride** controls how far the kernel moves at each step.

- Stride 1: Move 1 pixel at a time (default)
- Stride 2: Move 2 pixels at a time (reduces output size by half)

```
Stride = 1                    Stride = 2
───────────                   ───────────
Step 1: [x x x] _ _          Step 1: [x x x] _ _
Step 2: _ [x x x] _          Step 2: _ _ [x x x]
Step 3: _ _ [x x x]          (done - only 2 steps)
```

---

## Pooling

Pooling **downsamples** the feature maps, reducing computation and adding translation invariance.

### Max Pooling

Take the maximum value in each region.

```
Input (4×4)              Max Pool 2×2              Output (2×2)

┌───┬───┬───┬───┐                                 ┌───┬───┐
│ 1 │ 3 │ 2 │ 1 │                                 │ 4 │ 6 │
├───┼───┼───┼───┤        ──────────▶              ├───┼───┤
│ 4 │ 2 │ 6 │ 4 │         (take max)              │ 8 │ 7 │
├───┼───┼───┼───┤                                 └───┴───┘
│ 1 │ 8 │ 3 │ 7 │
├───┼───┼───┼───┤
│ 5 │ 3 │ 2 │ 1 │
└───┴───┴───┴───┘
```

---

## CNN Architecture

A typical CNN has this structure:

```
┌──────────────────────────────────────────────────────────────────────────┐
│                           CNN ARCHITECTURE                               │
│                                                                          │
│   Image ──▶ [Conv+ReLU] ──▶ [Pool] ──▶ [Conv+ReLU] ──▶ [Pool] ──▶ ...   │
│                                                                          │
│             └─────────── FEATURE EXTRACTION ───────────┘                 │
│                                                                          │
│   ... ──▶ [Flatten] ──▶ [Dense] ──▶ [Dense] ──▶ [Softmax] ──▶ Classes   │
│                                                                          │
│           └─────────────── CLASSIFICATION ──────────────┘                │
└──────────────────────────────────────────────────────────────────────────┘
```

**Typical pattern:**
1. **Conv → ReLU → Pool** (repeat several times)
2. **Flatten** (convert 2D to 1D)
3. **Dense layers** (standard neural network)
4. **Softmax** (output probabilities)

---

## Feature Hierarchy

As we go deeper, the network learns increasingly complex features:

```
Layer 1         Layer 2         Layer 3         Layer 4
─────────       ─────────       ─────────       ─────────
  Edges    →    Textures   →    Parts      →    Objects
  Lines         Patterns        Eyes, ears      Faces
  Corners       Gradients       Wheels          Cars
```

This hierarchical feature learning is why CNNs are so powerful for vision tasks.

---

## Key Parameters

| Parameter | Description | Typical values |
|-----------|-------------|----------------|
| **Filters** | Number of kernels (output channels) | 32, 64, 128 |
| **Kernel size** | Size of the convolution window | 3×3, 5×5 |
| **Stride** | Step size for sliding the kernel | 1, 2 |
| **Padding** | How to handle borders | 'same', 'valid' |
| **Pool size** | Size of pooling window | 2×2 |

---

## Transfer Learning (Preview)

Training CNNs from scratch requires:
- Millions of images
- Days of training
- Expensive hardware

**Transfer learning** solves this by reusing a model trained on a large dataset (like ImageNet).

```
PRETRAINED MODEL (e.g., MobileNet trained on ImageNet)
───────────────────────────────────────────────────────

[Conv layers]  →  [Dense]  →  [1000 classes]
   (frozen)       (remove)     (remove)
       ↓
YOUR MODEL
──────────

[Conv layers]  →  [Your Dense]  →  [Your classes]
   (frozen)        (trainable)      (e.g., 5 flowers)
```

This is covered in the CNN Lab notebook.

---

## Key Takeaways

1. **Convolution** slides a kernel across an image, detecting local patterns
2. **Kernels** are learned (not hand-designed) to detect useful features
3. **Pooling** reduces spatial size and adds translation invariance
4. **CNNs learn hierarchically:** edges → textures → parts → objects
5. **Transfer learning** lets you use pretrained models on small datasets

**Next:** Build CNNs in [09_lab_convolutional_networks](09_lab_convolutional_networks.ipynb)