# Convolutional Networks (CNNs)

**Convolutional Networks** are specialized neural networks designed for data with a **grid-like structure**, such as:

* **1-D grids:** time series, audio signals, etc.
* **2-D grids:** images, video frames, etc.

CNNs replace some or all **matrix multiplications** in standard neural networks with **convolutions**, a linear operation designed to capture **local patterns** efficiently.

## The Convolution Operation

Convolution is a mathematical operation between two functions. For **continuous 1-D functions**:

$$
s(t) = (x * w)(t) = \int_{-\infty}^{\infty} x(a) w(t-a) , da
$$

For **discrete signals**, the integral becomes a sum:

$$
s[t] = (x * w)[t] = \sum_{a=-\infty}^{\infty} x[a] w[t-a]
$$

* $x$ = input (e.g., signal or image)
* $w$ = kernel/filter
* $s$ = output (feature map)

For **2-D inputs** (images):

$$
S[i,j] = (I * K)[i,j] = \sum_m \sum_n I[m,n] K[i-m, j-n]
$$

Convolution is **commutative**:

$$
I * K = K * I
$$

Neural networks often implement **cross-correlation**, which is similar but **without flipping the kernel**:

$$
S[i,j] = \sum_m \sum_n I[i+m, j+n] K[m,n]
$$

<img src="img/matrix.png">

### Numerical Example: 1-D Convolution

Let $x = [1,2,3,4]$ and $w = [1,0,-1]$ (a simple edge detector).

Compute $s = x * w$ using discrete convolution ("valid" positions only):

* Position 0: $s[0] = 1*1 + 2*0 + 3*(-1) = -2$
* Position 1: $s[1] = 2*1 + 3*0 + 4*(-1) = -2$

Result: $s = [-2, -2]$



## Sparse Interactions & Parameter Sharing

* **Sparse interactions:** Each output unit depends only on a **small local patch** of the input.
* **Parameter sharing:** The same kernel is applied across all input locations.

If input has $m$ units and output has $n$ units:

* Fully connected layer: $m \times n$ parameters
* Convolution with kernel size $k$: $k \times n$ parameters

This **reduces memory** and **computational cost**, while still capturing complex patterns.

<img src="img/ex.png" height="10%" weight="10%">

<img src="img/ex2.png" height="10%" weight="10%">

<img src="img/ex3.png" height="10%" weight="10%">



## Equivariance to Translation

A function $f$ is **equivariant** to $g$ if:

$$
f(g(x)) = g(f(x))
$$

For convolution, shifting the input by one pixel shifts the output by the same amount:

* Let $I'(x, y) = I(x-1, y)$
* Then $(I' * K) = (I * K)'$

This is why CNNs are excellent for **vision tasks**, where objects can appear at any location.



## Pooling

Pooling reduces spatial dimensions and introduces **translation invariance**. Common pooling operations:

* **Max pooling:** $s[i,j] = \max {S[\text{neighbors of } (i,j)]}$
* **Average pooling:** $s[i,j] = \frac{1}{N}\sum S[\text{neighbors of } (i,j)]$
* **L2 pooling:** $s[i,j] = \sqrt{\sum S[\text{neighbors of } (i,j)]^2}$

Pooling also **reduces computation** for subsequent layers.

<img src="img/ex5.png" height="10%" weight="10%">

### Example: 1-D Max Pooling

Let $S = [1, 3, 2, 4]$, pool size = 2, stride = 2:

* Pool region 1: $\max(1,3) = 3$
* Pool region 2: $\max(2,4) = 4$

Result: $s_\text{pooled} = [3, 4]$

<img src="img/ex6.png" height="10%" weight="10%">

<img src="img/ex7.png" height="10%" weight="10%">

<img src="img/ex8.png" height="10%" weight="10%">



## Efficiency Example

Consider edge detection in an image:

* Input: $320 \times 280$
* Output (kernel width 2, edge detection horizontally): $319 \times 280$

**Convolution operations**:

$$
\text{FLOPs} = 319 \times 280 \times 2 \approx 178,640
$$

**Matrix multiplication approach** would require:

$$
320 \times 280 \times 319 \times 280 \approx 8 \text{ billion entries}
$$

Convolution is therefore **orders of magnitude more efficient**.

<img src="img/ex4.png" height="10%" weight="10%">



## Summary of Convolutional Layers

1. **Convolution:** extract local features using a shared kernel
2. **Nonlinearity:** apply activation function (e.g., ReLU)
3. **Pooling:** reduce dimensions, gain invariance to small translations

<img src="img/ex9.png" height="10%" weight="10%">
