# Convolutional Networks (CNNs)

**Convolutional Networks** are specialized neural networks for data with a **grid-like structure**, such as:

* **1-D grids:** time series, audio signals, sensor readings
* **2-D grids:** images, video frames, spectrograms

CNNs replace some or all **matrix multiplications** in standard fully connected networks with **convolutions**, a linear operation that:

* Exploits **local correlations** in the data
* **Reduces the number of parameters** via weight sharing
* Preserves spatial or temporal structure
* Improves **generalization** for structured data



## The Convolution Operation

### Continuous 1-D Convolution

For **continuous functions** $x(t)$ and $w(t)$:

$$
s(t) = (x * w)(t) = \int_{-\infty}^{\infty} x(a) w(t-a) da
$$

This slides the function $w$ over $x$, weighting and summing local contributions.

### Discrete 1-D Convolution

For **discrete signals**, the integral becomes a sum:

$$
s[t] = (x * w)[t] = \sum_{a=-\infty}^{\infty} x[a] w[t-a]
$$

* $x[a]$ = input value at position $a$
* $w[t-a]$ = kernel weight for relative position $t-a$
* $s[t]$ = output (feature map) at position $t$

In practice, sequences are finite and **“valid” convolution** sums only where the kernel fully overlaps the input.



### 2-D Convolution (Images)

For an image $I$ and kernel $K$:

$$
S[i,j] = (I * K)[i,j] = \sum_m \sum_n I[m,n] K[i-m, j-n]
$$

* $S[i,j]$ = feature map at position $(i,j)$
* $K[m,n]$ = kernel weights, usually small (e.g., $3 \times 3$)
* Each output depends only on a **local patch** of the input

**Cross-correlation**, used in most CNN implementations, does not flip the kernel:

$$
S[i,j] = \sum_m \sum_n I[i+m, j+n] K[m,n]
$$

This is equivalent to convolution in terms of learning.



### Numerical Example: 1-D Convolution

Let

$$
x = [1,2,3,4], \quad w = [1,0,-1]
$$

Compute **valid convolution** $s = x * w$:

* Position 0:

$$
s[0] = 1\cdot 1 + 2\cdot 0 + 3\cdot(-1) = -2
$$

* Position 1:

$$
s[1] = 2\cdot 1 + 3\cdot 0 + 4\cdot(-1) = -2
$$

Result:

$$
s = [-2, -2]
$$



## Sparse Interactions & Parameter Sharing

* **Sparse interactions:** Each output unit depends on a **small local patch** of size equal to the kernel.
* **Parameter sharing:** The **same kernel** is applied across all positions, drastically reducing the number of parameters.

Example:

* Input length $m = 10$, output length $n = 8$, kernel size $k=3$
* Fully connected: $m \cdot n = 80$ parameters
* Convolution: $k = 3$ parameters (shared) → huge reduction



## Equivariance to Translation

A function $f$ is **equivariant** to transformation $g$ if:

$$
f(g(x)) = g(f(x))
$$

* For convolution, shifting the input by 1 shifts the output by 1:

$$
I'(x, y) = I(x-1, y) \quad \Rightarrow \quad (I' * K) = (I * K)'
$$

This **preserves spatial relationships** and is why CNNs excel for vision.



## Pooling

Pooling reduces **spatial resolution** and adds **translation invariance**. Common operations:

* **Max pooling:** $s[i,j] = \max{S[\text{patch around } (i,j)]}$
* **Average pooling:** $s[i,j] = \frac{1}{N}\sum S[\text{patch}]$
* **L2 pooling:** $s[i,j] = \sqrt{\sum S[\text{patch}]^2}$

Pooling reduces computation in subsequent layers and increases **robustness** to small translations.



### Example: 1-D Max Pooling

Let

$$
S = [1,3,2,4], \quad \text{pool size}=2, \text{stride}=2
$$

* Pool region 1: $\max(1,3) = 3$
* Pool region 2: $\max(2,4) = 4$

Result:

$$
s_\text{pooled} = [3,4]
$$



## Efficiency Example

Edge detection in an image:

* Input: $320 \times 280$
* Kernel: width 2, horizontal edges
* Output: $319 \times 280$

**FLOPs for convolution:**

$$
319 \cdot 280 \cdot 2 \approx 178{,}640
$$

**Fully connected layer equivalent:**

$$
320 \cdot 280 \cdot 319 \cdot 280 \approx 8\text{ billion entries}
$$

Thus convolution is **orders of magnitude more efficient**.



## 2-D Numerical Example

Let

$$
I = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}, \quad
K = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}
$$

Compute **valid convolution** $S = I * K$:

* $S[0,0] = 1\cdot 1 + 2\cdot 0 + 4\cdot 0 + 5\cdot(-1) = -4$
* $S[0,1] = 2\cdot 1 + 3\cdot 0 + 5\cdot 0 + 6\cdot(-1) = -4$
* $S[1,0] = 4\cdot 1 + 5\cdot 0 + 7\cdot 0 + 8\cdot(-1) = -4$
* $S[1,1] = 5\cdot 1 + 6\cdot 0 + 8\cdot 0 + 9\cdot(-1) = -4$

Resulting feature map:

$$
S = \begin{bmatrix} -4 & -4 \\ -4 & -4 \end{bmatrix}
$$



## Summary of Convolutional Layers

1. **Convolution:** extract local features with a shared kernel
2. **Nonlinearity:** apply activation function (ReLU, sigmoid, etc.)
3. **Pooling:** reduce dimensions, gain invariance to small translations
