# Convolutional Neural Networks

A convolutional neural network is a special case of the multiplayer perceptron architecture that builds certain assumptions into the design of the model, in particular using local connectivity and equivariance.

An important motivating application for CNNs is computer vision, as the architectural design of these networks mimics the visual system, where neurons respond to stimulus in a restricted region of the visual field. This concept later led to the modern convolutional neural network trained by backpropagation.

Here we focus on 2D convolutions for image processing. However, CNNs also have been very successful when applied to time series data, using 1D convolutions and video analysis with 3D convolutions.

### The convolution operation

The convolution operation for two Lebesgue integrable functions $h$ and $k$ is defined as

$$
\begin{align}
(h \circ k)(t) = \int_{-\infty}^\infty h(\tau)k(t-\tau)d\tau.
\end{align}
$$

It can be described as the weighted average of the function $h$ according to the weighting function (also called kernel) $k$ at each point in time $t$.

In practice, we discretise the data and work with discrete convolutions

$$
\begin{align}
(h\circ k)(t) = \sum_{\tau=-\infty}^\infty h(\tau) k(t-\tau)
\end{align}
$$

A 2D condiscrete convolution is:

$$
\begin{align}
(\mathbf{h} \circ \mathbf k)(i,j) = \sum_{m,n}h(m,n)k(i-m, j-n)
\end{align}
$$

where we consider $h(i,j)$ and $k(i,j)$ to denote the $(i,j)$-th elements of the matrices $\mathbf h\in\mathbb R^{n_h\times n_w}$ and $\mathbf{k} \in\mathbb R^{k_h\times k_w}$ respectively, where $n_h$ and $n_w$ are the image height and width in pixels, and $k_h$ and $k_w$ are the kernel height and width in pixels.

Suppose we have a greyscale image $\mathbf x = \mathbf h^{(0)}\in\mathbb R^{n_w\times n_h}$ (whereas a coloured image will be 3D since each pixel consists of an RGB vector).

In CNN, the convolutional layers usually consist of the above operation plus a bias term, followed by a pointwise activation function:

$$
\begin{align}
\mathbf h^{(k)} = \sigma((\mathbf h^{(k-1)}\circ \mathbf k^{(k-1)}) + b^{(k-1)}).
\end{align}
$$

The output $\mathbf h^{(k)}$ is sometimes referred to as a feature map and the kernel $\mathbf k$ is referred to a filter.

The operation described above introduces a translational equivariance property in convolutional layers. That is, if the input image is translated, then the activations in the next hidden layer are also translated accordingly. The convolutional kernel searches for the same features across the input image.

### Multi-channel inputs and outputs

We can extend the convolution operation to inputs with multiple channels. A coloured image has three channel values per pixel. The input is now a rank-3 tensor $\mathbf x=\mathbf h^{(0)}\in\mathbb R^{7\times 7\times 3}$, and correspondingly we require a rank-3 kernel tensor $\mathbf k\in\mathbb R^{k_h\times k_w\times 3}$. The operation now becomes

$$
\begin{align}
(\mathbf h\circ \mathbf k)(i,j) = \sum_{m,n,p} h(m, n, p)k(i-m,j-n,p)
\end{align}
$$

In convolutional layers, it is possible that many filters are stacked on top of each other and produce a multichannel output. In this case, we implement a rank-4 kernel tensor $\mathbf k\in\mathbb R^{k_h,k_w,c_{in}, c_{out}}$, where $c_{in}$ are the number of channels in the input and $c_{out}$ are the number of channels in the output:

$$
\begin{align}
(\mathbf h\circ \mathbf k)(i,j,q) = \sum_{m,n,p} h(i + m, j + n, p)k(m,n,p,q)
\end{align}
$$

and the operation becomes:

$$
\begin{align}
\mathbf h^{(k)} = \sigma((\mathbf h^{(k-1)}\circ \mathbf k^{(k-1)}) + \mathbf b^{(k-1)})
\end{align}
$$

where we have $\mathbf b^{(k-1)}\in\mathbb R^{c_{out}}$ added pixel-wise to the output of the convolution operation $(\mathbf h^{(k-1)}\circ \mathbf k^{(k-1)})\in\mathbb R^{(n_h-k_h+1)\times (n_w-k_w+1)\times c_{out}}$.

### Pooling layers

In many CNN models, convolutional layers are alternated with pooling layers that downsample the spatial dimensions of a layer by computing a summary statistic of (often non-overlapping) regions of the input layer's post-activations. For example, a 4x4x2 image, after being processed by a pooling layer, is downsampled to a 2x2x2 image. Note that the channel dimensions stay the same. Common pooling methods are max pooling, average pooling, or L2 norm pooling.

### Padding

Padding gives some flexibility over the spatial dimensions of the output of convolutional and pooling layers. In general, for a spatial dimension of size $i$ and a kernel of width $k$, the output size $o$ is given by

$$
\begin{align}
o = i - k + 1
\end{align}
$$

In many model architectures, it is desirable to keep the spatial dimensions the same in the output of a convolutional layer. This can be achieved by padding the input layer with zeros. That is, we add $k-1$ zeros in the corresponding dimension. This type of padding is known as "same" padding.

If $p$ zeros are added to our input size $i$ with kernel width $k$, then the output size $o$ is given by

$$
\begin{align}
o = i + p - k + 1
\end{align}
$$

and we have $o=i$ if $p=k-1$. If $p=0$, we have "valid" padding and $o=i-k+1$.

### Strides

Convolutions may also use a stride $s$, which is the distance between consecutive positions of the kernel. Using $s>1$ leads to a downsampling of the input, just like a pooling layer. 
For a spatial dimension of size $i$ with padding $p$ and a kernel of width $k$ with stride $s$, the output size $o$ is given by

$$
\begin{align}
o=\frac{i+p-k}{s} + 1
\end{align}
$$

### Transposed convolutions

Transposed convolutions can be seen as a kind of inverse of regular convolutional layers. The main problem they try to solve is to give a consistent way of upsampling an input, rather than downsampling it.

In certain deep learning architectures, we would want to increase the spatial dimensions of an input. An example of this is when we are using encoder and decoder networks. Transposed convolutions give us a way of doing this, while still preserving the main structural properties of convolutional layers. They are the analogue to transposing the weight matrix in fully connected layers. They essentially swap the forward and backward passes of a convolution.

Every regular convolutional layer has an associated transposed convolution that reverses the dimensions of input and output, whilst preserving the connectivity pattern between the layers. 

In the case of a convolution with stride $s=1$, kernel size $k$, padding $p$ and input size $i$, recall the output size is $o=i+p-k+1$. There is an associated transposed convolution with kernel size $k'=k$, stride $s'=s=1$, and padding $p'=2(k-1)-p$. Its output size is given by

$$
\begin{align}
o'=i' + (k-1) - p
\end{align}
$$

so with $i'=o$, we have $o'=i$, and the transposed convolution reverses the input and output dimensions.


For a regular convolution with stride $s>1$, we can think of the associated transposed convolution as having a stride $s'<1$. Consider the case where the regular convolution is such that $s$ divides $(i+p-k)$, where the output size is $o=\frac{i+p-k}{s} + 1$. Then the input to the associated transposed convolution adds $s-1$ zeros between its input units. It has kernel size $k'=k$, $s'=1$, and padding $p'=2(k-1)-p$. The output size is given by

$$
\begin{align}
o' = s(i'-1) + k - p.
\end{align}
$$

For example, if $s=2$, we insert one zero in between the input units.

The case where $s$ does not divide $i + p - k$ is accounted for by the parameter $a=(i+p-k)\mod s$. The transposed convolution again adds $s-1$ zeros between input units, but also adds additional padding of $a$ zeros. It has kernel size $k'=k$, $s'=1$, padding $p'=2(k-1)-p+a$, and the output size is 

$$
\begin{align}
o' = s(i'-1) + a + k - p
\end{align}
$$

In [1]:
import torch
import torch.nn as nn # 신경망들이 포함됨
import torch.optim as optim # 최적화 알고리즘들이 포함됨
import torch.nn.init as init # 텐서에 초기값을 줌

import torchvision.datasets as datasets # 이미지 데이터셋 집합체
import torchvision.transforms as transforms # 이미지 변환 툴

from torch.utils.data import DataLoader # 학습 및 배치로 모델에 넣어주기 위한 툴

import numpy as np
import matplotlib.pyplot as plt

In [3]:
device = torch.device(
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)

In [4]:
# 학습에 사용할 파라미터 설정

learning_rate = 0.001
training_epochs = 15
batch_size = 128

In [7]:
mnist_train = datasets.MNIST(root="../../datasets/",
                             train=True,
                             transform=transforms.ToTensor(),
                             download=True)
mnist_test = datasets.MNIST(root = "../../datasets/",
                            train=False,
                            transform=transforms.ToTensor(),
                            download=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:02<00:00, 3800913.54it/s]


Extracting ./datasets/MNIST/raw/train-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 363584.15it/s]


Extracting ./datasets/MNIST/raw/train-labels-idx1-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:01<00:00, 1596943.91it/s]


Extracting ./datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to ./datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 4091608.41it/s]

Extracting ./datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./datasets/MNIST/raw






In [11]:
mnist_train.data[0].shape

torch.Size([28, 28])

In [8]:
# 데이터로더를 사용하여 배치 크기 지정
loader = DataLoader(dataset=mnist_train, 
                    batch_size=batch_size, 
                    shuffle=True, 
                    drop_last=True) # drops the last non-full batch

In [12]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        # 입력 이미지 형태: (batch_size, 28, 28, 1)
        # Conv2D: output channel 32개, kernel size 3x3, stride 1, padding 1
        # MaxPool2d: kernel size 2x2, stride 2로 다운샘플링
        # i + p - k + 1 = 28 + 2 - 3 + 1 = 28
        # (i+p-k)/s + 1 = (28-2)/2 + 1 = 14
        # 출력 형태: (batch_size, 14, 14, 32)
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # 입력 이미지 형태: (batch_size, 14, 14, 32)
        # Conv2D: output channel 64개, kernel size 3x3, stride 1, padding 1
        # MaxPool2d: kernel size 2x2, stride 2로 다운샘플링
        # 14+2-3+1 = 14
        # (14+0-2)/2+1 = 7
        # 출력 형태: (batch_size, 7, 7, 64)
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        # 입력 이미지 형태: (batch_size, 7, 7, 64)
        # Conv2D: output channel 128개, kernel size 3x3, stride 1, padding 1
        # MaxPool2d: kernel size 3x3, stride 2, padding 1로 다운샘플링
        # 7+2-3+1 = 7
        # (7+2-3)/2+1 = 4
        # 출력 형태: (batch_size, 4, 4, 128)
        self.layer3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )

        # Linear layer 1
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(4 * 4 * 128, 625, bias=True)
        self.keep_prob = 0.5  # 드롭아웃 확률
        nn.init.xavier_uniform_(self.fc1.weight) # 가중치 초기화
        self.layer4 = nn.Sequential(
            self.flatten,
            self.fc1,
            nn.ReLU(),
            nn.Dropout(p=1-self.keep_prob)
        )

        # Linear layer 2
        self.fc2 = nn.Linear(625, 10, bias=True)
        nn.init.xavier_uniform_(self.fc2.weight) # 가중치 초기화
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.fc2(x)
        return x

In [13]:
model = CNN().to(device)

In [14]:
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [15]:
total_batch = len(loader)
print("Total number of batches: {}".format(total_batch))

Total number of batches: 468


In [16]:
for epoch in range(training_epochs):
    avg_cost = 0
    for X, y in loader:
        X = X.to(device)
        y = y.to(device)

        optimizer.zero_grad()
        cost = criterion(model(X), y)
        cost.backward()
        optimizer.step()

        avg_cost += cost / total_batch
    
    print('Epoch: {} cost = {:>.9}'.format(epoch + 1, avg_cost))

Epoch: 1 cost = 0.231816724
Epoch: 2 cost = 0.0552743524
Epoch: 3 cost = 0.0390527658
Epoch: 4 cost = 0.0310358051
Epoch: 5 cost = 0.0255662575
Epoch: 6 cost = 0.0231119078
Epoch: 7 cost = 0.0212798715
Epoch: 8 cost = 0.016448278
Epoch: 9 cost = 0.0158520509
Epoch: 10 cost = 0.0140962638
Epoch: 11 cost = 0.0110809626
Epoch: 12 cost = 0.0102506513
Epoch: 13 cost = 0.00952469558
Epoch: 14 cost = 0.00980816502
Epoch: 15 cost = 0.00876408815


In [18]:
# 학습을 진행하지 않을 것이므로 torch.no_grad()

test_loader = DataLoader(mnist_test, batch_size=batch_size, shuffle=False, drop_last=True)

correct = 0
total = 0

# Evaluate mode
model.eval()

with torch.no_grad():
    for image, label in test_loader:
        X = image.to(device)
        y = label.to(device)
        y_pred = model.forward(X)

        # torch.max함수는 (최댓값, index)를 반환
        _, output_index = torch.max(y_pred, 1)

        total += label.size(0)
        correct += (output_index == y).sum().float()
    
    print("Accuracy: {}%".format(100 * correct/total))

Accuracy: 99.24879455566406%
