## Convolutional Neural Network

references: 
- [https://wegonnamakeit.tistory.com/48](https://wegonnamakeit.tistory.com/48)
- [http://taewan.kim/post/cnn/](http://taewan.kim/post/cnn/)
- [https://cheong.netlify.app/machine-learning/2019-10-14---cs231n-cnn-architectures/](https://cheong.netlify.app/machine-learning/2019-10-14---cs231n-cnn-architectures/)
- [https://kjhov195.github.io/2020-01-07-activation_function_2/](https://kjhov195.github.io/2020-01-07-activation_function_2/)
- [https://jsideas.net/batch_normalization/](https://jsideas.net/batch_normalization/)

### Basic CNN
CNN을 구성할때 보편적으로 사용되는 layers에 대해서 알아보겠습니다. <br/>



#### Convolution layer (or spatial convolution layer)

![img](http://deeplearning.net/software/theano/_images/numerical_padding_strides.gif)

- Convolution: 이미지 위에 stride 값 만큼 filter(or kernel)을 이동시키면서 겹쳐지는 부분의 각 원소의 값을 모두 곱한 뒤 합산한 값을 출력하는 연산
- filter(or kernel): number_of_filters x input_channels x kernel_size x kernel_size
- Stride: filter를 sliding window 방식으로 한 번에 이동시키는 간격
- Padding: pad 크기 만큼 이미지의 상하좌우에 '0'으로 값을 채우는 것. output의 width, height 크기를 조절하기 위해 사용합니다. 
- input(image or features): Batch x Channel x Height x Width (Pytorch: BCHW format, Tensorflow: BHWC format)
  - Batch는 입력데이터의 묶음을 의미합니다. 입력데이터를 이미지 한 장으로 구성한다면 1 x C x H x W와 같습니다. 이미지의 크기가 256 x 256이고, RGB channel이라면 입력데이터는 1x3x256x256 입니다.
- output(features or feature map): Batch x number_of_filters x computed_height x computed_width
  - computed_width = ((width - kernel_size + 2*pad) / stride) + 1
  - computed_height = ((height - kernel_size + 2*pad) / stride) + 1


In [1]:
import torch
import torch.nn as nn

in_channels = 3
out_channels = 2 # out_channels은 number_of_filters과 동일하다.
kernel_size = 3
stride = 2
pad = 1

# Basic 2D convolution layer
conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=pad)

print('conv.weight:\n', conv.weight) 
print('conv.weight.shape:\n', conv.weight.shape) # filter(kernel) size는 곧 weights size와 동일합니다.
print()
print('conv.bias:\n', conv.bias)
print('conv.bias.shape:\n', conv.bias.shape) # Convolution layer도 linear layer와 마찬가지로 bias를 가질 수 있습니다.
print()

inp = torch.ones((1, 3, 256, 256)) # 256x256x3
print('input:\n', inp)

out = conv(inp)
print('output:\n', out)
print('output.shape:\n', out.shape)


conv.weight:
 Parameter containing:
tensor([[[[ 0.0551, -0.0760,  0.1334],
          [-0.1192, -0.0871,  0.1630],
          [-0.1401, -0.0445,  0.1765]],

         [[-0.0620, -0.1905,  0.0928],
          [-0.0963, -0.1175, -0.1028],
          [ 0.1328,  0.1005,  0.0581]],

         [[-0.0466, -0.0871,  0.0760],
          [ 0.1106, -0.0890, -0.1355],
          [-0.0618,  0.0945,  0.0847]]],


        [[[-0.0892,  0.0151, -0.1040],
          [ 0.1507, -0.0944,  0.0347],
          [-0.1797, -0.0250, -0.1183]],

         [[ 0.1784,  0.0276, -0.1322],
          [ 0.1316, -0.0685, -0.0231],
          [ 0.1436,  0.1272, -0.1201]],

         [[ 0.1577,  0.0682, -0.1657],
          [-0.1063,  0.1061,  0.1656],
          [-0.0721, -0.0264,  0.1091]]]], requires_grad=True)
conv.weight.shape:
 torch.Size([2, 3, 3, 3])

conv.bias:
 Parameter containing:
tensor([0.1884, 0.0524], requires_grad=True)
conv.bias.shape:
 torch.Size([2])

input:
 tensor([[[[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1.

#### Max Pooling layer

<img src="https://www.jeremyjordan.me/content/images/2017/07/Screen-Shot-2017-07-27-at-11.43.19-AM.png" width="600">

- pooling_size(kernel_size): pooling layer에서 사용되는 sliding window size.
- stride: sliding window가 이동하는 간격 (보통 pooling size와 동일하게 설정함)
- input: Batch x Channel x Height x Width
- output: Batch x Channel x computed_height x computed_width (Channel은 유지함)
- computed_height = Height / pooling_size
- computed_width = Width / pooling_size


In [2]:
import torch
import torch.nn as nn

# Max pooling layer
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

inp = torch.rand((1, 1, 4, 4)) # BxCxHxW

out = maxpool(inp)

print(inp)
print(inp.shape)
print(out)
print(out.shape)

tensor([[[[0.4049, 0.7216, 0.5804, 0.7205],
          [0.0937, 0.6157, 0.9569, 0.0500],
          [0.7333, 0.6868, 0.8871, 0.8713],
          [0.1976, 0.2960, 0.5173, 0.7368]]]])
torch.Size([1, 1, 4, 4])
tensor([[[[0.7216, 0.9569],
          [0.7333, 0.8871]]]])
torch.Size([1, 1, 2, 2])


#### Activation functions

<img src="https://kjhov195.github.io/post_img/200107/image11.png" width="600">

이미지 처리 분야에선 일반적인 경우에 모델이 깊을수록 좋은 성능을 나타냅니다.<br/>
하지만, layer가 쌓임에 따라 고질적인 Vanishing gradient 현상이 나타나므로 적절한 activation function을 적용하는 것이 중요합니다.


In [3]:
import torch
import torch.nn as nn

inp = torch.ones((4))
inp[2:4] = -1 # 2, 3 index 값을 -1로 바꾼다.

relu = nn.ReLU() # activation function
sigmoid  = nn.Sigmoid()

out1 = relu(inp)
out2 = sigmoid(out1)

print(inp)
print(out1)
print(out2)

tensor([ 1.,  1., -1., -1.])
tensor([1., 1., 0., 0.])
tensor([0.7311, 0.7311, 0.5000, 0.5000])


#### Batch Normalization

|![](https://image.slidesharecdn.com/dlmmdcud1l06optimization-170427160940/95/optimizing-deep-networks-d1l6-insightdcu-machine-learning-workshop-2017-8-638.jpg?cb=1493309658)|
|:--:|
|Intenal Covariate Shift Problem|

|![](https://guillaumebrg.files.wordpress.com/2016/02/bn.png?w=656)|
|:--:|
|Batch Normalization (BN)|

<br/>
batch normalization은 입력 데이터를 normalize한 뒤, affine 변환을 적용하는 것으로, layer가 깊어질수록 입력 데이터의 distribution이 달라지는 문제를 해결하고자 사용됩니다. <br/>
batch normalization은 다음과 같은 장점을 가집니다.

- 안정되고 빠른 훈련
- Vanishing gradient, exploding gradient 억제
- 약간의 regularization 효과


In [4]:
import torch
import torch.nn as nn

inp = torch.randn((1, 3, 2, 2))
bn = nn.BatchNorm2d(num_features=3)

out = bn(inp)

print(inp)

print(bn.running_mean)
print(bn.running_var)

print(out)

tensor([[[[-0.5090,  1.5601],
          [ 0.8618,  0.0343]],

         [[-0.1874,  0.3379],
          [ 0.1858,  1.5325]],

         [[ 0.1149,  0.7461],
          [-0.1042,  0.5804]]]])
tensor([0.0487, 0.0467, 0.0334])
tensor([0.9830, 0.9553, 0.9157])
tensor([[[[-1.2624,  1.3606],
          [ 0.4754, -0.5736]],

         [[-1.0163, -0.2008],
          [-0.4369,  1.6540]],

         [[-0.6396,  1.2006],
          [-1.2784,  0.7174]]]], grad_fn=<NativeBatchNormBackward>)


#### Training simple CNN

Multiple layers를 한 단위로 엮은 개념을 building block이라고 합니다. <br/>
지금까지 소개한 layers를 building block으로 구성하여 간단한 CNN 모델을 정의해보겠습니다. <br/>
그리고 이 모델을 torchvision 패키지가 제공하는 MNIST 필기체 이미지 데이터셋을 사용해서 훈련시켜보겠습니다.

MNIST 데이터셋은 다음과 같은 이미지들을  

In [5]:
import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Hyper parameters
num_epochs = 5   # 데이터 전체를 훈련하는 횟수
num_classes = 10 # 숫자의 종류 (0, 1, 2, ..., 9)
batch_size = 100 # 입력 데이터 묶음
learning_rate = 0.001

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data/',
                                          train=False, 
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes) # fully connected layer
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

model = ConvNet(num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()  # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))


Epoch [1/5], Step [100/600], Loss: 0.1798
Epoch [1/5], Step [200/600], Loss: 0.0870
Epoch [1/5], Step [300/600], Loss: 0.0917
Epoch [1/5], Step [400/600], Loss: 0.0498
Epoch [1/5], Step [500/600], Loss: 0.0759
Epoch [1/5], Step [600/600], Loss: 0.0515
Epoch [2/5], Step [100/600], Loss: 0.0700
Epoch [2/5], Step [200/600], Loss: 0.0225
Epoch [2/5], Step [300/600], Loss: 0.0363
Epoch [2/5], Step [400/600], Loss: 0.0477
Epoch [2/5], Step [500/600], Loss: 0.0130
Epoch [2/5], Step [600/600], Loss: 0.0520
Epoch [3/5], Step [100/600], Loss: 0.0129
Epoch [3/5], Step [200/600], Loss: 0.0790
Epoch [3/5], Step [300/600], Loss: 0.0091
Epoch [3/5], Step [400/600], Loss: 0.0250
Epoch [3/5], Step [500/600], Loss: 0.0245
Epoch [3/5], Step [600/600], Loss: 0.0223
Epoch [4/5], Step [100/600], Loss: 0.0403
Epoch [4/5], Step [200/600], Loss: 0.0479
Epoch [4/5], Step [300/600], Loss: 0.0552
Epoch [4/5], Step [400/600], Loss: 0.0033
Epoch [4/5], Step [500/600], Loss: 0.0149
Epoch [4/5], Step [600/600], Loss:

### Advanced Architecture

이외에도 좋은 성능을 내는 모델들은 다음과 같은 방법들을 사용합니다.

#### 1x1 Convolution layer (convolution with kernel_size = 1)

![https://cheong.netlify.app/static/fb817ed940cd331991e5f40effdaf455/799d3/image16.png](https://cheong.netlify.app/static/fb817ed940cd331991e5f40effdaf455/799d3/image16.png)

일반적으로 input의 channels을 줄여 연산량을 줄이거나, 다른 features의 channels와 개수를 일치시키고 싶을때 사용됩니다. <br/>
예를 들어 3x3 conv, 5x5 conv와 같이 연산량이 큰 컨볼루션을 적용하기 전에 1x1 conv로 입력 차원을 줄인뒤 3x3 conv를 적용하면 더 적은 비용으로 비슷한 효과를 볼 수 있습니다. (ResNet의 Bottleneck block) <br/>



In [6]:
import torch
import torch.nn as nn

inp = torch.ones((1, 10, 4, 4)) # C=10

conv1x1 = nn.Conv2d(in_channels=10, out_channels=3, kernel_size=1, stride=1, padding=0) # kernel_size = 1

out = conv1x1(inp)

print(inp.shape)
print(out.shape) # C=3

torch.Size([1, 10, 4, 4])
torch.Size([1, 3, 4, 4])


#### Element-wise addition / Concatenate

|![](https://codeforwin.org/ezoimgfmt/secureservercdn.net/160.153.138.219/b79.d22.myftpupload.com/wp-content/uploads/2015/07/matrix-addition.png?ezimgfmt=rs:392x204/rscb1)|
|:--:|
|Element-wise Addition|

|![](https://lh3.googleusercontent.com/proxy/FzrxubMd4t113IigFibyfUm283qNi3_ZxCGzMaMw9Rwj6w2SmhtWKtHefLTk7XMpZmM9EJfoE1CLUE5PRYKMv2gDImWTY-1qABadiHp-e-ukUux8h8axxX_LbeBUI0QXTD1nHNRA_AbHc1OiWnLDyxOl9SpXwAo)|
|:--:|
|Concatenate (Inception module)|

element-wise addition은 말 그대로 동일한 차원을 가진 tensor의 요소끼리 더하는 연산을 의미합니다. <br/>
concatenate은 선택한 차원을 기준으로 tensors를 연결하는 것을 의미합니다.

In [7]:
import torch
import torch.nn

A = torch.arange(1, 10).view(3, 3)
B = torch.arange(9, 0, -1).view(3, 3)

print(A)
print(B)

out1 = A + B # element-wise addition
print(out1)

out2 = torch.cat((A, B), 0) # concat in 0 dim
print(out2)
print(out2.shape)

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
tensor([[9, 8, 7],
        [6, 5, 4],
        [3, 2, 1]])
tensor([[10, 10, 10],
        [10, 10, 10],
        [10, 10, 10]])
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9],
        [9, 8, 7],
        [6, 5, 4],
        [3, 2, 1]])
torch.Size([6, 3])


#### Skip connection (shortcut connection)

|![](https://datascienceschool.net/upfiles/6182312059774a81a2a26246bd4e83f2.png)|
|:---:|
|*Skip Connection (ResNet)*|

|![](https://cheong.netlify.app/static/5a711b3b3b3d4789e4d0e0fc742c0e11/7f576/image24.png)|
|:---:|
|*Bottleneck Block (ResNet)*|
<br/>

skip connection은 ResNet에서 제안되어졌으며 몇가지 장점들로 인해 많은 CNNs에서 필수적인 요소로 사용하고 있습니다. <br/>

- 깊은 모델의 Degradation 현상 해소
- 원활한 gradient 전파


In [8]:
import torch
import torch.nn as nn

x_in = torch.ones((1, 256, 28, 28))

conv1 = nn.Conv2d(256, 64, kernel_size=1, stride=1, padding=0) # 1x1 conv
conv2 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1) # 3x3 conv with padding
conv3 = nn.Conv2d(64, 256, kernel_size=1, stride=1, padding=0) # 1x1 conv
F = nn.Sequential(conv1, conv2, conv3) # Bottleneck block

F_out = F(x_in)
# nn.Sequential 모듈을 사용하면 아래와 같이 순서대로 feedforward 합니다.
# out = conv1(x_in)
# out = conv2(out)
# out = conv3(out)

print(F_out.shape)

x_out = F_out + x_in # skip connection (element-wise addition)
print(x_out.shape)

torch.Size([1, 256, 28, 28])
torch.Size([1, 256, 28, 28])
