## Convolutional Neural Network

reference: [https://wegonnamakeit.tistory.com/48](https://wegonnamakeit.tistory.com/48), [http://taewan.kim/post/cnn/](http://taewan.kim/post/cnn/)

### Basic CNN
CNN을 구성할때 보편적으로 사용되는 layers에 대해서 알아보겠습니다. <br/>



#### Convolution layer (also known as spatial convolution layer)

![img](http://deeplearning.net/software/theano/_images/numerical_padding_strides.gif)

- Convolution: 이미지 위에 stride 값 만큼 filter(or kernel)을 이동시키면서 겹쳐지는 부분의 각 원소의 값을 모두 곱한 뒤 합산한 값을 출력하는 연산
- filter(kernel): number_of_filters x input_channels x kernel_size x kernel_size
- Stride: filter를 sliding window 방식으로 한 번에 이동시키는 간격
- Padding: pad 크기 만큼 이미지의 상하좌우에 '0'으로 값을 채우는 것. output의 width, height 크기를 조절하기 위해 사용합니다. 
- input(image or features): Batch x Channel x Height x Width (Pytorch: BCHW format, Tensorflow: BHWC format)
  - Batch는 입력데이터의 묶음을 의미합니다. 입력데이터를 이미지 한 장으로 구성한다면 1 x C x H x W와 같습니다. 이미지의 크기가 256 x 256이고, RGB channel이라면 입력데이터는 1x3x256x256 입니다.
- output(features or feature map): Batch x number_of_filters x computed_height x computed_width
  - computed_width = ((width - kernel_size + 2*pad) / stride) + 1
  - computed_height = ((height - kernel_size + 2*pad) / stride) + 1


In [1]:
import torch
import torch.nn as nn

in_channels = 3
out_channels = 2 # out_channels은 number_of_filters과 동일하다.
kernel_size = 3
stride = 2
pad = 1

# Basic 2D convolution layer
conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=pad)

print('conv.weight:\n', conv.weight) 
print('conv.weight.shape:\n', conv.weight.shape) # filter(kernel) size는 곧 weights size와 동일합니다.
print()
print('conv.bias:\n', conv.bias)
print('conv.bias.shape:\n', conv.bias.shape) # Convolution layer도 linear layer와 마찬가지로 bias를 가질 수 있습니다.
print()

inp = torch.ones((1, 3, 256, 256)) # 256x256x3
print('input:\n', inp)

out = conv(inp)
print('output:\n', out)
print('output.shape:\n', out.shape)


conv.weight:
 Parameter containing:
tensor([[[[ 0.1615,  0.1683, -0.0925],
          [ 0.0425,  0.0511, -0.1227],
          [-0.0869,  0.0911, -0.1448]],

         [[ 0.0687,  0.1134, -0.0016],
          [-0.0437, -0.0768, -0.1107],
          [-0.0434,  0.0844,  0.1567]],

         [[ 0.1155, -0.0633,  0.0533],
          [ 0.1172,  0.1431, -0.0745],
          [-0.0649, -0.0369, -0.1078]]],


        [[[ 0.1911,  0.0234, -0.1513],
          [-0.1400, -0.1704, -0.1870],
          [ 0.1102,  0.1501,  0.1861]],

         [[ 0.0070,  0.0397, -0.0540],
          [-0.1614,  0.1011, -0.1816],
          [ 0.0786, -0.1219, -0.0636]],

         [[-0.1382, -0.0883, -0.1460],
          [ 0.1529,  0.1533, -0.1501],
          [ 0.1244, -0.1374, -0.1139]]]], requires_grad=True)
conv.weight.shape:
 torch.Size([2, 3, 3, 3])

conv.bias:
 Parameter containing:
tensor([-0.1112,  0.0886], requires_grad=True)
conv.bias.shape:
 torch.Size([2])

input:
 tensor([[[[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 

#### Max Pooling layer

<img src="https://www.jeremyjordan.me/content/images/2017/07/Screen-Shot-2017-07-27-at-11.43.19-AM.png" width="600">

- pooling_size(kernel_size): pooling layer에서 사용되는 sliding window size.
- stride: sliding window가 이동하는 간격 (일반적으로 pooling size와 동일하게 설정함)
- input: Batch x Channel x Height x Width
- output: Batch x Channel x computed_height x computed_width (Channel은 유지함)
- computed_height = Height / pooling_size
- computed_width = Width / pooling_size


In [2]:
import torch
import torch.nn as nn

# Max pooling layer
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)

inp = torch.rand((1, 1, 4, 4)) # BxCxHxW

out = maxpool(inp)

print(inp, inp.shape)
print(out, out.shape)

tensor([[[[0.4419, 0.8402, 0.3806, 0.0291],
          [0.9633, 0.6814, 0.5503, 0.9024],
          [0.7750, 0.6866, 0.8948, 0.4510],
          [0.1990, 0.8540, 0.3861, 0.1111]]]]) torch.Size([1, 1, 4, 4])
tensor([[[[0.9633, 0.9024],
          [0.8540, 0.8948]]]]) torch.Size([1, 1, 2, 2])


multiple layers를 한 단위로 엮은 개념을 Building block이라고 합니다. <br/>
지금까지 설명한 layers를 building block으로 구성하여 간단한 CNN 모델을 정의해보겠습니다. <br/>
그리고 이 모델을 torchvision 패키지가 제공하는 MNIST 필기체 이미지 데이터셋을 사용해서 훈련시켜보겠습니다.

In [4]:
import torch 
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Hyper parameters
num_epochs = 5
num_classes = 10
batch_size = 100 # 입력 데이터 묶음
learning_rate = 0.001

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data/',
                                          train=False, 
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size, 
                                          shuffle=False)

# Convolutional neural network (two convolutional layers)
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes) # fully connected layer
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

model = ConvNet(num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()  # eval mode (batchnorm uses moving mean/variance instead of mini-batch mean/variance)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))


0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST\raw\train-images-idx3-ubyte.gz
9920512it [00:03, 2509184.48it/s]
Extracting ./data/MNIST\raw\train-images-idx3-ubyte.gz to ./data/MNIST\raw
0it [00:00, ?it/s]Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST\raw\train-labels-idx1-ubyte.gz
32768it [00:00, 52601.65it/s]
0it [00:00, ?it/s]Extracting ./data/MNIST\raw\train-labels-idx1-ubyte.gz to ./data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST\raw\t10k-images-idx3-ubyte.gz
1654784it [00:02, 697141.88it/s]
0it [00:00, ?it/s]Extracting ./data/MNIST\raw\t10k-images-idx3-ubyte.gz to ./data/MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST\raw\t10k-labels-idx1-ubyte.gz
8192it [00:00, 15142.05it/s]
Extracting ./data/MNIST\raw\t10k-labels-idx1-ubyte.gz to ./data/MNIST\raw
Processing...
Done!
Epoch [1/5],

### Advanced Contents

다음은 이후 소개할 Pose Estimation 모델을 잘 이해하기 위해 추가한 내용입니다.

#### 1x1 Convolution layer (convolution with kernel_size = 1)

일반적으로 input의 channels을 줄여 연산량을 줄이거나, 다른 features의 channels와 개수를 맞출 때 사용됩니다. <br/>
ex) 3x3 conv, 5x5 conv와 같이 연산량이 큰 컨볼루션을 적용하기 전에 1x1 conv로 차원을 줄이면 더 적은 비용으로 비슷한 효과를 볼 수 있습니다.


In [12]:
import torch
import torch.nn as nn

inp_data = torch.ones((1, 10, 4, 4))

conv1x1 = nn.Conv2d(in_channels=10, out_channels=3, kernel_size=1, stride=1, padding=0) # kernel_size = 1

output = conv1x1(inp_data)

print(inp.shape)
print(output.shape) # channel reduction

torch.Size([1, 1, 4, 4])
torch.Size([1, 3, 4, 4])
