# M2177.003100 Deep Learning Assignment #1<br> Part 1-3. Training Vision Transformers (PyTorch)

Copyright (C) Data Science & AI Laboratory, Seoul National University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Jaehoon Lee, September 2023

**For understanding of this work, please carefully look at given PDF file.**

Now, you're going to leave behind your implementations and instead migrate to one of popular deep learning frameworks, **PyTorch**. <br>
In this notebook, you will learn to understand and build the basic components of Vision Tranformer(ViT). Then, you will try to classify images in the FashionMNIST datatset and explore the effects of different components of ViTs.
<br>
There are **2 sections**, and in each section, you need to follow the instructions to complete the skeleton codes and explain them.

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results. 

### Some helpful tutorials and references for assignment #1-2:
- [1] Pytorch official documentation. [[link]](https://pytorch.org/docs/stable/index.html)
- [2] Stanford CS231n lectures. [[link]](http://cs231n.stanford.edu/)
- [3] Alexey Dosovitskiy et al., "An Image is Worth 16 x 16 Words: Transformers for Image Recognition at Scale", ICLR 2021. [[pdf]](https://arxiv.org/pdf/2010.11929.pdf)

## 1. Building Vision Transformer
Here, you will build the basic components of Vision Transformer(ViT). <br>

![Vision Transformer](imgs/ViT.png)

Using the explanation and code provided as guidance, <br>
Define each component of ViT. <br>


#### ViT architecture:
* ViT model consists with input patch embedding, positional embeddings, transformer encoder, etc.
* Patch embedding
* Positional embeddings
* Transformer encoder with
    * Attention module
    * MLP module

In [1]:
import torch
import torch.nn as nn

##### Patch Embed

**Initialization**: When you create an instance of the PatchEmbedding class, you specify the image_size, patch_size, and in_channels. image_size is the height and width of the input image, patch_size is the size of each patch, and in_channels is the number of input image channels (e.g., 3 for RGB images). 

**Convolutional Projection**: Inside the PatchEmbedding class, a 2D convolutional layer (nn.Conv2d) is used to perform a patch-based projection. This convolutional layer has a kernel size of patch_size, which defines the size of each patch, and a stride of patch_size, which ensures that patches do not overlap. The convolutional layer effectively extracts image patches.

**Reshaping**: After the convolutional projection, the output tensor is reshaped using view. It is transformed from a 4D tensor with dimensions (batch_size, in_channels, H, W) to a 3D tensor with dimensions (batch_size, num_patches, patch_dim). num_patches is the total number of non-overlapping patches in the image, and patch_dim is the number of output channels from the convolutional layer.

In [2]:
class PatchEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
        super().__init__()
        num_patches = (img_size // patch_size) * (img_size // patch_size)
        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = num_patches

        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        self.projection = nn.Conv2d(
            in_channels=in_chans,
            out_channels=embed_dim,
            kernel_size=patch_size,
            stride=patch_size
        )

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = self.projection(x)
        x = x.view(x.size(0), -1, x.size(1))

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x # output dimension must be: (batch size, number of patches, embed_dim)

##### Attention

**Initialization**
* dim: The input dimension of the sequence. This is the dimensionality of the queries, keys, and values.
* num_heads: The number of attention heads to use. Multi-head attention allows the model to focus on different parts of the input simultaneously.

**Linear Projections (qkv and proj)**: The qkv linear layer takes the input sequence and projects it into three parts: queries (q), keys (k), and values (v). The output of this layer has a shape of (batch_size, sequence_length, 3 * dim).

**Forward Pass (forward method)**: In the forward pass, the input tensor x is processed through the attention mechanism. Here's what happens:<br>
* The linear projection qkv is applied to x, producing a tensor of shape (batch_size, sequence_length, 3 * dim).|
* This tensor is reshaped to have dimensions (batch_size, sequence_length, 3, num_heads, head_dim). The permute operation rearranges the dimensions to (3, batch_size, num_heads, sequence_length, head_dim), making it suitable for multi-head attention.
* The three parts, q, k, and v, are extracted from the reshaped tensor.
* The attention scores are computed by taking the dot product of queries q and keys k. The result is scaled by self.scale.
* The attention scores are passed through a softmax activation along the last dimension (sequence_length), producing attention weights.
* The weighted sum of values v is computed using the attention weights.
* The result is transposed and reshaped to its original shape, and then passed through the proj linear layer.
* The final output is returned.

In [3]:
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)

    def forward(self, x):
        B, N, C = x.shape
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        
        return x # output dimension must be: (batch size, number of patches, embed_dim)

##### MLP

The MLP module must consist of three layers:
* fully conncted layer 1
* activation layer
* fully conncted layer 2

In [4]:
class Mlp(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)
        
        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x # output dimension must be: (batch size, number of patches, out_features)

##### Transformer Block
The transformer block contains the attention module and MLP module which have residual connections. 
Refer to the following image and build the forward pass.

![Transformer Block](imgs/TransformerBlock.png)

In [5]:
class Block(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4., act_layer=nn.GELU, norm_layer=nn.LayerNorm):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(dim, num_heads=num_heads)
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim,
                       act_layer=act_layer)

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x


##### Vision Transformer

Using all the components that you built above, **complete** the vision transformer class.

In [6]:
class VisionTransformer(nn.Module):
    """ Vision Transformer """

    def __init__(self, img_size=28, patch_size=4, in_chans=1, num_classes=10, embed_dim=768, depth=12,
                 num_heads=12, mlp_ratio=4., norm_layer=nn.LayerNorm, ):
        super().__init__()
        self.num_features = self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.depth = depth

        self.patch_embed = PatchEmbed(
            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
        num_patches = self.patch_embed.num_patches

        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ############################################################################## 
        # similarly to cls_token, define a learnable positional embedding that matches the patchified input token size.
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################

        self.blocks = nn.ModuleList([
            Block(
                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio,  norm_layer=norm_layer)
            for i in range(depth)])
        self.norm = norm_layer(embed_dim)

        # Classifier head
        self.head = nn.Linear(
            embed_dim, num_classes) if num_classes > 0 else nn.Identity()

    def forward(self, x):
        ##############################################################################
        #                           IMPLEMENT YOUR CODE                              #
        ##############################################################################
        B = x.shape[0]
        
        # Patch Embedding
        x = self.patch_embed(x)
        
        # Concatenate class tokens to patch embedding
        cls_token = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_token, x), dim=1)

        # Add positional embedding to patches
        x = x + self.pos_embed

        # Forward through encoder blocks
        for block in self.blocks:
            x = block(x)

        # Use class token for classification
        x = self.norm(x)
        x = x[:, 0]
        
        # Classifier head
        x = self.head(x)

        ##############################################################################
        #                              END YOUR CODE                                 #
        ##############################################################################
        return x

## 2. Training a small ViT model on FashionMNIST dataset.

Define and Train a vision transformer on FashionMNIST dataset. **(You must reach above 85% for full points.)** <br>
Train with at least 5 different hyperparameter settings varying the following ViT hyperparameters. 
Report the setting for the best performance.

#### ViT hyperparameters:
* patch_size
* embed_dim
* depth
* num_heads
* mlp_ratio
* etc.


In [7]:
import numpy as np

from tqdm import tqdm, trange

import torch
import torch.nn as nn
from torch.optim import Adam
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader

from torchvision.transforms import ToTensor
from torchvision.datasets.mnist import FashionMNIST

In [8]:
def Train():
    ##############################################################################
    #                           IMPLEMENT YOUR CODE                              #
    ##############################################################################

    patch_size=4
    embed_dim=256
    depth=6
    num_heads=8 # make sure embed_dim is divisible by num_heads!
    mlp_ratio=4
    
    ##############################################################################
    #                              END YOUR CODE                                 #
    ##############################################################################
    
    # Loading data
    transform = ToTensor()

    train_set = FashionMNIST(root='./data', train=True, download=True, transform=transform)
    test_set = FashionMNIST(root='./data', train=False, download=True, transform=transform)

    train_loader = DataLoader(train_set, shuffle=True, batch_size=128)
    test_loader = DataLoader(test_set, shuffle=False, batch_size=128)

    # Defining model and training options
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Using device: ", device, f"({torch.cuda.get_device_name(device)})" if torch.cuda.is_available() else "")
    
    model = VisionTransformer(patch_size=patch_size, embed_dim=embed_dim, depth=depth, num_heads=num_heads, mlp_ratio=mlp_ratio).to(device)
    model_path = './vit.pth'
    N_EPOCHS = 30
    LR = 0.005

    # Training loop
    optimizer = Adam(model.parameters(), lr=LR)
    criterion = CrossEntropyLoss()
    for epoch in trange(N_EPOCHS, desc="Training"):
        train_loss = 0.0
        for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1} in training", leave=False):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)

            train_loss += loss.detach().cpu().item() / len(train_loader)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch + 1}/{N_EPOCHS} loss: {train_loss:.2f}")

    # Test loop
    with torch.no_grad():
        correct, total = 0, 0
        test_loss = 0.0
        for batch in tqdm(test_loader, desc="Testing"):
            x, y = batch
            x, y = x.to(device), y.to(device)
            y_hat = model(x)
            loss = criterion(y_hat, y)
            test_loss += loss.detach().cpu().item() / len(test_loader)

            correct += torch.sum(torch.argmax(y_hat, dim=1) == y).detach().cpu().item()
            total += len(x)
        print(f"Test loss: {test_loss:.2f}")
        print(f"Test accuracy: {correct / total * 100:.2f}%")

    torch.save(model.state_dict(), model_path)
    print('Saved Trained Model.')
    
Train()

Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)


Training:   3%|▎         | 1/30 [00:24<11:50, 24.51s/it]

Epoch 1/30 loss: 0.92


Training:   7%|▋         | 2/30 [00:47<11:04, 23.73s/it]

Epoch 2/30 loss: 0.61


Training:  10%|█         | 3/30 [01:11<10:36, 23.56s/it]

Epoch 3/30 loss: 0.54


Training:  13%|█▎        | 4/30 [01:34<10:10, 23.50s/it]

Epoch 4/30 loss: 0.50


Training:  17%|█▋        | 5/30 [01:57<09:47, 23.50s/it]

Epoch 5/30 loss: 0.45


Training:  20%|██        | 6/30 [02:21<09:23, 23.47s/it]

Epoch 6/30 loss: 0.44


Training:  23%|██▎       | 7/30 [02:44<08:59, 23.46s/it]

Epoch 7/30 loss: 0.42


Training:  27%|██▋       | 8/30 [03:08<08:35, 23.45s/it]

Epoch 8/30 loss: 0.42


Training:  30%|███       | 9/30 [03:31<08:12, 23.45s/it]

Epoch 9/30 loss: 0.39


Training:  33%|███▎      | 10/30 [03:55<07:48, 23.45s/it]

Epoch 10/30 loss: 0.39


Training:  37%|███▋      | 11/30 [04:18<07:24, 23.41s/it]

Epoch 11/30 loss: 0.44


Training:  40%|████      | 12/30 [04:41<07:01, 23.40s/it]

Epoch 12/30 loss: 0.40


Training:  43%|████▎     | 13/30 [05:05<06:37, 23.40s/it]

Epoch 13/30 loss: 0.37


Training:  47%|████▋     | 14/30 [05:28<06:14, 23.40s/it]

Epoch 14/30 loss: 0.36


Training:  50%|█████     | 15/30 [05:51<05:50, 23.39s/it]

Epoch 15/30 loss: 0.35


Training:  53%|█████▎    | 16/30 [06:15<05:27, 23.38s/it]

Epoch 16/30 loss: 0.34


Training:  57%|█████▋    | 17/30 [06:38<05:03, 23.38s/it]

Epoch 17/30 loss: 0.33


Training:  60%|██████    | 18/30 [07:02<04:40, 23.38s/it]

Epoch 18/30 loss: 0.32


Training:  63%|██████▎   | 19/30 [07:25<04:17, 23.37s/it]

Epoch 19/30 loss: 0.32


Training:  67%|██████▋   | 20/30 [07:48<03:53, 23.38s/it]

Epoch 20/30 loss: 0.31


Training:  70%|███████   | 21/30 [08:12<03:30, 23.35s/it]

Epoch 21/30 loss: 0.31


Training:  73%|███████▎  | 22/30 [08:35<03:06, 23.34s/it]

Epoch 22/30 loss: 0.31


Training:  77%|███████▋  | 23/30 [08:58<02:43, 23.36s/it]

Epoch 23/30 loss: 0.30


Training:  80%|████████  | 24/30 [09:22<02:20, 23.35s/it]

Epoch 24/30 loss: 0.29


Training:  83%|████████▎ | 25/30 [09:45<01:56, 23.37s/it]

Epoch 25/30 loss: 0.29


Training:  87%|████████▋ | 26/30 [10:08<01:33, 23.37s/it]

Epoch 26/30 loss: 0.29


Training:  90%|█████████ | 27/30 [10:32<01:10, 23.37s/it]

Epoch 27/30 loss: 0.28


Training:  93%|█████████▎| 28/30 [10:55<00:46, 23.36s/it]

Epoch 28/30 loss: 0.27


Training:  97%|█████████▋| 29/30 [11:19<00:23, 23.35s/it]

Epoch 29/30 loss: 0.27


Training: 100%|██████████| 30/30 [11:42<00:00, 23.41s/it]


Epoch 30/30 loss: 0.26


Testing: 100%|██████████| 79/79 [00:01<00:00, 55.51it/s]

Test loss: 0.34
Test accuracy: 87.41%
Saved Trained Model.





### Describe what you did and discovered here
In this cell you should write all the settings tried and performances you obtained. Report what you did and what you discovered from the trials.
You can write in Korean

### 관찰 내용
- Hidden dimension과 head의 수를 늘리는 것보다 depth를 늘리는 것이 연산량을 덜 증가시키면서 학습을 더 빠르게 하는 데 더 효과적이다.
- Hyperparameter를 적절히 설정하면 모델의 복잡도를 낮게 유지하면서(학습에 드는 자원이 적으면서) 빠르게 학습시킬 수 있다. 어떻게 보면 적절한 d_vc를 찾는 것 같기도 하다.

### 실험 1
#### Hyperparams
```
patch_size=4
embed_dim=256
depth=6
num_heads=8 # make sure embed_dim is divisible by num_heads!
mlp_ratio=4
N_EPOCHS = 10
LR = 0.005
```
#### 결과 분석
```
Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)
Epoch 1/10 loss: 0.80
Epoch 2/10 loss: 0.55
Epoch 3/10 loss: 0.48
Epoch 4/10 loss: 0.48
Epoch 5/10 loss: 0.49
Epoch 6/10 loss: 0.44
Epoch 7/10 loss: 0.41
Epoch 8/10 loss: 0.39
Epoch 9/10 loss: 0.38
Epoch 10/10 loss: 0.37
Test loss: 0.40
Test accuracy: 85.75%
```

최종적으로는 기준 정확도를 넘겼으나, 중간중간 loss가 다시 증가하는 모습도 보임.

이에 learning rate를 조금 줄여서 학습시켜보기로 함.

### 실험 2
#### Hyperparams
```
patch_size=4
embed_dim=256
depth=6
num_heads=8 # make sure embed_dim is divisible by num_heads!
mlp_ratio=4
N_EPOCHS = 10
LR = 0.002
```
#### 결과 분석
```
Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)
Epoch 1/10 loss: 0.64
Epoch 2/10 loss: 0.43
Epoch 3/10 loss: 0.39
Epoch 4/10 loss: 0.39
Epoch 5/10 loss: 0.37
Epoch 6/10 loss: 0.36
Epoch 7/10 loss: 0.35
Epoch 8/10 loss: 0.34
Epoch 9/10 loss: 0.35
Epoch 10/10 loss: 0.38
Test loss: 0.41
Test accuracy: 85.35%
```
학습 과정에서 이전보다 훨씬 빠른 속도로 loss가 감소하는 모습을 보임. 하지만 test accuracy는 이전보다 증가하였음.

이에 learning rate를 이전 값으로 되돌리고, 모델의 depth와 epoch의 수를 늘려보기로 함.

### 실험 3
#### Hyperparams
```
patch_size=4
embed_dim=256
depth=8
num_heads=8 # make sure embed_dim is divisible by num_heads!
mlp_ratio=4
N_EPOCHS = 15
LR = 0.005
```
#### 결과 분석
```
Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)
Epoch 1/15 loss: 0.94
Epoch 2/15 loss: 0.53
Epoch 3/15 loss: 0.48
Epoch 4/15 loss: 0.46
Epoch 5/15 loss: 0.46
Epoch 6/15 loss: 0.43
Epoch 7/15 loss: 0.41
Epoch 8/15 loss: 0.40
Epoch 9/15 loss: 0.40
Epoch 10/15 loss: 0.42
Epoch 11/15 loss: 0.39
Epoch 12/15 loss: 0.37
Epoch 13/15 loss: 0.36
Epoch 14/15 loss: 0.36
Epoch 15/15 loss: 0.37
Test loss: 0.40
Test accuracy: 85.19%
```
학습을 시작할 때의 loss는 이전보다 높았지만, 학습이 충분히 진행된 후의 loss는 이전과 비슷했다.

반대로 매우 단순한 모델을 학습시켜보기로 했다.

### 실험 4
#### Hyperparams
```
patch_size=4
embed_dim=128
depth=4
num_heads=2 # make sure embed_dim is divisible by num_heads!
mlp_ratio=4
N_EPOCHS = 20
LR = 0.005
```
#### 결과 분석
```
Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)
Epoch 1/20 loss: 0.67
Epoch 2/20 loss: 0.46
Epoch 3/20 loss: 0.42
Epoch 4/20 loss: 0.41
Epoch 5/20 loss: 0.39
Epoch 6/20 loss: 0.39
Epoch 7/20 loss: 0.39
Epoch 8/20 loss: 0.38
Epoch 9/20 loss: 0.36
Epoch 10/20 loss: 0.36
Epoch 11/20 loss: 0.35
Epoch 12/20 loss: 0.35
Epoch 13/20 loss: 0.35
Epoch 14/20 loss: 0.35
Epoch 15/20 loss: 0.33
Epoch 16/20 loss: 0.33
Epoch 17/20 loss: 0.34
Epoch 18/20 loss: 0.32
Epoch 19/20 loss: 0.37
Epoch 20/20 loss: 0.33
Testing: 100%|██████████| 79/79 [00:00<00:00, 130.63it/s]
Test loss: 0.35
Test accuracy: 87.19%
```
훨씬 성능이 좋았다. 마지막으로는 모델의 크기를 조금만 늘리고 학습을 꽤 오래 진행해보기로 했다.

### 실험 5
#### Hyperparams
```
patch_size=4
embed_dim=256
depth=6
num_heads=8 # make sure embed_dim is divisible by num_heads!
mlp_ratio=4
N_EPOCHS = 30
LR = 0.005
```
#### 결과 분석
```
Using device:  cuda (NVIDIA GeForce RTX 3060 Ti)
Epoch 1/30 loss: 0.92
Epoch 2/30 loss: 0.61
Epoch 3/30 loss: 0.54
Epoch 4/30 loss: 0.50
Epoch 5/30 loss: 0.45
Epoch 6/30 loss: 0.44
Epoch 7/30 loss: 0.42
Epoch 8/30 loss: 0.42
Epoch 9/30 loss: 0.39
Epoch 10/30 loss: 0.39
Epoch 11/30 loss: 0.44
Epoch 12/30 loss: 0.40
Epoch 13/30 loss: 0.37
Epoch 14/30 loss: 0.36
Epoch 15/30 loss: 0.35
Epoch 16/30 loss: 0.34
Epoch 17/30 loss: 0.33
Epoch 18/30 loss: 0.32
Epoch 19/30 loss: 0.32
Epoch 20/30 loss: 0.31
Epoch 21/30 loss: 0.31
Epoch 22/30 loss: 0.31
Epoch 23/30 loss: 0.30
Epoch 24/30 loss: 0.29
Epoch 25/30 loss: 0.29
Epoch 26/30 loss: 0.29
Epoch 27/30 loss: 0.28
Epoch 28/30 loss: 0.27
Epoch 29/30 loss: 0.27
Epoch 30/30 loss: 0.26
Testing: 100%|██████████| 79/79 [00:01<00:00, 55.51it/s]
Test loss: 0.34
Test accuracy: 87.41%
```
이전보다 아주 약간 성능이 증가하였다.

