# GPU memory usage for deep learning image processing

### 目的
DLによる画像処理について、モデルの学習と実行に必要なGPUメモリ容量の理論値と実測値の差を検証する。

### 方法
以下の6つの項目について、メモリの使用量の理論値と実測値を比較する。
   
1. [Tensor variable](#tensor-variable)
2. [Convolutional Neural Network](#convolutional-neural-network)
3. [VGG16](#vgg16)
4. [ResNet50](#resnet50)
5. [Vision Transformer](#visiont-transformer)
6. [UNETR](#unetr)

メモリの使用量の実測値は、pytorchもしくはtensorflowの関数と、nvml（nvidia  management library）の関数で測定する。
pytorchやtensorflowの関数ではプログラム中で確保された容量が測定される。一方、nvmlではプログラム実行に必要なpythonやcudaなどのオーバーヘッドを含んだ容量が測定される。
変数については２つの実装方法（pytorch, tensorflow）で実測値を測定する（現状pythorchのみ）。どちらの実装方法でも実測値に大きな差はないことが期待される。

データセットには疑似乱数を用いる。データセットのサイズは、以下のようにする
- For Tensor variable
  - input = 0.25 GiB dimension (1GiB size with 4byte float32)
- For CNN, VGG, ResNet, ViT
  - input = [3,224,224]
  - output = [10]
  - datasize = 512
  - batchsize = 128
- For UNETR
  - input = [1, 64,64,64]
  - output = [2, 64, 64, 64]
  - datasize = 32
  - batchsize = 8

### 結果
|  | Estimated [MiB] | Real [MiB] | Error [%] |
|---|---|---|---|
| Variable | 1024 | 1024 | 0 |
| CNN | 636 | 531 | 19.8 |
| VGG16 | 15929 | 17232 | 7.6 |
| ResNet50 | 11378 | 11738 | 3.1 |
| ViT | 18678 | 17218 | 8.5 |
| UNETR | 5605 | 5178 | 8.2 |

#### 検証環境
- python 3.10
- tensorflow 2.13.0
- torch 2.0.1
- NVIDIA driver 530.30.02
- CUDA toolkit 11.8
- cuDNN 8.9

#### 参考文献
1. [Estimating GPU Memory Consumption of Deep Learning Models](https://2020.esec-fse.org/details/esecfse-2020-industry-papers/5/Estimating-GPU-Memory-Consumption-of-Deep-Learning-Models), [video](https://dl.acm.org/doi/10.1145/3368089.3417050)
2. [A comprehensive guide to memory usage in PyTorch](https://medium.com/deep-learning-for-protein-design/a-comprehensive-guide-to-memory-usage-in-pytorch-b9b7c78031d3)

## Tensor variable
float32型（4byte）で0.25GiB次元の変数を考える。この変数は理論値で1GiBのサイズである。

変数を定義または削除したときのメモリの使用量を確認して、実測値を求める。
まず、Pytorchでの実測値を求める。

In [1]:
import pynvml
import torch


def print_memory_torch(prefix: str):
    """Print memory usage.
    """    
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    info = pynvml.nvmlDeviceGetMemoryInfo(handle)    
    memory_al = torch.cuda.memory_allocated()
    memory_res = torch.cuda.memory_reserved()
    memory_maxal = torch.cuda.max_memory_allocated()

    print(f"{prefix}: allocated = {memory_al/1024**2:.1f} MiB, "
        f"reserved = {memory_res/1024**2:.1f}MiB, "
        f"max allocated = {memory_maxal/1024**2:.1f} MiB, "
        f"used = {int(info.used)/1024**2:.1f} MiB")

In [2]:
# Define a variable with 1GiB.
dim = 1024**3//4
var_cpu = torch.zeros(dim, dtype=torch.float32)
print(f"var_cpu dtype: {var_cpu.dtype}, dim: {dim/1024**2}MiB")

torch.cuda.reset_peak_memory_stats()
print_memory_torch("Initial")

# Copy the variable to gpu.
var_gpu = var_cpu.to("cuda")
print_memory_torch("Define")

# Delete the variable from gpu.
del var_gpu
print_memory_torch("Delete")

# Release cached memory.
torch.cuda.empty_cache()
print_memory_torch("Release")

var_cpu dtype: torch.float32, dim: 256.0MiB
Initial: allocated = 0.0 MiB, reserved = 0.0MiB, max allocated = 0.0 MiB, used = 662.1 MiB
Define: allocated = 1024.0 MiB, reserved = 1024.0MiB, max allocated = 1024.0 MiB, used = 2486.2 MiB
Delete: allocated = 0.0 MiB, reserved = 1024.0MiB, max allocated = 1024.0 MiB, used = 2486.2 MiB
Release: allocated = 0.0 MiB, reserved = 0.0MiB, max allocated = 1024.0 MiB, used = 1462.2 MiB


変数を定義した後に1024MiB=1GiBのVRAMが確保されたので、実測値は理論値と一致した。
delで変数を削除したあともメモリはreservedとして確保されており、torch.cuda.empty_cache()によって完全にメモリが解放された。

次にTensorFlowでの実測値を求める。
Tensorflowはデフォルトでプログラム開始時にメモリを最大限確保する。必要なメモリだけを確保するために、memory growthを有効にする。

In [3]:
import tensorflow as tf
import numpy as np


def print_memory_tf(prefix: str):
    """Print memory usage.
    """
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    info = pynvml.nvmlDeviceGetMemoryInfo(handle)  
    memory_info = tf.config.experimental.get_memory_info("GPU:0")

    print(f"{prefix}: current = {memory_info['current']/1024**2:.1f} MiB, "
        f"peak = {memory_info['peak']/1024**2:.1f}MiB, "
        f"used = {int(info.used)/1024**2:.1f} MiB")

2023-11-12 23:27:45.994576: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-12 23:27:46.016695: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# Enabled memory growth.
devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

# Define a variable with 1GiB.
dim = 1024**3//4
var_cpu = np.zeros(dim, dtype=np.float32)
print(f"var_cpu dtype: {var_cpu.dtype}, dim: {dim/1024**2}MiB")

tf.config.experimental.reset_memory_stats("GPU:0")
print_memory_tf("Initail")

# Copy the variable to gpu.
with tf.device("GPU"): # type: ignore
    var_gpu = tf.constant(var_cpu)
print_memory_tf("Define")

# Delete the variable from gpu.
del var_gpu
print_memory_tf("Delete")

2023-11-12 23:27:46.921069: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-12 23:27:46.921678: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-12 23:27:46.921738: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

var_cpu dtype: float32, dim: 256.0MiB
Initail: current = 0.0 MiB, peak = 0.0MiB, used = 1648.1 MiB
Define: current = 1024.0 MiB, peak = 1024.0MiB, used = 3695.8 MiB
Delete: current = 0.0 MiB, peak = 1024.0MiB, used = 3695.8 MiB


変数を定義した後に1GiBのVRAMが確保されたので、実測値は理論値と一致した。

## Convolutional Neural Network
画像分類をする単純なCNNについて、メモリ使用量の理論値と実測値を比較する。

データセットは疑似乱数で作成する。
入力画像は3x224x224次元として、1280枚を用意する。
分類クラスは10個として、one-hot encodingで表現する。
バッチサイズは128とする。

CNNは、畳み込み層、平均プーリング層、全結合層を1つずつ持つ構造とする。
畳み込み層は、カーネル3x3、チャンネル数8、ストライド1、パディングなし、バイアスありとする。
平均プーリング層は、カーネル2x2、ストライド2、パディングなしとする。
全結合層は10次元の出力である。
学習の最適化にはSGDを用いる。
数値の精度はPytorchのデフォルトであるfloat32とする。

このCNNに必要なメモリの理論値を求める。
参考文献1,2によれば、学習と推論に必要なメモリは各々以下のようになる。

$$
\begin{align*}
\text{usage for training}   &= \text{data} + \text{weight} + \text{forward output} + \text{weight gradient} + \text{output gradient} \\
                            &= D + W + O * b + W * (d + m) + (O + D) \\
\text{usage for inference}  &= \text{data} + \text{weight} + \text{forward output} \\
                            &= D + W + O \\
\end{align*}
$$

ここで、  
$D$: ミニバッチのデータセットのメモリ使用量  
$W$: 学習パラメーターのメモリ使用量  
$O$: 中間層の出力のメモリ使用量  
$b$: 推論時に中間層の出力のみ半精度にするMixed Precisionを利用する場合は0.5、そうでない場合は1  
$d$: 複数のGPUで実行するDistributed Data Parallelを利用する場合は2、そうでない場合は1  
$m$: 最適化で使うモーメントの数（SGD: 0, Adagrad, RMSprop: 1, Adam: 2）

実際にはこれに加えて、GPUでの計算に必要なメモリ（CUDA context、cuDNN Workspace）とメモリ管理の最適化のための余剰メモリ（Internal Tensor Fragmentation）が使用される。文献1によれば、この追加分はライブラリーのバージョンやモデルによって変わるが、0.5GB程度である。

今回のCNNの場合は、メモリ使用量の理論値は以下のように求まる。

In [5]:
from torch import optim
import torch.nn.functional as F


class Config:
    def __init__(self):
        self.dim_input = [3,224,224]
        self.dim_output = [10]
        self.datasize = 512
        self.batchsize = 128
        self.num_epochs = 3
        self.lr = 1e-2
        self.device = 'cuda'
        self.criterion = F.cross_entropy
        self.optim = optim.SGD
        self.moment = 0         # SGD: 0, Adagrad, RMSprop: 1, Adam: 2
        self.ddp = 1            # Distributed data parallel: 2, Not: 1
        self.mixed_pre = 1      # Mixed precision: 0.5, Not: 1


conf = Config()

In [6]:
def print_memory_estimate1(
    dim_input: list[int], 
    dim_output: list[int], 
    num_param: int, 
    num_hidden_output: int, 
    moment: int, 
    ddp: int = 1, 
    mixed_pre: float = 1):
    '''Print theoretical memory usage.
    
    Parameters
    ----------
    dim_input: Shape of input data including batch size. e.g. [batch size, channel, width, height]
    dim_output: Shape of output data.
    num_param: Number of trainable parameters.
    num_hidden_output: Total number of hidden layer's output. Don't include in-place hidden layer that require no additional memory (Activations, Softmax).
    moment: Moment use for optimization. SGD: 0, Adagrad, RMSprop: 1, Adam: 2
    ddp: Multiple GPU use. Distributed data parallel: 2, Not: 1
    mixed_pre: Forward outputs memory saving by Mixed precision: 0.5, Not: 1
    '''
    mem_data = np.prod(dim_input) * 4
    mem_weight = num_param * 4
    mem_weight_grad = mem_weight * (ddp + moment)
    mem_forward_output = (num_hidden_output + np.prod(dim_output)) * 4 * mixed_pre
    mem_output_gradient = mem_forward_output + mem_data
    mem_training = mem_data + mem_weight + mem_forward_output + mem_weight_grad + mem_output_gradient
    mem_inference = mem_data + mem_weight + mem_forward_output

    print(f"Data(MiB): {mem_data/1024**2:.1f}")
    print(f"Weight(MiB): {mem_weight/1024**2:.1f}")
    print(f"Forward output(MiB): {mem_forward_output/1024**2:.1f}")
    print(f"Weight gradient(MiB): {mem_weight_grad/1024**2:.1f}")
    print(f"Output gradient(MiB): {mem_output_gradient/1024**2:.1f}")
    print(f"Total for training(MiB): {mem_training/1024**2:.1f}")
    print(f"Total for inference(MiB): {mem_inference/1024**2:.1f}")

In [7]:

num_param = 3*3*3*8+8 + 8*((conf.dim_input[1]-2)//2)*((conf.dim_input[2]-2)//2)*10 + 10
num_output_shape = np.prod([conf.batchsize, 8, (conf.dim_input[1]-2), (conf.dim_input[2]-2)]) \
                 + np.prod([conf.batchsize, 8, ((conf.dim_input[1]-2)//2), ((conf.dim_input[2]-2)//2)])

print_memory_estimate1([conf.batchsize] + conf.dim_input, conf.dim_output,
                        num_param, int(num_output_shape), conf.moment, conf.ddp, conf.mixed_pre)

Data(MiB): 73.5
Weight(MiB): 3.8
Forward output(MiB): 240.6
Weight gradient(MiB): 3.8
Output gradient(MiB): 314.1
Total for training(MiB): 635.8
Total for inference(MiB): 317.9


よって、理論値は636MiBである。

次にpytorchでCNNを実行したときのメモリ使用量の実測値を調べる。

In [8]:
from torch import nn, Tensor
from tqdm import tqdm
from typing import Callable


class CNN(nn.Module):
    def __init__(self, dim_c: int, dim_h: int, dim_w: int):
        super().__init__()
        self.conv = nn.Conv2d(in_channels=dim_c, out_channels=8, kernel_size=3)
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(8*((dim_h-2)//2)*((dim_w-2)//2), 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.conv(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

def train(
    model:nn.Module, 
    dim_input: list[int], 
    dim_output: list[int],
    batchsize: int, 
    epoch: int, 
    criterion: Callable[..., Tensor],
    optimizer = optim.SGD, 
    device: str = "cuda"):
    """Train model using random dataset.
    
    Parameters
    ----------
    model: 
    dim_input: Shape of input data including data size. e.g. [data size, channel, width, height]
    dim_output: 
    batchsize:
    epoch:
    optimizer:
    device: 
    """
    
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    print_memory_torch("Initial")

    model.to(device)
    print_memory_torch("Model")
    
    data = [[torch.randn([batchsize] + dim_input[1:]), 
             torch.randn([batchsize] + dim_output)] 
             for _ in range(dim_input[0]//batchsize)]

    criterion = F.cross_entropy
    opt = optimizer(model.parameters(), lr=0.01)
    for ep in range(epoch):
        model.train()
        with tqdm(data) as pbar:
            pbar.set_description(f'[Epoch {ep + 1}]')
            for x, y in pbar:
                x = x.to(device)
                y = y.to(device)
                
                opt.zero_grad()
                y_pred = model(x)
                loss = criterion(y_pred, y)
                loss.backward()
                opt.step()
            
        print_memory_torch("Train")
    print_memory_torch("Final")


model_cnn = CNN(conf.dim_input[0], conf.dim_input[1], conf.dim_input[2])
train(model_cnn, [conf.datasize] + conf.dim_input, conf.dim_output, conf.batchsize, conf.num_epochs,
      conf.criterion, conf.optim, conf.device)

Initial: allocated = 0.0 MiB, reserved = 0.0MiB, max allocated = 0.0 MiB, used = 3695.8 MiB
Model: allocated = 3.8 MiB, reserved = 22.0MiB, max allocated = 3.8 MiB, used = 3717.8 MiB


[Epoch 1]: 100%|██████████| 4/4 [00:00<00:00,  4.61it/s]


Train: allocated = 97.8 MiB, reserved = 604.0MiB, max allocated = 530.9 MiB, used = 4983.2 MiB


[Epoch 2]: 100%|██████████| 4/4 [00:00<00:00, 110.95it/s]


Train: allocated = 97.8 MiB, reserved = 604.0MiB, max allocated = 530.9 MiB, used = 4983.2 MiB


[Epoch 3]: 100%|██████████| 4/4 [00:00<00:00, 108.40it/s]

Train: allocated = 97.8 MiB, reserved = 604.0MiB, max allocated = 530.9 MiB, used = 4983.2 MiB
Final: allocated = 97.8 MiB, reserved = 604.0MiB, max allocated = 530.9 MiB, used = 4983.2 MiB





モデルを用意したときの実測値は3.8MiBであり、理論値と一致した。
学習終了時の実測ピーク値は531MiBであり、理論値と19.8%の誤差があった。



pytorchでは、torchinfoを使うとネットワーク構造やパラメータ数、メモリ使用量の情報を取得・表示できる。この機能は理論値を計算する際に便利であるが、メモリ使用量の計算は上記理論式と異なる。

具体的には、メモリ使用量＝入力データ＋順伝搬/逆伝搬サイズ*２＋Weightとしている。順伝搬/逆伝搬サイズForward/backward pass sizeは、リストに表示している中間層の出力サイズの合計を求め、gradientを考慮してその２倍をメモリ使用量としている。よって、メモリを必要としない中間層（ReLU, Softmaxなど）が多いモデルでは特に誤差が大きくなる。例えば、VGGでは特に顕著である。
また、メモリ使用量の表示がMiBではなくMBである。（1MB = 1000^2byte, 1MiB = 1,024^2byte）

今回のCNNでのtorchinfoの表示は以下のようになる。つづく検証結果から、上述のような計算式であることがわかる。

In [9]:
from torchinfo import summary


result = summary(model_cnn, [conf.batchsize] + conf.dim_input,
                depth=6,
                col_names=["input_size",
                            "output_size",
                            "num_params",
                            "params_percent",
                            "kernel_size",
                            "mult_adds",
                            "trainable"])
print(result)

# Confirm result by the follwoing calculation that is supposed to be implimented in summary()
print("\nConfirm result")
num_input = conf.batchsize * conf.dim_input[0] * conf.dim_input[1] * conf.dim_input[2]
num_param = conf.dim_input[0]*3*3*8 + 8 + 8*((conf.dim_input[1]-2)//2)*((conf.dim_input[2]-2)//2)*conf.dim_output[0] + conf.dim_output[0]
num_hidden_output = np.prod([conf.batchsize, 8, conf.dim_input[1]-2, conf.dim_input[2]-2]) + np.prod([conf.batchsize, conf.dim_output[0]])
total = num_input + num_hidden_output * 2 + num_param

print("Total prams: ", num_param)
print("Input size(byte): ", num_input * 4)
print("Forward/backward(byte): ", 2 * num_hidden_output * 4) # https://github.com/sksq96/pytorch-summary/issues/51
print("Params size(byte): ", num_param * 4)
print("Total size(byte): ", total * 4)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Mult-Adds                 Trainable
CNN                                      [128, 3, 224, 224]        [128, 10]                 --                             --                   --                        --                        True
├─Conv2d: 1-1                            [128, 3, 224, 224]        [128, 8, 222, 222]        224                         0.02%                   [3, 3]                    1,413,070,848             True
├─AvgPool2d: 1-2                         [128, 8, 222, 222]        [128, 8, 111, 111]        --                             --                   2                         --                        --
├─Linear: 1-3                            [128, 98568]              [128, 10]                 985,690                    99.98%                   --                        126,168,320       

torchinfoを利用すれば、pytorchのmodelを入力として、理論値を計算できる。torchinfoの表にあってもメモリを使用しない層やメモリを数倍使用する層もあるので、層ごとの場合わけによる計算が必要である。

In [10]:
def is_memoryless(class_name: str) -> bool:
    ''' Return True if the class is memoryless type.
    Activations, normalizations and dropouts are supposed to be performed in-place updates by default
    and does not require additional memory.
    '''
    return any((class_name == "ReLU",
                class_name == "LeakyReLU",
                class_name == "Sigmoid",
                class_name == "Tanh",
                class_name == "ELU",
                class_name == "GLU",
                class_name == "PReLU",
                class_name == "GELU",
                class_name == "Mish",
                class_name == "Softmin",
                class_name == "Softmax",
                class_name == "Softmax2d",
                class_name == "Dropout",
                class_name == "Dropout1d",
                class_name == "Dropout2d",
                class_name == "Dropout3d",
                class_name == "AlphaDropout",
                class_name == "BatchNorm1d",
                class_name == "BatchNorm2d",
                class_name == "BatchNorm3d",
                class_name == "LayerNorm"
                ))

def print_memory_estimate2(
    model: nn.Module, 
    dim_input: list[int], 
    moment: int, 
    ddp: int=1, 
    mixed_pre: float = 1):
    '''Print theoretical memory usage.
    
    Parameters
    ----------
    model: 
    dim_input: Shape of input data including batch size. e.g. [batch size, channel, width, height]
    moment: Moment use for optimization. SGD: 0, Adagrad, RMSprop: 1, Adam: 2
    ddp: Multiple GPU use. Distributed data parallel: 2, Not: 1
    mixed_pre: Forward outputs memory saving by Mixed precision: 0.5, Not: 1
    '''
    info = summary(model, dim_input, verbose=0)
    dim_output = info.summary_list[-1].output_size[1:]

    num_param = 0
    num_output_shape = 0
    last_layer = len(info.summary_list) -1
    # print("#, Class, Leaf, Memoryless, Output")
    for i, layer in enumerate(info.summary_list):
        # print(f"{i}, {layer.class_name}, {layer.is_leaf_layer}, {is_memoryless(layer.class_name)}, {layer.output_size}")
        if layer.is_leaf_layer:
            num_param += layer.trainable_params
            if not is_memoryless(layer.class_name):
                num_output_shape += np.prod(layer.output_size)
        elif layer.class_name == "MultiheadAttention": # pytorch's multihead attention is not leaf layer but shuold be counted.
            num_output_shape += np.prod(layer.output_size) * 5
            num_param += layer.trainable_params

    
    mem_data = np.prod(dim_input) * 4
    mem_weight = num_param * 4
    mem_weight_grad = mem_weight * (ddp + moment)
    mem_forward_output = num_output_shape * 4 * mixed_pre
    mem_output_gradient = mem_forward_output + mem_data
    mem_training = mem_data + mem_weight + mem_forward_output + mem_weight_grad + mem_output_gradient
    mem_inference = mem_data + mem_weight + mem_forward_output

    print(f"Data(MiB): {mem_data/1024**2:.1f}")
    print(f"Weight(MiB): {mem_weight/1024**2:.1f}")
    print(f"Forward output(MiB): {mem_forward_output/1024**2:.1f}")
    print(f"Weight gradient(MiB): {mem_weight_grad/1024**2:.1f}")
    print(f"Output gradient(MiB): {mem_output_gradient/1024**2:.1f}")
    print(f"Total for training(MiB): {mem_training/1024**2:.1f}")
    print(f"Total for inference(MiB): {mem_inference/1024**2:.1f}")

CNNについて、modelを入力として先ほどと同じ結果が得られる。

In [11]:
print_memory_estimate2(model_cnn, [conf.batchsize] + conf.dim_input, 
                       conf.moment, conf.ddp, conf.mixed_pre)

Data(MiB): 73.5
Weight(MiB): 3.8
Forward output(MiB): 240.6
Weight gradient(MiB): 3.8
Output gradient(MiB): 314.1
Total for training(MiB): 635.8
Total for inference(MiB): 317.9


## VGG16

In [12]:
import torchvision.models as models


model_vgg16 = models.vgg16_bn(weights=None)
model_vgg16.classifier[6] = nn.Linear(model_vgg16.classifier[6].in_features, 10) # Change the num of final output features to 10  # type: ignore 
# print(model_vgg16)

result = summary(model_vgg16, [conf.batchsize] + conf.dim_input,
                depth=6,
                col_names=["input_size",
                            "output_size",
                            "num_params",
                            "params_percent",
                            "kernel_size",
                            "mult_adds",
                            "trainable"])
print(result)

print("=== Estimated ===")
print_memory_estimate2(model_vgg16, [conf.batchsize] + conf.dim_input, 
                       conf.moment, conf.ddp, conf.mixed_pre)

print("=== Real ===")
train(model_vgg16, [conf.datasize] + conf.dim_input, conf.dim_output, conf.batchsize, conf.num_epochs,
      conf.criterion, conf.optim, conf.device)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Mult-Adds                 Trainable
VGG                                      [128, 3, 224, 224]        [128, 10]                 --                             --                   --                        --                        True
├─Sequential: 1-1                        [128, 3, 224, 224]        [128, 512, 7, 7]          --                             --                   --                        --                        True
│    └─Conv2d: 2-1                       [128, 3, 224, 224]        [128, 64, 224, 224]       1,792                       0.00%                   [3, 3]                    11,509,170,176            True
│    └─BatchNorm2d: 2-2                  [128, 64, 224, 224]       [128, 64, 224, 224]       128                         0.00%                   --                        16,384          

[Epoch 1]: 100%|██████████| 4/4 [00:01<00:00,  2.88it/s]


Train: allocated = 1125.1 MiB, reserved = 18752.0MiB, max allocated = 17231.9 MiB, used = 23164.9 MiB


[Epoch 2]: 100%|██████████| 4/4 [00:01<00:00,  2.46it/s]


Train: allocated = 1125.1 MiB, reserved = 18752.0MiB, max allocated = 17231.9 MiB, used = 23164.1 MiB


[Epoch 3]: 100%|██████████| 4/4 [00:01<00:00,  2.47it/s]

Train: allocated = 1125.1 MiB, reserved = 18752.0MiB, max allocated = 17231.9 MiB, used = 23161.1 MiB
Final: allocated = 1125.1 MiB, reserved = 18752.0MiB, max allocated = 17231.9 MiB, used = 23161.1 MiB





VGG16では、理論値が15929MiB、実測値が17232MiB、誤差は7.6%であった。

## ResNet50

In [13]:
model_rn50 = models.resnet50(weights=None)
model_rn50.fc = nn.Linear(model_rn50.fc.in_features, 10) # Change the num of final output features to 10
# print(model_rn50)

result = summary(model_rn50, [conf.batchsize] + conf.dim_input,
                depth=6,
                col_names=["input_size",
                            "output_size",
                            "num_params",
                            "params_percent",
                            "kernel_size",
                            "mult_adds",
                            "trainable"])
print(result)

print("=== Estimated ===")
print_memory_estimate2(model_rn50, [conf.batchsize] + conf.dim_input, 
                       conf.moment, conf.ddp, conf.mixed_pre)

print("=== Real ===")
train(model_rn50, [conf.datasize] + conf.dim_input, conf.dim_output, conf.batchsize, conf.num_epochs,
      conf.criterion, conf.optim, conf.device)

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Mult-Adds                 Trainable
ResNet                                   [128, 3, 224, 224]        [128, 10]                 --                             --                   --                        --                        True
├─Conv2d: 1-1                            [128, 3, 224, 224]        [128, 64, 112, 112]       9,408                       0.04%                   [7, 7]                    15,105,785,856            True
├─BatchNorm2d: 1-2                       [128, 64, 112, 112]       [128, 64, 112, 112]       128                         0.00%                   --                        16,384                    True
├─ReLU: 1-3                              [128, 64, 112, 112]       [128, 64, 112, 112]       --                             --                   --                        --              

[Epoch 1]: 100%|██████████| 4/4 [00:00<00:00,  6.43it/s]


Train: allocated = 1307.0 MiB, reserved = 12982.0MiB, max allocated = 11738.2 MiB, used = 17390.6 MiB


[Epoch 2]: 100%|██████████| 4/4 [00:00<00:00,  4.95it/s]


Train: allocated = 1307.0 MiB, reserved = 12982.0MiB, max allocated = 11738.2 MiB, used = 17387.2 MiB


[Epoch 3]: 100%|██████████| 4/4 [00:00<00:00,  4.97it/s]

Train: allocated = 1307.0 MiB, reserved = 12982.0MiB, max allocated = 11738.2 MiB, used = 17387.2 MiB
Final: allocated = 1307.0 MiB, reserved = 12982.0MiB, max allocated = 11738.2 MiB, used = 17387.2 MiB





ResNet50では、理論値が11378MiB、実測値が11738MiB、誤差は3.1%であった。

## Visiont Transformer

In [14]:
model_vit = models.vit_b_16(weights=None)
model_vit.heads[0] = nn.Linear(in_features=model_vit.heads[0].in_features, out_features=10)# Change the num of final output features to 10  # type: ignore 
# print(model_vit)

result = summary(model_vit, [conf.batchsize] + conf.dim_input,
                depth=7,
                col_names=["input_size",
                            "output_size",
                            "num_params",
                            "params_percent",
                            "kernel_size",
                            "mult_adds",
                            "trainable"])
print(result)

print("=== Estimated ===")
print_memory_estimate2(model_vit, [conf.batchsize] + conf.dim_input, 
                       conf.moment, conf.ddp, conf.mixed_pre)

print("=== Real ===")
train(model_vit, [conf.datasize] + conf.dim_input, conf.dim_output, conf.batchsize, conf.num_epochs,
      conf.criterion, conf.optim, conf.device)

Layer (type:depth-idx)                        Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Mult-Adds                 Trainable
VisionTransformer                             [128, 3, 224, 224]        [128, 10]                 768                         0.00%                   --                        --                        True
├─Conv2d: 1-1                                 [128, 3, 224, 224]        [128, 768, 14, 14]        590,592                     0.69%                   [16, 16]                  14,816,772,096            True
├─Encoder: 1-2                                [128, 197, 768]           [128, 197, 768]           151,296                     0.18%                   --                        --                        True
│    └─Dropout: 2-1                           [128, 197, 768]           [128, 197, 768]           --                             --                   --               

[Epoch 1]: 100%|██████████| 4/4 [00:01<00:00,  2.67it/s]


Train: allocated = 1967.2 MiB, reserved = 18142.0MiB, max allocated = 17218.1 MiB, used = 22545.5 MiB


[Epoch 2]: 100%|██████████| 4/4 [00:01<00:00,  2.02it/s]


Train: allocated = 1967.2 MiB, reserved = 18142.0MiB, max allocated = 17218.1 MiB, used = 22542.1 MiB


[Epoch 3]: 100%|██████████| 4/4 [00:01<00:00,  2.02it/s]

Train: allocated = 1967.2 MiB, reserved = 18142.0MiB, max allocated = 17218.1 MiB, used = 22540.2 MiB
Final: allocated = 1967.2 MiB, reserved = 18142.0MiB, max allocated = 17218.1 MiB, used = 22540.2 MiB





ViTでは、理論値が18678MiB、実測値が17218MiB、誤差は8.5%であった。

## UNETR
pytorchでUNETRのモデルを定義して、理論値と実測値を求める。

In [15]:
conf.dim_input = [64, 64, 64]
conf.dim_output = [2, 64, 64, 64]
conf.batchsize = 8
conf.datasize = 32

In [16]:
from monai.networks.nets.unetr import UNETR

# os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_unetr = UNETR(
    in_channels=1,
    out_channels=conf.dim_output[0],
    img_size=conf.dim_input,
    feature_size=16,
    hidden_size=768,
    mlp_dim=3072,
    num_heads=12,
    pos_embed="perceptron",
    norm_name="instance",
    res_block=True,
    dropout_rate=0.0,
).to(conf.device)


result = summary(model_unetr, [conf.batchsize, 1] + conf.dim_input,
                depth=6,
                col_names=["input_size",
                            "output_size",
                            "num_params",
                            "params_percent",
                            "kernel_size",
                            "mult_adds",
                            "trainable"])
print(result)


print("=== Estimated ===")
print_memory_estimate2(model_unetr, [conf.batchsize, 1] + conf.dim_input, 
                       conf.moment, conf.ddp, conf.mixed_pre)

print("=== Real ===")
train(model_unetr, [conf.datasize, 1] + conf.dim_input, conf.dim_output, conf.batchsize, conf.num_epochs,
      conf.criterion, conf.optim, conf.device)

Layer (type:depth-idx)                             Input Shape               Output Shape              Param #                   Param %                   Kernel Shape              Mult-Adds                 Trainable
UNETR                                              [8, 1, 64, 64, 64]        [8, 2, 64, 64, 64]        --                             --                   --                        --                        True
├─ViT: 1-1                                         [8, 1, 64, 64, 64]        [8, 64, 768]              --                             --                   --                        --                        True
│    └─PatchEmbeddingBlock: 2-1                    [8, 1, 64, 64, 64]        [8, 64, 768]              49,152                      0.05%                   --                        --                        True
│    │    └─Sequential: 3-1                        [8, 1, 64, 64, 64]        [8, 64, 768]              --                             --           

[Epoch 1]: 100%|██████████| 4/4 [00:00<00:00,  7.29it/s]


Train: allocated = 2645.8 MiB, reserved = 9268.0MiB, max allocated = 5177.9 MiB, used = 13666.5 MiB


[Epoch 2]: 100%|██████████| 4/4 [00:00<00:00,  5.72it/s]


Train: allocated = 2647.4 MiB, reserved = 9268.0MiB, max allocated = 5177.9 MiB, used = 13657.4 MiB


[Epoch 3]: 100%|██████████| 4/4 [00:00<00:00,  5.72it/s]

Train: allocated = 2647.4 MiB, reserved = 9268.0MiB, max allocated = 5177.9 MiB, used = 13657.1 MiB
Final: allocated = 2647.4 MiB, reserved = 9268.0MiB, max allocated = 5177.9 MiB, used = 13657.1 MiB





UNETRは理論値が5605MiB、pytorchによる実装で実測値が5178MiB、誤差は8.2%であった。