<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary 代码 for 这个 <一个 href="http://mng.bz/orYv">构建 一个 大语言模型 From Scratch</一个> book by <一个 href="https://sebastianraschka.com">Sebastian Raschka</一个><br>
<br>代码 repository: <一个 href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</一个>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<一个 href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></一个>
</td>
</tr>
</table>

# Memory-efficient 模型 权重 Loading

- 这个 笔记本 provides tips for loading larger pretrained 或者 finetuned models 当 GPU (或者 CPU) memory is limited
- Specifically, 它 focuses on cases 哪里 你 saved 这个 模型 using `torch.保存(模型.state_dict(), "模型.pth")` (for 示例, in chapters 5-7) 和 want to 加载 它 in 一个 new session later for continued pretraining 或者 additional finetuning
- While 这个 示例 uses 一个 大语言模型, 这个 methods explained in 这个 笔记本 are general 和 应用 to loading any PyTorch 模型, not just LLMs

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/memory-efficient-loading/memory-efficient-loading.webp" width="800px">

In [1]:
from importlib.metadata import version

pkgs = [
    "torch",
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

memory_profiler version: 0.61.0
torch version: 2.4.1+cu121


&nbsp;
## 1. Benchmark utilities

- 首先, 让我们 定义 some utility 代码 to track VRAM (GPU memory)
- Later, 我们 will also introduce 一个 tool to track 这个 main system RAM (CPU memory)
- 这个 purpose of these functions will become 清除 当 我们 应用 them later

In [2]:
import gc
import time
import torch


def start_memory_tracking():
    """初始化 GPU memory tracking."""
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    else:
        print("This notebook is intended for CUDA GPUs but CUDA is not available.")

def print_memory_usage():
    max_gpu_memory = torch.cuda.max_memory_allocated() / (1024 ** 3)  # 转换 bytes to GB
    print(f"Maximum GPU memory allocated: {max_gpu_memory:.1f} GB")

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(3)  # some buffer time to allow memory to 清除
    torch.cuda.reset_peak_memory_stats()
    max_memory_allocated = torch.cuda.max_memory_allocated(device) / (1024 ** 3)
    print(f"Maximum GPU memory allocated: {max_memory_allocated:.1f} GB")

&nbsp;
## 2. 模型 setup

- 这个 代码 section sets up 这个 模型 itself
- 这里, 我们 使用 这个 "large" GPT-2 模型 to make things more interesting (你 may 使用 这个 "gpt2-small (124M)" to lower 这个 memory 依赖 和 execution time of 这个 笔记本)

In [3]:
from previous_chapters import GPTModel
# 如果 这个 `previous_chapters.py` file is not available locally,
# 你 can 导入 它 from 这个 `llms-from-scratch` PyPI 包.
# For details, see: https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg
# E.g.,
# from llms_from_scratch.ch04 导入 GPTModel



BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value 偏置
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-xl (1558M)"

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

- 现在, 让我们 see 这个 GPU memory functions in action:

In [4]:
start_memory_tracking()


model = GPTModel(BASE_CONFIG)
device = torch.device("cuda")
model.to(device)

print_memory_usage()

Maximum GPU memory allocated: 6.4 GB


- Additionally, 让我们 make sure 那个 这个 模型 runs okay by passing in some 示例 tensor

In [5]:
# 测试 如果 这个 模型 works (no need to track memory 这里)
test_input = torch.tensor([[1, 2, 3]]).to(device)
model.eval()

with torch.no_grad():
    model(test_input)

- 接下来, imagine 我们 were pretraining 这个 模型 和 saving 它 for later 使用
- 我们 skip 这个 actual pretraining 这里 for simplicity 和 just 保存 这个 initialized 模型 (但是 这个 same concept applies)

In [6]:
# 训练 代码 would go 这里...

model.train()
torch.save(model.state_dict(), "model.pth")

- Lastly, 我们 删除 这个 模型 和 示例 tensor in 这个 Python session to 重置 这个 GPU memory

In [7]:
del model, test_input
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 3. 权重 loading

- 现在 begins 这个 interesting part 哪里 我们 加载 这个 pretrained 模型 weights
- 让我们 see 如何 much GPU memory is required to 加载 这个 previously saved 模型

In [8]:
# 然后 加载 pretrained weights

start_memory_tracking()

model = GPTModel(BASE_CONFIG)
model.to(device)

model.load_state_dict(
    torch.load("model.pth", map_location=device, weights_only=True)
)
model.to(device)
model.eval();

print_memory_usage()

Maximum GPU memory allocated: 12.8 GB


- Notice 那个 这个 memory is 2x as large as in 这个 previous session
- 这个 is because 我们 have 这个 same 模型 in memory twice, for 一个 short period of time:
  - 这个 首先 time via `模型.to(device)`
  - 这个 second time via 这个 代码 line `模型.load_state_dict(torch.加载("模型.pth", map_location=device, weights_only=True))`; eventually, 这个 loaded 模型 weights will be copied into 这个 模型, 和 这个 `state_dict` will be discarded, 但是 for 一个 brief amount of time, 我们 have both 这个 main 模型 和 这个 loaded `state_dict` in memory
- 这个 remaining sections focus on addressing 这个
- 但是 首先, 让我们 测试 这个 模型 和 重置 这个 GPU memory


In [9]:
# 测试 如果 这个 模型 works (no need to track memory 这里)
test_input = torch.tensor([[1, 2, 3]]).to(device)
model.eval()

with torch.no_grad():
    model(test_input)

del model, test_input
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 4. Loading weights sequentially

- One workaround for 这个 problem of having 这个 模型 weights in GPU memory twice, as highlighted in 这个 previous section, is to 加载 这个 模型 sequentially
- Below, 我们:
  - 首先 加载 这个 模型 into GPU memory
  - 然后 加载 这个 模型 weights into CPU memory
  - 和 最后 copy each 参数 one by one into GPU memory


In [10]:
start_memory_tracking()

model = GPTModel(BASE_CONFIG).to(device)

state_dict = torch.load("model.pth", map_location="cpu", weights_only=True)

print_memory_usage()

# Sequentially copy weights to 这个 模型's parameters
with torch.no_grad():
    for name, param in model.named_parameters():
        if name in state_dict:
            param.copy_(state_dict[name].to(device))
        else:
            print(f"Warning: {name} not found in state_dict.")

print_memory_usage()

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.7 GB


- As 我们 can see above, 这个 memory usage is much lower than before
- Notice 那个 这个 memory increases from 6.4 to 6.7 GB because initially, 我们 only have 这个 模型 in memory, 和 然后 我们 have 这个 模型 plus 1 参数 tensor in memory (我们 temporarily move 这个 参数 tensor to 这个 GPU so 我们 can assign 它 using `".to"` 这个 模型)
- Overall, 这个 is 一个 significant improvement
- Again, 让我们 briefly 测试 这个 模型 和 然后 重置 这个 GPU memory for 这个 接下来 section

In [11]:
# 测试 如果 这个 模型 works (no need to track memory 这里)
test_input = torch.tensor([[1, 2, 3]]).to(device)
model.eval()

with torch.no_grad():
    model(test_input)

del model, test_input, state_dict, param
cleanup()

Maximum GPU memory allocated: 0.0 GB


&nbsp;
## 5. Loading 这个 模型 with low CPU memory

- In 这个 previous session, 我们 reduced GPU memory 使用 by loading 这个 weights (`state_dict`) into CPU memory 首先 before copying them one-by-one into 这个 模型
- However, 什么 do 我们 do 如果 我们 have limited CPU memory?
- 这个 section uses PyTorch's so-called `"meta"` device approach to 加载 一个 模型 on machines with large GPU memory 但是 small CPU memory
- 但是 首先, 让我们 定义 一个 convenience 函数 to monitor CPU memory

In [12]:
import os
import psutil
from threading import Thread


def memory_usage_in_gb(func, *args, **kwargs):
    process = psutil.Process(os.getpid())

    # Measure 这个 baseline memory usage before running 这个 函数
    baseline_mem = process.memory_info().rss / 1024 ** 3  # in GB

    # 开始 monitoring memory in 一个 separate thread
    mem_usage = []
    done = False

    def monitor_memory():
        while not done:
            mem_usage.append(process.memory_info().rss / 1024 ** 3)  # 转换 to GB
            time.sleep(0.1)

    t = Thread(target=monitor_memory)
    t.start()

    # 运行 这个 函数
    func(*args, **kwargs)

    # 停止 monitoring
    done = True
    t.join()

    peak_mem_usage_gb = max(mem_usage) - baseline_mem
    return peak_mem_usage_gb


- To 开始 with, 让我们 track 这个 CPU memory of 这个 sequential 权重 loading approach from 这个 previous section

In [13]:
def load_sequentially():
    start_memory_tracking()

    model = GPTModel(BASE_CONFIG).to(device)

    state_dict = torch.load("model.pth", map_location="cpu", weights_only=True)

    print_memory_usage()

    # Sequentially copy weights to 这个 模型's parameters
    with torch.no_grad():
        for name, param in model.named_parameters():
            if name in state_dict:
                param.copy_(state_dict[name].to(device))
            else:
                print(f"Warning: {name} not found in state_dict.")

    print_memory_usage()


peak_memory_used = memory_usage_in_gb(load_sequentially)
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.7 GB
-> Maximum CPU memory allocated: 6.3 GB


- 现在, suppose 我们 have 一个 machine with low CPU memory 但是 large GPU memory
- 我们 can trade off CPU memory 和 GPU memory usage by introducing PyTorch's so-called "meta" device
- PyTorch's meta device is 一个 special device type 那个 allows 你 to 创建 tensors without allocating actual memory for their data, effectively creating "meta" tensors
- 这个 is useful for tasks like 模型 analysis 或者 architecture definition, 哪里 你 need tensor shapes 和 types without 这个 overhead of memory allocation

In [14]:
def load_sequentially_with_meta():
    start_memory_tracking()

    with torch.device("meta"):
        model = GPTModel(BASE_CONFIG)

    model = model.to_empty(device=device)

    state_dict = torch.load("model.pth", map_location=device, weights_only=True)

    print_memory_usage()

    # Sequentially copy weights to 这个 模型's parameters
    with torch.no_grad():
        for name, param in model.named_parameters():
            if name in state_dict:
                param.copy_(state_dict[name])
            else:
                print(f"Warning: {name} not found in state_dict.")

    print_memory_usage()

peak_memory_used = memory_usage_in_gb(load_sequentially_with_meta)
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 12.8 GB
Maximum GPU memory allocated: 12.8 GB
-> Maximum CPU memory allocated: 1.3 GB


- As 我们 can see above, by creating 这个 模型 on 这个 meta-device 和 loading 这个 weights directly into GPU memory, 我们 effectively reduced 这个 CPU memory 依赖
- One might ask: "Is 这个 sequential 权重 loading still necessary 然后, 和 如何 does 那个 compare to 这个 original approach?"
- 让我们 检查 这个 simple PyTorch 权重 loading approach for comparison (from 这个 首先 权重 loading section in 这个 笔记本):

In [15]:
def baseline():
    start_memory_tracking()

    model = GPTModel(BASE_CONFIG)
    model.to(device)

    model.load_state_dict(torch.load("model.pth", map_location=device, weights_only=True))
    model.to(device)
    model.eval();

    print_memory_usage()

peak_memory_used = memory_usage_in_gb(baseline)
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 12.8 GB
-> Maximum CPU memory allocated: 4.4 GB


- As 我们 can see above, 这个 "simple" 权重 loading without 这个 meta device uses more memory
- In other words, 如果 你 have 一个 machine with limited CPU memory, 你 can 使用 这个 meta device approach to directly 加载 这个 模型 weights into GPU memory to reduce peak CPU memory usage

&nbsp;
## 6. Using `mmap=True` (recommmended)

- As 一个 intermediate 或者 advanced `torch.加载` user, 你 may wonder 如何 these approaches compare to 这个 `mmap=True` setting in PyTorch
- 这个 `mmap=True` setting in PyTorch enables memory-mapped file I/O, 哪个 allows 这个 tensor to access data directly from disk storage, thus reducing memory usage by not loading 这个 entire file into RAM 如果 RAM is limited
- Also, see 这个 helpful comment by [mikaylagawarecki](https://github.com/rasbt/LLMs-from-scratch/issues/402)
- At 首先 glance, 它 may look less efficient than 这个 sequential approaches above:

In [37]:
def best_practices():
  with torch.device("meta"):
      model = GPTModel(BASE_CONFIG)

  model.load_state_dict(
      torch.load("model.pth", map_location=device, weights_only=True, mmap=True),
      assign=True
  )

  print_memory_usage()

peak_memory_used = memory_usage_in_gb(best_practices)
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
-> Maximum CPU memory allocated: 5.9 GB


- 这个 reason 为什么 这个 CPU RAM usage is so high is 那个 那里's enough CPU RAM available on 这个 machine
- However, 如果 你 were to 运行 这个 on 一个 machine with limited CPU RAM, 这个 `mmap` approach would 使用 less memory

&nbsp;
## 7. Other methods

- 这个 笔记本 is focused on simple, built-in methods for loading weights in PyTorch
- 这个 recommended approach for limited CPU memory cases is 这个 `mmap=True` approach explained enough
- Alternatively, one other option is 一个 brute-force approach 那个 saves 和 loads each 权重 tensor separately:

In [13]:
model = GPTModel(BASE_CONFIG)
# Assume `模型` is your trained 模型
state_dict = model.state_dict()

# 创建 一个 directory to store individual 参数 files
os.makedirs("model_parameters", exist_ok=True)

# 保存 each 参数 tensor separately
for name, param in state_dict.items():
    torch.save(param.cpu(), f"model_parameters/{name}.pt")

del model

In [16]:
def load_individual_weights():

    start_memory_tracking()

    with torch.device("meta"):
        model = GPTModel(BASE_CONFIG)

    model = model.to_empty(device=device)

    print_memory_usage()
    param_dir = "model_parameters"

    with torch.no_grad():
        for name, param in model.named_parameters():
            weight_path = os.path.join(param_dir, f"{name}.pt")
            if os.path.exists(weight_path):
                param_data = torch.load(weight_path, map_location="cpu", weights_only=True)
                param.copy_(param_data)
                del param_data  # Free memory
            else:
                print(f"Warning: {name} not found in {param_dir}.")

    print_memory_usage()


peak_memory_used = memory_usage_in_gb(load_individual_weights)
print(f"-> Maximum CPU memory allocated: {peak_memory_used:.1f} GB")

Maximum GPU memory allocated: 6.4 GB
Maximum GPU memory allocated: 6.4 GB
-> Maximum CPU memory allocated: 0.3 GB
