Skip to content

[Bug] Training use CPU offload raise OOM error in wsl2 system #2977

@lpty

Description

@lpty

cpu/gpu memory is enough to do this job,i run the model gpt-neo-125M in a nvidia 3090-24g and cpu memory is 64g.
system
win10 with a nvidia 3090-24g and cpu memory is 64g.
docker image

REPOSITORY            TAG                   IMAGE ID       CREATED         SIZE
deepspeed/deepspeed   v072_torch112_cu117   b1d3268ea315   6 months ago    14.7GB

docker run --gpus all --name deepspeed -p 10022:22 --ipc=host --shm-size=16g --ulimit memlock=-1 --privileged -v E:\:/mnt/e -it b1d3268ea315 /bin/bash

free

              total        used        free      shared  buff/cache   available
Mem:           50Gi       792Mi        46Gi       1.0Mi       3.1Gi        48Gi
Swap:          13Gi          0B        13Gi

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0a0+8a1a93a
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.7

python -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.12.0a0+8a1a93a
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.23.1
Libc version: glibc-2.31

Python version: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.7.64
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 516.94
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-quantization==2.1.2
[pip3] torch==1.12.0a0+8a1a93a
[pip3] torch-tensorrt==1.1.0a0
[pip3] torchtext==0.13.0a0
[pip3] torchtyping==0.1.4
[pip3] torchvision==0.13.0a0
[conda] mkl                       2019.5                      281    conda-forge
[conda] mkl-include               2019.5                      281    conda-forge
[conda] numpy                     1.22.3           py38h99721a1_2    conda-forge
[conda] pytorch-quantization      2.1.2                    pypi_0    pypi
[conda] torch                     1.12.0a0+8a1a93a          pypi_0    pypi
[conda] torch-tensorrt            1.1.0a0                  pypi_0    pypi
[conda] torchtext                 0.13.0a0                 pypi_0    pypi
[conda] torchtyping               0.1.4                    pypi_0    pypi
[conda] torchvision               0.13.0a0                 pypi_0    pypi

Track log

/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:09,449] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 02:23:09,494] [INFO] [runner.py:548:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train_gptj_summarize.py
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[2023-03-09 02:23:10,864] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10+cuda11.6
[2023-03-09 02:23:10,864] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-09 02:23:10,864] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-09 02:23:10,864] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-09 02:23:10,864] [INFO] [launch.py:162:main] dist_world_size=1
[2023-03-09 02:23:10,864] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
/opt/conda/lib/python3.8/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py:1402: UserWarning: positional arguments and argument "destination" are deprecated. nn.Module.state_dict will not accept them in the future. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/root/.cache/huggingface/datasets/CarperAI___parquet/CarperAI--openai_summarize_tldr-536d9955f5e6f921/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
[2023-03-09 02:23:33,945] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using True half precision backend
[2023-03-09 02:23:33,996] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2023-03-09 02:23:35,259] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.64772057533264 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
[2023-03-09 02:23:54,821] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-03-09 02:23:54,825] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-03-09 02:23:54,825] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500,000,000
[2023-03-09 02:23:54,825] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500000000
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: True
[2023-03-09 02:23:54,826] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.08072209358215332 seconds
Rank: 0 partition count [1] and sizes[(125198592, False)]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/e/trlx/examples/summarize_rlhf/sft/train_gptj_summarize.py:112 in <module>                  │
│                                                                                                  │
│   109 │   │   data_collator=default_data_collator,                                               │
│   110 │   │   preprocess_logits_for_metrics=preprocess_logits_for_metrics,                       │
│   111 │   )                                                                                      │
│ ❱ 112 │   trainer.train()                                                                        │
│   113 │   trainer.save_model(output_dir)                                                         │
│   114                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1543 in train                     │
│                                                                                                  │
│   1540 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1541 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1542 │   │   )                                                                                 │
│ ❱ 1543 │   │   return inner_training_loop(                                                       │
│   1544 │   │   │   args=args,                                                                    │
│   1545 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1546 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/trainer.py:1612 in _inner_training_loop      │
│                                                                                                  │
│   1609 │   │   │   or self.fsdp is not None                                                      │
│   1610 │   │   )                                                                                 │
│   1611 │   │   if args.deepspeed:                                                                │
│ ❱ 1612 │   │   │   deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(                   │
│   1613 │   │   │   │   self, num_training_steps=max_steps, resume_from_checkpoint=resume_from_c  │
│   1614 │   │   │   )                                                                             │
│   1615 │   │   │   self.model = deepspeed_engine.module                                          │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/transformers/deepspeed.py:344 in deepspeed_init           │
│                                                                                                  │
│   341 │   │   lr_scheduler=lr_scheduler,                                                         │
│   342 │   )                                                                                      │
│   343 │                                                                                          │
│ ❱ 344 │   deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)          │
│   345 │                                                                                          │
│   346 │   if resume_from_checkpoint is not None:                                                 │
│   347                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py:125 in initialize                   │
│                                                                                                  │
│   122 │   assert model is not None, "deepspeed.initialize requires a model"                      │
│   123 │                                                                                          │
│   124 │   if not isinstance(model, PipelineModule):                                              │
│ ❱ 125 │   │   engine = DeepSpeedEngine(args=args,                                                │
│   126 │   │   │   │   │   │   │   │    model=model,                                              │
│   127 │   │   │   │   │   │   │   │    optimizer=optimizer,                                      │
│   128 │   │   │   │   │   │   │   │    model_parameters=model_parameters,                        │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:336 in __init__               │
│                                                                                                  │
│    333 │   │   │   model_parameters = self.module.parameters()                                   │
│    334 │   │                                                                                     │
│    335 │   │   if has_optimizer:                                                                 │
│ ❱  336 │   │   │   self._configure_optimizer(optimizer, model_parameters)                        │
│    337 │   │   │   self._configure_lr_scheduler(lr_scheduler)                                    │
│    338 │   │   │   self._report_progress(0)                                                      │
│    339 │   │   elif self.zero_optimization():                                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1292 in _configure_optimizer  │
│                                                                                                  │
│   1289 │   │   optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)              │
│   1290 │   │                                                                                     │
│   1291 │   │   if optimizer_wrapper == ZERO_OPTIMIZATION:                                        │
│ ❱ 1292 │   │   │   self.optimizer = self._configure_zero_optimizer(basic_optimizer)              │
│   1293 │   │   elif optimizer_wrapper == AMP:                                                    │
│   1294 │   │   │   amp_params = self.amp_params()                                                │
│   1295 │   │   │   log_dist(f"Initializing AMP with these params: {amp_params}", ranks=[0])      │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py:1542 in                       │
│ _configure_zero_optimizer                                                                        │
│                                                                                                  │
│   1539 │   │   │   │   │   │   "Pipeline parallelism does not support overlapped communication,  │
│   1540 │   │   │   │   │   )                                                                     │
│   1541 │   │   │   │   │   overlap_comm = False                                                  │
│ ❱ 1542 │   │   │   optimizer = DeepSpeedZeroOptimizer(                                           │
│   1543 │   │   │   │   optimizer,                                                                │
│   1544 │   │   │   │   self.param_names,                                                         │
│   1545 │   │   │   │   timers=timers,                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:451 in __init__   │
│                                                                                                  │
│    448 │   │   │   self.norm_for_param_grads = {}                                                │
│    449 │   │   │   self.local_overflow = False                                                   │
│    450 │   │   │   self.grad_position = {}                                                       │
│ ❱  451 │   │   │   self.temp_grad_buffer_for_cpu_offload = get_accelerator().pin_memory(         │
│    452 │   │   │   │   torch.zeros(largest_param_numel,                                          │
│    453 │   │   │   │   │   │   │   device=self.device,                                           │
│    454 │   │   │   │   │   │   │   dtype=self.dtype))                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/accelerator/cuda_accelerator.py:214 in          │
│ pin_memory                                                                                       │
│                                                                                                  │
│   211 │   │   return torch.cuda.LongTensor                                                       │
│   212 │                                                                                          │
│   213 │   def pin_memory(self, tensor):                                                          │
│ ❱ 214 │   │   return tensor.pin_memory()                                                         │
│   215 │                                                                                          │
│   216 │   def on_accelerator(self, tensor):                                                      │
│   217 │   │   device_str = str(tensor.device)                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[2023-03-09 02:23:56,918] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 735
[2023-03-09 02:23:56,918] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python3.8', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = 1

ds config

{
  "train_batch_size": 1,
  "fp16": {
    "enabled": true,
    "min_loss_scale": 1,
    "opt_level": "O2"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-05,
      "betas": [
        0.9,
        0.95
      ],
      "eps": 1e-08,
      "torch_adam": false
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-05,
      "warmup_num_steps": "auto"
    }
  }
}

is cpu offload not support in wsl2 ? i can run the same code success in another machine with a v100 gpu

Originally posted by @lpty in #2676 (comment)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions