Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when using multi-gpu to train a transformer model #4352

Open
JaniceXiong opened this issue Apr 14, 2022 · 0 comments
Open

OOM when using multi-gpu to train a transformer model #4352

JaniceXiong opened this issue Apr 14, 2022 · 0 comments

Comments

@JaniceXiong
Copy link

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Hi, everyone!
Here I trained a model using fairseq 3090 GPUs and the default adam trainer is used (fairseq-train command). It went well on a single GPU, not OOM and other errors. But when I tried to use two GPUs, OOM occurred like below.
According to traceback, it seemed to occur in the optimizer step. It was strange that device 0 is allocated 0 memory while device 1 is allocated large memory.

| WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUD
A out of memory. Tried to allocate 3.02 GiB (GPU 1; 23.70 GiB total capacity; 
16.92 GiB already allocated; 1019.69 MiB free; 21.03 GiB reserved in total by 
PyTorch)                                                                      
2022-04-14 02:39:35 | WARNING | fairseq.trainer | |===========================
================================================|                             
|                  PyTorch CUDA memory summary, device ID 0                 | 
|---------------------------------------------------------------------------| 
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         | 
|===========================================================================| 
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  | 
|---------------------------------------------------------------------------| 
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|---------------------------------------------------------------------------| 
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|---------------------------------------------------------------------------| 
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|-------------------------------------------------------------------[115/1294]
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  | 
|---------------------------------------------------------------------------| 
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  | 
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    | 
|       from large pool |       0    |       0    |       0    |       0    | 
|       from small pool |       0    |       0    |       0    |       0    | 
|---------------------------------------------------------------------------| 
| Non-releasable allocs |       0    |       0    |       0    |       0    | 
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-04-14 02:39:35 | WARNING | fairseq.trainer | |===========================
================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   17329 MB |   20441 MB |  315650 MB |  298320 MB |
|       from large pool |   17326 MB |   20437 MB |  303570 MB |  286244 MB |
|       from small pool |       3 MB |     280 MB |   12079 MB |   12076 MB |
|---------------------------------------------------------------------------|
| Active memory         |   17329 MB |   20441 MB |  315650 MB |  298320 MB |
|       from large pool |   17326 MB |   20437 MB |  303570 MB |  286244 MB |
|       from small pool |       3 MB |     280 MB |   12079 MB |   12076 MB |
|---------------------------------------------------------------------------|
|--------------------------------------------------------------------[70/1294]
| GPU reserved memory   |   21530 MB |   21830 MB |   43546 MB |   22016 MB |
|       from large pool |   21464 MB |   21624 MB |   41444 MB |   19980 MB |
|       from small pool |      66 MB |     292 MB |    2102 MB |    2036 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |    4200 MB |    4202 MB |  264811 MB |  260611 MB |
|       from large pool |    4137 MB |    4139 MB |  250471 MB |  246333 MB |
|       from small pool |      62 MB |     112 MB |   14340 MB |   14277 MB |
|---------------------------------------------------------------------------|
| Allocations           |    2280    |    3778    |  177529    |  175249    |
|       from large pool |     929    |    1345    |   52873    |   51944    |
|       from small pool |    1351    |    2537    |  124656    |  123305    |
|---------------------------------------------------------------------------|
| Active allocs         |    2280    |    3778    |  177529    |  175249    |
|       from large pool |     929    |    1345    |   52873    |   51944    |
|       from small pool |    1351    |    2537    |  124656    |  123305    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     128    |     265    |    1846    |    1718    |
|       from large pool |      95    |     162    |     795    |     700    |
|       from small pool |      33    |     146    |    1051    |    1018    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     135    |     181    |   85994    |   85859    |
|       from large pool |      53    |      68    |   31036    |   30983    |
|       from small pool |      82    |     146    |   54958    |   54876    |
|===========================================================================|
2022-04-14 02:39:35 | ERROR | fairseq.trainer | OOM during optimization, irrec
overable
Traceback (most recent call last):                                            
  File "/home/xjw/miniconda3/envs/cliff/bin/fairseq-train", line 8, in <module
>
    sys.exit(cli_main())               
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq_cl
i/train.py", line 392, in cli_main
    distributed_utils.call_main(cfg, main)                                    
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq/di
stributed_utils.py", line 318, in call_main
    cfg.distributed_training.distributed_world_size,                          
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/torch/mult
iprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn
')
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/torch/mult
iprocessing/spawn.py", line 188, in start_processes
    while not context.join():          
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/torch/mult
iprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:                           

-- Process 1 terminated with the following error: 
Traceback (most recent call last):                                            
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/torch/mult
iprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)                       
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq/di
stributed_utils.py", line 300, in distributed_main
    main(cfg, **kwargs)                
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq_cl
i/train.py", line 130, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/contextlib.py", line 52,
 in inner
    return func(*args, **kwds)         
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq_cl
i/train.py", line 219, in train
    log_output = trainer.train_step(samples)                                  
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/contextlib.py", line 52,
 in inner
    return func(*args, **kwds)         
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq/tr
ainer.py", line 674, in train_step
    raise e        
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq/tr
ainer.py", line 647, in train_step
    self.optimizer.step()       
    self.optimizer.step(closure)
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/torch/optim/optimizer.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/home/xjw/miniconda3/envs/cliff/lib/python3.6/site-packages/fairseq/optim/adam.py", line 210, in step
    denom = exp_avg_sq.sqrt().add_(group["eps"])
RuntimeError: CUDA out of memory. Tried to allocate 3.02 GiB (GPU 1; 23.70 GiB total capacity; 16.92 GiB already allocated; 1019.69 MiB free; 21.03 GiB reserved in total by PyTorch)

My training script is like below, and I only changed DEVICE when using multi GPUs.

TOTAL_NUM_UPDATES=20000
WARMUP_UPDATES=500
LR=3e-05
MAX_TOKENS=1024
UPDATE_FREQ=4
BART_PATH=$1
DATA_DIR=$2
USER_DIR=$3
SAVE_PATH=$4
TENSOR_LOG_PATH=$5
DEVICES=4,5

CUDA_VISIBLE_DEVICES=$DEVICES fairseq-train $DATA_DIR \
    --facets Purpose,Method,Findings \
    --max-epoch 10 \
    --tensorboard-logdir $TENSOR_LOG_PATH \
    --restore-file $BART_PATH --save-dir $SAVE_PATH \
    --max-tokens $MAX_TOKENS \
    --task divide_translation \
    --source-lang source --target-lang Purpose.target \
    --truncate-source \
    --layernorm-embedding \
    --share-all-embeddings \
    --share-decoder-input-output-embed \
    --reset-optimizer --reset-dataloader --reset-meters \
    --required-batch-size-multiple 1 \
    --arch divide_bart_large \
    --criterion divide_loss \
    --label-smoothing 0.1 \
    --fixed-validation-seed 7 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
    --clip-norm 0.1 \
    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
    --fp16 --update-freq $UPDATE_FREQ \
    --skip-invalid-size-inputs-valid-test \
    --no-save-optimizer-state \
    --find-unused-parameters \
    --user-dir $USER_DIR;

What's your environment?

  • fairseq Version: 1.0.0a0+0db28cd
  • PyTorch Version: 1.8.2+cu111
  • OS: Linux
  • Python version: 3.6
  • GPU models: 3090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant