-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Open
Labels
Description
Describe the bug
There is a problem with asynchronous communication in zero stage2 by using overlap_comm.
To Reproduce
Steps to reproduce the behavior:
Use deepspeed zero-2 on the hugging face to train the bloomz-7b1-mt model. When you enable overlap_comm = true and control the randomness, you still find that the loss is different every time.
model: bloomz-7b1-mt
zero2-config:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
System info (please complete the following information):
- OS: Ubuntu 20.04
- 1 machine with x8 A100s(NVIDIA A100-SXM)
- Python version 3.8.13
- deepspeed 0.8.1
- transformers 4.26.1
- torch 1.13.0a0+08820cb
Launcher context
torchrun
Docker context
nvcr.io/nvidia/pytorch:22.07-py3
Reactions are currently unavailable