[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090

I'm using DeepSpeed for fine-tuning large models. Because of the lack of video memory, I'm using deepspeed_zero2 for training and I'm getting OOM issues. So I switched to deepspeed_zero3. but a new problem appeared:


```
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800956 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800651 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[2024-11-18 11:07:07,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539827 closing signal SIGTERM
[2024-11-18 11:07:07,749] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539828 closing signal SIGTERM

1. [2024-11-18 11:07:13,274] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 3539829) of binary: /home/mls01/miniconda3/envs/omg-llava/bin/python
```

I get the same problem with deepspeed_zero3_offload. The problem is usually during the model weight loading phase. Any replies are appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions