Skip to content

[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090 #6756

@MLS2021

Description

@MLS2021

I'm using DeepSpeed for fine-tuning large models. Because of the lack of video memory, I'm using deepspeed_zero2 for training and I'm getting OOM issues. So I switched to deepspeed_zero3. but a new problem appeared:

[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800956 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800651 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800283 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=235, OpType=BROADCAST, NumelIn=768, NumelOut=768, Timeout(ms)=1800000) ran for 1800253 milliseconds before timing out.
[2024-11-18 11:07:07,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539827 closing signal SIGTERM
[2024-11-18 11:07:07,749] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3539828 closing signal SIGTERM

1. [2024-11-18 11:07:13,274] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 3539829) of binary: /home/mls01/miniconda3/envs/omg-llava/bin/python

I get the same problem with deepspeed_zero3_offload. The problem is usually during the model weight loading phase. Any replies are appreciated.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions