Skip to content

Set explicit 30-minute timeout on all init_process_group calls#1069

Merged
mcgibbon merged 2 commits into
mainfrom
fix/distributed-init-timeouts
Apr 21, 2026
Merged

Set explicit 30-minute timeout on all init_process_group calls#1069
mcgibbon merged 2 commits into
mainfrom
fix/distributed-init-timeouts

Conversation

@mcgibbon
Copy link
Copy Markdown
Contributor

@mcgibbon mcgibbon commented Apr 21, 2026

The Gloo+torchrun and srun init_process_group paths had no explicit timeout, relying on PyTorch defaults (30 minutes). This makes all paths consistent with the NCCL+torchrun path, ensuring collective operations time out after 60 minutes when a peer dies.

Changes:

  • fme.core.distributed.TorchDistributed.__init__: added timeout=timedelta(minutes=60) to the Gloo+torchrun and srun init_process_group calls

  • Tests added

The Gloo+torchrun and srun init paths had no explicit timeout, relying
on PyTorch defaults. This makes all paths consistent with the
NCCL+torchrun path, ensuring collective operations time out after 60
minutes when a peer dies instead of potentially hanging longer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@jpdunc23 jpdunc23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Per discussion on Slack, maybe update to 30 min in this PR?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mcgibbon mcgibbon enabled auto-merge (squash) April 21, 2026 17:46
@mcgibbon mcgibbon merged commit 68767dc into main Apr 21, 2026
7 checks passed
@mcgibbon mcgibbon deleted the fix/distributed-init-timeouts branch April 21, 2026 17:59
@mcgibbon mcgibbon changed the title Set explicit 60-minute timeout on all init_process_group calls Set explicit 30-minute timeout on all init_process_group calls Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants