Skip to content

feat: add diagnostic logging to EFA NCCL test#6114

Merged
Eren-Jeager123 merged 1 commit into
mainfrom
debug-efa
May 21, 2026
Merged

feat: add diagnostic logging to EFA NCCL test#6114
Eren-Jeager123 merged 1 commit into
mainfrom
debug-efa

Conversation

@Eren-Jeager123
Copy link
Copy Markdown
Contributor

@Eren-Jeager123 Eren-Jeager123 commented May 20, 2026

Summary

Add diagnostic logging to the EFA test so future failures are easier to diagnose without manual SSH debugging.

Changes (test_efa.py only)

  1. Pre-test diagnostics: Print host driver version and cuda-compat library version inside the container before running tests
  2. Failure capture: Use warn=True on the NCCL allreduce command to capture stdout/stderr/log on failure instead of losing output through fabric's exception handling

Context

The EFA test was failing for ~2 weeks due to a DLAMI embargo driver (580.150) incompatible with CUDA 13.0.2 on A100. The failure message (system not yet initialized) gave no indication of the driver mismatch. These diagnostics would have identified the root cause immediately by showing:

580.150, NVIDIA A100-SXM4-40GB    ← host driver
libcuda.so.580.159.04              ← cuda-compat (mismatch!)

Now resolved by DLAMI team publishing the post-embargo AMI.

Test plan

  • EFA test passes with new DLAMI (post-embargo driver)
  • Diagnostics print driver info in CI logs

@Eren-Jeager123 Eren-Jeager123 changed the title Trigger EFA Test fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility May 20, 2026
@Eren-Jeager123 Eren-Jeager123 changed the title fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility fix: remove cuda-compat upgrade to prevent driver version mismatch May 20, 2026
@Eren-Jeager123 Eren-Jeager123 changed the title fix: remove cuda-compat upgrade to prevent driver version mismatch Fix EFA Tests May 20, 2026
@Eren-Jeager123 Eren-Jeager123 changed the title Fix EFA Tests debug: add diagnostic logging to EFA NCCL test May 20, 2026
@Eren-Jeager123 Eren-Jeager123 force-pushed the debug-efa branch 2 times, most recently from ddb9389 to eb65273 Compare May 21, 2026 18:14
- Print host driver version and cuda-compat version before test
  (catches driver mismatch issues like the embargo incident)
- Capture NCCL allreduce stdout/stderr/log on failure instead of
  losing output through fabric's exception handling
@Eren-Jeager123 Eren-Jeager123 changed the title debug: add diagnostic logging to EFA NCCL test feat: add diagnostic logging to EFA NCCL test May 21, 2026
@Eren-Jeager123 Eren-Jeager123 merged commit 956819d into main May 21, 2026
14 checks passed
@Eren-Jeager123 Eren-Jeager123 deleted the debug-efa branch May 21, 2026 19:50
Yadan-Wei pushed a commit that referenced this pull request May 22, 2026
The merge of main (PR #6114, 956819d) added warn=True to a
run_on_container() call. Our verbose-_step refactor (9522545) wraps
that same call inside a local _step() helper which doesn't accept warn,
so the merge produced _step(..., warn=True) — TypeError at runtime.

_step already prints stdout/stderr/exit on success and pytest captures
the UnexpectedExit on failure, so warn=True is no longer needed (it was
only useful when the broken DLAMI was masking real errors).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants