Skip to content

fix(op_builder): avoid duplicate/wrong -gencode flags#7974

Merged
delock merged 4 commits intodeepspeedai:masterfrom
Cursx:fix/duplicate-gencode-flags
Apr 15, 2026
Merged

fix(op_builder): avoid duplicate/wrong -gencode flags#7974
delock merged 4 commits intodeepspeedai:masterfrom
Cursx:fix/duplicate-gencode-flags

Conversation

@Cursx
Copy link
Copy Markdown
Contributor

@Cursx Cursx commented Apr 14, 2026

Summary

  • Fix duplicate/wrong -gencode= flags in both JIT and non-JIT compilation paths (op_builder/builder.py)
  • Fix TORCH_CUDA_ARCH_LIST env-var restore logic in OpBuilder.jit_load()

DeepSpeed's compute_capability_args() generates its own -gencode flags, but PyTorch (load() in JIT mode, BuildExtension in non-JIT mode) also reads TORCH_CUDA_ARCH_LIST and generates -gencode flags. This causes two problems:

  1. JIT mode: jit_load() set TORCH_CUDA_ARCH_LIST="", which PyTorch treats as unset and falls back to auto-detection — resulting in every flag appearing twice.
  2. Non-JIT mode: subclasses that override filter_ccs() (e.g. FPQuantizerBuilder, EvoformerAttnBuilder) remove certain archs, but BuildExtension re-reads the unfiltered TORCH_CUDA_ARCH_LIST and adds them back — undermining the filter.

The fix synchronises TORCH_CUDA_ARCH_LIST with the filtered arch list in compute_capability_args(), for both JIT and non-JIT paths.

Fixes #7972

Before / After

Before (buggy behavior)

JIT modeTORCH_CUDA_ARCH_LIST cleared to "", PyTorch auto-detects and adds flags, DeepSpeed also adds the same flags:

nvcc ... -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80
     ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80

Plus a spurious warning:

UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.

Non-JIT modeFPQuantizerBuilder.filter_ccs() removes < 8.0, but BuildExtension re-adds them from the unfiltered env var:

# FPQuantizer compiled for sm_70 even though filter_ccs() removed it
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed (correct)
     ... -gencode=arch=compute_70,code=sm_70   # from BuildExtension (wrong!)
After (fixed behavior)

JIT modeTORCH_CUDA_ARCH_LIST is set to the detected architectures, PyTorch generates flags once, no duplicates:

nvcc ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80

No spurious warning. Env var is properly restored/removed after build.

Non-JIT modeTORCH_CUDA_ARCH_LIST is updated to the filtered list. Each extension keeps its own -gencode flags, and BuildExtension reads the filtered env var:

# FPQuantizer: only sm_80+ as intended
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed
     ... -gencode=arch=compute_80,code=sm_80   # from BuildExtension (harmless dup)

Note: in multi-builder setup.py builds, the last builder's filtered arch list wins for TORCH_CUDA_ARCH_LIST. This may cause harmless duplicates for some extensions, but will never reintroduce archs that any builder's filter_ccs() removed — a strict improvement over the current behavior where the unfiltered original is always used.

Changes

  • op_builder/builder.py
    • CUDAOpBuilder.compute_capability_args():
      • Always sync TORCH_CUDA_ARCH_LIST with the filtered arch list
      • JIT mode: return [] (PyTorch generates flags via load())
      • Non-JIT mode: return -gencode args as before (per-builder flags in extra_compile_args)
    • OpBuilder.jit_load(): simplified stash/restore — properly del the env var if it was not originally set

…dai#7972)

In JIT mode, compute_capability_args() now sets TORCH_CUDA_ARCH_LIST to
the detected GPU architectures and returns an empty list, letting PyTorch
generate -gencode flags.  Previously the env var was cleared to an empty
string (which PyTorch treats as unset, triggering auto-detection) while
DeepSpeed also added its own -gencode flags, resulting in duplicates.

The jit_load() restore logic is also improved: if TORCH_CUDA_ARCH_LIST
was not originally set, it is now removed from os.environ after build
instead of being left as an empty string.

Fixes deepspeedai#7972

Signed-off-by: Cursx <674760201@qq.com>
@Cursx Cursx requested review from loadams and tjruwase as code owners April 14, 2026 15:42
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@Flamefire
Copy link
Copy Markdown
Contributor

Shouldn't this always be done, not only for JIT mode? Otherwise you set TORCH_CUDA_ARCH_LIST, builder will filter them but leave them set AND add flags so you build for wrong ones in addition to the desired ones. Or am I missing anything?

@Cursx
Copy link
Copy Markdown
Contributor Author

Cursx commented Apr 14, 2026

@Flamefire
You're right, the non-JIT path should also be modified. I feel the multi-builder scenario might be quite complex. Trying to solve this issue might expand the scope of changes.

I'm looking into a proper fix for this as a follow-up.

Cursx added 2 commits April 15, 2026 07:59
…-JIT mode

Extend the fix to non-JIT (setup.py) mode: compute_capability_args() now
updates TORCH_CUDA_ARCH_LIST to the filtered arch list from filter_ccs()
for both JIT and non-JIT paths.  Each CUDAExtension still carries its own
-gencode flags in extra_compile_args, but BuildExtension will no longer
silently re-introduce archs that filter_ccs() removed.

Signed-off-by: Cursx <674760201@qq.com>
Signed-off-by: Cursx <674760201@qq.com>
@Cursx Cursx changed the title fix(op_builder): avoid duplicate -gencode flags in JIT mode fix(op_builder): avoid duplicate/wrong -gencode flags Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@delock delock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@delock delock enabled auto-merge (squash) April 15, 2026 06:28
@delock delock merged commit 893c6d2 into deepspeedai:master Apr 15, 2026
1 check passed
Format:

- `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:
- ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples:
- `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:

Comment on lines 665 to +668
self.enable_bf16 = True
for cc in ccs:
if int(cc[0]) <= 7:
self.enable_bf16 = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using any?

Suggested change
self.enable_bf16 = True
for cc in ccs:
if int(cc[0]) <= 7:
self.enable_bf16 = False
self.enable_bf16 = not any(int(cc[0]) <= 7 for cc in ccs)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is better this way.

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6 9.0 10.0+PTX" pip install ...

- `cross_compile_archs` uses ; separator.
- ``cross_compile_archs`` uses ; separator.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- ``cross_compile_archs`` uses ; separator.
- `cross_compile_archs` uses ; separator.

# Let PyTorch generate -gencode flags from the env var.
return []

# Non-JIT: return explicit flags per builder for extra_compile_args.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause duplicate, possibly wrong flags again? If TORCH_CUDA_ARCH_LIST is used for all extensions and one allows a CC another doesn't you'll still get it for both

I guess for non-JIT mode you'd want to set TORCH_CUDA_ARCH_LIST to a single CC out of the intersection of the CCs allowed by all extensions if that is the case to at least not add the wrong one if you can't avoid the duplication.

Or if more high level is possible:

  • Determine CCs to add (from GPU arch and/or TORCH_CUDA_ARCH_LIST)
  • Filter them per extension
  • if the intersection is equal to the full list set TORCH_CUDA_ARCH_LIST to all of them, else to one of them or error if empty
  • add extension specific flags for each non-filtered arch of that extension that is not in TORCH_CUDA_ARCH_LIST, this might be none

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with your intersection-based approach, but implementing it would likely require changes to setup.py (splitting the builder loop into two passes), and I'm concerned the scope of that refactor could be too large for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Duplicate/Wrong(?) Compute capability flags added

3 participants