fix(op_builder): avoid duplicate/wrong -gencode flags by Cursx · Pull Request #7974 · deepspeedai/DeepSpeed

Cursx · 2026-04-14T15:42:19Z

Summary

Fix duplicate/wrong -gencode= flags in both JIT and non-JIT compilation paths (op_builder/builder.py)
Fix TORCH_CUDA_ARCH_LIST env-var restore logic in OpBuilder.jit_load()

DeepSpeed's compute_capability_args() generates its own -gencode flags, but PyTorch (load() in JIT mode, BuildExtension in non-JIT mode) also reads TORCH_CUDA_ARCH_LIST and generates -gencode flags. This causes two problems:

JIT mode: jit_load() set TORCH_CUDA_ARCH_LIST="", which PyTorch treats as unset and falls back to auto-detection — resulting in every flag appearing twice.
Non-JIT mode: subclasses that override filter_ccs() (e.g. FPQuantizerBuilder, EvoformerAttnBuilder) remove certain archs, but BuildExtension re-reads the unfiltered TORCH_CUDA_ARCH_LIST and adds them back — undermining the filter.

The fix synchronises TORCH_CUDA_ARCH_LIST with the filtered arch list in compute_capability_args(), for both JIT and non-JIT paths.

Fixes #7972

Before / After

Before (buggy behavior)

JIT mode — TORCH_CUDA_ARCH_LIST cleared to "", PyTorch auto-detects and adds flags, DeepSpeed also adds the same flags:

nvcc ... -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80
     ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80

Plus a spurious warning:

UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.

Non-JIT mode — FPQuantizerBuilder.filter_ccs() removes < 8.0, but BuildExtension re-adds them from the unfiltered env var:

# FPQuantizer compiled for sm_70 even though filter_ccs() removed it
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed (correct)
     ... -gencode=arch=compute_70,code=sm_70   # from BuildExtension (wrong!)

After (fixed behavior)

JIT mode — TORCH_CUDA_ARCH_LIST is set to the detected architectures, PyTorch generates flags once, no duplicates:

nvcc ... -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80

No spurious warning. Env var is properly restored/removed after build.

Non-JIT mode — TORCH_CUDA_ARCH_LIST is updated to the filtered list. Each extension keeps its own -gencode flags, and BuildExtension reads the filtered env var:

# FPQuantizer: only sm_80+ as intended
nvcc ... -gencode=arch=compute_80,code=sm_80   # from DeepSpeed
     ... -gencode=arch=compute_80,code=sm_80   # from BuildExtension (harmless dup)

Note: in multi-builder setup.py builds, the last builder's filtered arch list wins for TORCH_CUDA_ARCH_LIST. This may cause harmless duplicates for some extensions, but will never reintroduce archs that any builder's filter_ccs() removed — a strict improvement over the current behavior where the unfiltered original is always used.

Changes

op_builder/builder.py
- CUDAOpBuilder.compute_capability_args():
  - Always sync TORCH_CUDA_ARCH_LIST with the filtered arch list
  - JIT mode: return [] (PyTorch generates flags via load())
  - Non-JIT mode: return -gencode args as before (per-builder flags in extra_compile_args)
- OpBuilder.jit_load(): simplified stash/restore — properly del the env var if it was not originally set

…dai#7972) In JIT mode, compute_capability_args() now sets TORCH_CUDA_ARCH_LIST to the detected GPU architectures and returns an empty list, letting PyTorch generate -gencode flags. Previously the env var was cleared to an empty string (which PyTorch treats as unset, triggering auto-detection) while DeepSpeed also added its own -gencode flags, resulting in duplicates. The jit_load() restore logic is also improved: if TORCH_CUDA_ARCH_LIST was not originally set, it is now removed from os.environ after build instead of being left as an empty string. Fixes deepspeedai#7972 Signed-off-by: Cursx <674760201@qq.com>

chatgpt-codex-connector · 2026-04-14T15:42:25Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Flamefire · 2026-04-14T16:26:37Z

Shouldn't this always be done, not only for JIT mode? Otherwise you set TORCH_CUDA_ARCH_LIST, builder will filter them but leave them set AND add flags so you build for wrong ones in addition to the desired ones. Or am I missing anything?

Cursx · 2026-04-14T16:57:28Z

@Flamefire
You're right, the non-JIT path should also be modified. I feel the multi-builder scenario might be quite complex. Trying to solve this issue might expand the scope of changes.

I'm looking into a proper fix for this as a follow-up.

…-JIT mode Extend the fix to non-JIT (setup.py) mode: compute_capability_args() now updates TORCH_CUDA_ARCH_LIST to the filtered arch list from filter_ccs() for both JIT and non-JIT paths. Each CUDAExtension still carries its own -gencode flags in extra_compile_args, but BuildExtension will no longer silently re-introduce archs that filter_ccs() removed. Signed-off-by: Cursx <674760201@qq.com>

Signed-off-by: Cursx <674760201@qq.com>

delock

LGTM

Flamefire · 2026-04-15T06:28:50Z

op_builder/builder.py

        Format:

-        - `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:
+        - ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples:


Suggested change

- ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples:

- `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:

Flamefire · 2026-04-15T06:30:02Z

op_builder/builder.py

        self.enable_bf16 = True
+        for cc in ccs:
+            if int(cc[0]) <= 7:
+                self.enable_bf16 = False


How about using any?

Suggested change

self.enable_bf16 = True

for cc in ccs:

if int(cc[0]) <= 7:

self.enable_bf16 = False

self.enable_bf16 = not any(int(cc[0]) <= 7 for cc in ccs)

Yes, it is better this way.

Flamefire · 2026-04-15T06:31:44Z

op_builder/builder.py

        TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6 9.0 10.0+PTX" pip install ...

-        - `cross_compile_archs` uses ; separator.
+        - ``cross_compile_archs`` uses ; separator.


Suggested change

- ``cross_compile_archs`` uses ; separator.

- `cross_compile_archs` uses ; separator.

Flamefire · 2026-04-15T06:40:39Z

op_builder/builder.py

+            # Let PyTorch generate -gencode flags from the env var.
+            return []
+
+        # Non-JIT: return explicit flags per builder for extra_compile_args.


Won't this cause duplicate, possibly wrong flags again? If TORCH_CUDA_ARCH_LIST is used for all extensions and one allows a CC another doesn't you'll still get it for both

I guess for non-JIT mode you'd want to set TORCH_CUDA_ARCH_LIST to a single CC out of the intersection of the CCs allowed by all extensions if that is the case to at least not add the wrong one if you can't avoid the duplication.

Or if more high level is possible:

Determine CCs to add (from GPU arch and/or TORCH_CUDA_ARCH_LIST)

Filter them per extension

if the intersection is equal to the full list set TORCH_CUDA_ARCH_LIST to all of them, else to one of them or error if empty

add extension specific flags for each non-filtered arch of that extension that is not in TORCH_CUDA_ARCH_LIST, this might be none

I totally agree with your intersection-based approach, but implementing it would likely require changes to setup.py (splitting the builder loop into two passes), and I'm concerned the scope of that refactor could be too large for this PR.

Cursx requested review from loadams and tjruwase as code owners April 14, 2026 15:42

Cursx added 2 commits April 15, 2026 07:59

style: trim comments to match project conventions

b5cdc89

Signed-off-by: Cursx <674760201@qq.com>

Cursx changed the title ~~fix(op_builder): avoid duplicate -gencode flags in JIT mode~~ fix(op_builder): avoid duplicate/wrong -gencode flags Apr 15, 2026

delock approved these changes Apr 15, 2026

View reviewed changes

delock enabled auto-merge (squash) April 15, 2026 06:28

Merge branch 'master' into fix/duplicate-gencode-flags

1f32f02

delock merged commit 893c6d2 into deepspeedai:master Apr 15, 2026
1 check passed

Flamefire reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(op_builder): avoid duplicate/wrong -gencode flags#7974

fix(op_builder): avoid duplicate/wrong -gencode flags#7974
delock merged 4 commits intodeepspeedai:masterfrom
Cursx:fix/duplicate-gencode-flags

Cursx commented Apr 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot commented Apr 14, 2026

Uh oh!

Flamefire commented Apr 14, 2026

Uh oh!

Cursx commented Apr 14, 2026 •

edited

Loading

Uh oh!

delock left a comment

Uh oh!

Uh oh!

Flamefire Apr 15, 2026

Uh oh!

Flamefire Apr 15, 2026

Uh oh!

Cursx Apr 15, 2026

Uh oh!

Flamefire Apr 15, 2026

Uh oh!

Flamefire Apr 15, 2026

Uh oh!

Cursx Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples:
	- `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples:

	- ``cross_compile_archs`` uses ; separator.
	- `cross_compile_archs` uses ; separator.

Conversation

Cursx commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before / After

Changes

Uh oh!

chatgpt-codex-connector bot commented Apr 14, 2026

Uh oh!

Flamefire commented Apr 14, 2026

Uh oh!

Cursx commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Flamefire Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Flamefire Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Cursx Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Flamefire Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Flamefire Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Cursx Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cursx commented Apr 14, 2026 •

edited

Loading

Cursx commented Apr 14, 2026 •

edited

Loading