fix(op_builder): avoid duplicate/wrong -gencode flags#7974
fix(op_builder): avoid duplicate/wrong -gencode flags#7974delock merged 4 commits intodeepspeedai:masterfrom
Conversation
…dai#7972) In JIT mode, compute_capability_args() now sets TORCH_CUDA_ARCH_LIST to the detected GPU architectures and returns an empty list, letting PyTorch generate -gencode flags. Previously the env var was cleared to an empty string (which PyTorch treats as unset, triggering auto-detection) while DeepSpeed also added its own -gencode flags, resulting in duplicates. The jit_load() restore logic is also improved: if TORCH_CUDA_ARCH_LIST was not originally set, it is now removed from os.environ after build instead of being left as an empty string. Fixes deepspeedai#7972 Signed-off-by: Cursx <674760201@qq.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Shouldn't this always be done, not only for JIT mode? Otherwise you set TORCH_CUDA_ARCH_LIST, |
|
@Flamefire I'm looking into a proper fix for this as a follow-up. |
…-JIT mode Extend the fix to non-JIT (setup.py) mode: compute_capability_args() now updates TORCH_CUDA_ARCH_LIST to the filtered arch list from filter_ccs() for both JIT and non-JIT paths. Each CUDAExtension still carries its own -gencode flags in extra_compile_args, but BuildExtension will no longer silently re-introduce archs that filter_ccs() removed. Signed-off-by: Cursx <674760201@qq.com>
Signed-off-by: Cursx <674760201@qq.com>
| Format: | ||
|
|
||
| - `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples: | ||
| - ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples: |
There was a problem hiding this comment.
| - ``TORCH_CUDA_ARCH_LIST`` may use ; or whitespace separators. Examples: | |
| - `TORCH_CUDA_ARCH_LIST` may use ; or whitespace separators. Examples: |
| self.enable_bf16 = True | ||
| for cc in ccs: | ||
| if int(cc[0]) <= 7: | ||
| self.enable_bf16 = False |
There was a problem hiding this comment.
How about using any?
| self.enable_bf16 = True | |
| for cc in ccs: | |
| if int(cc[0]) <= 7: | |
| self.enable_bf16 = False | |
| self.enable_bf16 = not any(int(cc[0]) <= 7 for cc in ccs) |
There was a problem hiding this comment.
Yes, it is better this way.
| TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6 9.0 10.0+PTX" pip install ... | ||
|
|
||
| - `cross_compile_archs` uses ; separator. | ||
| - ``cross_compile_archs`` uses ; separator. |
There was a problem hiding this comment.
| - ``cross_compile_archs`` uses ; separator. | |
| - `cross_compile_archs` uses ; separator. |
| # Let PyTorch generate -gencode flags from the env var. | ||
| return [] | ||
|
|
||
| # Non-JIT: return explicit flags per builder for extra_compile_args. |
There was a problem hiding this comment.
Won't this cause duplicate, possibly wrong flags again? If TORCH_CUDA_ARCH_LIST is used for all extensions and one allows a CC another doesn't you'll still get it for both
I guess for non-JIT mode you'd want to set TORCH_CUDA_ARCH_LIST to a single CC out of the intersection of the CCs allowed by all extensions if that is the case to at least not add the wrong one if you can't avoid the duplication.
Or if more high level is possible:
- Determine CCs to add (from GPU arch and/or TORCH_CUDA_ARCH_LIST)
- Filter them per extension
- if the intersection is equal to the full list set TORCH_CUDA_ARCH_LIST to all of them, else to one of them or error if empty
- add extension specific flags for each non-filtered arch of that extension that is not in TORCH_CUDA_ARCH_LIST, this might be none
There was a problem hiding this comment.
I totally agree with your intersection-based approach, but implementing it would likely require changes to setup.py (splitting the builder loop into two passes), and I'm concerned the scope of that refactor could be too large for this PR.
Summary
-gencode=flags in both JIT and non-JIT compilation paths (op_builder/builder.py)TORCH_CUDA_ARCH_LISTenv-var restore logic inOpBuilder.jit_load()DeepSpeed's
compute_capability_args()generates its own-gencodeflags, but PyTorch (load()in JIT mode,BuildExtensionin non-JIT mode) also readsTORCH_CUDA_ARCH_LISTand generates-gencodeflags. This causes two problems:jit_load()setTORCH_CUDA_ARCH_LIST="", which PyTorch treats as unset and falls back to auto-detection — resulting in every flag appearing twice.filter_ccs()(e.g.FPQuantizerBuilder,EvoformerAttnBuilder) remove certain archs, butBuildExtensionre-reads the unfilteredTORCH_CUDA_ARCH_LISTand adds them back — undermining the filter.The fix synchronises
TORCH_CUDA_ARCH_LISTwith the filtered arch list incompute_capability_args(), for both JIT and non-JIT paths.Fixes #7972
Before / After
Before (buggy behavior)
JIT mode —
TORCH_CUDA_ARCH_LISTcleared to"", PyTorch auto-detects and adds flags, DeepSpeed also adds the same flags:Plus a spurious warning:
Non-JIT mode —
FPQuantizerBuilder.filter_ccs()removes< 8.0, butBuildExtensionre-adds them from the unfiltered env var:After (fixed behavior)
JIT mode —
TORCH_CUDA_ARCH_LISTis set to the detected architectures, PyTorch generates flags once, no duplicates:No spurious warning. Env var is properly restored/removed after build.
Non-JIT mode —
TORCH_CUDA_ARCH_LISTis updated to the filtered list. Each extension keeps its own-gencodeflags, andBuildExtensionreads the filtered env var:Changes
op_builder/builder.pyCUDAOpBuilder.compute_capability_args():TORCH_CUDA_ARCH_LISTwith the filtered arch list[](PyTorch generates flags viaload())-gencodeargs as before (per-builder flags inextra_compile_args)OpBuilder.jit_load(): simplified stash/restore — properlydelthe env var if it was not originally set