[NVIDIA] Add Orin, GB300, Spark Support#1781
[NVIDIA] Add Orin, GB300, Spark Support#1781johnnynunez wants to merge 1 commit intobitsandbytes-foundation:mainfrom
Conversation
|
|
||
| # CUDA 12.8+: Add sm100 and sm120; remove < sm70 to align with PyTorch 2.8+cu128 minimum | ||
| [[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="70;75;80;86;89;90;100;120" | ||
| [[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="70;75;80;87;86;89;90;100;120;121" |
There was a problem hiding this comment.
My understanding is that Orin (sm87) and Spark (GB10, sm121) are only available on aarch64 platforms, so we shouldn't need to do this for x86-64.
|
|
||
| # CUDA 13.0+: Add sm100/sm110/sm120 | ||
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;90;100;110;120" | ||
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;87;90;100;103;110;120;121" |
There was a problem hiding this comment.
I would have expected building for sm80, sm100, and sm120 to cover this as we don't use any specific features in sm87/sm103/sm121 yet. Is it not working today? Can you clarify on the benefit of adding these targets? I assume maybe just some performance optimizations?
There was a problem hiding this comment.
I didn’t know that. That is fine. Closing
|
|
||
| # CUDA 13.0+: Remove < sm75 to align with PyTorch 2.9+cu130 minimum | ||
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;86;89;90;100;120" | ||
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;87;90;100;103;110;120;121" |
There was a problem hiding this comment.
Same comment on sm87, sm110, sm121 being exclusive to aarch64.
I think we should keep sm86/sm89, and maybe consider adding sm103.
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;87;90;100;103;110;120;121" | |
| [[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;86;89;90;100;103;120" |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Adds sm_87 (NVIDIA Jetson Orin: Nano / NX / AGX) to the aarch64 build_capability list in .github/scripts/build-cuda.sh and documents the addition in installation.mdx. Why an explicit cubin is needed: the CMake arch logic at CMakeLists.txt:226-230 only emits PTX for the latest capability. sm87 hardware can't JIT from sm90+ PTX (forward-compat is upward- only), so aarch64 wheels currently targeting sm75/sm80/sm90 ship PTX only for sm90 and Jetson Orin users fall back to slow or unsupported paths. This rebuts the "sm80 should cover sm87" reasoning that closed bitsandbytes-foundation#1781. Wheel size impact (measured on Linux aarch64, CUDA 12.6.68, source HEAD a57d8e2): baseline (sm75;80;90): 5,710,520 bytes (5.45 MiB) with sm87 (sm75;80;87;90): 7,353,064 bytes (7.01 MiB) delta: +1,642,544 bytes (+1.57 MiB, +28.76%) Adds tests/test_linear4bit_sm87_multishape_regression.py — pytest reproducer for the multi-shape Linear4bit cold-start fault on sm_87 (bitsandbytes-foundation#1936). The test runs the historical failing recipe (NF4 + bf16 quant_storage + bf16 compute + double_quant + ABC shape order + no hygiene + batch=1) at sm_87 cold-state. The fault is cold-start-specific; the test docstring documents the warm/cold distinction so CI runners can configure accordingly. Closes bitsandbytes-foundation#1930 Closes bitsandbytes-foundation#1218 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.