cuda : fix argsort with 64k+ rows #16849

CISC · 2025-10-29T20:07:30Z

The non-CUB ARGSORT would fail with 64k+ rows (-ub 8192 with expert group selection) due to storing nrows in the 16-bit y-dimension.

ORippler

The non-CUB ARGSORT would fail with 64k+ rows (-ub 8192 with expert group selection) due to storing nrows in the 16-bit y-dimension.

According to official cuda docs, both dim3 and blockIdx resolve to uint3 type, which is a vector of 3 uints that are each stored with 32-bit precision. Did you verify that blockIdx.y is actually stored with 16-bit precision only? Cause that would be a pretty severe bug on CUDA side and I'd be interested in a repro so I can raise this internally

slaren · 2025-10-29T20:51:02Z

The limits are documented in https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications-technical-specifications-per-compute-capability

jukofyork · 2025-10-29T21:01:13Z

I can confirm this fixes my problem mentioned in #16691 (comment).

print_info: n_expert_groups  = 8
print_info: n_group_used     = 4

using ubatch = 16384:

srv  params_from_: Chat format: DeepSeek R1
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 24135
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 16384, progress = 0.678848
slot update_slots: id  0 | task 0 | n_past = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 24135, n_tokens = 7751, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 24135, n_tokens = 7751
slot print_timing: id  0 | task 0 | 
prompt eval time =  230863.86 ms / 24135 tokens (    9.57 ms per token,   104.54 tokens per second)
       eval time =  167146.10 ms /  1182 tokens (  141.41 ms per token,     7.07 tokens per second)
      total time =  398009.96 ms / 25317 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.53755 (  272 accepted /   506 generated)
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.110 200
slot      release: id  0 | task 0 | stop processing: n_past = 25316, truncated = 0

thanks @CISC!

CISC · 2025-10-29T22:17:43Z

@jeffbolznv @0cc4m Looks like Vulkan needs a fix too:
https://github.com/ggml-org/llama.cpp/actions/runs/18920687463/job/54015723024?pr=16849#step:3:37607

jeffbolznv · 2025-10-29T22:52:04Z

Sure, I'll take a look.

jeffbolznv · 2025-10-29T23:05:07Z

Vulkan fix is at #16851.

fix argsort with 64k+ rows

de74edc

CISC requested a review from slaren as a code owner October 29, 2025 20:07

CISC requested review from JohannesGaessler and am17an October 29, 2025 20:09

ORippler reviewed Oct 29, 2025

View reviewed changes

slaren approved these changes Oct 29, 2025

View reviewed changes

jukofyork mentioned this pull request Oct 29, 2025

convert : enable expert group selection for all models with it #16691

Merged

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025

jeffbolznv mentioned this pull request Oct 29, 2025

vulkan: Handle argsort with a large number of rows #16851

Merged

am17an approved these changes Oct 29, 2025

View reviewed changes

CISC merged commit 229bf68 into master Oct 30, 2025
68 of 72 checks passed

CISC deleted the cisc/fix-cuda-large-argsort-rows branch October 30, 2025 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : fix argsort with 64k+ rows #16849

cuda : fix argsort with 64k+ rows #16849

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

ORippler left a comment

Uh oh!

slaren commented Oct 29, 2025

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

jeffbolznv commented Oct 29, 2025

Uh oh!

jeffbolznv commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cuda : fix argsort with 64k+ rows #16849

cuda : fix argsort with 64k+ rows #16849

Uh oh!

Conversation

CISC commented Oct 29, 2025

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

slaren commented Oct 29, 2025

Uh oh!

jukofyork commented Oct 29, 2025

Uh oh!

CISC commented Oct 29, 2025

Uh oh!

jeffbolznv commented Oct 29, 2025

Uh oh!

jeffbolznv commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants