Skip to content

Conversation

@CISC
Copy link
Collaborator

@CISC CISC commented Oct 29, 2025

The non-CUB ARGSORT would fail with 64k+ rows (-ub 8192 with expert group selection) due to storing nrows in the 16-bit y-dimension.

@CISC CISC requested a review from slaren as a code owner October 29, 2025 20:07
Copy link
Contributor

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-CUB ARGSORT would fail with 64k+ rows (-ub 8192 with expert group selection) due to storing nrows in the 16-bit y-dimension.

According to official cuda docs, both dim3 and blockIdx resolve to uint3 type, which is a vector of 3 uints that are each stored with 32-bit precision. Did you verify that blockIdx.y is actually stored with 16-bit precision only? Cause that would be a pretty severe bug on CUDA side and I'd be interested in a repro so I can raise this internally

@slaren
Copy link
Member

slaren commented Oct 29, 2025

The limits are documented in https://docs.nvidia.com/cuda/cuda-c-programming-guide/#features-and-technical-specifications-technical-specifications-per-compute-capability

@jukofyork
Copy link
Collaborator

I can confirm this fixes my problem mentioned in #16691 (comment).

print_info: n_expert_groups  = 8
print_info: n_group_used     = 4

using ubatch = 16384:

srv  params_from_: Chat format: DeepSeek R1
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 24135
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 16384, n_tokens = 16384, progress = 0.678848
slot update_slots: id  0 | task 0 | n_past = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 24135, n_tokens = 7751, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 24135, n_tokens = 7751
slot print_timing: id  0 | task 0 | 
prompt eval time =  230863.86 ms / 24135 tokens (    9.57 ms per token,   104.54 tokens per second)
       eval time =  167146.10 ms /  1182 tokens (  141.41 ms per token,     7.07 tokens per second)
      total time =  398009.96 ms / 25317 tokens
slot print_timing: id  0 | task 0 | 
draft acceptance rate = 0.53755 (  272 accepted /   506 generated)
srv  log_server_r: request: POST /v1/chat/completions 192.168.1.110 200
slot      release: id  0 | task 0 | stop processing: n_past = 25316, truncated = 0

thanks @CISC!

@CISC
Copy link
Collaborator Author

CISC commented Oct 29, 2025

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 29, 2025
@jeffbolznv
Copy link
Collaborator

Sure, I'll take a look.

@jeffbolznv
Copy link
Collaborator

Vulkan fix is at #16851.

@CISC CISC merged commit 229bf68 into master Oct 30, 2025
68 of 72 checks passed
@CISC CISC deleted the cisc/fix-cuda-large-argsort-rows branch October 30, 2025 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants