Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

The top_k shader reduces workgroupsize elements to k elements in each iteration. When k is large, choose a larger workgroup size so it converges in fewer iterations. This helps with the test cases added here: https://github.com/ggml-org/llama.cpp/pull/17004/files#diff-2749fdb8974ec96afa18444a9d546409318b0a862709139b677eee468c479578R8045

I also saw that for very small tensors, test-backend-ops has an overflow that makes it compute incorrect results. This change does the std::min operation as 64b before converting to 32b.

before
  TOP_K(type=f32,ne=[2,1,1,1],k=1):                 -1431647801 runs -    -0.00 us/run -        0 kB/run - -382276.15 GB/s
  TOP_K(type=f32,ne=[1,1,1,1],k=1):                    23544 runs -    42.48 us/run -        0 kB/run -    0.00 GB/s
...
  TOP_K(type=f32,ne=[400,1,1,1],k=400):               548864 runs -     1.83 us/run -        3 kB/run -    1.63 GB/s
  TOP_K(type=f32,ne=[1000,1,1,1],k=400):              311296 runs -     3.22 us/run -        5 kB/run -    1.62 GB/s
  TOP_K(type=f32,ne=[65000,1,1,1],k=400):              16384 runs -    63.84 us/run -      255 kB/run -    3.82 GB/s
  TOP_K(type=f32,ne=[200000,1,1,1],k=400):             16384 runs -    83.12 us/run -      782 kB/run -    8.98 GB/s
  TOP_K(type=f32,ne=[400,16,1,1],k=400):              516096 runs -     1.95 us/run -       50 kB/run -   24.45 GB/s
  TOP_K(type=f32,ne=[1000,16,1,1],k=400):             303104 runs -     3.33 us/run -       87 kB/run -   25.05 GB/s
  TOP_K(type=f32,ne=[65000,16,1,1],k=400):             16384 runs -    97.10 us/run -     4087 kB/run -   40.15 GB/s
  TOP_K(type=f32,ne=[200000,16,1,1],k=400):             5358 runs -   211.52 us/run -    12525 kB/run -   56.47 GB/s

after
  TOP_K(type=f32,ne=[2,1,1,1],k=1):                   679936 runs -     1.48 us/run -        0 kB/run -    0.01 GB/s
  TOP_K(type=f32,ne=[1,1,1,1],k=1):                   679936 runs -     1.47 us/run -        0 kB/run -    0.01 GB/s
...
  TOP_K(type=f32,ne=[400,1,1,1],k=400):               548864 runs -     1.83 us/run -        3 kB/run -    1.63 GB/s
  TOP_K(type=f32,ne=[1000,1,1,1],k=400):              311296 runs -     3.22 us/run -        5 kB/run -    1.62 GB/s
  TOP_K(type=f32,ne=[65000,1,1,1],k=400):              49152 runs -    23.47 us/run -      255 kB/run -   10.38 GB/s
  TOP_K(type=f32,ne=[200000,1,1,1],k=400):             32768 runs -    31.72 us/run -      782 kB/run -   23.54 GB/s
  TOP_K(type=f32,ne=[400,16,1,1],k=400):              524288 runs -     1.92 us/run -       50 kB/run -   24.83 GB/s
  TOP_K(type=f32,ne=[1000,16,1,1],k=400):             303104 runs -     3.32 us/run -       87 kB/run -   25.17 GB/s
  TOP_K(type=f32,ne=[65000,16,1,1],k=400):             24576 runs -    45.12 us/run -     4087 kB/run -   86.39 GB/s
  TOP_K(type=f32,ne=[200000,16,1,1],k=400):            10716 runs -   108.55 us/run -    12525 kB/run -  110.04 GB/s

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
@0cc4m 0cc4m merged commit 59d8d4e into ggml-org:master Nov 29, 2025
72 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants