Skip to content

Conversation

@jeffbolznv
Copy link
Collaborator

This is an optimized transposed-copy similar to #16841. I've verified this also works with the new test cases in #17332.

On 5090:

before:

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               308 runs -  3256.21 us/run -  1572864 kB/run -  471.13 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               344 runs -  3253.28 us/run -   786432 kB/run -  233.22 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               602 runs -  1761.13 us/run -   786432 kB/run -  430.81 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                     602 runs -  1756.74 us/run -   786432 kB/run -  431.89 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               308 runs -  3254.67 us/run -  1572864 kB/run -  471.35 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               506 runs -  2004.88 us/run -  1572864 kB/run -  765.18 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               344 runs -  3256.40 us/run -   786432 kB/run -  232.99 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               602 runs -  1760.63 us/run -   786432 kB/run -  430.94 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                     602 runs -  1758.17 us/run -   786432 kB/run -  431.54 GB/s
  
after:

  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):               946 runs -  1068.94 us/run -  1572864 kB/run - 1435.16 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):              1849 runs -   541.66 us/run -   786432 kB/run - 1400.74 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):              1935 runs -   523.07 us/run -   786432 kB/run - 1450.52 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[1,0,2,3],permute_dst=[0,0,0,0],_src_transpose=0):                    1935 runs -   525.75 us/run -   786432 kB/run - 1443.12 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               946 runs -  1068.55 us/run -  1572864 kB/run - 1435.67 GB/s
  CPY(type_src=f32,type_dst=f32,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):               968 runs -  1051.84 us/run -  1572864 kB/run - 1458.48 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[786432,256,1,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):              1892 runs -   540.77 us/run -   786432 kB/run - 1403.04 GB/s
  CPY(type_src=f16,type_dst=f16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):              1935 runs -   525.13 us/run -   786432 kB/run - 1444.83 GB/s
  CPY(type_src=bf16,type_dst=bf16,ne=[768,1024,256,1],permute_src=[0,0,0,0],permute_dst=[0,0,0,0],_src_transpose=1):                    1935 runs -   525.78 us/run -   786432 kB/run - 1443.03 GB/s

@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner November 18, 2025 22:39
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 18, 2025
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really big improvements in the CPY benchmarks across all my hardware, nice!

@0cc4m 0cc4m merged commit 2eba631 into ggml-org:master Nov 19, 2025
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants