metal: optimise `GGML_OP_SUM` #16559

cern1710 · 2025-10-13T09:57:13Z

This PR optimises the GGML_OP_SUM operation, as implemented in #16539.

The original implementation performed the sum op on one thread as follows:

    ggml_metal_encoder_dispatch_threadgroups(enc, 1, 1, 1, 1, 1, 1);

resulting the following ./test-backend-ops perf -o SUM log:

./test-backend-ops perf -o SUM
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
Testing 3 devices

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
Backend 1/3: Metal
  Device description: Apple M1 Pro
  Device memory: 12124 MB (12123 MB free)

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_op_sum_f32', name = 'kernel_op_sum_f32'
ggml_metal_library_compile_pipeline: loaded kernel_op_sum_f32                             0x102efc6b0 | th_max = 1024 | th_width =   32
  SUM(type=f32,ne=[8192,1,1,1]):              163820 runs -     6.25 us/run -       32 kB/run -    0.02 GB/s

Implementation

To fix this, I've modified the host-side code to launch nth threads in one threadgroup, so that each thread sums a strided chunk (similar to op_sum_rows).

simd_sum(sumf) is used to perform a partial sum within each SIMD group
A second simd_sum(v)is used to do a reduction across SIMD groups (only SIMD group 0 is involved here)
The nth value is calculated similarly to op_sum_rows

Resulting in the following performance log via ./test-backend-ops perf -o SUM:

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_op_sum_f32', name = 'kernel_op_sum_f32'
ggml_metal_library_compile_pipeline: loaded kernel_op_sum_f32                             0x102efc6b0 | th_max = 1024 | th_width =   32
  SUM(type=f32,ne=[8192,1,1,1]):              163820 runs -     6.25 us/run -       32 kB/run -    4.88 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 23911.46 us/run -   262144 kB/run -   10.54 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):              8191 runs -   296.56 us/run -     4096 kB/run -   13.17 GB/s
  Backend Metal: OK
ggml_metal_free: deallocating
Backend 2/3: BLAS
  Device description: Accelerate
  Device memory: 0 MB (0 MB free)

  SUM(type=f32,ne=[8192,1,1,1]): not supported
  SUM(type=f32,ne=[8192,8192,1,1]): not supported
  SUM(type=f32,ne=[128,8192,1,1]): not supported
  Backend BLAS: OK
Backend 3/3: CPU
  Skipping CPU backend
3/3 backends passed
OK

Note that this is still quite a bit slower than SUM_ROWS,

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_sum_rows_f32', name = 'kernel_sum_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_sum_rows_f32                           0x10552e7f0 | th_max = 1024 | th_width =   32
  SUM_ROWS(type=f32,ne=[8192,1,1,1],permute=0,slice=0):               147438 runs -     6.84 us/run -       32 kB/run -    4.46 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1],permute=0,slice=0):               768 runs -  1467.22 us/run -   262176 kB/run -  171.74 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1],permute=0,slice=0):              16258 runs -    80.57 us/run -     4128 kB/run -   48.87 GB/s
  Backend Metal: OK

So there may be room for improvement in the current kernel implementation.

ggerganov · 2025-10-13T10:53:03Z

This implementation assumes that the src0 is contiguous, while the current requirement is only the rows to be contiguous:

llama.cpp/ggml/src/ggml-metal/ggml-metal-device.m

Lines 659 to 664 in 9cc51d3

    
           case GGML_OP_SUM: 
        
           case GGML_OP_SUM_ROWS: 
        
           case GGML_OP_MEAN: 
        
           case GGML_OP_SOFT_MAX: 
        
           case GGML_OP_GROUP_NORM: 
        
               return has_simdgroup_reduction && ggml_is_contiguous_rows(op->src[0]);

We should either update the requirement or support non-contiguous input. This is also wrong on master, but it's nice to fix it as we are making changes here.

Would need to add non-contiguous tests in test-backend-ops (for example, permute the input).

So there may be room for improvement in the current kernel implementation.

Likely to match the performance of sum_rows we would need to implement a 2-pass approach with an intermediate fleeting buffer, so that we can launch many threadgroups (not just one). The threadgroups would write their results in the fleeting buffer and then a second pass with 1 threadgroup would accumulate the final result. But this is more complicated to implement, so probably in another PR.

cern1710 · 2025-10-13T13:15:53Z

This is a temporary fix for now, but I assume it's possible to sum non-contiguous src0 by passing the sizes and strides in the host side struct?

    int64_t  ne00;
    int64_t  ne01;
    int64_t  ne02;
    int64_t  ne03;
    uint64_t nb00;
    uint64_t nb01;
    uint64_t nb02;
    uint64_t nb03;

ggerganov · 2025-10-13T18:24:39Z

Yes, we have to pass the strides and use them in the kernel. Note that we are mainly interested in the "contiguous rows" case. The completely non-contiguous case is rare and for now, no need to support it.

optimise GGML_OP_SUM

9cc51d3

cern1710 requested a review from ggerganov as a code owner October 13, 2025 09:57

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 13, 2025

add non-contiguous tests by permuting the input

4619142

cern1710 requested a review from slaren as a code owner October 13, 2025 11:27

github-actions bot added the testing Everything test related label Oct 13, 2025

change tests to require full contiguity of OP_SUM

c25a6c7

ggerganov approved these changes Oct 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal: optimise `GGML_OP_SUM` #16559

metal: optimise `GGML_OP_SUM` #16559

cern1710 commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

cern1710 commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

metal: optimise GGML_OP_SUM #16559

Are you sure you want to change the base?

metal: optimise GGML_OP_SUM #16559

Conversation

cern1710 commented Oct 13, 2025

Implementation

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

cern1710 commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

metal: optimise `GGML_OP_SUM` #16559

metal: optimise `GGML_OP_SUM` #16559