Skip to content

Conversation

cern1710
Copy link
Contributor

This PR optimises the GGML_OP_SUM operation, as implemented in #16539.

The original implementation performed the sum op on one thread as follows:

    ggml_metal_encoder_dispatch_threadgroups(enc, 1, 1, 1, 1, 1, 1);

resulting the following ./test-backend-ops perf -o SUM log:

./test-backend-ops perf -o SUM
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 12713.12 MB
Testing 3 devices

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
Backend 1/3: Metal
  Device description: Apple M1 Pro
  Device memory: 12124 MB (12123 MB free)

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_op_sum_f32', name = 'kernel_op_sum_f32'
ggml_metal_library_compile_pipeline: loaded kernel_op_sum_f32                             0x102efc6b0 | th_max = 1024 | th_width =   32
  SUM(type=f32,ne=[8192,1,1,1]):              163820 runs -     6.25 us/run -       32 kB/run -    0.02 GB/s

Implementation

To fix this, I've modified the host-side code to launch nth threads in one threadgroup, so that each thread sums a strided chunk (similar to op_sum_rows).

  • simd_sum(sumf) is used to perform a partial sum within each SIMD group
  • A second simd_sum(v)is used to do a reduction across SIMD groups (only SIMD group 0 is involved here)
  • The nth value is calculated similarly to op_sum_rows

Resulting in the following performance log via ./test-backend-ops perf -o SUM:

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_op_sum_f32', name = 'kernel_op_sum_f32'
ggml_metal_library_compile_pipeline: loaded kernel_op_sum_f32                             0x102efc6b0 | th_max = 1024 | th_width =   32
  SUM(type=f32,ne=[8192,1,1,1]):              163820 runs -     6.25 us/run -       32 kB/run -    4.88 GB/s
  SUM(type=f32,ne=[8192,8192,1,1]):                      128 runs - 23911.46 us/run -   262144 kB/run -   10.54 GB/s
  SUM(type=f32,ne=[128,8192,1,1]):              8191 runs -   296.56 us/run -     4096 kB/run -   13.17 GB/s
  Backend Metal: OK
ggml_metal_free: deallocating
Backend 2/3: BLAS
  Device description: Accelerate
  Device memory: 0 MB (0 MB free)

  SUM(type=f32,ne=[8192,1,1,1]): not supported
  SUM(type=f32,ne=[8192,8192,1,1]): not supported
  SUM(type=f32,ne=[128,8192,1,1]): not supported
  Backend BLAS: OK
Backend 3/3: CPU
  Skipping CPU backend
3/3 backends passed
OK

Note that this is still quite a bit slower than SUM_ROWS,

ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_sum_rows_f32', name = 'kernel_sum_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_sum_rows_f32                           0x10552e7f0 | th_max = 1024 | th_width =   32
  SUM_ROWS(type=f32,ne=[8192,1,1,1],permute=0,slice=0):               147438 runs -     6.84 us/run -       32 kB/run -    4.46 GB/s
  SUM_ROWS(type=f32,ne=[8192,8192,1,1],permute=0,slice=0):               768 runs -  1467.22 us/run -   262176 kB/run -  171.74 GB/s
  SUM_ROWS(type=f32,ne=[128,8192,1,1],permute=0,slice=0):              16258 runs -    80.57 us/run -     4128 kB/run -   48.87 GB/s
  Backend Metal: OK

So there may be room for improvement in the current kernel implementation.

@cern1710 cern1710 requested a review from ggerganov as a code owner October 13, 2025 09:57
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 13, 2025
@ggerganov
Copy link
Member

This implementation assumes that the src0 is contiguous, while the current requirement is only the rows to be contiguous:

case GGML_OP_SUM:
case GGML_OP_SUM_ROWS:
case GGML_OP_MEAN:
case GGML_OP_SOFT_MAX:
case GGML_OP_GROUP_NORM:
return has_simdgroup_reduction && ggml_is_contiguous_rows(op->src[0]);

We should either update the requirement or support non-contiguous input. This is also wrong on master, but it's nice to fix it as we are making changes here.

Would need to add non-contiguous tests in test-backend-ops (for example, permute the input).

So there may be room for improvement in the current kernel implementation.

Likely to match the performance of sum_rows we would need to implement a 2-pass approach with an intermediate fleeting buffer, so that we can launch many threadgroups (not just one). The threadgroups would write their results in the fleeting buffer and then a second pass with 1 threadgroup would accumulate the final result. But this is more complicated to implement, so probably in another PR.

@cern1710 cern1710 requested a review from slaren as a code owner October 13, 2025 11:27
@github-actions github-actions bot added the testing Everything test related label Oct 13, 2025
@cern1710
Copy link
Contributor Author

This is a temporary fix for now, but I assume it's possible to sum non-contiguous src0 by passing the sizes and strides in the host side struct?

    int64_t  ne00;
    int64_t  ne01;
    int64_t  ne02;
    int64_t  ne03;
    uint64_t nb00;
    uint64_t nb01;
    uint64_t nb02;
    uint64_t nb03;

@ggerganov
Copy link
Member

Yes, we have to pass the strides and use them in the kernel. Note that we are mainly interested in the "contiguous rows" case. The completely non-contiguous case is rare and for now, no need to support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants