Ggml/cpu col2im 1d by ServeurpersoCom · Pull Request #24206 · ggml-org/llama.cpp

ServeurpersoCom · 2026-06-05T19:18:05Z

Overview

cpu: add GGML_OP_COL2IM_1D

CPU part of #23424, split per review feedback; the CUDA backend follows in a separate PR.

Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.

A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:

    columns = mul_mat(weight[IC, K*OC], input[IC, T_in])  -> [K*OC, T_in]
    signal  = col2im_1d(columns)                          -> [T_out, OC]
    with T_out = (T_in - 1)*s0 + K - 2*p0

Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.

The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.

This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.

Additional information

GGML_OP_COL2IM_1D for CPU, F32 / F16 / BF16 with an F32 accumulator, parallelized over output channels.

Backend coverage is already in place in test-backend-ops: eight geometries across all three types, including the canonical kernel = 2*stride upsampling shape, kernel < stride (gap positions are zeroed), kernel not a multiple of stride, both cropping variants and the single column edge case, plus three perf entries at real vocoder stage shapes reporting memory bandwidth. The CUDA follow-up will be validated against this grid with zero additional test code.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES Opus / MCP rootless container with Nvidia GPU

Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)*s0 + K - 2*p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator.

Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2*stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.

ServeurpersoCom · 2026-06-05T20:19:33Z

CPU Benchmark with 3 real-world use cases :

col2im-cpu-test.cpp.txt

root@pod:/mnt/workspace/col2im-bench# ./col2im-cpu-test
col2im-cpu-test: threads = 32, iters = 5
full ConvTranspose chain per decode, F32 weights, min over iters, all paths fed their preferred layout
MUL_MAT+COL2IM_1D: this PR, ConvTranspose1d as GEMM + overlap-add
MUL_MAT+IM2COL_BACK: existing GGML trick, strictly equivalent mathematically, but ugly (backward op, dummy kernel tensor, F32 CPU only)
CONV_TRANSPOSE_1D: the naive op traditionally used by every project unaware it can be accelerated

acestep.cpp T0 = 1024 (81.9s)

MUL_MAT+COL2IM_1D = 689.083 ms, Cosine Similarity = 0.999999999998
MUL_MAT+IM2COL_BACK = 743.477 ms (1.08x), Cosine Similarity = 0.999999999998
CONV_TRANSPOSE_1D = 2413.170 ms (3.50x), Cosine Similarity reference

qwentts.cpp T0 = 1500 (30.0s)

MUL_MAT+COL2IM_1D = 200.650 ms, Cosine Similarity = 0.999999999998
MUL_MAT+IM2COL_BACK = 273.453 ms (1.36x), Cosine Similarity = 0.999999999998
CONV_TRANSPOSE_1D = 582.309 ms (2.90x), Cosine Similarity reference

omnivoice.cpp T0 = 750 (30.0s)

MUL_MAT+COL2IM_1D = 67.129 ms, Cosine Similarity = 0.999999999999
MUL_MAT+IM2COL_BACK = 93.729 ms (1.40x), Cosine Similarity = 0.999999999999
CONV_TRANSPOSE_1D = 295.380 ms (4.40x), Cosine Similarity reference

ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve.

The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once.

tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there.

ggerganov · 2026-06-08T12:34:12Z

Need to fix the RPC: https://github.com/ggml-org/llama.cpp/actions/runs/27053569438/job/79853460419?pr=24206#step:7:126

ggerganov · 2026-06-09T07:47:39Z

I think you just need to bump the RPC protocol patch version.

ServeurpersoCom · 2026-06-09T08:15:39Z

Yes, I remember doing that back when Snake was a separate operator and not a graph fusion pattern. I had launched the entire CI locally, which I forgot to do on this PR. I'll fix that quickly.

GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts.

ServeurpersoCom added 2 commits June 5, 2026 21:05

ServeurpersoCom requested a review from ggerganov as a code owner June 5, 2026 19:18

github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026

ServeurpersoCom added 3 commits June 6, 2026 07:17

tests: extend the GGML_OP_COL2IM_1D grid

00b682b

The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once.

ggerganov approved these changes Jun 8, 2026

View reviewed changes

ggerganov self-assigned this Jun 8, 2026

rpc: bump protocol patch version for GGML_OP_COL2IM_1D

b948e4e

GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts.

ggerganov merged commit 2602169 into ggml-org:master Jun 9, 2026
25 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ggml/cpu col2im 1d#24206

Ggml/cpu col2im 1d#24206
ggerganov merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-col2im_1d

ServeurpersoCom commented Jun 5, 2026

Uh oh!

ServeurpersoCom commented Jun 5, 2026

Uh oh!

ggerganov commented Jun 8, 2026

Uh oh!

ggerganov commented Jun 9, 2026

Uh oh!

ServeurpersoCom commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ServeurpersoCom commented Jun 5, 2026

Overview

cpu: add GGML_OP_COL2IM_1D

Additional information

Requirements

Uh oh!

ServeurpersoCom commented Jun 5, 2026

Uh oh!

ggerganov commented Jun 8, 2026

Uh oh!

ggerganov commented Jun 9, 2026

Uh oh!

ServeurpersoCom commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants