Ggml/cpu col2im 1d#24206
Merged
Merged
Conversation
Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)*s0 + K - 2*p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels, supporting F32, F16 and BF16 with an F32 accumulator.
Add test_col2im_1d next to the conv_transpose_1d cases, covering F32, F16 and BF16 across eight geometries: the canonical kernel = 2*stride DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and p0 = stride/2), kernel < stride with zeroed gaps, kernel not a multiple of stride, and a single column unfold. Perf mode gets three real vocoder stage shapes reporting memory bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.
Contributor
Author
|
CPU Benchmark with 3 real-world use cases : |
ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph build time, before the oc division, protecting every backend at once. The kernel asserts the contiguity its flat indexing assumes and its doc states the full output length including the crop term. The kernel parallelizes over the time axis: the split stays balanced down to OC = 1, where the previous channel split was single threaded. Values are bit identical on the three real vocoder chains, two out of three improve.
The eval grid grows to eleven geometries: OC = 1 (mono output stage), K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and a crop down to T_out = 2 where all the gather bounds act at once.
tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16 and BF16 through casts of the column matrix. test-backend-ops cannot cover this for a CPU only op since the CPU backend is its own reference there.
ggerganov
approved these changes
Jun 8, 2026
Member
Member
|
I think you just need to bump the RPC protocol patch version. |
Contributor
Author
|
Yes, I remember doing that back when Snake was a separate operator and not a graph fusion pattern. I had launched the entire CI locally, which I forgot to do on this PR. I'll fix that quickly. |
GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the op is appended and no existing op code shifts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
cpu: add GGML_OP_COL2IM_1D
CPU part of #23424, split per review feedback; the CUDA backend follows in a separate PR.
Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.
A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:
Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.
The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.
This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.
Additional information
GGML_OP_COL2IM_1D for CPU, F32 / F16 / BF16 with an F32 accumulator, parallelized over output channels.
Backend coverage is already in place in test-backend-ops: eight geometries across all three types, including the canonical kernel = 2*stride upsampling shape, kernel < stride (gap positions are zeroed), kernel not a multiple of stride, both cropping variants and the single column edge case, plus three perf entries at real vocoder stage shapes reporting memory bandwidth. The CUDA follow-up will be validated against this grid with zero additional test code.
Requirements