Skip to content

Ggml/cpu col2im 1d#24206

Merged
ggerganov merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-col2im_1d
Jun 9, 2026
Merged

Ggml/cpu col2im 1d#24206
ggerganov merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-col2im_1d

Conversation

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

Overview

cpu: add GGML_OP_COL2IM_1D

CPU part of #23424, split per review feedback; the CUDA backend follows in a separate PR.

Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.

A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:

    columns = mul_mat(weight[IC, K*OC], input[IC, T_in])  -> [K*OC, T_in]
    signal  = col2im_1d(columns)                          -> [T_out, OC]
    with T_out = (T_in - 1)*s0 + K - 2*p0

Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.

The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.

This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.

Additional information

GGML_OP_COL2IM_1D for CPU, F32 / F16 / BF16 with an F32 accumulator, parallelized over output channels.

Backend coverage is already in place in test-backend-ops: eight geometries across all three types, including the canonical kernel = 2*stride upsampling shape, kernel < stride (gap positions are zeroed), kernel not a multiple of stride, both cropping variants and the single column edge case, plus three perf entries at real vocoder stage shapes reporting memory bandwidth. The CUDA follow-up will be validated against this grid with zero additional test code.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES Opus / MCP rootless container with Nvidia GPU

Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.

Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.

CPU uses a gather formulation parallelized over output channels,
supporting F32, F16 and BF16 with an F32 accumulator.
Add test_col2im_1d next to the conv_transpose_1d cases, covering F32,
F16 and BF16 across eight geometries: the canonical kernel = 2*stride
DAC upsampling shape, overlap, no overlap, cropping (p0 = 1 and
p0 = stride/2), kernel < stride with zeroed gaps, kernel not a
multiple of stride, and a single column unfold.

Perf mode gets three real vocoder stage shapes reporting memory
bandwidth. max_nmse_err relaxes to 5e-4 for F16 and BF16.
@github-actions github-actions Bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@ServeurpersoCom

Copy link
Copy Markdown
Contributor Author

CPU Benchmark with 3 real-world use cases :

col2im-cpu-test.cpp.txt

root@pod:/mnt/workspace/col2im-bench# ./col2im-cpu-test
col2im-cpu-test: threads = 32, iters = 5
full ConvTranspose chain per decode, F32 weights, min over iters, all paths fed their preferred layout
MUL_MAT+COL2IM_1D: this PR, ConvTranspose1d as GEMM + overlap-add
MUL_MAT+IM2COL_BACK: existing GGML trick, strictly equivalent mathematically, but ugly (backward op, dummy kernel tensor, F32 CPU only)
CONV_TRANSPOSE_1D: the naive op traditionally used by every project unaware it can be accelerated

acestep.cpp T0 = 1024 (81.9s)

MUL_MAT+COL2IM_1D = 689.083 ms, Cosine Similarity = 0.999999999998
MUL_MAT+IM2COL_BACK = 743.477 ms (1.08x), Cosine Similarity = 0.999999999998
CONV_TRANSPOSE_1D = 2413.170 ms (3.50x), Cosine Similarity reference

qwentts.cpp T0 = 1500 (30.0s)

MUL_MAT+COL2IM_1D = 200.650 ms, Cosine Similarity = 0.999999999998
MUL_MAT+IM2COL_BACK = 273.453 ms (1.36x), Cosine Similarity = 0.999999999998
CONV_TRANSPOSE_1D = 582.309 ms (2.90x), Cosine Similarity reference

omnivoice.cpp T0 = 750 (30.0s)

MUL_MAT+COL2IM_1D = 67.129 ms, Cosine Similarity = 0.999999999999
MUL_MAT+IM2COL_BACK = 93.729 ms (1.40x), Cosine Similarity = 0.999999999999
CONV_TRANSPOSE_1D = 295.380 ms (4.40x), Cosine Similarity reference

ggml_col2im_1d validates s0, oc, p0 and input contiguity at graph
build time, before the oc division, protecting every backend at once.
The kernel asserts the contiguity its flat indexing assumes and its
doc states the full output length including the crop term.

The kernel parallelizes over the time axis: the split stays balanced
down to OC = 1, where the previous channel split was single threaded.
Values are bit identical on the three real vocoder chains, two out of
three improve.
The eval grid grows to eleven geometries: OC = 1 (mono output stage),
K = 1 with stride > 1 (sparse scatter, every gap position zeroed) and
a crop down to T_out = 2 where all the gather bounds act at once.
tests/test-col2im-1d.cpp proves mul_mat + col2im_1d matches the
native ggml_conv_transpose_1d on the CPU backend, F32 bit exact, F16
and BF16 through casts of the column matrix. test-backend-ops cannot
cover this for a CPU only op since the CPU backend is its own
reference there.
@ggerganov

Copy link
Copy Markdown
Member

Need to fix the RPC: https://github.com/ggml-org/llama.cpp/actions/runs/27053569438/job/79853460419?pr=24206#step:7:126

@ggerganov ggerganov self-assigned this Jun 8, 2026
@ggerganov

Copy link
Copy Markdown
Member

I think you just need to bump the RPC protocol patch version.

@ServeurpersoCom

Copy link
Copy Markdown
Contributor Author

Yes, I remember doing that back when Snake was a separate operator and not a graph fusion pattern. I had launched the entire CI locally, which I forgot to do on this PR. I'll fix that quickly.

GGML_OP_COUNT goes from 96 to 97 with the new op, which trips the
static_assert in ggml-rpc.h. Bump RPC_PROTO_PATCH_VERSION since the
op is appended and no existing op code shifts.
@ggerganov ggerganov merged commit 2602169 into ggml-org:master Jun 9, 2026
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants