CUDA: handle OW > 65535 in im2col (2D and 3D)#22944
Conversation
`im2col_cuda` and `im2col_3d_cuda` both dispatch with `block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet at 11 s lands at OW = 176000 -- and the launch returns `invalid configuration argument`. Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel` and 3D `im2col_3d_kernel` need the same fix. Bit-identical for OW <= 65535 (single iteration of the new outer loop). Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s / 16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns `invalid configuration argument`, post-fix runs to completion. Existing test-backend-ops im2col cases unchanged.
@CISC closed our ggml-org/ggml#1485 with the steer that CUDA changes belong in ggml-org/llama.cpp instead. The src/ggml-cuda/ commit log on ggml master confirms it — 100% (llama/NNNNN) sync commits, no direct CUDA PRs in months. ggml's own README points to llama.cpp as the development source-of-truth. Refiled the im2col OW > 65535 fix as ggml-org/llama.cpp#22944. Tracker updates: - tools/upstream-prs/README.md: new "Which repo to file against" routing table (CUDA / Vulkan → llama.cpp; CPU + ggml.c → ggml; Metal works either way); per-repo title conventions; flag for llama.cpp's stricter AI-content policy. - UPSTREAM.md: log entry for the redirect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This comment was marked as resolved.
This comment was marked as resolved.
|
This commit is suspiciously close, almost identical, and was committed ~14 hours earlier. Might be a coincidence, also might not. |
|
@CrispStrobe can you comment on this? cc: @leejet |
|
well i had this patch from commit 1552434d at 2026-04-23 13:24 +0200 in CrispASR repo which is MIT licensed so it is perfectly ok for me when others post such patches upstream earlier than me - or just find it indepedently (it is the simplest straightforward application of the MAX_GRIDDIM_Z pattern already in im2col.cu anyway), i don't know which, and am ok with either of course. |
|
@CrispStrobe Thanks for clarifying, looks good and significantly older (with dated gh action too). |
|
Thank you for clarifying. For context, it's usually fine to take commits from other repositories but this should be disclosed. |
|
@JohannesGaessler i am not sure if you mean i should have said that i already have this patch in another of my repos. if so, that is fine by me. i would have sent this earlier but you have that rule for only one PR at a time from newbies, and another patch was only recently accepted... (and i am sitting on a few more...) |
|
If you did not write the code yourself, just add a short blurb like "I took this code from ...". |
|
ah well but i did write it of course (2 weeks ago). |
|
Ah sorry, I misread your earlier, clarifying post. |
im2col_cudaandim2col_3d_cudaboth dispatch withblock_nums.y = OW. CUDA caps grid Y at 65535. Conv1d encoders onraw 16 kHz audio with T > 65535 (~ 4 s) trip the limit — e.g. SEANet
at 11 s lands at OW = 176000 — and the launch returns
invalid configuration argument.Fix: clamp
block_nums.ytoMIN(OW, MAX_GRIDDIM_Y)and loop insidethe kernel with stride
MAX_GRIDDIM_Y. Same in-kernel stride patternalready used for the z axis (
MAX_GRIDDIM_Z). Both 2Dim2col_kerneland 3D
im2col_3d_kernelneed the same fix. Bit-identical forOW ≤ 65535 (single iteration of the new outer loop).
Verification
Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ≈ 176000); pre-fix launch returns
invalid configuration argument, post-fix runs to completion.Existing
test-backend-opsim2col cases unchanged.Requirements