Skip to content

CUDA: handle OW > 65535 in im2col (2D and 3D)#22944

Merged
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
CrispStrobe:cuda-im2col-ow
May 11, 2026
Merged

CUDA: handle OW > 65535 in im2col (2D and 3D)#22944
JohannesGaessler merged 1 commit into
ggml-org:masterfrom
CrispStrobe:cuda-im2col-ow

Conversation

@CrispStrobe
Copy link
Copy Markdown
Contributor

im2col_cuda and im2col_3d_cuda both dispatch with
block_nums.y = OW. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit — e.g. SEANet
at 11 s lands at OW = 176000 — and the launch returns
invalid configuration argument.

Fix: clamp block_nums.y to MIN(OW, MAX_GRIDDIM_Y) and loop inside
the kernel with stride MAX_GRIDDIM_Y. Same in-kernel stride pattern
already used for the z axis (MAX_GRIDDIM_Z). Both 2D im2col_kernel
and 3D im2col_3d_kernel need the same fix. Bit-identical for
OW ≤ 65535 (single iteration of the new outer loop).

Verification

Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ≈ 176000); pre-fix launch returns
invalid configuration argument, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: AI assisted (mechanical port, formatting, description draft) re-applying existing fork patch to llama.cpp's tree.

`im2col_cuda` and `im2col_3d_cuda` both dispatch with
`block_nums.y = OW`. CUDA caps grid Y at 65535. Conv1d encoders on
raw 16 kHz audio with T > 65535 (~ 4 s) trip the limit -- e.g. SEANet
at 11 s lands at OW = 176000 -- and the launch returns
`invalid configuration argument`.

Clamp `block_nums.y` to `MIN(OW, MAX_GRIDDIM_Y)` and loop inside the
kernel with stride `MAX_GRIDDIM_Y`. Same in-kernel stride pattern
already used for the z axis (`MAX_GRIDDIM_Z`). Both 2D `im2col_kernel`
and 3D `im2col_3d_kernel` need the same fix. Bit-identical for
OW <= 65535 (single iteration of the new outer loop).

Tested on T4 / Jetson Orin with a SEANet encoder running on 11 s /
16 kHz audio (im2col reaching OW ~ 176000); pre-fix launch returns
`invalid configuration argument`, post-fix runs to completion.
Existing test-backend-ops im2col cases unchanged.
@CrispStrobe CrispStrobe requested a review from a team as a code owner May 11, 2026 13:14
CrispStrobe added a commit to CrispStrobe/CrispASR that referenced this pull request May 11, 2026
@CISC closed our ggml-org/ggml#1485 with the steer that CUDA changes
belong in ggml-org/llama.cpp instead. The src/ggml-cuda/ commit log on
ggml master confirms it — 100% (llama/NNNNN) sync commits, no direct
CUDA PRs in months. ggml's own README points to llama.cpp as the
development source-of-truth.

Refiled the im2col OW > 65535 fix as ggml-org/llama.cpp#22944.

Tracker updates:
- tools/upstream-prs/README.md: new "Which repo to file against"
  routing table (CUDA / Vulkan → llama.cpp; CPU + ggml.c → ggml;
  Metal works either way); per-repo title conventions; flag for
  llama.cpp's stricter AI-content policy.
- UPSTREAM.md: log entry for the redirect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ggml-gh-bot

This comment was marked as resolved.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026
@Green-Sky
Copy link
Copy Markdown
Collaborator

leejet/ggml@7f4ab36

This commit is suspiciously close, almost identical, and was committed ~14 hours earlier.

Might be a coincidence, also might not.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@CrispStrobe can you comment on this?

cc: @leejet

@CrispStrobe
Copy link
Copy Markdown
Contributor Author

CrispStrobe commented May 11, 2026

well i had this patch from commit 1552434d at 2026-04-23 13:24 +0200 in CrispASR repo which is MIT licensed so it is perfectly ok for me when others post such patches upstream earlier than me - or just find it indepedently (it is the simplest straightforward application of the MAX_GRIDDIM_Z pattern already in im2col.cu anyway), i don't know which, and am ok with either of course.

@Green-Sky
Copy link
Copy Markdown
Collaborator

@CrispStrobe Thanks for clarifying, looks good and significantly older (with dated gh action too).

@JohannesGaessler JohannesGaessler merged commit 8e1f9d0 into ggml-org:master May 11, 2026
47 checks passed
@JohannesGaessler
Copy link
Copy Markdown
Contributor

Thank you for clarifying. For context, it's usually fine to take commits from other repositories but this should be disclosed.

@CrispStrobe
Copy link
Copy Markdown
Contributor Author

@JohannesGaessler i am not sure if you mean i should have said that i already have this patch in another of my repos. if so, that is fine by me. i would have sent this earlier but you have that rule for only one PR at a time from newbies, and another patch was only recently accepted... (and i am sitting on a few more...)

@JohannesGaessler
Copy link
Copy Markdown
Contributor

If you did not write the code yourself, just add a short blurb like "I took this code from ...".

@CrispStrobe
Copy link
Copy Markdown
Contributor Author

CrispStrobe commented May 11, 2026

ah well but i did write it of course (2 weeks ago).

@JohannesGaessler
Copy link
Copy Markdown
Contributor

JohannesGaessler commented May 11, 2026

Ah sorry, I misread your earlier, clarifying post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants