Skip to content

Conversation

@lhez
Copy link
Collaborator

@lhez lhez commented Jan 20, 2026

The specialized kqv kernel for Adreno uses image1d_buffer_t to wrap dst to utilize texture processor. However, image1d_buffer_t has dimension limit. The limit can be reached when there are enough tokens in kv cache.

For example, providing gpt-oss-20b with a prompt of about 4k tokens using llama-completion with -ub 1024 will hit the limit, resulting in assert failure.

This PR utilizes the general mm for floating point by copying to src0 and src1 to contiguous if necessary. I believe this has been used in vulkan backend. A check is added to ensure that the arguments are within the limit of image1d_buffer_t for the specialized kqv kernel.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jan 20, 2026
@lhez lhez marked this pull request as ready for review January 21, 2026 19:07
@lhez lhez requested a review from max-krasnyansky as a code owner January 21, 2026 19:07
Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Didn't get a chance to try it on my setup yet. I'm testing some Hexagon changes and will give this a go along with that.

@max-krasnyansky max-krasnyansky merged commit 9c96465 into ggml-org:master Jan 22, 2026
147 of 149 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants