ggml-cuda : use int64 for norm forward dst offset (match rms_norm_back) by palios-taey · Pull Request #24214 · ggml-org/llama.cpp

palios-taey · 2026-06-05T21:18:20Z

The forward norm kernels (norm_f32, rms_norm_f32, l2_norm_f32) compute the destination offset in 32-bit int:

dst += ((sample*nchannels + channel)*nrows + row)*ncols;

nrows/nchannels come from gridDim, and all factors are int, so for a tensor with more than INT_MAX elements the product overflows and the kernel writes out of bounds. The matching backward kernel rms_norm_back_f32 already guards this with int64_t:

dst  += int64_t(row)*ncols;

This promotes the three forward offsets to int64_t the same way, by casting the first factor:

dst += ((int64_t(sample)*nchannels + channel)*nrows + row)*ncols;

The source-pointer arithmetic already uses int64_t strides, so only the destination offset needed it. No behavior change for normal sizes; this just makes the forward path consistent with the backward kernel and removes the latent overflow on very large tensors.

AI usage disclosure: AI-assisted finding the overflow (fuzzing the cuda norm ops under compute-sanitizer) and drafting the fix. Reviewed it, own it.

The forward norm kernels computed ((sample*nchannels+channel)*nrows+row)*ncols in 32-bit int, overflowing for tensors with >INT_MAX elements. rms_norm_back_f32 already uses int64_t for this; promote the forward offset to match. Signed-off-by: Jesse LaRose <jesse@taey.ai>

ggml-gh-bot · 2026-06-05T21:22:25Z

Hi @palios-taey, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

palios-taey · 2026-06-06T01:00:43Z

Going to close this one. On a closer look, the forward norm_f32 / rms_norm_f32 destination offset doesn't appear reachable through the tensor shapes a real model graph actually produces — the activation dimensions stay well under INT_MAX — so this would be guarding an unreachable path rather than fixing a live issue, and I'd rather not add churn for that. If I come across a shape that genuinely hits it I'll open a focused PR. Thanks!

palios-taey requested a review from a team as a code owner June 5, 2026 21:18

github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026

palios-taey closed this Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : use int64 for norm forward dst offset (match rms_norm_back)#24214

ggml-cuda : use int64 for norm forward dst offset (match rms_norm_back)#24214
palios-taey wants to merge 1 commit into
ggml-org:masterfrom
palios-taey:norm-int64-offset

palios-taey commented Jun 5, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

palios-taey commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

palios-taey commented Jun 5, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

palios-taey commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant