Skip to content

ggml-cuda : use int64 for norm forward dst offset (match rms_norm_back)#24214

Closed
palios-taey wants to merge 1 commit into
ggml-org:masterfrom
palios-taey:norm-int64-offset
Closed

ggml-cuda : use int64 for norm forward dst offset (match rms_norm_back)#24214
palios-taey wants to merge 1 commit into
ggml-org:masterfrom
palios-taey:norm-int64-offset

Conversation

@palios-taey
Copy link
Copy Markdown

The forward norm kernels (norm_f32, rms_norm_f32, l2_norm_f32) compute the destination offset in 32-bit int:

dst += ((sample*nchannels + channel)*nrows + row)*ncols;

nrows/nchannels come from gridDim, and all factors are int, so for a tensor with more than INT_MAX elements the product overflows and the kernel writes out of bounds. The matching backward kernel rms_norm_back_f32 already guards this with int64_t:

dst  += int64_t(row)*ncols;

This promotes the three forward offsets to int64_t the same way, by casting the first factor:

dst += ((int64_t(sample)*nchannels + channel)*nrows + row)*ncols;

The source-pointer arithmetic already uses int64_t strides, so only the destination offset needed it. No behavior change for normal sizes; this just makes the forward path consistent with the backward kernel and removes the latent overflow on very large tensors.

AI usage disclosure: AI-assisted finding the overflow (fuzzing the cuda norm ops under compute-sanitizer) and drafting the fix. Reviewed it, own it.

The forward norm kernels computed ((sample*nchannels+channel)*nrows+row)*ncols
in 32-bit int, overflowing for tensors with >INT_MAX elements. rms_norm_back_f32
already uses int64_t for this; promote the forward offset to match.

Signed-off-by: Jesse LaRose <jesse@taey.ai>
@palios-taey palios-taey requested a review from a team as a code owner June 5, 2026 21:18
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 5, 2026

Hi @palios-taey, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@palios-taey
Copy link
Copy Markdown
Author

Going to close this one. On a closer look, the forward norm_f32 / rms_norm_f32 destination offset doesn't appear reachable through the tensor shapes a real model graph actually produces — the activation dimensions stay well under INT_MAX — so this would be guarding an unreachable path rather than fixing a live issue, and I'd rather not add churn for that. If I come across a shape that genuinely hits it I'll open a focused PR. Thanks!

@palios-taey palios-taey closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant