Skip to content

Conversation

@YaelLogic
Copy link
Contributor

Summary

Add SYCL backend support for RMS_NORM_BACK using a single FP32 compensated parallel reduction path.
No changes to the public API. Default numerical accuracy is preserved; a fast opt-in macro is also available.


Implementation

Algorithm (onsistent with existing backend behavior)

  • inv_r = 1 / sqrt( (Σ x²) / D + eps )
  • coeff = − (Σ x·dz) / (Σ x² + D·eps)
  • dx[i] = (dz[i] + coeff * x[i]) * inv_r

What was implemented

  • Per-thread accumulation of Σ x² and Σ x·dz with Kahan-style compensation.
  • Warp (sub_group) reduction via warp_reduce_sum.
  • Cross-warp reduction using local memory (one value per warp) with a single barrier.
  • group_broadcast used to distribute inv_r and coeff across the work-group.
  • Work-group size: multiple of WARP_SIZE, capped by device limit (≤256), not larger than D.

Optional fast path

  • Define GGML_SYCL_RMS_BACK_FAST to disable compensated summation and use plain FP32 accumulation.
  • Default remains high-accuracy compensated mode.

Validation

Focused tests executed locally:

Test Suite Result
RMS_NORM_BACK (CPU) 4 / 4 passed
RMS_NORM_BACK (SYCL host/GPU) 4 / 4 passed
Sanity check (NMSE) ≈ 1e-11

Build is warning-free for this code path.


Reproduce (build + test)

# Configure & build with SYCL
cmake -B build -DGGML_SYCL=ON && cmake --build build -j"$(nproc)"

# Focused RMS_NORM_BACK tests
./build/bin/test-backend-ops test -o RMS_NORM_BACK -b CPU
SYCL_DEVICE_FILTER=host ./build/bin/test-backend-ops test -o RMS_NORM_BACK -b SYCL0
./build/bin/test-backend-ops test -o RMS_NORM_BACK -b SYCL0

# Optional fast path (less numerically stable)
# add -DGGML_SYCL_RMS_BACK_FAST to your compiler definitions

Files Changed (minimal scope only)

File Purpose
ggml/src/ggml-sycl/norm.cpp Implementation of ggml_sycl_op_rms_norm_back
ggml/src/ggml-sycl/ggml-sycl.cpp Operation dispatch registration
ggml/src/ggml-sycl/norm.hpp Function declaration
docs/ops.md Mark RMS_NORM_BACK as ✅ for SYCL
docs/ops/SYCL.csv Mark RMS_NORM_BACK entries as supported

No unrelated files or personal data included.


Notes & Risks

  • Default path gives high numerical accuracy using compensated FP32 sums.
  • Fast path is fully optional (disabled by default).
  • Reduction order on GPUs is not bitwise-identical to CPU, but produces NMSE ≈ 1e-11.

Reviewers

cc @CISC @NeoZhangJianyu
Looking forward to your feedback. Thanks in advance!

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Oct 27, 2025
@YaelLogic
Copy link
Contributor Author

This PR is ready for review.
Tagging @CISC and @NeoZhangJianyu — your feedback would be greatly appreciated whenever you have the chance.
Thanks for your work on maintaining and improving the SYCL backend!

Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good job!

Thank you!

@NeoZhangJianyu
Copy link
Collaborator

@YaelLogic
Please fix the EditorConfig issue.

@YaelLogic
Copy link
Contributor Author

Hi @NeoZhangJianyu,
the EditorConfig issue has been fixed.
Please let me know if anything else is needed. Thank you.

@NeoZhangJianyu NeoZhangJianyu merged commit 338074c into ggml-org:master Oct 29, 2025
76 of 79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants