Skip to content

feat(cpu-kernels): add SIMD-based vector operations for flash attention#484

Merged
chenghuaWang merged 4 commits intoUbiquitousLearning:v2from
chenghuaWang:v2
Oct 22, 2025
Merged

feat(cpu-kernels): add SIMD-based vector operations for flash attention#484
chenghuaWang merged 4 commits intoUbiquitousLearning:v2from
chenghuaWang:v2

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Oct 22, 2025

Introduce Highway-based SIMD implementations for common vector operations including dot product, scalar multiplication, fused multiply-add, and constant filling. These are integrated into the flash attention v2 CPU backend for performance optimization on x86 architectures.

The changes include:

  • New SIMD kernels in impl-any-simd-inl.hpp
  • Dynamic dispatch setup in impl-any-simd-inst.cpp
  • Public interface definitions in impl-any-simd-inst.hpp
  • Template specializations in `impl-any-simd.hpp

Also update MLIR Python bindings flag in install script and fix build dependency version in pyproject.toml. Add initial documentation for MLLM IR and MLIR extensions.

Summary by CodeRabbit

  • New Features

    • Added SIMD-accelerated operations for CPU-based attention computations, improving performance on x86 architectures.
    • Enabled Python bindings for MLIR in the build system.
  • Build

    • Pinned scikit-build-core to version 0.10.0 for reproducible builds.
  • Documentation

    • Updated build documentation with new section heading.

Introduce Highway-based SIMD implementations for common vector operations
including dot product, scalar multiplication, fused multiply-add, and
constant filling. These are integrated into the flash attention v2 CPU
backend for performance optimization on x86 architectures.

The changes include:
- New SIMD kernels in `impl-any-simd-inl.hpp`
- Dynamic dispatch setup in `impl-any-simd-inst.cpp`
- Public interface definitions in `impl-any-simd-inst.hpp`
- Template specializations in `impl-any-simd.hpp

Also update MLIR Python bindings flag in install script and fix build
dependency version in pyproject.toml. Add initial documentation for
MLLM IR and MLIR extensions.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 22, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This PR introduces SIMD-accelerated FP32 vector operations for Flash Attention 2 on x86 architecture using Highway library, adds dynamic dispatch wrappers bridging template specializations to HWY implementations, updates build configuration to enable MLIR Python bindings, and pins scikit-build-core to a specific version.

Changes

Cohort / File(s) Summary
SIMD Implementation Layer
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inl.hpp
Added four inline SIMD implementations: vector_dot_product_fp32_fp32_fp32, mul_from_const_fp32, fma_const_array_fp32, filled_with_const_fp32. Each uses Highway scalable vectors with scalar fallback for tail elements.
Dynamic Dispatch Wrappers
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp, impl-any-simd-inst.cpp
New header and implementation files declaring and defining four wrapper functions (call_vector_dot_product_fp32_fp32_fp32, call_fma_const_array_fp32, call_filled_with_const_fp32, call_mul_from_const_fp32) that route to HWY_DYNAMIC_DISPATCH-enabled variants using foreach_target macro.
X86 Architecture Specializations
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd.hpp
Added template specializations for __X86ArchTag on VectorDotProduct, MulFromConst, FMAConstArray, and FilledWithConst structs, each forwarding to corresponding call_* wrapper functions.
Build & Configuration
pyproject.toml
Pinned scikit-build-core from >=0.10.0 to ==0.10.0 for reproducible builds.
Build Script
scripts/install_mlir.sh
Added CMake flag -DMLIR_ENABLE_BINDINGS_PYTHON=On to enable MLIR Python bindings during installation.
Documentation
mllm/compile/README.md
Added top-level heading "# MLLM IR and It's MLIR Extensions".

Sequence Diagram

sequenceDiagram
    participant User
    participant Template as Template Specialization<br/>(__X86ArchTag)
    participant Wrapper as Dynamic Dispatch Wrapper<br/>(call_*)
    participant HWY as Highway SIMD Implementation<br/>(vector_dot_product_*)

    User->>Template: Invoke VectorDotProduct::run()
    Template->>Wrapper: call_vector_dot_product_fp32_fp32_fp32()
    Wrapper->>HWY: HWY_DYNAMIC_DISPATCH → best SIMD target
    HWY->>HWY: SIMD loop (Highway vectors)
    HWY->>HWY: Scalar tail handling
    HWY-->>Wrapper: Result
    Wrapper-->>Template: Return
    Template-->>User: Computed result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Rationale: The diff spans multiple files with heterogeneous changes—SIMD implementations with logic density (FP32 vector operations), new dynamic dispatch infrastructure (HWY_DYNAMIC_DISPATCH patterns), template specializations bridging layers, and build configuration changes. While individual cohorts follow consistent patterns, each requires separate reasoning: SIMD correctness and Highway API usage, dispatch mechanism validation, specialization mapping accuracy, and configuration impact assessment.

Poem

🐰 Whiskers twitch with joy—
SIMD vectors dance on x86,
Dispatch wrappers align,
Templates meet Highway streams,
Carrots faster now! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "feat(cpu-kernels): add SIMD-based vector operations for flash attention" directly and accurately describes the primary changes in the changeset. The PR introduces Highway-based SIMD implementations for four key vector operations (dot product, scalar multiplication, fused multiply-add, and constant filling) used by the flash attention v2 CPU backend on x86 architectures, along with the necessary infrastructure files for dynamic dispatch and template specializations. The title is concise, specific, and clearly conveys the main contribution without vague language. Secondary changes to build configuration and documentation are supplementary to this core purpose.

Comment @coderabbitai help to get the list of available commands and usage tips.

@chenghuaWang
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 22, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (10)
scripts/install_mlir.sh (1)

5-5: Enable Python bindings: OK. Consider script hardening and flag normalization.

CMake treats On/ON equally; use ON for consistency and add strict bash flags.

Apply:

+set -euo pipefail
@@
-cmake -G Ninja ../llvm -DMLIR_ENABLE_BINDINGS_PYTHON=On -DLLVM_ENABLE_PROJECTS=mlir -DLLVM_BUILD_EXAMPLES=ON -DLLVM_TARGETS_TO_BUILD="Native;NVPTX" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DLLVM_ENABLE_LLD=ON -DLLVM_CCACHE_BUILD=ON -DLLVM_USE_SANITIZER="Address;Undefined" -DMLIR_INCLUDE_INTEGRATION_TESTS=ON 
+cmake -G Ninja ../llvm -DMLIR_ENABLE_BINDINGS_PYTHON=ON -DLLVM_ENABLE_PROJECTS=mlir -DLLVM_BUILD_EXAMPLES=ON -DLLVM_TARGETS_TO_BUILD="Native;NVPTX" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DLLVM_ENABLE_LLD=ON -DLLVM_CCACHE_BUILD=ON -DLLVM_USE_SANITIZER="Address;Undefined" -DMLIR_INCLUDE_INTEGRATION_TESTS=ON

Optional (if CI lacks auto‑detect): add -DPython3_EXECUTABLE="$(command -v python3)".

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inl.hpp (4)

11-24: Dot product SIMD path looks correct; consider adding restrict and HWY_INLINE for consistency.

Small nits: help optimizer/aliasing and follow Highway style.

Apply:

-void vector_dot_product_fp32_fp32_fp32(const float* lhs, const float* rhs, float* out, size_t len) {
+HWY_INLINE void vector_dot_product_fp32_fp32_fp32(const float* HWY_RESTRICT lhs,
+                                                  const float* HWY_RESTRICT rhs,
+                                                  float* HWY_RESTRICT out, size_t len) {

Ensure the corresponding HWY_EXPORT(vector_dot_product_fp32_fp32_fp32) exists in the inst TU.


26-36: Mul-from-const: add restrict and inline; otherwise LGTM.

-void mul_from_const_fp32(float* from, float c, size_t len) {
+HWY_INLINE void mul_from_const_fp32(float* HWY_RESTRICT from, float c, size_t len) {

37-47: FMA path: add restrict and inline.

-void fma_const_array_fp32(float* acc_o, float acc_s, const float* v_token, size_t len) {
+HWY_INLINE void fma_const_array_fp32(float* HWY_RESTRICT acc_o, float acc_s,
+                                     const float* HWY_RESTRICT v_token, size_t len) {

49-55: Fill-const: add restrict and inline.

-void filled_with_const_fp32(float* a, float v, size_t len) {
+HWY_INLINE void filled_with_const_fp32(float* HWY_RESTRICT a, float v, size_t len) {
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp (1)

30-39: Wrappers look correct; consider noexcept for all.

Signatures and HWY_DYNAMIC_DISPATCH usage are good. Consider adding noexcept to the dot and FMA wrappers for consistency and to prevent exceptions crossing ABI.

-HWY_DLLEXPORT void call_vector_dot_product_fp32_fp32_fp32(..., size_t len) {
+HWY_DLLEXPORT void call_vector_dot_product_fp32_fp32_fp32(..., size_t len) noexcept {
   HWY_DYNAMIC_DISPATCH(vector_dot_product_fp32_fp32_fp32)(lhs, rhs, out, len);
 }
-
-HWY_DLLEXPORT void call_fma_const_array_fp32(..., size_t len) {
+HWY_DLLEXPORT void call_fma_const_array_fp32(..., size_t len) noexcept {
   HWY_DYNAMIC_DISPATCH(fma_const_array_fp32)(acc_o, acc_s, v_token, len);
 }
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd.hpp (4)

6-6: Guard or decouple the Highway include.

If this header is included in builds without Highway in the include path, <hwy/highway.h> will fail. Either gate it or provide a fallback for HWY_RESTRICT.

Option A (guard):

-#include <hwy/highway.h>
+#if defined(MLLM_WITH_HWY)
+#  include <hwy/highway.h>
+#endif

Option B (fallback define placed after includes):

+#ifndef HWY_RESTRICT
+#  if defined(_MSC_VER)
+#    define HWY_RESTRICT __restrict
+#  else
+#    define HWY_RESTRICT __restrict__
+#  endif
+#endif

14-20: Make run() wrappers consistent: restrict, inlining, noexcept, and clearer names.

Unify qualifiers and reduce “easily-swappable parameters” risk; keep names descriptive.

 template<>
 struct VectorDotProduct<__X86ArchTag, mllm_fp32_t, mllm_fp32_t, mllm_fp32_t> {
-  static MLLM_FORCE_INLINE void run(const mllm_fp32_t* __restrict__ __lhs, const mllm_fp32_t* __restrict__ __rhs,
-                                    mllm_fp32_t* __out, size_t len) {
-    call_vector_dot_product_fp32_fp32_fp32(__lhs, __rhs, __out, len);
+  static MLLM_FORCE_INLINE void run(const mllm_fp32_t* HWY_RESTRICT lhs,
+                                    const mllm_fp32_t* HWY_RESTRICT rhs,
+                                    mllm_fp32_t* HWY_RESTRICT out,
+                                    size_t len) noexcept {
+    call_vector_dot_product_fp32_fp32_fp32(lhs, rhs, out, len);
   }
 };

 template<>
 struct MulFromConst<__X86ArchTag, mllm_fp32_t, mllm_fp32_t> {
-  static void run(mllm_fp32_t* from, mllm_fp32_t c, size_t len) { call_mul_from_const_fp32(from, c, len); }
+  static MLLM_FORCE_INLINE void run(mllm_fp32_t* HWY_RESTRICT dst,
+                                    mllm_fp32_t scale,
+                                    size_t len) noexcept {
+    call_mul_from_const_fp32(dst, scale, len);
+  }
 };

 template<>
 struct FMAConstArray<__X86ArchTag, mllm_fp32_t, mllm_fp32_t, mllm_fp32_t> {
-  static void run(mllm_fp32_t* acc_o, mllm_fp32_t acc_s, const mllm_fp32_t* v_token, size_t len) {
-    call_fma_const_array_fp32(acc_o, acc_s, v_token, len);
+  static MLLM_FORCE_INLINE void run(mllm_fp32_t* HWY_RESTRICT acc_out,
+                                    mllm_fp32_t acc_scale,
+                                    const mllm_fp32_t* HWY_RESTRICT vec,
+                                    size_t len) noexcept {
+    call_fma_const_array_fp32(acc_out, acc_scale, vec, len);
   }
 };

 template<>
 struct FilledWithConst<__X86ArchTag, mllm_fp32_t> {
-  static void run(mllm_fp32_t* a, mllm_fp32_t v, size_t len) { call_filled_with_const_fp32(a, v, len); }
+  static MLLM_FORCE_INLINE void run(mllm_fp32_t* HWY_RESTRICT dst,
+                                    mllm_fp32_t value,
+                                    size_t len) noexcept {
+    call_filled_with_const_fp32(dst, value, len);
+  }
 };

Also applies to: 22-25, 27-32, 34-37


14-20: Consider std::span to reduce swappable-parameter risk.

If C++20 is available, change pointer+len pairs to std::span to improve safety and clarity.

Example:

- static MLLM_FORCE_INLINE void run(const mllm_fp32_t* HWY_RESTRICT lhs, const mllm_fp32_t* HWY_RESTRICT rhs,
-                                   mllm_fp32_t* HWY_RESTRICT out, size_t len) noexcept {
-   call_vector_dot_product_fp32_fp32_fp32(lhs, rhs, out, len);
- }
+ static MLLM_FORCE_INLINE void run(std::span<const mllm_fp32_t> lhs,
+                                   std::span<const mllm_fp32_t> rhs,
+                                   std::span<mllm_fp32_t> out) noexcept {
+   call_vector_dot_product_fp32_fp32_fp32(lhs.data(), rhs.data(), out.data(), out.size());
+ }

8-11: Move impl-any-simd-inst.hpp inclusion to implementation file to avoid forcing HWY as transitive public dependency.

The include chain Kernels.hpp → fwd_bshd.hpp → impl-any-simd.hpp → impl-any-simd-inst.hpp exposes HWY (#include <hwy/base.h>) to all downstream consumers via the public re-export marked in Kernels.hpp line 44-45. Moving this include to a .cpp file or private header would prevent downstream code from requiring HWY.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3401c62 and 9ec3596.

📒 Files selected for processing (7)
  • mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inl.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp (1 hunks)
  • mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd.hpp (1 hunks)
  • mllm/compile/README.md (1 hunks)
  • pyproject.toml (1 hunks)
  • scripts/install_mlir.sh (1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp

[error] 8-8: 'mllm/core/DataTypes.hpp' file not found

(clang-diagnostic-error)


[error] 20-20: variable 'call_fma_const_array_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 23-23: variable 'call_filled_with_const_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 25-25: variable 'call_mul_from_const_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inl.hpp

[error] 4-4: 'hwy/highway.h' file not found

(clang-diagnostic-error)


[error] 13-13: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 13-13: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 15-15: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 15-15: variable name 'i' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 21-21: variable 'result' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 26-26: 2 adjacent parameters of 'mul_from_const_fp32' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 26-26: parameter name 'c' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 27-27: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 27-27: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 28-28: variable name 'vc' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 29-29: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 29-29: variable name 'i' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 38-38: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 38-38: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 40-40: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 40-40: variable name 'i' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 49-49: parameter name 'a' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 49-49: 2 adjacent parameters of 'filled_with_const_fp32' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 49-49: parameter name 'v' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 50-50: variable 'd' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 50-50: variable name 'd' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 51-51: variable name 'vv' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 52-52: variable 'i' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 52-52: variable name 'i' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp

[error] 4-4: 'mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp' file not found

(clang-diagnostic-error)


[error] 11-11: macro 'HWY_TARGET_INCLUDE' used to declare a constant; consider using a 'constexpr' constant

(cppcoreguidelines-macro-usage,-warnings-as-errors)

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd.hpp

[error] 6-6: 'hwy/highway.h' file not found

(clang-diagnostic-error)


[error] 16-16: 2 adjacent parameters of 'run' of similar type ('const int *__restrict') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 24-24: 2 adjacent parameters of 'run' of similar type ('int') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 24-24: parameter name 'c' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 36-36: parameter name 'a' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)


[error] 36-36: 2 adjacent parameters of 'run' of similar type ('int') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 36-36: parameter name 'v' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-android
  • GitHub Check: build-macos
🔇 Additional comments (6)
mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp (1)

14-27: Dynamic-dispatch exports verified—no issues found.

The verification confirms all four functions (vector_dot_product_fp32_fp32_fp32, fma_const_array_fp32, filled_with_const_fp32, mul_from_const_fp32) have both HWY_EXPORT declarations and corresponding HWY_DYNAMIC_DISPATCH calls in their wrapper functions. The linkage is properly configured and will not produce link errors.

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inl.hpp (1)

4-4: No action needed—Highway dependency is already properly configured.

Highway is already fetched via CPMAddPackage (mllm/backends/cpu/CMakeLists.txt line 81-85) and explicitly linked to MllmCPUBackend with target_link_libraries(MllmCPUBackend PUBLIC hwy) (line 92). CMake automatically propagates include directories through public linking, so #include <hwy/highway.h> will resolve correctly at compile time.

Likely an incorrect or invalid review comment.

pyproject.toml (1)

3-3: Consider version pinning strategy; manual CI verification needed as no CI configuration files exist in repository.

The current exact pin (==0.10.0) is valid and improves reproducibility. However, scikit-build-core v0.11.6 is the latest release on PyPI (Aug 2025), so v0.10.0 is outdated. If you're intentionally using an older version for compatibility, keep the exact pin; if not, consider upgrading or using a bounded range like >=0.10,<0.11.

The CI image verification suggested in the original comment cannot be completed: no CI configuration files (.github/workflows, etc.) were found in the repository. Verify manually that your CI environment has scikit-build-core 0.10.0 available.

mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp (3)

25-28: HWY_EXPORT placement looks right.

Exporting the per-target functions before defining the dynamic-dispatch shims is correct.


4-4: No action required—header file exists and include path is correct.

The verification confirms that impl-any-simd-inst.hpp exists at the expected path and is correctly included. The static analysis warning is a false positive, likely a tool configuration issue rather than an actual problem with the code or build system.


6-14: Dynamic-dispatch scaffold verified as correct.

The inl file (impl-any-simd-inl.hpp) properly wraps all four functions (vector_dot_product_fp32_fp32_fp32, mul_from_const_fp32, fma_const_array_fp32, filled_with_const_fp32) with HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE macros and defines them inside the HWY_NAMESPACE namespace. This follows standard Highway conventions—explicit HWY_EXPORT macros are not needed when using foreach_target.h, as that mechanism handles per-target re-inclusion and visibility.

Comment on lines +41 to +43
HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_DLLEXPORT a, float v, size_t len) {
HWY_DYNAMIC_DISPATCH(filled_with_const_fp32)(a, v, len);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix invalid HWY_DLLEXPORT on parameters + align scalar types.

HWY_DLLEXPORT must annotate the function, not parameters. Also, prefer mllm_fp32_t over float for consistency with the rest of the API.

Apply:

-HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_DLLEXPORT a, float v, size_t len) {
-  HWY_DYNAMIC_DISPATCH(filled_with_const_fp32)(a, v, len);
-}
+HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_RESTRICT a, mllm_fp32_t v, size_t len) noexcept {
+  HWY_DYNAMIC_DISPATCH(filled_with_const_fp32)(a, v, len);
+}

-HWY_DLLEXPORT void call_mul_from_const_fp32(mllm_fp32_t* HWY_DLLEXPORT from, float c, size_t len) {
-  HWY_DYNAMIC_DISPATCH(mul_from_const_fp32)(from, c, len);
-}
+HWY_DLLEXPORT void call_mul_from_const_fp32(mllm_fp32_t* HWY_RESTRICT from, mllm_fp32_t c, size_t len) noexcept {
+  HWY_DYNAMIC_DISPATCH(mul_from_const_fp32)(from, c, len);
+}

Also applies to: 45-47

🤖 Prompt for AI Agents
In mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp around lines
41-43 (and also apply the same change to lines 45-47), remove the misplaced
HWY_DLLEXPORT placed on the parameter declaration(s) so the export attribute
only decorates the function itself, and change the scalar parameter type from
float to mllm_fp32_t to match the API; also update any matching
prototypes/forward declarations accordingly.

Comment on lines +5 to +7
#ifndef HIGHWAY_HWY_FA2_IMPL_INL_H_
#define HIGHWAY_HWY_FA2_IMPL_INL_H_

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Header guard name is incorrect/mismatched.

Guard references “INL” instead of “INST” and non‑project style. Fix to match filename.

Apply:

-#ifndef HIGHWAY_HWY_FA2_IMPL_INL_H_
-#define HIGHWAY_HWY_FA2_IMPL_INL_H_
+// impl-any-simd-inst.hpp
+#ifndef MLLM_BACKENDS_CPU_KERNELS_COMMON_FA2_1_IMPL_ANY_SIMD_INST_HPP_
+#define MLLM_BACKENDS_CPU_KERNELS_COMMON_FA2_1_IMPL_ANY_SIMD_INST_HPP_
@@
-#endif  // HIGHWAY_HWY_FA2_IMPL_INL_H_
+#endif  // MLLM_BACKENDS_CPU_KERNELS_COMMON_FA2_1_IMPL_ANY_SIMD_INST_HPP_

Also applies to: 29-29

🤖 Prompt for AI Agents
In mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp (around lines
5–7 and also line 29) the header guard uses the wrong name/format
("HIGHWAY_HWY_FA2_IMPL_INL_H_" and "INL") which doesn't match the filename or
project style; replace it with a guard that matches the file path and “INST”
suffix (e.g. MLLM_BACKENDS_CPU_KERNELS_COMMON_FA2_1_IMPL_ANY_SIMD_INST_HPP_) and
update both the #ifndef/#define at the top and the corresponding #endif comment
at the bottom to use the new identifier.

Comment on lines +16 to +26
HWY_DLLEXPORT void call_vector_dot_product_fp32_fp32_fp32(const mllm_fp32_t* HWY_RESTRICT lhs,
const mllm_fp32_t* HWY_RESTRICT rhs, mllm_fp32_t* HWY_RESTRICT out,
size_t len);

HWY_DLLEXPORT void call_fma_const_array_fp32(mllm_fp32_t* HWY_RESTRICT acc_o, mllm_fp32_t acc_s,
const mllm_fp32_t* HWY_RESTRICT v_token, size_t len);

HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_DLLEXPORT a, float v, size_t len);

HWY_DLLEXPORT void call_mul_from_const_fp32(mllm_fp32_t* HWY_DLLEXPORT from, float c, size_t len);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Misuse of HWY_DLLEXPORT on parameter declarations; use HWY_RESTRICT for pointers.

Putting HWY_DLLEXPORT on parameters is invalid and may break compilation/ABI. Keep HWY_DLLEXPORT on the function, use HWY_RESTRICT on pointer params, and include for size_t.

Apply:

 #include "mllm/core/DataTypes.hpp"
+#include <cstddef>
@@
-HWY_DLLEXPORT void call_vector_dot_product_fp32_fp32_fp32(const mllm_fp32_t* HWY_RESTRICT lhs,
-                                                          const mllm_fp32_t* HWY_RESTRICT rhs, mllm_fp32_t* HWY_RESTRICT out,
-                                                          size_t len);
+HWY_DLLEXPORT void call_vector_dot_product_fp32_fp32_fp32(const mllm_fp32_t* HWY_RESTRICT lhs,
+                                                          const mllm_fp32_t* HWY_RESTRICT rhs,
+                                                          mllm_fp32_t* HWY_RESTRICT out, size_t len);
@@
-HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_DLLEXPORT a, float v, size_t len);
+HWY_DLLEXPORT void call_filled_with_const_fp32(mllm_fp32_t* HWY_RESTRICT a, float v, size_t len);
@@
-HWY_DLLEXPORT void call_mul_from_const_fp32(mllm_fp32_t* HWY_DLLEXPORT from, float c, size_t len);
+HWY_DLLEXPORT void call_mul_from_const_fp32(mllm_fp32_t* HWY_RESTRICT from, float c, size_t len);

Please confirm mllm/core/DataTypes.hpp resolves on all targets; otherwise adjust the include path.


I'll verify the review comment by examining the file and understanding the Highway library macros. Let me start by investigating.

Now let me examine the actual file in the repository:



The file doesn't exist in search results without cloning. Let me try with cloning:




🏁 Script executed:

#!/bin/bash
# Locate and examine the target file
fd "impl-any-simd-inst.hpp" --type f

Length of output: 65


🏁 Script executed:

#!/bin/bash
# Find the file in the repository
find . -name "impl-any-simd-inst.hpp" 2>/dev/null

Length of output: 67



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the actual file content
cat "./git/mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp"

Length of output: 99



---

🏁 Script executed:

```shell
#!/bin/bash
# List directory structure and try to read with fd's path
ls -la mllm/backends/cpu/kernels/common/fa2_1/ 2>/dev/null || ls -la git/mllm/backends/cpu/kernels/common/fa2_1/ 2>/dev/null || pwd && find . -name "impl-any-simd-inst.hpp" -type f -exec cat {} \;

Length of output: 1834


Now let me verify the Highway library macro semantics:


---

🌐 Web query:

```web
Highway library HWY_DLLEXPORT HWY_RESTRICT macro documentation

💡 Result:

Short answer:

  • HWY_RESTRICT — Macro you put after a pointer (e.g. T* HWY_RESTRICT p) to promise the pointer does not alias other pointers; used to improve code generation. (See Highway quick-reference / FAQ and the library headers.) [1][2]

  • HWY_DLLEXPORT — Macro used on functions/variables to mark them as part of Highway’s public/ABI surface (exported symbols). In the codebase it prefixes public API declarations (e.g. SupportedTargets(), AllocateAlignedBytes(), per-target queries) so they are exported when building the library; the implementation of HWY_DLLEXPORT lives in highway_export.h and on common platforms expands to the platform-appropriate export attribute (e.g. __declspec(dllexport)/__declspec(dllimport) on MSVC or attribute((visibility("default"))) / equivalent on GCC/Clang). See usages in the headers. [2][3]

References:

  • Highway quick reference / macros (describes HWY_RESTRICT). [1]
  • Highway source headers (base.h shows HWY_RESTRICT mapping for MSVC; targets.h / per_target.h / aligned_allocator.h show HWY_DLLEXPORT usage). [2][3]

If you want, I can fetch and show the exact highway_export.h contents (the precise HWY_DLLEXPORT definition) and paste the lines here.



---

🏁 Script executed:

```shell
#!/bin/bash
# Check if size_t is available from existing includes or needs explicit cstddef
# Look at what hwy/base.h and DataTypes.hpp provide
grep -r "include.*cstddef\|include.*cstdlib\|include.*cstring" mllm/core/DataTypes.hpp mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp 2>/dev/null || echo "No cstddef/cstdlib/cstring found in direct includes"
# Check if size_t is used elsewhere in the codebase without explicit cstddef
rg -n "size_t" mllm/backends/cpu/kernels/common/fa2_1/ -A1 -B1 --max-count=5

Length of output: 10230


Fix HWY_DLLEXPORT misuse on parameters in both header and implementation files; replace with HWY_RESTRICT and add explicit #include <cstddef>.

The review is correct: HWY_DLLEXPORT (function visibility macro) is incorrectly applied to pointer parameters in two functions. These should use HWY_RESTRICT instead (aliasing guarantee). The same issues exist in both impl-any-simd-inst.hpp and impl-any-simd-inst.cpp.

Required fixes:

impl-any-simd-inst.hpp (lines 23, 25):

  • Line 23: Replace HWY_DLLEXPORT with HWY_RESTRICT on parameter a
  • Line 25: Replace HWY_DLLEXPORT with HWY_RESTRICT on parameter from

impl-any-simd-inst.cpp (lines 41, 45):

  • Line 41: Replace HWY_DLLEXPORT with HWY_RESTRICT on parameter a
  • Line 45: Replace HWY_DLLEXPORT with HWY_RESTRICT on parameter from

Also add #include <cstddef> to both files for explicit size_t definition.

🧰 Tools
🪛 Clang (14.0.6)

[error] 20-20: variable 'call_fma_const_array_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 23-23: variable 'call_filled_with_const_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 25-25: variable 'call_mul_from_const_fp32' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)

🤖 Prompt for AI Agents
In mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.hpp around lines
16–26 and mllm/backends/cpu/kernels/common/fa2_1/impl-any-simd-inst.cpp around
lines 41–45, HWY_DLLEXPORT was mistakenly used on pointer parameters — replace
those HWY_DLLEXPORT occurrences on parameters `a` (hpp line ~23 / cpp line ~41)
and `from` (hpp line ~25 / cpp line ~45) with HWY_RESTRICT to provide the
correct aliasing guarantee, and add #include <cstddef> to both files to ensure
size_t is defined.

Comment thread mllm/compile/README.md Outdated
@@ -0,0 +1 @@
# MLLM IR and It's MLIR Extensions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: “It’s” → “Its”.

Use the possessive “Its” in the heading.

Apply this diff:

-# MLLM IR and It's MLIR Extensions
+# MLLM IR and Its MLIR Extensions
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# MLLM IR and It's MLIR Extensions
# MLLM IR and Its MLIR Extensions
🤖 Prompt for AI Agents
In mllm/compile/README.md around line 1, the heading contains a typo using the
contraction "It's" instead of the possessive "Its"; update the first-line
heading from "MLLM IR and It's MLIR Extensions" to "MLLM IR and Its MLIR
Extensions" and save the file.

- Update macro checks to properly include ARM architecture
- Add missing `HWY_RESTRICT` qualifiers to function parameters
- Wrap implementation includes with proper architecture guards
- Fix typo in README.md ("It's" → "Its")
The logical condition for excluding ARM architectures has been updated
to use proper parentheses grouping. This ensures that the code is only
compiled for non-ARM architectures as intended, improving build
compatibility across different CPU targets.
Include CPUArchHelper.hpp to enable architecture-specific checks
in the FA2 SIMD implementation files. This change ensures proper
compilation and runtime behavior for different CPU architectures
by providing necessary architecture detection utilities.
@chenghuaWang chenghuaWang merged commit c6c000e into UbiquitousLearning:v2 Oct 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant