Add ARM64/NEON SIMD paths to all compute kernels#14
Draft
Conversation
- Add ARM NEON SIMD implementations for all 5 compute kernels: * matrix_operations: float64x2_t with vmlaq_f64 (FMA) for 2x double matmul * polynomial_eval: float64x2_t with vmlaq_f64 for 2-coeff parallel Horner * string_search: uint8x16_t with vceqq_u8 for 16-byte parallel char search * memory_operations: uint8x16_t with vld1q_u8/vst1q_u8 for 16-byte copy * hash_operations: uint8x16_t with vld1q_u8/vst1q_u8 for 16-byte batch load - Rename polynomial_eval_sse -> polynomial_eval_simd (arch-neutral name) - Update Dockerfile: arch-detect build flags (-march=armv8-a on AArch64, -march=native on x86), -O3 (was -O2), -std=c++14 (was -std=c++11) - Add USE_ARM_NEON macro detection in all source files via #ifdef __aarch64__ - All SIMD guards use #if/#elif/#else chains (x86 -> ARM -> scalar fallback) - ubuntu:22.04 base image confirmed multi-arch (supports linux/arm64) - Fix scalar hash fallback: cast to unsigned char to prevent sign-extension
Copilot
AI
changed the title
[WIP] Migrate this repository to ARM architecture
Add ARM64/NEON SIMD paths to all compute kernels
Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All five compute kernels were gated behind
#ifdef __x86_64__with only scalar fallbacks on other architectures — ARM64 got zero SIMD acceleration.Changes
NEON intrinsic paths added to each kernel
matrix_operations.cpp—float64x2_t+vmlaq_f64for fused multiply-accumulate on doublespolynomial_eval.cpp— parallel coefficient evaluation usingvmlaq_f64string_search.cpp—vceqq_u816-byte character comparison per cyclememory_operations.cpp—vld1q_u8/vst1q_u816-byte loads/storeshash_operations.cpp— 16-byte batched loads viavld1q_u8Consistent arch-detection pattern across all files
Dockerfile — build flags updated:
-march=armv8-aon AArch64,-O3 -std=c++14on both targets;ubuntu:22.04base already multi-arch.main.cpp— runtime banner reports "AArch64 with NEON Optimizations" on ARM64.Original prompt
🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.