[Deepin-Kernel-SIG] [linux 6.18-y] [Fromlist] lib/crc: arm64: add NEON accelerated CRC64-NVMe implementation#1619
Conversation
Reviewer's GuideAdds an ARM64 NEON/PMULL-accelerated CRC64-NVMe implementation and wires it into the generic CRC64 architecture layer, with build flags and a chunked SIMD dispatch path that falls back to the generic implementation for short buffers or when PMULL/SIMD is unavailable. Sequence diagram for crc64_nvme_arch dispatch and fallbacksequenceDiagram
participant Caller
participant crc64_nvme_arch
participant cpu_have_named_feature
participant may_use_simd
participant scoped_ksimd
participant crc64_nvme_arm64_c
participant crc64_nvme_generic
Caller->>crc64_nvme_arch: crc64_nvme_arch(crc, p, len)
alt len >= 128
crc64_nvme_arch->>cpu_have_named_feature: cpu_have_named_feature(PMULL)
cpu_have_named_feature-->>crc64_nvme_arch: has_pmull
crc64_nvme_arch->>may_use_simd: may_use_simd()
may_use_simd-->>crc64_nvme_arch: simd_allowed
alt has_pmull and simd_allowed
loop while len >= 128
crc64_nvme_arch->>crc64_nvme_arch: chunk = min(len & ~15, 4KB)
crc64_nvme_arch->>scoped_ksimd: enter ksimd section
scoped_ksimd->>crc64_nvme_arm64_c: crc64_nvme_arm64_c(crc, p, chunk)
crc64_nvme_arm64_c-->>scoped_ksimd: updated_crc
scoped_ksimd-->>crc64_nvme_arch: leave ksimd section
crc64_nvme_arch->>crc64_nvme_arch: crc = updated_crc, p += chunk, len -= chunk
end
end
end
crc64_nvme_arch->>crc64_nvme_generic: crc64_nvme_generic(crc, p, len)
crc64_nvme_generic-->>crc64_nvme_arch: final_crc
crc64_nvme_arch-->>Caller: final_crc
Class diagram for CRC64 ARM64 NEON implementation and dispatchclassDiagram
class Crc64Arm64 {
+u64 crc64_nvme_arm64_c(u64 crc, const u8 *p, size_t len)
-u64 fold_consts_val[2]
-u64 bconsts_val[2]
}
class Crc64ArchLayer {
+u64 crc64_nvme_arch(u64 crc, const u8 *p, size_t len)
+u64 crc64_nvme_generic(u64 crc, const u8 *p, size_t len)
+u64 crc64_be_arch(u64 crc, const u8 *p, size_t len)
+u64 crc64_be_generic(u64 crc, const u8 *p, size_t len)
}
class CpuFeature {
+bool cpu_have_named_feature(int feature)
+const int PMULL
}
class SimdSubsystem {
+bool may_use_simd()
+scoped_ksimd scoped_ksimd()
}
class BuildConfig {
+CONFIG_CRC64
+CONFIG_CRC64_ARCH
+CONFIG_ARM64
}
class Objects {
+crc64_main_o
+arm64_crc64_neon_inner_o
+riscv_crc64_lsb_o
+riscv_crc64_msb_o
+x86_crc64_pclmul_o
}
Crc64ArchLayer --> Crc64Arm64 : uses
Crc64ArchLayer --> CpuFeature : checks_features
Crc64ArchLayer --> SimdSubsystem : manages_simd_context
BuildConfig --> Objects : selects
Objects --> Crc64Arm64 : links_arm64_neon_path
Objects --> Crc64ArchLayer : links_common_crc64
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- Consider adding an include guard to lib/crc/arm64/crc64.h to avoid accidental multiple inclusion as this header grows or is reused elsewhere.
- crc64_nvme_arm64_c is only used from the architecture-specific path; making it file-local (static) and exposing only the inline wrapper in crc64.h would better encapsulate the NEON implementation detail and reduce the chance of unintended external use.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider adding an include guard to lib/crc/arm64/crc64.h to avoid accidental multiple inclusion as this header grows or is reused elsewhere.
- crc64_nvme_arm64_c is only used from the architecture-specific path; making it file-local (static) and exposing only the inline wrapper in crc64.h would better encapsulate the NEON implementation detail and reduce the chance of unintended external use.
## Individual Comments
### Comment 1
<location path="lib/crc/arm64/crc64.h" line_range="1" />
<code_context>
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * CRC64 using ARM64 PMULL instructions
</code_context>
<issue_to_address>
**nitpick (bug_risk):** Consider adding a traditional include guard to the new header.
In this codebase we usually prefer `#ifndef`/`#define` guards over `#pragma once`, and they make multiple inclusion behavior explicit and easier to reason about. Please add a guard consistent with nearby headers.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @@ -0,0 +1,30 @@ | |||
| /* SPDX-License-Identifier: GPL-2.0-only */ | |||
There was a problem hiding this comment.
nitpick (bug_risk): Consider adding a traditional include guard to the new header.
In this codebase we usually prefer #ifndef/#define guards over #pragma once, and they make multiple inclusion behavior explicit and easier to reason about. Please add a guard consistent with nearby headers.
There was a problem hiding this comment.
Pull request overview
Adds an ARM64 NVMe CRC64 implementation accelerated with NEON/PMULL to improve throughput vs the generic shift/XOR path, and wires it into the existing CRC64 arch-dispatch mechanism.
Changes:
- Enables
CRC64_ARCHby default on ARM64 and adds an ARM64-specific dispatch header forcrc64_nvme(). - Introduces a NEON/PMULL-based CRC64-NVMe inner implementation (
crc64_nvme_arm64_c()), with chunking to cap SIMD critical sections. - Updates
lib/crc/Makefileto build the new ARM64 object with appropriate compiler flags.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lib/crc/Makefile | Adds ARM64 CRC64-NVMe NEON inner object and per-file compile flags. |
| lib/crc/Kconfig | Defaults CRC64_ARCH to enabled on ARM64. |
| lib/crc/arm64/crc64.h | Adds ARM64 crc64_nvme_arch() dispatch with PMULL + SIMD gating and chunking. |
| lib/crc/arm64/crc64-neon-inner.c | Implements the PMULL-accelerated CRC64-NVMe update routine using NEON intrinsics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (len >= 128 && cpu_have_named_feature(PMULL) && | ||
| likely(may_use_simd())) { | ||
| do { | ||
| size_t chunk = min_t(size_t, len & ~15, SZ_4K); | ||
|
|
||
| scoped_ksimd() | ||
| crc = crc64_nvme_arm64_c(crc, p, chunk); | ||
|
|
There was a problem hiding this comment.
scoped_ksimd() does not exist anywhere in this tree (the arm64 SIMD users use kernel_neon_begin() / kernel_neon_end() after may_use_simd()). As written, this will fail to build. Replace this with a kernel_neon_begin()/kernel_neon_end() pair around the accelerated call (and include <asm/neon.h>), keeping the 4KB chunking to bound preempt-disable time.
| CFLAGS_REMOVE_arm64/crc64-neon-inner.o += -mgeneral-regs-only | ||
| CFLAGS_arm64/crc64-neon-inner.o += -ffreestanding -march=armv8-a+crypto | ||
| CFLAGS_arm64/crc64-neon-inner.o += -isystem $(shell $(CC) -print-file-name=include) |
There was a problem hiding this comment.
The per-object flags here hardcode -ffreestanding and -isystem $(CC ... -print-file-name=include), and use -march=armv8-a+crypto. For consistency with existing arm64 NEON/crypto intrinsics build rules, prefer using the arch-exported CC_FLAGS_FPU/CC_FLAGS_NO_FPU (see arch/arm64/Makefile:36-42) and consider -mcpu=generic+crypto (as done for aegis128-neon-inner.o in crypto/Makefile:110-123) instead of a raw -march override.
| CFLAGS_REMOVE_arm64/crc64-neon-inner.o += -mgeneral-regs-only | |
| CFLAGS_arm64/crc64-neon-inner.o += -ffreestanding -march=armv8-a+crypto | |
| CFLAGS_arm64/crc64-neon-inner.o += -isystem $(shell $(CC) -print-file-name=include) | |
| CFLAGS_REMOVE_arm64/crc64-neon-inner.o += $(CC_FLAGS_NO_FPU) | |
| CFLAGS_arm64/crc64-neon-inner.o += $(CC_FLAGS_FPU) -mcpu=generic+crypto |
|
@opsiff That said, backporting it does seem necessary, as it can significantly boost NVMe I/O performance in certain scenarios. I just couldn't come up with an elegant solution within half a minute, so I couldn't be bothered to deal with it. |
[Upstream commit 814f541] Implement the ksimd scoped guard API so that it can be used by code that supports both ARM and arm64. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
9c6310e to
8a03e66
Compare
[Upstream commit c5b91a1] Encapsulate kernel_neon_begin() and kernel_neon_end() using a 'ksimd' cleanup guard. This hides the prototype of those functions, allowing them to be changed for arm64 but not ARM, without breaking code that is shared between those architectures (RAID6, AEGIS-128) It probably makes sense to expose this API more widely across architectures, as it affords more flexibility to the arch code to plumb it in, while imposing more rigid rules regarding the start/end bookends appearing in matched pairs. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Reviewed-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems. The acceleration is implemented using C intrinsics (<arm_neon.h>) rather than raw assembly for better readability and maintainability. Key highlights of this implementation: - Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers. - Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling. - Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128. Performance results (kunit crc_benchmark on Cortex-A72): - Generic (len=4096): ~268 MB/s - PMULL (len=4096): ~1556 MB/s (nearly 6x improvement) Signed-off-by: Demian Shulhan <demyansh@gmail.com> Link: https://lore.kernel.org/all/20260329074338.1053550-1-demyansh@gmail.com/ Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
8a03e66 to
fc50112
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: opsiff The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.
The acceleration is implemented using C intrinsics (<arm_neon.h>) rather than raw assembly for better readability and maintainability.
Key highlights of this implementation:
Performance results (kunit crc_benchmark on Cortex-A72):
Link: https://lore.kernel.org/all/20260329074338.1053550-1-demyansh@gmail.com/
Summary by Sourcery
Add an ARM64 NEON/PMULL-accelerated CRC64-NVMe implementation and wire it into the generic CRC64 architecture-specific path for capable CPUs.
New Features:
Build: