Improve the handling of popcnt and tzcnt on arm64#128677
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
Improves ARM64 codegen for int.PopCount/long.PopCount and int.TrailingZeroCount/long.TrailingZeroCount by representing them as a single GT_INTRINSIC IR node instead of leaving popcnt unhandled and emitting tzcnt as two separate hwintrinsic nodes (ReverseElementBits + LeadingZeroCount). Lowering to actual instructions (cnt/addv for popcnt, rbit/clz for tzcnt) is now done directly in the codegen, which should preserve more information for heuristics and pattern matching.
Changes:
- In the importer, emit a
GenTreeIntrinsic(TYP_INT, ...)node for ARM64 PopCount (previously not handled) and for ARM64 TrailingZeroCount (previously decomposed into two hwintrinsics). - Add ARM64 LSRA build logic for
NI_PRIMITIVE_PopCount(requests an internal SIMD temp) andNI_PRIMITIVE_TrailingZeroCount. - Add ARM64 codegen for the two intrinsics:
cnt+addvover an 8-byte SIMD register for PopCount, andrbit+clz(sized viaemitActualTypeSize(srcNode)) for TrailingZeroCount.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/coreclr/jit/importercalls.cpp | Build a single NI_PRIMITIVE_PopCount/NI_PRIMITIVE_TrailingZeroCount intrinsic node on ARM64; set baseType = TYP_INT so the existing widen-back cast handles long return types. |
| src/coreclr/jit/lsraarm64.cpp | Add LSRA cases: PopCount needs an internal float (SIMD) reg; TrailingZeroCount is a straight 1-src, 1-dst integer op. |
| src/coreclr/jit/codegenarmarch.cpp | Emit cnt/addv via a SIMD temp for PopCount and rbit/clz for TrailingZeroCount under TARGET_ARM64. |
9607e31 to
bf02815
Compare
bf02815 to
24d3c5f
Compare
24d3c5f to
bc6488e
Compare
|
CC. @dotnet/jit-contrib, @EgorBo for review. This ensures that Diffs are generally positive, showing up to -7.4k bytes of codegen on Linux Arm64 and nearly as much on Windows Arm64. The overall PerfScore is an improvement as well. There are a few regressions intermixed, these mostly look to be due to additional CSEs, although a couple are from extra casts inserted for small types. |
The extra casts look easy enough to fix, seems to be because we aren't handling it in |
Correct, because it is a This allows us to track it directly as an integer operation and ignore the fact that codegen actually involves SIMD operations due to a limitation in the Arm64 instruction set. |

Previously
popcntwasn't handled at all andtzcntwas imported as two IR nodes which could negatively impact certain heuristics and pattern matching.