Skip to content

Improve the handling of popcnt and tzcnt on arm64#128677

Merged
tannergooding merged 2 commits into
dotnet:mainfrom
tannergooding:arm64-popcnt-tzcnt
May 29, 2026
Merged

Improve the handling of popcnt and tzcnt on arm64#128677
tannergooding merged 2 commits into
dotnet:mainfrom
tannergooding:arm64-popcnt-tzcnt

Conversation

@tannergooding
Copy link
Copy Markdown
Member

Previously popcnt wasn't handled at all and tzcnt was imported as two IR nodes which could negatively impact certain heuristics and pattern matching.

Copilot AI review requested due to automatic review settings May 28, 2026 03:22
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 28, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves ARM64 codegen for int.PopCount/long.PopCount and int.TrailingZeroCount/long.TrailingZeroCount by representing them as a single GT_INTRINSIC IR node instead of leaving popcnt unhandled and emitting tzcnt as two separate hwintrinsic nodes (ReverseElementBits + LeadingZeroCount). Lowering to actual instructions (cnt/addv for popcnt, rbit/clz for tzcnt) is now done directly in the codegen, which should preserve more information for heuristics and pattern matching.

Changes:

  • In the importer, emit a GenTreeIntrinsic(TYP_INT, ...) node for ARM64 PopCount (previously not handled) and for ARM64 TrailingZeroCount (previously decomposed into two hwintrinsics).
  • Add ARM64 LSRA build logic for NI_PRIMITIVE_PopCount (requests an internal SIMD temp) and NI_PRIMITIVE_TrailingZeroCount.
  • Add ARM64 codegen for the two intrinsics: cnt+addv over an 8-byte SIMD register for PopCount, and rbit+clz (sized via emitActualTypeSize(srcNode)) for TrailingZeroCount.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/coreclr/jit/importercalls.cpp Build a single NI_PRIMITIVE_PopCount/NI_PRIMITIVE_TrailingZeroCount intrinsic node on ARM64; set baseType = TYP_INT so the existing widen-back cast handles long return types.
src/coreclr/jit/lsraarm64.cpp Add LSRA cases: PopCount needs an internal float (SIMD) reg; TrailingZeroCount is a straight 1-src, 1-dst integer op.
src/coreclr/jit/codegenarmarch.cpp Emit cnt/addv via a SIMD temp for PopCount and rbit/clz for TrailingZeroCount under TARGET_ARM64.

@tannergooding tannergooding force-pushed the arm64-popcnt-tzcnt branch 2 times, most recently from 9607e31 to bf02815 Compare May 28, 2026 04:22
Copilot AI review requested due to automatic review settings May 28, 2026 04:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 28, 2026 05:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@tannergooding tannergooding marked this pull request as ready for review May 28, 2026 16:56
Copilot AI review requested due to automatic review settings May 28, 2026 16:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@tannergooding
Copy link
Copy Markdown
Member Author

CC. @dotnet/jit-contrib, @EgorBo for review. This ensures that tzcnt and popcnt are properly imported as intrinsic on Arm64, allowing other phases to handle those nodes and avoiding the inliner.

Diffs are generally positive, showing up to -7.4k bytes of codegen on Linux Arm64 and nearly as much on Windows Arm64. The overall PerfScore is an improvement as well.

There are a few regressions intermixed, these mostly look to be due to additional CSEs, although a couple are from extra casts inserted for small types.

@tannergooding
Copy link
Copy Markdown
Member Author

tannergooding commented May 28, 2026

There are a few regressions intermixed, these mostly look to be due to additional CSEs, although a couple are from extra casts inserted for small types.

The extra casts look easy enough to fix, seems to be because we aren't handling it in IntegralRange::ForNode like we did with ToScalar(). I'd prefer to do that in a follow up PR, however, since there's a few intrinsics to add there; not just tzcnt/popcnt

@EgorBo
Copy link
Copy Markdown
Member

EgorBo commented May 29, 2026

This feels like yet another case where I don't understand why the managed code in BitOperations
{8AE13EF5-083B-49D9-A5E2-267FC9C19929}

isn't inlined into exactly what you emit by hands here. My understanding that it happens because we don't do opts for AdvSimd.PopCount that we do for PRIMITIVE_PopCount later in JIT (meaning users code that calls AdvSimd.PopCount directly won't be optimized this much)

@tannergooding
Copy link
Copy Markdown
Member Author

meaning users code that calls AdvSimd.PopCount directly won't be optimized this much

Correct, because it is a SIMD instruction and must be presumed to operate across all SIMD8/SIMD16 bytes. We have no way to trivially map that back, especially given the more complex IR and our limited ability to track ranges already.

This allows us to track it directly as an integer operation and ignore the fact that codegen actually involves SIMD operations due to a limitation in the Arm64 instruction set.

@tannergooding tannergooding enabled auto-merge (squash) May 29, 2026 13:08
@tannergooding tannergooding merged commit db51bb8 into dotnet:main May 29, 2026
141 of 146 checks passed
@tannergooding tannergooding deleted the arm64-popcnt-tzcnt branch May 29, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants