-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate CRC-32C calculation on Z #16379
Accelerate CRC-32C calculation on Z #16379
Conversation
838e66d
to
50bde15
Compare
50bde15
to
b9a8506
Compare
b9a8506
to
c0e2494
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Spencer-Comin I have skimmed through the changes quickly and left some comments. Will give it another check soon, but would appreciate if you can provide a sample implementation in comment, this will make it easier to follow.
2d7eb52
to
7544479
Compare
For reference, here's the generated assembly (log from a unit test on a z13 machine):
|
I've opened a PR adding a unit test for this acceleration here: #16404 |
7544479
to
8fae066
Compare
8fae066
to
b7bce51
Compare
@Spencer-Comin briefly going over the changes, looks ok to me and most of the concerns were addressed. Any changes to assembly you printed in this comment (#16379 (comment)) ? I wanted to give one last look before approving changes. |
@r30shah I've updated the assembly in #16379 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Spencer-Comin I went to the changes quickly and looks ok to me. This kind of performance changes should accompanied by relevant performance numbers. Can you post results and analysis from your JMH benchmark so we have it documented here?
@joransiu Can you please give one quick look as well ?
b7bce51
to
1c2eb8a
Compare
Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized methods by inlining them with a vector assembly sequence. Signed-off-by: Spencer Comin <spencer.comin@ibm.com>
1c2eb8a
to
2a14647
Compare
Jenkins test sanity,extended zlinux jdk11,jdk17 |
While doing some benchmarking I found a ~1/100 error related to this change. As a brief summary, a register spill can be inserted between the first possible jump to the out-of-line call and the start of the vector internal control flow. In the 15B or less case, the spill isn't hit, leading to a segfault later on. |
Thanks @Spencer-Comin - Can you open an issue to track this issue? I'd like to better understand which virtual reg is leading to that spill. From the code, it looks like you have all the dependencies registered and added to the relevant control flow point instructions already. |
Accelerate
CRC32C.updateBytes
andCRC32C.updateDirectByteBuffers
recognized methods by inlining them with a vector assembly sequence.The implementation is based on the bit-reflected algorithm described in Intel's whitepaper "Fast CRC Computation for Generic Polynomials Using
PCLMULQDQ
Instruction" [1] using equivalent Z instructions. The basic layout of the implementation is as follows:Preliminary benchmarking shows approximately 22× speedup for 4kB and 16kB byte arrays, and no significant speedup or slowdown for 15B arrays.
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf