Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate CRC-32C calculation on Z #16379

Merged
merged 1 commit into from
Jan 16, 2023

Conversation

Spencer-Comin
Copy link
Contributor

Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized methods by inlining them with a vector assembly sequence.

The implementation is based on the bit-reflected algorithm described in Intel's whitepaper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" [1] using equivalent Z instructions. The basic layout of the implementation is as follows:

  • While there are more than 64B of data remaining, consume 64B chunks using 4 vectors in parallel, then
  • while there are more than 16B of data remaining, consume 16B chunks using 1 vector, then
  • if there is remaining data, call the original implementation on the remainder

Preliminary benchmarking shows approximately 22× speedup for 4kB and 16kB byte arrays, and no significant speedup or slowdown for 15B arrays.

[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

@Spencer-Comin
Copy link
Contributor Author

@joransiu @r30shah fyi

runtime/compiler/z/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@r30shah r30shah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spencer-Comin I have skimmed through the changes quickly and left some comments. Will give it another check soon, but would appreciate if you can provide a sample implementation in comment, this will make it easier to follow.

runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
@Spencer-Comin Spencer-Comin force-pushed the crc32c-call-java branch 3 times, most recently from 2d7eb52 to 7544479 Compare December 9, 2022 17:34
@Spencer-Comin
Copy link
Contributor Author

Spencer-Comin commented Dec 9, 2022

For reference, here's the generated assembly (log from a unit test on a z13 machine):

============================================================
; Live regs: GPR=4 FPR=0 VRF=0 {GPR_0034, &GPR_0018, &GPR_0017, GPR_0016}
------------------------------
 n62n     (  0)  treetop                                                                              [     0x3ff87080320] bci=[-1,46,153] rc=0 vc=333 vn=- li=8 udi=- nc=1
 n61n     (  2)    icall  java/util/zip/CRC32C.updateBytes(I[BII)I[#409  final static Method] [flags 0x20500 0x0 ] ()  [     0x3ff870802d0] bci=[-1,46,153] rc=2 vc=333 vn=- li=8 udi=- nc=4 flg=0x20
 n55n     (  1)      iloadi  java/util/zip/CRC32C.crc I[#404  Shadow +8] [flags 0x80603 0x0 ] (cannotOverflow )  [     0x3ff870800f0] bci=[-1,38,153] rc=1 vc=333 vn=- li=8 udi=- nc=1 flg=0x1000
 n113n    (  2)        ==>aRegLoad (in &GPR_0018) (X!=0 X>=0 SeenRealReference sharedMemory )
 n112n    (  1)      ==>aRegLoad (in &GPR_0017) (SeenRealReference sharedMemory )
 n111n    (  2)      ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n60n     (  1)      iadd                                                                             [     0x3ff87080280] bci=[-1,45,153] rc=1 vc=333 vn=- li=8 udi=- nc=2
 n111n    (  2)        ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n115n    (  1)        ==>iload (in GPR_0034)
------------------------------

buildDirectDispatch
Build Direct Call

GC Point at 000003FF871513F0 has preserved register map 1fc0
------------------------------
 n62n     (  0)  treetop                                                                              [     0x3ff87080320] bci=[-1,46,153] rc=0 vc=333 vn=- li=8 udi=- nc=1
 n61n     (  1)    icall  java/util/zip/CRC32C.updateBytes(I[BII)I[#409  final static Method] [flags 0x20500 0x0 ] (in GPR_0066) ()  [     0x3ff870802d0] bci=[-1,46,153] rc=1 vc=333 vn=- li=8 udi=- nc=4 flg=0x20
 n55n     (  0)      iloadi  java/util/zip/CRC32C.crc I[#404  Shadow +8] [flags 0x80603 0x0 ] (in GPR_0066) (cannotOverflow )  [     0x3ff870800f0] bci=[-1,38,153] rc=0 vc=333 vn=- li=8 udi=- nc=1 flg=0x1000
 n113n    (  1)        ==>aRegLoad (in &GPR_0018) (X!=0 X>=0 SeenRealReference sharedMemory )
 n112n    (  0)      ==>aRegLoad (in &GPR_0017) (SeenRealReference sharedMemory )
 n111n    (  0)      ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n60n     (  0)      iadd (in GPR_0034)                                                               [     0x3ff87080280] bci=[-1,45,153] rc=0 vc=333 vn=- li=8 udi=- nc=2
 n111n    (  0)        ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n115n    (  0)        ==>iload (in GPR_0034)
------------------------------

 [     0x3ff87128850]                          L       GPR_0066, Shadow[java/util/zip/CRC32C.crc I] 8(&GPR_0018)
 [     0x3ff87128990]                          LR      GPR_0067,GPR_0016         ; LR=Clobber_eval
 [     0x3ff87128a90]                          AR      GPR_0034,GPR_0016      
 [     0x3ff87128c20]                          LLGFR   GPR_0067,GPR_0067      
 [     0x3ff87128d00]                          LLGFR   GPR_0034,GPR_0034      
 [     0x3ff87128f50]                          LA      GPR_0069,#413 16(GPR_0067,&GPR_0017)
 [     0x3ff87129020]                          SGRK    GPR_0068,GPR_0034,GPR_0067      
 [     0x3ff87129370]                          CLGIJ   GPR_0068,Outlined Label L0032,16,BM(mask=0x4),    ; Jump to OOL call to original Java implementation if < 16B 
 [     0x3ff87129790]                          LARL    GPR_0070, &<LiteralPool Base Address>     ; LoadLitPool
 [     0x3ff87129de0]                          VLM     VRF_0071,VRF_0076,#414 =X(3ffb00e38f0) 0(GPR_0070) ; Populate vectors with CRC-32C reduction constants
 [     0x3ff87129f90]                          VGBM    VRF_0077,0x0
 [     0x3ff8712a140]                          VLVGF   VRF_0077,GPR_0066,#415 3
 [     0x3ff8712a280]                          Label L0034:     # (Start of internal control flow)      
 [     0x3ff8712a3c0]                          CLGIJ   GPR_0068,Label L0035,64,BM(mask=0x4),     ; if remaining < 64 goto foldBy1
 [     0x3ff8712a530]                          Label L0036:      ; foldBy4
 [     0x3ff8712a9b0]                          VLM     VRF_0078,VRF_0081,#416 0(GPR_0069)
 [     0x3ff8712aa90]                          VPERM   VRF_0078,VRF_0078,VRF_0078,VRF_0071
 [     0x3ff8712ab70]                          VPERM   VRF_0079,VRF_0079,VRF_0079,VRF_0071
 [     0x3ff8712ac50]                          VPERM   VRF_0080,VRF_0080,VRF_0080,VRF_0071
 [     0x3ff8712ad30]                          VPERM   VRF_0081,VRF_0081,VRF_0081,VRF_0071
 [     0x3ff8712ae10]                          VX      VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712af50]                          BRC     BRU(0xf), Label L0037    
 [     0x3ff8712b090]                          Label L0038:      ; foldBy4Loop
 [     0x3ff8712b510]                          VLM     VRF_0082,VRF_0085,#417 0(GPR_0069)
 [     0x3ff8712b5f0]                          VPERM   VRF_0082,VRF_0082,VRF_0082,VRF_0071
 [     0x3ff8712b6d0]                          VPERM   VRF_0083,VRF_0083,VRF_0083,VRF_0071
 [     0x3ff8712b7b0]                          VPERM   VRF_0084,VRF_0084,VRF_0084,VRF_0071
 [     0x3ff8712b890]                          VPERM   VRF_0085,VRF_0085,VRF_0085,VRF_0071
 [     0x3ff8712b970]                          VGFMAG  VRF_0078,VRF_0072,VRF_0078,VRF_0082
 [     0x3ff8712ba50]                          VGFMAG  VRF_0079,VRF_0072,VRF_0079,VRF_0083
 [     0x3ff8712bb30]                          VGFMAG  VRF_0080,VRF_0072,VRF_0080,VRF_0084
 [     0x3ff8712bc10]                          VGFMAG  VRF_0081,VRF_0072,VRF_0081,VRF_0085
 [     0x3ff8712bcf0]                          Label L0037:     
 [     0x3ff8712be90]                          LA      GPR_0069,#418 64(GPR_0069)
 [     0x3ff8712bf60]                          AGHI    GPR_0068,0xffc0  
 [     0x3ff8712c040]                          CLGIJ   GPR_0068,Label L0038,64,BNM(mask=0xa),   
 [     0x3ff8712c190]                          Label L0039:      ; reduce 4 vectors into 1
 [     0x3ff8712c290]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0079
 [     0x3ff8712c370]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0080
 [     0x3ff8712c450]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0081
 [     0x3ff8712c590]                          CLGIJ   GPR_0068,Label L0040,16,BM(mask=0x4),     ; if remaining < 16 goto finalReduction
 [     0x3ff8712c700]                          BRC     BRU(0xf), Label L0041     ; Jump over foldBy1 header
 [     0x3ff8712c800]                          Label L0035:      ; foldBy1
 [     0x3ff8712c9c0]                          VL      VRF_0078,#419 0(GPR_0069),0
 [     0x3ff8712caa0]                          VPERM   VRF_0078,VRF_0078,VRF_0078,VRF_0071
 [     0x3ff8712cb80]                          VX      VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712ccc0]                          BRC     BRU(0xf), Label L0042
 [     0x3ff8712cda0]                          Label L0041:      ; foldBy1Loop
 [     0x3ff8712cf60]                          VL      VRF_0079,#420 0(GPR_0069),0
 [     0x3ff8712d040]                          VPERM   VRF_0079,VRF_0079,VRF_0079,VRF_0071
 [     0x3ff8712d120]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0079
 [     0x3ff8712d200]                          Label L0042:
 [     0x3ff8712d3a0]                          LA      GPR_0069,#421 16(GPR_0069)
 [     0x3ff8712d470]                          AGHI    GPR_0068,0xfff0
 [     0x3ff8712d550]                          CLGIJ   GPR_0068,Label L0041,16,BNM(mask=0xa),
 [     0x3ff8712dc00]                          assocreg {GPR1:&GPR_0018:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0016:R}
 [     0x3ff8712d640]                          Label L0040:     # (End of internal control flow)         ; Reduce vector into 32-bit CRC-32C value
 POST:
 {AssignAny:GPR_0070:R} {VRF9:VRF_0071:R} {VRF10:VRF_0072:R} {VRF11:VRF_0073:R} {VRF12:VRF_0074:R} {VRF13:VRF_0075:R} {VRF14:VRF_0076:R} {VRF1:VRF_0078:R}
 {VRF2:VRF_0079:R} {VRF3:VRF_0080:R} {VRF4:VRF_0081:R} {VRF5:VRF_0082:R} {VRF6:VRF_0083:R} {VRF7:VRF_0084:R} {VRF8:VRF_0085:R} {AssignAny:VRF_0077:R}
 {AssignAny:GPR_0069:R}
 [     0x3ff8712dcf0]                          VLEIB   VRF_0071,0x40,7
 [     0x3ff8712dde0]                          VSRLB   VRF_0077,VRF_0073,VRF_0071
 [     0x3ff8712dec0]                          VLEIG   VRF_0077,0x1,0
 [     0x3ff8712dfb0]                          VGFMG   VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712e090]                          VLEIB   VRF_0071,0x20,7
 [     0x3ff8712e180]                          VSRLB   VRF_0079,VRF_0078,VRF_0071
 [     0x3ff8712e260]                          VUPLLF  VRF_0078,VRF_0078
 [     0x3ff8712e340]                          VGFMAG  VRF_0078,VRF_0074,VRF_0078,VRF_0079
 [     0x3ff8712e420]                          VUPLLF  VRF_0079,VRF_0078
 [     0x3ff8712e500]                          VGFMG   VRF_0079,VRF_0075,VRF_0079
 [     0x3ff8712e5e0]                          VUPLLF  VRF_0079,VRF_0079
 [     0x3ff8712e6c0]                          VGFMAG  VRF_0079,VRF_0076,VRF_0079,VRF_0078
 [     0x3ff8712e860]                          VLGVF   GPR_0066,VRF_0079,#422 2
 [     0x3ff8712e940]                          CLGIJ   GPR_0068,Outlined Label L0032,0,BNH(mask=0x6),    ; Jump to OOL call to original Java implementation if remaining data
 [     0x3ff87152330]                          assocreg {GPR0:D_GPR_0088:R}* {GPR1:D_GPR_0089:R}* {GPR2:GPR_0087:R} {GPR3:D_GPR_0090:R}* {GPR4:D_GPR_0091:R}* {GPR14:GPR_0092:R} {FPR0:D_FPR_0093:R}* {FPR1:D_FPR_0094:R}* {FPR2:D_FPR_0095:R}* {FPR3:D_FPR_00
 [     0x3ff87151d70]                          Label L0043:
 POST:
 {AssignAny:&GPR_0017:R} {AssignAny:GPR_0067:R} {AssignAny:GPR_0034:R} {AssignAny:GPR_0068:R} {AssignAny:GPR_0066:R}

------------------------------

 [     0x3ff8712eb00]                          Outlined Label L0032:     ; Denotes start of OOL call to original CRC32C.updateBytes Java implementation
 [     0x3ff8712ec00]                          SGRK    GPR_0067,GPR_0034,GPR_0068
 [     0x3ff8712f950]                          LGR     GPR_0086,GPR_0066         ; LR=Reg_param
 [     0x3ff8712fb10]                          ST      GPR_0034,#423 0(GPR5)
 [     0x3ff871519e0]                          assocreg {GPR1:&GPR_0018:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0016:R} {VRF1:VRF_0078:R} {VRF2:VRF_0079:R} {VRF3:VRF_0080:R} {VRF4:VRF_0081:R} {VRF5:VRF_0082:R} {VRF6:VRF_0083:R} {VRF7:VRF_0084:R} {VRF8:VRF_0085:R
  PRE:
 {GPR1:GPR_0086:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0067:R}
 [     0x3ff871513f0]                          BRASL   GPR_0092, 0x000003FF87151380
 POST:
 {GPR2:GPR_0087:D} {GPR0:D_GPR_0088:D}* {GPR1:D_GPR_0089:D}* {GPR3:D_GPR_0090:D}* {GPR4:D_GPR_0091:D}* {GPR14:GPR_0092:D} {FPR0:D_FPR_0093:D}* {FPR1:D_FPR_0094:D}*
 {FPR2:D_FPR_0095:D}* {FPR3:D_FPR_0096:D}* {FPR4:D_FPR_0097:D}* {FPR5:D_FPR_0098:D}* {FPR6:D_FPR_0099:D}* {FPR7:D_FPR_0100:D}* {FPR8:D_FPR_0101:D}* {FPR9:D_FPR_0102:D}*
 {FPR10:D_FPR_0103:D}* {FPR11:D_FPR_0104:D}* {FPR12:D_FPR_0105:D}* {FPR13:D_FPR_0106:D}* {FPR14:D_FPR_0107:D}* {FPR15:D_FPR_0108:D}* {VRF16:D_VRF_0109:D}* {VRF17:D_VRF_0110:D}*
 {VRF18:D_VRF_0111:D}* {VRF19:D_VRF_0112:D}* {VRF20:D_VRF_0113:D}* {VRF21:D_VRF_0114:D}* {VRF22:D_VRF_0115:D}* {VRF23:D_VRF_0116:D}* {VRF24:D_VRF_0117:D}* {VRF25:D_VRF_0118:D}*
 {VRF26:D_VRF_0119:D}* {VRF27:D_VRF_0120:D}* {VRF28:D_VRF_0121:D}* {VRF29:D_VRF_0122:D}* {VRF30:D_VRF_0123:D}* {VRF31:D_VRF_0124:D}*
 [     0x3ff87151ab0]                          LGR     GPR_0066,GPR_0087
 [     0x3ff87151b90]                          BRC     NOP(0xf), Label L0043     ; Return to main-line

@Spencer-Comin
Copy link
Contributor Author

I've opened a PR adding a unit test for this acceleration here: #16404

runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
runtime/compiler/z/codegen/J9TreeEvaluator.cpp Outdated Show resolved Hide resolved
@r30shah
Copy link
Contributor

r30shah commented Dec 16, 2022

@Spencer-Comin briefly going over the changes, looks ok to me and most of the concerns were addressed. Any changes to assembly you printed in this comment (#16379 (comment)) ? I wanted to give one last look before approving changes.

@Spencer-Comin
Copy link
Contributor Author

@r30shah I've updated the assembly in #16379 (comment)

Copy link
Contributor

@r30shah r30shah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spencer-Comin I went to the changes quickly and looks ok to me. This kind of performance changes should accompanied by relevant performance numbers. Can you post results and analysis from your JMH benchmark so we have it documented here?

@joransiu Can you please give one quick look as well ?

Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized
methods by inlining them with a vector assembly sequence.

Signed-off-by: Spencer Comin <spencer.comin@ibm.com>
@joransiu
Copy link
Member

Jenkins test sanity,extended zlinux jdk11,jdk17

@joransiu joransiu merged commit a9d4f14 into eclipse-openj9:master Jan 16, 2023
@Spencer-Comin
Copy link
Contributor Author

While doing some benchmarking I found a ~1/100 error related to this change. As a brief summary, a register spill can be inserted between the first possible jump to the out-of-line call and the start of the vector internal control flow. In the 15B or less case, the spill isn't hit, leading to a segfault later on.

@joransiu
Copy link
Member

While doing some benchmarking I found a ~1/100 error related to this change. As a brief summary, a register spill can be inserted between the first possible jump to the out-of-line call and the start of the vector internal control flow. In the 15B or less case, the spill isn't hit, leading to a segfault later on.

Thanks @Spencer-Comin - Can you open an issue to track this issue? I'd like to better understand which virtual reg is leading to that spill. From the code, it looks like you have all the dependencies registered and added to the relevant control flow point instructions already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants