Accelerate CRC-32C calculation on Z #16379

Spencer-Comin · 2022-11-29T16:52:20Z

Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized methods by inlining them with a vector assembly sequence.

The implementation is based on the bit-reflected algorithm described in Intel's whitepaper "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" [1] using equivalent Z instructions. The basic layout of the implementation is as follows:

While there are more than 64B of data remaining, consume 64B chunks using 4 vectors in parallel, then
while there are more than 16B of data remaining, consume 16B chunks using 1 vector, then
if there is remaining data, call the original implementation on the remainder

Preliminary benchmarking shows approximately 22× speedup for 4kB and 16kB byte arrays, and no significant speedup or slowdown for 15B arrays.

[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Spencer-Comin · 2022-11-29T17:55:44Z

@joransiu @r30shah fyi

runtime/compiler/z/codegen/J9CodeGenerator.cpp

r30shah

@Spencer-Comin I have skimmed through the changes quickly and left some comments. Will give it another check soon, but would appreciate if you can provide a sample implementation in comment, this will make it easier to follow.

runtime/compiler/z/codegen/J9CodeGenerator.cpp

runtime/compiler/z/codegen/J9TreeEvaluator.cpp

Spencer-Comin · 2022-12-09T17:38:56Z

For reference, here's the generated assembly (log from a unit test on a z13 machine):

============================================================
; Live regs: GPR=4 FPR=0 VRF=0 {GPR_0034, &GPR_0018, &GPR_0017, GPR_0016}
------------------------------
 n62n     (  0)  treetop                                                                              [     0x3ff87080320] bci=[-1,46,153] rc=0 vc=333 vn=- li=8 udi=- nc=1
 n61n     (  2)    icall  java/util/zip/CRC32C.updateBytes(I[BII)I[#409  final static Method] [flags 0x20500 0x0 ] ()  [     0x3ff870802d0] bci=[-1,46,153] rc=2 vc=333 vn=- li=8 udi=- nc=4 flg=0x20
 n55n     (  1)      iloadi  java/util/zip/CRC32C.crc I[#404  Shadow +8] [flags 0x80603 0x0 ] (cannotOverflow )  [     0x3ff870800f0] bci=[-1,38,153] rc=1 vc=333 vn=- li=8 udi=- nc=1 flg=0x1000
 n113n    (  2)        ==>aRegLoad (in &GPR_0018) (X!=0 X>=0 SeenRealReference sharedMemory )
 n112n    (  1)      ==>aRegLoad (in &GPR_0017) (SeenRealReference sharedMemory )
 n111n    (  2)      ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n60n     (  1)      iadd                                                                             [     0x3ff87080280] bci=[-1,45,153] rc=1 vc=333 vn=- li=8 udi=- nc=2
 n111n    (  2)        ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n115n    (  1)        ==>iload (in GPR_0034)
------------------------------

buildDirectDispatch
Build Direct Call

GC Point at 000003FF871513F0 has preserved register map 1fc0
------------------------------
 n62n     (  0)  treetop                                                                              [     0x3ff87080320] bci=[-1,46,153] rc=0 vc=333 vn=- li=8 udi=- nc=1
 n61n     (  1)    icall  java/util/zip/CRC32C.updateBytes(I[BII)I[#409  final static Method] [flags 0x20500 0x0 ] (in GPR_0066) ()  [     0x3ff870802d0] bci=[-1,46,153] rc=1 vc=333 vn=- li=8 udi=- nc=4 flg=0x20
 n55n     (  0)      iloadi  java/util/zip/CRC32C.crc I[#404  Shadow +8] [flags 0x80603 0x0 ] (in GPR_0066) (cannotOverflow )  [     0x3ff870800f0] bci=[-1,38,153] rc=0 vc=333 vn=- li=8 udi=- nc=1 flg=0x1000
 n113n    (  1)        ==>aRegLoad (in &GPR_0018) (X!=0 X>=0 SeenRealReference sharedMemory )
 n112n    (  0)      ==>aRegLoad (in &GPR_0017) (SeenRealReference sharedMemory )
 n111n    (  0)      ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n60n     (  0)      iadd (in GPR_0034)                                                               [     0x3ff87080280] bci=[-1,45,153] rc=0 vc=333 vn=- li=8 udi=- nc=2
 n111n    (  0)        ==>iRegLoad (in GPR_0016) (cannotOverflow SeenRealReference )
 n115n    (  0)        ==>iload (in GPR_0034)
------------------------------

 [     0x3ff87128850]                          L       GPR_0066, Shadow[java/util/zip/CRC32C.crc I] 8(&GPR_0018)
 [     0x3ff87128990]                          LR      GPR_0067,GPR_0016         ; LR=Clobber_eval
 [     0x3ff87128a90]                          AR      GPR_0034,GPR_0016      
 [     0x3ff87128c20]                          LLGFR   GPR_0067,GPR_0067      
 [     0x3ff87128d00]                          LLGFR   GPR_0034,GPR_0034      
 [     0x3ff87128f50]                          LA      GPR_0069,#413 16(GPR_0067,&GPR_0017)
 [     0x3ff87129020]                          SGRK    GPR_0068,GPR_0034,GPR_0067      
 [     0x3ff87129370]                          CLGIJ   GPR_0068,Outlined Label L0032,16,BM(mask=0x4),    ; Jump to OOL call to original Java implementation if < 16B 
 [     0x3ff87129790]                          LARL    GPR_0070, &<LiteralPool Base Address>     ; LoadLitPool
 [     0x3ff87129de0]                          VLM     VRF_0071,VRF_0076,#414 =X(3ffb00e38f0) 0(GPR_0070) ; Populate vectors with CRC-32C reduction constants
 [     0x3ff87129f90]                          VGBM    VRF_0077,0x0
 [     0x3ff8712a140]                          VLVGF   VRF_0077,GPR_0066,#415 3
 [     0x3ff8712a280]                          Label L0034:     # (Start of internal control flow)      
 [     0x3ff8712a3c0]                          CLGIJ   GPR_0068,Label L0035,64,BM(mask=0x4),     ; if remaining < 64 goto foldBy1
 [     0x3ff8712a530]                          Label L0036:      ; foldBy4
 [     0x3ff8712a9b0]                          VLM     VRF_0078,VRF_0081,#416 0(GPR_0069)
 [     0x3ff8712aa90]                          VPERM   VRF_0078,VRF_0078,VRF_0078,VRF_0071
 [     0x3ff8712ab70]                          VPERM   VRF_0079,VRF_0079,VRF_0079,VRF_0071
 [     0x3ff8712ac50]                          VPERM   VRF_0080,VRF_0080,VRF_0080,VRF_0071
 [     0x3ff8712ad30]                          VPERM   VRF_0081,VRF_0081,VRF_0081,VRF_0071
 [     0x3ff8712ae10]                          VX      VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712af50]                          BRC     BRU(0xf), Label L0037    
 [     0x3ff8712b090]                          Label L0038:      ; foldBy4Loop
 [     0x3ff8712b510]                          VLM     VRF_0082,VRF_0085,#417 0(GPR_0069)
 [     0x3ff8712b5f0]                          VPERM   VRF_0082,VRF_0082,VRF_0082,VRF_0071
 [     0x3ff8712b6d0]                          VPERM   VRF_0083,VRF_0083,VRF_0083,VRF_0071
 [     0x3ff8712b7b0]                          VPERM   VRF_0084,VRF_0084,VRF_0084,VRF_0071
 [     0x3ff8712b890]                          VPERM   VRF_0085,VRF_0085,VRF_0085,VRF_0071
 [     0x3ff8712b970]                          VGFMAG  VRF_0078,VRF_0072,VRF_0078,VRF_0082
 [     0x3ff8712ba50]                          VGFMAG  VRF_0079,VRF_0072,VRF_0079,VRF_0083
 [     0x3ff8712bb30]                          VGFMAG  VRF_0080,VRF_0072,VRF_0080,VRF_0084
 [     0x3ff8712bc10]                          VGFMAG  VRF_0081,VRF_0072,VRF_0081,VRF_0085
 [     0x3ff8712bcf0]                          Label L0037:     
 [     0x3ff8712be90]                          LA      GPR_0069,#418 64(GPR_0069)
 [     0x3ff8712bf60]                          AGHI    GPR_0068,0xffc0  
 [     0x3ff8712c040]                          CLGIJ   GPR_0068,Label L0038,64,BNM(mask=0xa),   
 [     0x3ff8712c190]                          Label L0039:      ; reduce 4 vectors into 1
 [     0x3ff8712c290]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0079
 [     0x3ff8712c370]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0080
 [     0x3ff8712c450]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0081
 [     0x3ff8712c590]                          CLGIJ   GPR_0068,Label L0040,16,BM(mask=0x4),     ; if remaining < 16 goto finalReduction
 [     0x3ff8712c700]                          BRC     BRU(0xf), Label L0041     ; Jump over foldBy1 header
 [     0x3ff8712c800]                          Label L0035:      ; foldBy1
 [     0x3ff8712c9c0]                          VL      VRF_0078,#419 0(GPR_0069),0
 [     0x3ff8712caa0]                          VPERM   VRF_0078,VRF_0078,VRF_0078,VRF_0071
 [     0x3ff8712cb80]                          VX      VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712ccc0]                          BRC     BRU(0xf), Label L0042
 [     0x3ff8712cda0]                          Label L0041:      ; foldBy1Loop
 [     0x3ff8712cf60]                          VL      VRF_0079,#420 0(GPR_0069),0
 [     0x3ff8712d040]                          VPERM   VRF_0079,VRF_0079,VRF_0079,VRF_0071
 [     0x3ff8712d120]                          VGFMAG  VRF_0078,VRF_0073,VRF_0078,VRF_0079
 [     0x3ff8712d200]                          Label L0042:
 [     0x3ff8712d3a0]                          LA      GPR_0069,#421 16(GPR_0069)
 [     0x3ff8712d470]                          AGHI    GPR_0068,0xfff0
 [     0x3ff8712d550]                          CLGIJ   GPR_0068,Label L0041,16,BNM(mask=0xa),
 [     0x3ff8712dc00]                          assocreg {GPR1:&GPR_0018:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0016:R}
 [     0x3ff8712d640]                          Label L0040:     # (End of internal control flow)         ; Reduce vector into 32-bit CRC-32C value
 POST:
 {AssignAny:GPR_0070:R} {VRF9:VRF_0071:R} {VRF10:VRF_0072:R} {VRF11:VRF_0073:R} {VRF12:VRF_0074:R} {VRF13:VRF_0075:R} {VRF14:VRF_0076:R} {VRF1:VRF_0078:R}
 {VRF2:VRF_0079:R} {VRF3:VRF_0080:R} {VRF4:VRF_0081:R} {VRF5:VRF_0082:R} {VRF6:VRF_0083:R} {VRF7:VRF_0084:R} {VRF8:VRF_0085:R} {AssignAny:VRF_0077:R}
 {AssignAny:GPR_0069:R}
 [     0x3ff8712dcf0]                          VLEIB   VRF_0071,0x40,7
 [     0x3ff8712dde0]                          VSRLB   VRF_0077,VRF_0073,VRF_0071
 [     0x3ff8712dec0]                          VLEIG   VRF_0077,0x1,0
 [     0x3ff8712dfb0]                          VGFMG   VRF_0078,VRF_0077,VRF_0078
 [     0x3ff8712e090]                          VLEIB   VRF_0071,0x20,7
 [     0x3ff8712e180]                          VSRLB   VRF_0079,VRF_0078,VRF_0071
 [     0x3ff8712e260]                          VUPLLF  VRF_0078,VRF_0078
 [     0x3ff8712e340]                          VGFMAG  VRF_0078,VRF_0074,VRF_0078,VRF_0079
 [     0x3ff8712e420]                          VUPLLF  VRF_0079,VRF_0078
 [     0x3ff8712e500]                          VGFMG   VRF_0079,VRF_0075,VRF_0079
 [     0x3ff8712e5e0]                          VUPLLF  VRF_0079,VRF_0079
 [     0x3ff8712e6c0]                          VGFMAG  VRF_0079,VRF_0076,VRF_0079,VRF_0078
 [     0x3ff8712e860]                          VLGVF   GPR_0066,VRF_0079,#422 2
 [     0x3ff8712e940]                          CLGIJ   GPR_0068,Outlined Label L0032,0,BNH(mask=0x6),    ; Jump to OOL call to original Java implementation if remaining data
 [     0x3ff87152330]                          assocreg {GPR0:D_GPR_0088:R}* {GPR1:D_GPR_0089:R}* {GPR2:GPR_0087:R} {GPR3:D_GPR_0090:R}* {GPR4:D_GPR_0091:R}* {GPR14:GPR_0092:R} {FPR0:D_FPR_0093:R}* {FPR1:D_FPR_0094:R}* {FPR2:D_FPR_0095:R}* {FPR3:D_FPR_00
 [     0x3ff87151d70]                          Label L0043:
 POST:
 {AssignAny:&GPR_0017:R} {AssignAny:GPR_0067:R} {AssignAny:GPR_0034:R} {AssignAny:GPR_0068:R} {AssignAny:GPR_0066:R}

------------------------------

 [     0x3ff8712eb00]                          Outlined Label L0032:     ; Denotes start of OOL call to original CRC32C.updateBytes Java implementation
 [     0x3ff8712ec00]                          SGRK    GPR_0067,GPR_0034,GPR_0068
 [     0x3ff8712f950]                          LGR     GPR_0086,GPR_0066         ; LR=Reg_param
 [     0x3ff8712fb10]                          ST      GPR_0034,#423 0(GPR5)
 [     0x3ff871519e0]                          assocreg {GPR1:&GPR_0018:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0016:R} {VRF1:VRF_0078:R} {VRF2:VRF_0079:R} {VRF3:VRF_0080:R} {VRF4:VRF_0081:R} {VRF5:VRF_0082:R} {VRF6:VRF_0083:R} {VRF7:VRF_0084:R} {VRF8:VRF_0085:R
  PRE:
 {GPR1:GPR_0086:R} {GPR2:&GPR_0017:R} {GPR3:GPR_0067:R}
 [     0x3ff871513f0]                          BRASL   GPR_0092, 0x000003FF87151380
 POST:
 {GPR2:GPR_0087:D} {GPR0:D_GPR_0088:D}* {GPR1:D_GPR_0089:D}* {GPR3:D_GPR_0090:D}* {GPR4:D_GPR_0091:D}* {GPR14:GPR_0092:D} {FPR0:D_FPR_0093:D}* {FPR1:D_FPR_0094:D}*
 {FPR2:D_FPR_0095:D}* {FPR3:D_FPR_0096:D}* {FPR4:D_FPR_0097:D}* {FPR5:D_FPR_0098:D}* {FPR6:D_FPR_0099:D}* {FPR7:D_FPR_0100:D}* {FPR8:D_FPR_0101:D}* {FPR9:D_FPR_0102:D}*
 {FPR10:D_FPR_0103:D}* {FPR11:D_FPR_0104:D}* {FPR12:D_FPR_0105:D}* {FPR13:D_FPR_0106:D}* {FPR14:D_FPR_0107:D}* {FPR15:D_FPR_0108:D}* {VRF16:D_VRF_0109:D}* {VRF17:D_VRF_0110:D}*
 {VRF18:D_VRF_0111:D}* {VRF19:D_VRF_0112:D}* {VRF20:D_VRF_0113:D}* {VRF21:D_VRF_0114:D}* {VRF22:D_VRF_0115:D}* {VRF23:D_VRF_0116:D}* {VRF24:D_VRF_0117:D}* {VRF25:D_VRF_0118:D}*
 {VRF26:D_VRF_0119:D}* {VRF27:D_VRF_0120:D}* {VRF28:D_VRF_0121:D}* {VRF29:D_VRF_0122:D}* {VRF30:D_VRF_0123:D}* {VRF31:D_VRF_0124:D}*
 [     0x3ff87151ab0]                          LGR     GPR_0066,GPR_0087
 [     0x3ff87151b90]                          BRC     NOP(0xf), Label L0043     ; Return to main-line

Spencer-Comin · 2022-12-13T15:35:34Z

I've opened a PR adding a unit test for this acceleration here: #16404

runtime/compiler/z/codegen/J9TreeEvaluator.cpp

runtime/compiler/z/codegen/J9CodeGenerator.cpp

r30shah · 2022-12-16T14:44:46Z

@Spencer-Comin briefly going over the changes, looks ok to me and most of the concerns were addressed. Any changes to assembly you printed in this comment (#16379 (comment)) ? I wanted to give one last look before approving changes.

Spencer-Comin · 2022-12-20T16:08:52Z

@r30shah I've updated the assembly in #16379 (comment)

r30shah

@Spencer-Comin I went to the changes quickly and looks ok to me. This kind of performance changes should accompanied by relevant performance numbers. Can you post results and analysis from your JMH benchmark so we have it documented here?

@joransiu Can you please give one quick look as well ?

Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized methods by inlining them with a vector assembly sequence. Signed-off-by: Spencer Comin <spencer.comin@ibm.com>

joransiu · 2023-01-13T19:25:33Z

Jenkins test sanity,extended zlinux jdk11,jdk17

Spencer-Comin · 2023-01-16T18:47:41Z

While doing some benchmarking I found a ~1/100 error related to this change. As a brief summary, a register spill can be inserted between the first possible jump to the out-of-line call and the start of the vector internal control flow. In the 15B or less case, the spill isn't hit, leading to a segfault later on.

joransiu · 2023-01-16T21:34:11Z

While doing some benchmarking I found a ~1/100 error related to this change. As a brief summary, a register spill can be inserted between the first possible jump to the out-of-line call and the start of the vector internal control flow. In the 15B or less case, the spill isn't hit, leading to a segfault later on.

Thanks @Spencer-Comin - Can you open an issue to track this issue? I'd like to better understand which virtual reg is leading to that spill. From the code, it looks like you have all the dependencies registered and added to the relevant control flow point instructions already.

Spencer-Comin force-pushed the crc32c-call-java branch from 838e66d to 50bde15 Compare November 29, 2022 17:04

Spencer-Comin force-pushed the crc32c-call-java branch from 50bde15 to b9a8506 Compare November 29, 2022 18:52

r30shah requested changes Nov 29, 2022

View reviewed changes

runtime/compiler/z/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved

runtime/compiler/z/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved

Spencer-Comin force-pushed the crc32c-call-java branch from b9a8506 to c0e2494 Compare November 29, 2022 20:59

Spencer-Comin mentioned this pull request Dec 1, 2022

Add unit test for accelerated CRC32C.update on Z #16404

Merged

r30shah requested changes Dec 7, 2022

View reviewed changes

Spencer-Comin force-pushed the crc32c-call-java branch 3 times, most recently from 2d7eb52 to 7544479 Compare December 9, 2022 17:34

r30shah requested changes Dec 13, 2022

View reviewed changes

joransiu reviewed Dec 13, 2022

View reviewed changes

runtime/compiler/z/codegen/J9CodeGenerator.cpp Outdated Show resolved Hide resolved

Spencer-Comin force-pushed the crc32c-call-java branch from 7544479 to 8fae066 Compare December 14, 2022 16:18

Spencer-Comin mentioned this pull request Dec 14, 2022

Investigate big-endian algorithm for CRC-32C acceleration on Z #16474

Open

Spencer-Comin force-pushed the crc32c-call-java branch from 8fae066 to b7bce51 Compare December 14, 2022 16:46

r30shah approved these changes Jan 11, 2023

View reviewed changes

Spencer-Comin force-pushed the crc32c-call-java branch from b7bce51 to 1c2eb8a Compare January 12, 2023 20:45

Accelerate CRC-32C calculation on Z

2a14647

Accelerate CRC32C.updateBytes and CRC32C.updateDirectByteBuffers recognized methods by inlining them with a vector assembly sequence. Signed-off-by: Spencer Comin <spencer.comin@ibm.com>

Spencer-Comin force-pushed the crc32c-call-java branch from 1c2eb8a to 2a14647 Compare January 12, 2023 20:49

joransiu added comp:jit arch:z labels Jan 13, 2023

joransiu approved these changes Jan 16, 2023

View reviewed changes

joransiu merged commit a9d4f14 into eclipse-openj9:master Jan 16, 2023

Spencer-Comin mentioned this pull request Jan 16, 2023

Errors in CRC-32C accleration on Z #16560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate CRC-32C calculation on Z #16379

Accelerate CRC-32C calculation on Z #16379

Spencer-Comin commented Nov 29, 2022

Spencer-Comin commented Nov 29, 2022

r30shah left a comment

Spencer-Comin commented Dec 9, 2022 •

edited

Spencer-Comin commented Dec 13, 2022

r30shah commented Dec 16, 2022

Spencer-Comin commented Dec 20, 2022

r30shah left a comment

joransiu commented Jan 13, 2023

Spencer-Comin commented Jan 16, 2023

joransiu commented Jan 16, 2023

Accelerate CRC-32C calculation on Z #16379

Accelerate CRC-32C calculation on Z #16379

Conversation

Spencer-Comin commented Nov 29, 2022

Spencer-Comin commented Nov 29, 2022

r30shah left a comment

Choose a reason for hiding this comment

Spencer-Comin commented Dec 9, 2022 • edited

Spencer-Comin commented Dec 13, 2022

r30shah commented Dec 16, 2022

Spencer-Comin commented Dec 20, 2022

r30shah left a comment

Choose a reason for hiding this comment

joransiu commented Jan 13, 2023

Spencer-Comin commented Jan 16, 2023

joransiu commented Jan 16, 2023

Spencer-Comin commented Dec 9, 2022 •

edited