Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

rmnattas · 2020-12-02T21:25:45Z

In-register byte-reverse uses the same instruction sequence for P7, P8 & P9.
Given that rlwimi became cheaper in P8 & P9, using it instead of rlwinm in P8 & P9 results in more performant and less total instructions to execute.
Also, given the higher throughput of rlwimi in P8 & P9, instructions were ordered in a non-dependent matter to allow for parallel execution when possible.

`rlwimi` Cost	Latency	Throughput
POWER7	3-4 cycles	1 per cycle
POWER8	1 cycle	2 per cycle
POWER9	2 cycles	4 per cycle	pipe busy 1 cycle

Number of Instructions:

Type	Original #Instr	Improved #Instr
Short Byte Swap	3	2
Int Byte Swap	7	3
Long Byte Swap (32-bit runtime)	16 (14)	8 (6)

BumbleBench Performance Result on P9 (Updated numbers in comments)

Signed-off-by: Abdulrahman Alattas rmnattas@gmail.com

rmnattas · 2020-12-02T21:26:38Z

fyi: @aviansie-ben @zl-wang

rmnattas · 2020-12-02T21:49:25Z

Passes Tril LogicalTest on a P9 machine:

[----------] 20 tests from LogicalTest/Int16LogicalUnary
[----------] 96 tests from LogicalTest/Int32LogicalUnary
[----------] 1536 tests from LogicalTest/Int32LogicalBinary
[----------] 1944 tests from LogicalTest/Int64LogicalBinary
[----------] 72 tests from LogicalTest/Int64LogicalUnary

gita-omr · 2020-12-03T00:37:59Z

I would like to review before this is merged. @aviansie-ben could you please add me as a reviewer? @zl-wang I hope you can review as well.

compiler/p/codegen/OMRTreeEvaluator.cpp

rmnattas · 2020-12-03T20:18:46Z

Updated PR description with more details and performance results

rmnattas · 2020-12-04T15:36:10Z

Ready for another review...
fyi: @aviansie-ben @zl-wang @gita-omr

compiler/p/codegen/OMRTreeEvaluator.cpp

aviansie-ben · 2020-12-07T22:23:49Z

Passes Tril LogicalTest on a P9 machine

Was this run with a special version of Tril? AFAIK, the CPU detection logic doesn't currently run in Tril, so unless you hacked something together the new code wouldn't have been tested.

rmnattas · 2020-12-08T15:24:27Z

Can you please add comments explaining why rlwimi isn't used prior to P8?

rlwimi has a high cost on P7 (of 3-4 cycles) so using rlwinm (which costs 1 cycle) in the P7 implementation even with more total number of instructions yields less total cycles than an implementation with rlwimi.

Was this run with a special version of Tril? AFAIK, the CPU detection logic doesn't currently run in Tril, so unless you hacked something together the new code wouldn't have been tested.

I didn't know about that, thanks. Will use another test that checks byte-swap (sanity/functional/manual).

rmnattas · 2020-12-14T22:31:41Z

Updated code with suggested changes.
Using rlwinm in the first instructions shows +/- 0.05% difference in BumbleBench result, except P8 long byteswap with -0.3% difference and excluding P8 short byteswap* (negative is in favour of the old version). I still think using rlwinm is preferable as it eliminates situational stalls that could happen in workloads.

P8 BumbleBench results are as following:

Type	Original Implementation	Improved Updated Implementation	Improvement
Short ByteSwap	615.4M	609.2M	-1.00% *
Int ByteSwap	552.5M	667.3M	20.79%
Long ByteSwap	295.6M	447.9M	51.55%

P9 BumbleBench results are as following:

Type	Original Implementation	Improved Updated Implementation	Improvement
Short ByteSwap	561.9M	578.6M	2.95%
Int ByteSwap	459.3M	626.7M	36.41%
Long ByteSwap	307.7M	376.1M	22.24%

* P8 short byteswap has unexpected result and I'll be looking more into it.

Implementation	BumbleBench Result
v0.23 release	625.5M (100%)
Original (rlwinm, rlwinm, or)	615.4M (98.38%)
Improved Orig (rlwimi, rlwimi)	615.0M (98.33%)
Improved Updated (rlwinm, rldimi)	609.2M (97.39%)

rmnattas · 2020-12-15T19:41:18Z

Looking more into the openj9-omr v0.23 release vs OMR Dec. 14 master d255f37
BumbleBench Result (Negative percentage in favour of the v0.23 release):

	P8	P9
Short ByteSwap	-1.65%	-0.98%
Int ByteSwap	+0.88%	+4.76%
Long ByteSwap	+1.97%	+3.34%

Looking for the openj9-omr v0.23 release vs OMR Dec. 14 master d255f37 plus this PR
BumbleBench Result (Negative percentage in favour of the v0.23 release):

	P8	P9
Short ByteSwap	-2.68%	+1.95%
Int ByteSwap	+21.85%	+42.90%
Long ByteSwap	+54.54%	+26.33%

rmnattas · 2020-12-16T14:57:47Z

A possible change in hand of some significance is reverting the short ByteSwap to original improved implementation (with rlwimi instructions and different implementation for P7) and that would have the false dependency but it improves P8 short ByteSwap with 1% (Reason for the updated change: #5702 (comment)).
Leaving int and long implementation with the updated (with rlwinm) implementation have +/- negligible effect and I think its safe to keep.

BumbleBench Result in compare to openj9-omr v0.23 release:

Short ByteSwap	P8	P9
Original Improved Implementation (rlwimi, rlwimi)	-1.70%	+1.95%
Updated Improved Implementation (rlwinm, rldimi)	-2.68%	+1.95%

@aviansie-ben @zl-wang @gita-omr
I'm a bit wary of having unnecessary dependency but I feel that's better than the 1% performance drop.
Looking at the numbers in this and the last two comments, any other opinions.

rmnattas · 2021-01-05T15:25:20Z

Given the ~1% performance gain in hand comparing to a possible situational unnecessary dependency.
I'll change the short ByteSwap implementation to rlwimi with having different implementations for P7 than P8/P9.

This change will give the following performance numbers compared to before the PR

	P8	P9
Short ByteSwap	-0.06%	+2.95%
Int ByteSwap	+20.79%	+36.41%
Long ByteSwap	+51.55%	+22.24%

And the following performance numbers compared to openj9-omr v0.23 release

	P8	P9
Short ByteSwap	-1.70%	+1.95%
Int ByteSwap	+21.85%	+42.90%
Long ByteSwap	+54.54%	+26.33%

As rlwimi became cheaper in P8 & P9, using it for in-register byte-reverse instruction sequence results in less total instructions and better performance Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>

This to ensure same constant and argument type-length in 64- and 32-bit run modes. Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>

rmnattas · 2021-01-05T18:46:51Z

PR ready for review
@aviansie-ben @zl-wang @gita-omr

gita-omr · 2021-01-05T19:13:31Z

Thanks! I will take a look soon.

zl-wang · 2021-01-05T19:45:59Z

Seemed no change from last time I reviewed. Approved!

gita-omr

Looks good!

aviansie-ben

LGTM

aviansie-ben · 2021-01-18T19:16:31Z

@genie-omr build plinux,aix

rmnattas requested a review from aviansie-ben as a code owner December 2, 2020 21:25

github-actions bot added arch:power comp:compiler labels Dec 2, 2020

0xdaryl requested review from zl-wang and gita-omr December 3, 2020 14:27

zl-wang suggested changes Dec 3, 2020

View reviewed changes

compiler/p/codegen/OMRTreeEvaluator.cpp Show resolved Hide resolved

aviansie-ben self-assigned this Dec 7, 2020

aviansie-ben suggested changes Dec 7, 2020

View reviewed changes

compiler/p/codegen/OMRTreeEvaluator.cpp Outdated Show resolved Hide resolved

compiler/p/codegen/OMRTreeEvaluator.cpp Show resolved Hide resolved

compiler/p/codegen/OMRTreeEvaluator.cpp Show resolved Hide resolved

rmnattas force-pushed the p8-p9-byte-reverse branch from 500c36c to 53f07b6 Compare December 14, 2020 22:30

rmnattas added 2 commits January 5, 2021 13:26

Add more performant in-reg byte-reverse series of instr for P8 & P9

eb756f9

As rlwimi became cheaper in P8 & P9, using it for in-register byte-reverse instruction sequence results in less total instructions and better performance Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>

Wrap instr mask constant into CONSTANT64() in Power byteSwap

0edd401

This to ensure same constant and argument type-length in 64- and 32-bit run modes. Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>

rmnattas force-pushed the p8-p9-byte-reverse branch from 53f07b6 to 0edd401 Compare January 5, 2021 18:42

zl-wang approved these changes Jan 5, 2021

View reviewed changes

gita-omr approved these changes Jan 6, 2021

View reviewed changes

aviansie-ben approved these changes Jan 18, 2021

View reviewed changes

aviansie-ben merged commit 2ccca9b into eclipse:master Jan 18, 2021

rmnattas deleted the p8-p9-byte-reverse branch January 18, 2021 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

rmnattas commented Dec 2, 2020 •

edited

Loading

rmnattas commented Dec 2, 2020

rmnattas commented Dec 2, 2020 •

edited

Loading

gita-omr commented Dec 3, 2020

rmnattas commented Dec 3, 2020

rmnattas commented Dec 4, 2020

aviansie-ben commented Dec 7, 2020

rmnattas commented Dec 8, 2020

rmnattas commented Dec 14, 2020 •

edited

Loading

rmnattas commented Dec 15, 2020 •

edited

Loading

rmnattas commented Dec 16, 2020 •

edited

Loading

rmnattas commented Jan 5, 2021

rmnattas commented Jan 5, 2021

gita-omr commented Jan 5, 2021

zl-wang commented Jan 5, 2021

gita-omr left a comment

aviansie-ben left a comment

aviansie-ben commented Jan 18, 2021

Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

Conversation

rmnattas commented Dec 2, 2020 • edited Loading

rmnattas commented Dec 2, 2020

rmnattas commented Dec 2, 2020 • edited Loading

gita-omr commented Dec 3, 2020

rmnattas commented Dec 3, 2020

rmnattas commented Dec 4, 2020

aviansie-ben commented Dec 7, 2020

rmnattas commented Dec 8, 2020

rmnattas commented Dec 14, 2020 • edited Loading

rmnattas commented Dec 15, 2020 • edited Loading

rmnattas commented Dec 16, 2020 • edited Loading

rmnattas commented Jan 5, 2021

rmnattas commented Jan 5, 2021

gita-omr commented Jan 5, 2021

zl-wang commented Jan 5, 2021

gita-omr left a comment

Choose a reason for hiding this comment

aviansie-ben left a comment

Choose a reason for hiding this comment

aviansie-ben commented Jan 18, 2021

rmnattas commented Dec 2, 2020 •

edited

Loading

rmnattas commented Dec 2, 2020 •

edited

Loading

rmnattas commented Dec 14, 2020 •

edited

Loading

rmnattas commented Dec 15, 2020 •

edited

Loading

rmnattas commented Dec 16, 2020 •

edited

Loading