Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more performant in-reg byte-reverse series of instr for P8 & P9 #5702

Merged
merged 2 commits into from
Jan 18, 2021

Conversation

rmnattas
Copy link
Contributor

@rmnattas rmnattas commented Dec 2, 2020

In-register byte-reverse uses the same instruction sequence for P7, P8 & P9.
Given that rlwimi became cheaper in P8 & P9, using it instead of rlwinm in P8 & P9 results in more performant and less total instructions to execute.
Also, given the higher throughput of rlwimi in P8 & P9, instructions were ordered in a non-dependent matter to allow for parallel execution when possible.

 rlwimi Cost Latency Throughput  
POWER7 3-4 cycles 1 per cycle  
POWER8 1 cycle 2 per cycle  
POWER9 2 cycles 4 per cycle pipe busy 1 cycle

Number of Instructions:

Type Original #Instr Improved #Instr
Short Byte Swap 3 2
Int Byte Swap 7 3
Long Byte Swap (32-bit runtime) 16 (14) 8 (6)

BumbleBench Performance Result on P9 (Updated numbers in comments)

Signed-off-by: Abdulrahman Alattas rmnattas@gmail.com

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 2, 2020

fyi: @aviansie-ben @zl-wang

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 2, 2020

Passes Tril LogicalTest on a P9 machine:

[----------] 20 tests from LogicalTest/Int16LogicalUnary
[----------] 96 tests from LogicalTest/Int32LogicalUnary
[----------] 1536 tests from LogicalTest/Int32LogicalBinary
[----------] 1944 tests from LogicalTest/Int64LogicalBinary
[----------] 72 tests from LogicalTest/Int64LogicalUnary

@gita-omr
Copy link
Contributor

gita-omr commented Dec 3, 2020

I would like to review before this is merged. @aviansie-ben could you please add me as a reviewer? @zl-wang I hope you can review as well.

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 3, 2020

Updated PR description with more details and performance results

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 4, 2020

Ready for another review...
fyi: @aviansie-ben @zl-wang @gita-omr

@aviansie-ben aviansie-ben self-assigned this Dec 7, 2020
compiler/p/codegen/OMRTreeEvaluator.cpp Outdated Show resolved Hide resolved
compiler/p/codegen/OMRTreeEvaluator.cpp Show resolved Hide resolved
compiler/p/codegen/OMRTreeEvaluator.cpp Show resolved Hide resolved
@aviansie-ben
Copy link
Member

Passes Tril LogicalTest on a P9 machine

Was this run with a special version of Tril? AFAIK, the CPU detection logic doesn't currently run in Tril, so unless you hacked something together the new code wouldn't have been tested.

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 8, 2020

Can you please add comments explaining why rlwimi isn't used prior to P8?

rlwimi has a high cost on P7 (of 3-4 cycles) so using rlwinm (which costs 1 cycle) in the P7 implementation even with more total number of instructions yields less total cycles than an implementation with rlwimi.

Was this run with a special version of Tril? AFAIK, the CPU detection logic doesn't currently run in Tril, so unless you hacked something together the new code wouldn't have been tested.

I didn't know about that, thanks. Will use another test that checks byte-swap (sanity/functional/manual).

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 14, 2020

Updated code with suggested changes.
Using rlwinm in the first instructions shows +/- 0.05% difference in BumbleBench result, except P8 long byteswap with -0.3% difference and excluding P8 short byteswap* (negative is in favour of the old version). I still think using rlwinm is preferable as it eliminates situational stalls that could happen in workloads.

P8 BumbleBench results are as following:

Type Original Implementation Improved Updated Implementation Improvement
Short ByteSwap 615.4M 609.2M -1.00% *
Int ByteSwap 552.5M 667.3M 20.79%
Long ByteSwap 295.6M 447.9M 51.55%

P9 BumbleBench results are as following:

Type Original Implementation Improved Updated Implementation Improvement
Short ByteSwap 561.9M 578.6M 2.95%
Int ByteSwap 459.3M 626.7M 36.41%
Long ByteSwap 307.7M 376.1M 22.24%

* P8 short byteswap has unexpected result and I'll be looking more into it.

Implementation BumbleBench Result
v0.23 release 625.5M (100%)
Original (rlwinm, rlwinm, or) 615.4M (98.38%)
Improved Orig (rlwimi, rlwimi) 615.0M (98.33%)
Improved Updated (rlwinm, rldimi) 609.2M (97.39%)

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 15, 2020

Looking more into the openj9-omr v0.23 release vs OMR Dec. 14 master d255f37
BumbleBench Result (Negative percentage in favour of the v0.23 release):

P8 P9
Short ByteSwap -1.65% -0.98%
Int ByteSwap +0.88% +4.76%
Long ByteSwap +1.97% +3.34%

Looking for the openj9-omr v0.23 release vs OMR Dec. 14 master d255f37 plus this PR
BumbleBench Result (Negative percentage in favour of the v0.23 release):

P8 P9
Short ByteSwap -2.68% +1.95%
Int ByteSwap +21.85% +42.90%
Long ByteSwap +54.54% +26.33%

@rmnattas
Copy link
Contributor Author

rmnattas commented Dec 16, 2020

A possible change in hand of some significance is reverting the short ByteSwap to original improved implementation (with rlwimi instructions and different implementation for P7) and that would have the false dependency but it improves P8 short ByteSwap with 1% (Reason for the updated change: #5702 (comment)).
Leaving int and long implementation with the updated (with rlwinm) implementation have +/- negligible effect and I think its safe to keep.

BumbleBench Result in compare to openj9-omr v0.23 release:

Short ByteSwap P8 P9
Original Improved Implementation (rlwimi, rlwimi) -1.70% +1.95%
Updated Improved Implementation (rlwinm, rldimi) -2.68% +1.95%

@aviansie-ben @zl-wang @gita-omr
I'm a bit wary of having unnecessary dependency but I feel that's better than the 1% performance drop.
Looking at the numbers in this and the last two comments, any other opinions.

@rmnattas
Copy link
Contributor Author

rmnattas commented Jan 5, 2021

Given the ~1% performance gain in hand comparing to a possible situational unnecessary dependency.
I'll change the short ByteSwap implementation to rlwimi with having different implementations for P7 than P8/P9.

This change will give the following performance numbers compared to before the PR

P8 P9
Short ByteSwap -0.06% +2.95%
Int ByteSwap +20.79% +36.41%
Long ByteSwap +51.55% +22.24%

And the following performance numbers compared to openj9-omr v0.23 release

P8 P9
Short ByteSwap -1.70% +1.95%
Int ByteSwap +21.85% +42.90%
Long ByteSwap +54.54% +26.33%

As rlwimi became cheaper in P8 & P9, using it for in-register
byte-reverse instruction sequence results in less total instructions
and better performance

Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>
This to ensure same constant and argument type-length
in 64- and 32-bit run modes.

Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>
@rmnattas
Copy link
Contributor Author

rmnattas commented Jan 5, 2021

PR ready for review
@aviansie-ben @zl-wang @gita-omr

@gita-omr
Copy link
Contributor

gita-omr commented Jan 5, 2021

Thanks! I will take a look soon.

@zl-wang
Copy link
Contributor

zl-wang commented Jan 5, 2021

Seemed no change from last time I reviewed. Approved!

Copy link
Contributor

@gita-omr gita-omr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Member

@aviansie-ben aviansie-ben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aviansie-ben
Copy link
Member

@genie-omr build plinux,aix

@aviansie-ben aviansie-ben merged commit 2ccca9b into eclipse:master Jan 18, 2021
@rmnattas rmnattas deleted the p8-p9-byte-reverse branch January 18, 2021 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants