Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime compressed refs interpreter performance #8878

Open
gacholio opened this issue Mar 16, 2020 · 80 comments
Open

Runtime compressed refs interpreter performance #8878

gacholio opened this issue Mar 16, 2020 · 80 comments

Comments

@gacholio
Copy link
Contributor

In order to retain optimal performance in the interpreter in runtime compressed refs mode, we'll need to multiply-compile it (we already do this for debug mode).

The idea is to defined something like J9_COMPRESSED_REFS_OVERRIDE to either TRUE or FALSE, and use the override value if present in the J9VMTHREAD_COMPRESS_OBJECT_REFERENCES and J9JAVAVM_COMPRESS_OBJECT_REFERENCES macros.

In keeping with the current naming conventions, I'd suggest replacing BytecodeInterprer.cpp with BytecodeInterpreterCompressed.cpp and BytecodeInterpreterFull.cpp, and rename the C entry points appropriately. (Almost) the entirety of these files should be ifdeffed on OMR_GC_COMPRESSED_POINTERS or OMR_GC_FULL_POINTERS (you'll need to include j9cfg.h to get those ifdefs).

The new entry points will need to be used in:

https://github.com/eclipse/openj9/blob/b542db86f655d22d1697c64d071bd483c3da3695/runtime/vm/jvminit.c#L2455

For now, let's not bother splitting the debug interpreter or MH interpreter.

@gacholio
Copy link
Contributor Author

@DanHeidinga @sharon-wang

@gacholio
Copy link
Contributor Author

Makefiles will also need to be updated - search for BytecodeInterpreter and duplicate accordingly.

@fjeremic
Copy link
Contributor

Makefiles will also need to be updated - search for BytecodeInterpreter and duplicate accordingly.

@dnakamura FYI for CMake.

@sharon-wang
Copy link
Contributor

I'm working on this

@gacholio
Copy link
Contributor Author

To start with, there's no need to actually be in mixed mode - if the individual interpreters are ifdeffed on the full/compressed flags, then they'll work in the normal builds, and won't bloat them.

For an example, see:

https://github.com/eclipse/openj9/blob/6a8a0aef2e12405cc08654f15c62952d7c5f4fc6/runtime/compiler/runtime/Runtime.cpp#L274-L281

https://github.com/eclipse/openj9/blob/6a8a0aef2e12405cc08654f15c62952d7c5f4fc6/runtime/compiler/runtime/Runtime.cpp#L1094-L1107

@gacholio
Copy link
Contributor Author

For the override check, I was thinking something like:

#define J9_COMPRESSED_OVERRIDE TRUE or FALSE

before including much of anything else (particularly j9nonbuilder.h or anything that includes it)

and in j9nonbuilder.h:

#if defined(OMR_GC_COMPRESSED_POINTERS)
#if defined(OMR_GC_FULL_POINTERS)
/* Mixed mode - necessarily 64-bit */
#if defined(J9_COMPRESSED_OVERRIDE)
#define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) J9_COMPRESSED_OVERRIDE
#else /* J9_COMPRESSED_OVERRIDE */
#define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) (0 != (vmThread)->compressObjectReferences)
#endif /* J9_COMPRESSED_OVERRIDE */
#else /* OMR_GC_FULL_POINTERS */
/* Compressed only - necessarily 64-bit */
#define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) TRUE
#endif /* OMR_GC_FULL_POINTERS */
#else /* OMR_GC_COMPRESSED_POINTERS */
/* Full only - could be 32 or 64-bit */
#define J9VMTHREAD_COMPRESS_OBJECT_REFERENCES(vmThread) FALSE
#endif /* OMR_GC_COMPRESSED_POINTERS */
 

@sharon-wang
Copy link
Contributor

sharon-wang commented Apr 24, 2020

Some initial benchmark results...

split: "mixed" build
baseline: regular build

LibertyStartupDT7 - slight improvement in startup time

startup time (ms)
split 12417
baseline 12931
diff -514

Throughput Benchmark - significant reduction in throughput
Options: -Xnocompressedrefs -Xms6g -Xmx6g -Xmn4g

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff -46.8% -62.3% -42.0% -39.3%

I'll be running more iterations of these benchmarks to see if results are consistent.

@DanHeidinga
Copy link
Member

@sharon-wang those numbers seem really off. The overhead from the compressed checks should mostly be in the GC at this point which shouldn't account for a 2-3x overhead.
@amicic any comment on that?

Can you collect some profiles from a combo build run? It would really help to see where time is going.

@gacholio
Copy link
Contributor Author

The probable reason for the awful numbers is that the bench was being run in GC stress mode (which is something we will eventually need to address, but for now, the new numbers will be in normal mode to give us a general idea).

@amicic
Copy link
Contributor

amicic commented Apr 24, 2020

hard to speculate without seeing heap sizing, machine config, GC logs, CPU profiles...

@sharon-wang
Copy link
Contributor

Ran a few more iterations of LibertyStartupDT7, which yielded similar results as above.

For GC, I've collected a few profiles and logs. What sort of data do we want from the profiles and GC logs? Are there specific measurements that we are interested in?

@DanHeidinga
Copy link
Member

For the profiles, we want to compare the profiles from the combo build to a non-combo build to see which symbols have become hotter. This helps to identify where we're spending more time.

@dmitripivkine
Copy link
Contributor

For GC, I've collected a few profiles and logs

Please send collected GC verbose log files (both combo and non-combo) to me or @amicic

@amicic
Copy link
Contributor

amicic commented Apr 28, 2020

GC times are about 10% longer. More or less expected at this stage.

CPU profiles are too short to be meaningful. I asked @sharon-wang to generate them again

@sharon-wang
Copy link
Contributor

sharon-wang commented Apr 30, 2020

Updated Throughput Benchmark Results
The GC regression appears to be much less than the initial results. It's not super clear why those results were so off, other than the small heap.

COMPOSITE + PRESET, IR = 5000, Heap settings: -Xms12g -Xmx12g -Xmn10g
image

COMPOSITE + PRESET, IR = 20000, Heap settings: -Xms12g -Xmx12g -Xmn10g
image

COMPOSITE + COMPLIANT + Heap settings: -Xms12g -Xmx12g -Xmn10g

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff -2.22% -9.88% 0 -1.97%

COMPOSITE + COMPLIANT + Heap settings: -Xms8g -Xmx8g -Xmn6g

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff -1.18% -2.54% 0 0

@amicic
Copy link
Contributor

amicic commented Apr 30, 2020

Original runs were with -Xnocompressedrefs. Maybe there is something specific to it that led to so big difference.
The machine is also different, and heap sizing is also a bit different, but I would not think they would be a cause for such big difference.
Anyhow, nonCR should be tried, again.

@gacholio
Copy link
Contributor Author

I find it odd that the lower heap sizes are showing better results.

@amicic
Copy link
Contributor

amicic commented Apr 30, 2020

Run-to-run variation is easily 3%.

We did not pick the right combo of heap sizes to emphasize the difference between low and high GC overhead scenarios, either. When I picked large heap (10G Nursery) I assumed small will be 4G (like in the original runs), so about 2.5x smaller GC overhead. But I should've also picked even larger than 10G.

@sharon-wang
Copy link
Contributor

I can do some more runs: one set with compressed refs and one without. Are we interested in both COMPLIANT and PRESET runs?

What is suggested for the heap settings? I can use -Xms6g -Xmx6g -Xmn4g for small - what is recommended for a larger heap size?

@amicic
Copy link
Contributor

amicic commented Apr 30, 2020

Let's do compliant runs only for nonCR. If we indeed reproduce the big gap on GC perf machine, then will do PRESET runs and with CPU profiles.

large: -Xmx24G -Xms24G -Xmn20G
small: -Xmx6G -Xms6G -Xmn4G

@sharon-wang
Copy link
Contributor

sharon-wang commented May 1, 2020

New Throughput Benchmark Results - COMPOSITE + COMPLIANT, -Xnocompressedrefs

There is not a big gap in performance like initially measured. Might have been a machine issue, since those first runs were done on a different machine than the subsequent runs.

Heap settings: -Xms24g -Xmx24g -Xmn20g

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff -1.21% -5.81% 0 -2.65%

Heap settings: -Xms6g -Xmx6g -Xmn4g

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff 0 -2.45% 0 -1.10%

@amicic
Copy link
Contributor

amicic commented May 1, 2020

GC (Scavenge specifically) time slowdown for nonCR

  Reg Mix Diff (%)
Large 154.851 174.017 12.38
Small 115.13 123.128 6.95

Similar for CR:

  Reg Mix Diff (%)
Large 147.904 172.263 16.47
Small 145.972 156.681 7.34

Those are average GC times as obtained by:

grep '<gc-op.*scavenge' verbosegc.log | awk -F\" '{x+=$6} END {print x/NR}'

@sharon-wang
Copy link
Contributor

@amicic Should more perf results be collected or is further profiling needed?

@amicic
Copy link
Contributor

amicic commented May 5, 2020

I don't need more perf results. These slow downs are more or less expected, knowing the changes that came in... Next step is to try to reduce the gap.

Generally, we need to reduce the number of runtime 'if CR' checks. If possible, extracting it from tight loops. Very often it won't be possible, so we'll have to resort to versioning the code. Something similar to what is done to the Interpreter (although to avoid maintaining 2 version of source code, I'd consider C++ templates). Either way we need to do it for minimum amount of code that is really impacted by the runtimes checks most.

Specifically, for Scavenger that is most frequently executed GC in Gencon, we can version:

  • top level completeScan() loop and several methods that are dragged in like various scanObject() and copyAndForward() methods
  • Heap Object Iterator, specifically next() method
  • ForwardedHeader, probably the whole class

@gacholio
Copy link
Contributor Author

gacholio commented Jun 9, 2020

So where do we go from here? @rwy0717 suggested that an interpreter-style solution (the compressed override define) might work better than templates. Any thoughts on that?

Another obvious option is to simply multiply-compile the majority of the GC and fill in the MM function table appropriately. This would obviously perform, but may be prohibitively large, and it would also need to address the ODR constraint (i.e. the classes would need to be multiply-named).

I'm not familiar at all with templates, so I have nothing to offer in that regard.

Can anyone provide an example of what a piece of code would look like with templates?

One of the major places we probably want to optimize are the scanners/iterators, but I had heard they were being reworked (for VT?), so I've been loathe to put a lot of effort into changing those. One thing I had looked at was keeping an internal SlotObject instead of a scan pointer, and using the new instance-based APIs to increment the pointer.

@sharon-wang
Copy link
Contributor

sharon-wang commented Oct 21, 2020

Sure thing, I'll do 2 runs of a Throughput Benchmark for mixedrefs and baseline for each of those policies. Expecting it'll a take a week or so to run everything and put the results together.

I'll coordinate with @mpirvu to get him the builds - thanks for doing the DT7 runs!

@mpirvu
Copy link
Contributor

mpirvu commented Oct 24, 2020

I compared the baseline (compressedrefs) to the mixedrefs build, both being run with -Xcompressedrefs.
I used DT7 throughput runs, both with SCC on and off, and DT7 start-up experiments with SCC on. I did not see any change in throughput, footprint after load, start-up time or footprint after start-up.

================= Detailed results =============

Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput      avg=3561.54     min=3529.70     max=3603.30     stdDev=27.7     maxVar=2.09%    confInt=0.45%   samples=10
Intermediate results:
Run 0   241.9   3372.0  3556.4  3533.0  Avg=3533        CPU=184399 ms  Footprint=972096 KB
Run 1   270.2   3175.6  3562.2  3546.6  Avg=3547        CPU=138939 ms  Footprint=945864 KB
Run 2   278.6   3135.5  3610.9  3588.3  Avg=3588        CPU=140642 ms  Footprint=936156 KB
Run 3   278.3   3157.1  3542.8  3551.6  Avg=3552        CPU=135591 ms  Footprint=942300 KB
Run 4   273.4   3216.3  3615.3  3603.3  Avg=3603        CPU=137415 ms  Footprint=951392 KB
Run 5   301.2   3102.5  3592.5  3584.7  Avg=3585        CPU=138050 ms  Footprint=946904 KB
Run 6   271.4   3177.7  3613.8  3529.7  Avg=3530        CPU=140451 ms  Footprint=958532 KB
Run 7   281.1   3103.0  3619.0  3592.2  Avg=3592        CPU=141665 ms  Footprint=959196 KB
Run 8   294.0   3170.7  3577.3  3551.2  Avg=3551        CPU=141405 ms  Footprint=944408 KB
Run 9   276.8   3203.7  3605.0  3534.8  Avg=3535        CPU=137581 ms  Footprint=947024 KB
CompTime        avg=143613.80   min=135591.00   max=184399.00   stdDev=14464.3  maxVar=36.00%   confInt=5.84%   samples=10
Footprint       avg=950387.20   min=936156.00   max=972096.00   stdDev=10348.4  maxVar=3.84%    confInt=0.63%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m
Throughput      avg=3569.66     min=3486.60     max=3616.00     stdDev=35.2     maxVar=3.71%    confInt=0.57%   samples=10
Intermediate results:
Run 0   256.2   3306.2  3531.7  3486.6  Avg=3487        CPU=189048 ms  Footprint=971648 KB
Run 1   280.0   3183.4  3589.9  3557.0  Avg=3557        CPU=138386 ms  Footprint=946104 KB
Run 2   279.1   3100.3  3521.4  3584.2  Avg=3584        CPU=140922 ms  Footprint=941188 KB
Run 3   281.8   3205.2  3617.9  3616.0  Avg=3616        CPU=135744 ms  Footprint=946652 KB
Run 4   269.1   3118.3  3629.6  3557.1  Avg=3557        CPU=135273 ms  Footprint=940276 KB
Run 5   290.2   3230.1  3648.3  3598.8  Avg=3599        CPU=131868 ms  Footprint=953932 KB
Run 6   265.5   3200.7  3595.9  3580.5  Avg=3580        CPU=137383 ms  Footprint=942832 KB
Run 7   291.2   3182.2  3592.9  3552.6  Avg=3553        CPU=135526 ms  Footprint=951772 KB
Run 8   278.3   3215.6  3588.7  3583.6  Avg=3584        CPU=137627 ms  Footprint=946320 KB
Run 9   283.0   3021.0  3553.9  3580.2  Avg=3580        CPU=140904 ms  Footprint=950332 KB
CompTime        avg=142268.10   min=131868.00   max=189048.00   stdDev=16658.7  maxVar=43.36%   confInt=6.79%   samples=10
Footprint       avg=949105.60   min=940276.00   max=971648.00   stdDev=9085.2   maxVar=3.34%    confInt=0.55%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput      avg=3647.57     min=3586.80     max=3700.40     stdDev=32.4     maxVar=3.17%    confInt=0.52%   samples=10
Intermediate results:
Run 0   288.9   3297.1  3607.4  3586.8  Avg=3587        CPU=143930 ms  Footprint=957548 KB
Run 1   294.8   3463.2  3652.6  3700.4  Avg=3700        CPU=131022 ms  Footprint=918148 KB
Run 2   271.7   3297.0  3684.4  3681.1  Avg=3681        CPU=141237 ms  Footprint=927300 KB
Run 3   301.5   3281.0  3611.4  3629.5  Avg=3630        CPU=137024 ms  Footprint=916332 KB
Run 4   296.4   3311.9  3651.2  3624.1  Avg=3624        CPU=133670 ms  Footprint=933276 KB
Run 5   294.3   3352.8  3667.4  3626.5  Avg=3626        CPU=138411 ms  Footprint=928344 KB
Run 6   306.2   3360.2  3632.1  3645.6  Avg=3646        CPU=133881 ms  Footprint=929844 KB
Run 7   306.9   3320.7  3647.3  3657.3  Avg=3657        CPU=136465 ms  Footprint=923268 KB
Run 8   290.3   3354.3  3722.6  3665.2  Avg=3665        CPU=134035 ms  Footprint=923764 KB
Run 9   293.7   3310.0  3655.2  3659.2  Avg=3659        CPU=140797 ms  Footprint=923900 KB
CompTime        avg=137047.20   min=131022.00   max=143930.00   stdDev=4055.1   maxVar=9.85%    confInt=1.72%   samples=10
Footprint       avg=928172.40   min=916332.00   max=957548.00   stdDev=11522.8  maxVar=4.50%    confInt=0.72%   samples=10

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xms1024m -Xmx1024m -Xshareclasses:none
Throughput      avg=3643.11     min=3607.00     max=3681.20     stdDev=23.8     maxVar=2.06%    confInt=0.38%   samples=10
Intermediate results:
Run 0   269.9   3336.6  3652.3  3607.0  Avg=3607        CPU=144425 ms  Footprint=954116 KB
Run 1   283.2   3361.4  3700.8  3650.1  Avg=3650        CPU=138062 ms  Footprint=922316 KB
Run 2   287.4   3344.3  3728.8  3681.2  Avg=3681        CPU=133078 ms  Footprint=934540 KB
Run 3   282.3   3350.6  3671.7  3635.5  Avg=3636        CPU=134608 ms  Footprint=924100 KB
Run 4   268.5   3352.2  3639.9  3626.4  Avg=3626        CPU=136802 ms  Footprint=926748 KB
Run 5   291.6   3311.9  3633.3  3643.4  Avg=3643        CPU=140688 ms  Footprint=928880 KB
Run 6   290.7   3365.5  3678.4  3665.8  Avg=3666        CPU=133991 ms  Footprint=932460 KB
Run 7   272.6   3367.4  3654.5  3610.3  Avg=3610        CPU=128890 ms  Footprint=930800 KB
Run 8   289.8   3385.1  3633.9  3659.7  Avg=3660        CPU=136654 ms  Footprint=908452 KB
Run 9   294.8   3379.3  3640.8  3651.7  Avg=3652        CPU=133142 ms  Footprint=925196 KB
CompTime        avg=136034.00   min=128890.00   max=144425.00   stdDev=4365.3   maxVar=12.05%   confInt=1.86%   samples=10
Footprint       avg=928760.80   min=908452.00   max=954116.00   stdDev=11482.4  maxVar=5.03%    confInt=0.72%   samples=10
Results for JDK=/home/mpirvu/sdks/Sharon/Baseline jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime     avg=3990        min=3838        max=4278        stdDev=78.2     maxVar=11.5%    confInt=0.36%   samples= 80
Footprint       avg=217608      min=212580      max=223988      stdDev=2425.8   maxVar=5.4%     confInt=0.23%   samples= 64
        Outlier values:  279768 273776 270648 268024 277728 276680 271772 271932 271836 278256 273840 276412 280328 268892 267852 271984
CThreadTime     avg=2287        min=1897        max=2825        stdDev=205.1    maxVar=48.9%    confInt=1.67%   samples= 80
ProcessTime     avg=7583        min=7000        max=9040        stdDev=533.7    maxVar=29.1%    confInt=1.32%   samples= 79

Results for JDK=/home/mpirvu/sdks/Sharon/MixedRefs jvmOpts=-Xcompressedrefs -Xmx256m
StartupTime     avg=3995        min=3826        max=4243        stdDev=86.1     maxVar=10.9%    confInt=0.40%   samples= 80
Footprint       avg=217352      min=210596      max=235584      stdDev=3546.9   maxVar=11.9%    confInt=0.33%   samples= 67
        Outlier values:  275104 275252 273148 276632 274804 283276 281312 275076 277008 265988 273980 276528 272844
CThreadTime     avg=2292        min=1904        max=3211        stdDev=252.0    maxVar=68.6%    confInt=2.06%   samples= 79
ProcessTime     avg=7571        min=6900        max=9060        stdDev=526.9    maxVar=31.3%    confInt=1.30%   samples= 79

@sharon-wang
Copy link
Contributor

sharon-wang commented Oct 28, 2020

Some more Throughput Benchmark COMPOSITE Results

baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime

Heap settings: -Xms12g -Xmx12g -Xmn10g

Measurements below are an average of 3 runs for each. Run-to-run measurements varied slightly for both configurations and there is no major performance difference between baseline and mixedrefs.

-Xgcpolicy:optthruput

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff % -2.13% +0.41% -- -1.06%

-Xgcpolicy:balanced

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff % -2.25% +4.54% -2.09% +2.88%

-Xgcpolicy:metronome

This GC policy appears to be incompatible with the Throughput Benchmark? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.

@dmitripivkine
Copy link
Contributor

-Xgcpolicy:metronome

This GC policy appears to be incompatible with SPECjbb? The benchmark errors out or crashes even when running with the latest openj9 Adopt JDK15 build.

Crashes? It is interesting... Do you have any example around to get any idea where/how it crashed?

@sharon-wang
Copy link
Contributor

@dmitripivkine Yes, I'll send you the segfault and stack trace

@vijaysun-omr
Copy link
Contributor

@sharon-wang were you planning to investigate why the max-jOps was regressed by ~2% for both optthruput and balanced ?

@gacholio
Copy link
Contributor Author

I'm assuming that's random variance - these GCs should be identical to the normal builds. Perhaps another run is in order. I've manually verified that none of the getters are missing the override check.

@sharon-wang
Copy link
Contributor

I will do another set of runs to check if the same regression is seen.

@sharon-wang
Copy link
Contributor

sharon-wang commented Nov 4, 2020

New set of Throughput Benchmark COMPOSITE runs:

baseline: configure --with-cmake
mixedrefs: configure --with-cmake --with-mixedrefs + -Xcompressedrefs at runtime

Heap settings: -Xms12g -Xmx12g -Xmn10g

Measurements below are an average of 5 runs of each.

-Xgcpolicy:optthruput

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff % +0.16% -2.68% -0.19% +2.53%

-Xgcpolicy:balanced

max_jOPS critical_jOPS hbIR_max hbIR_settled
diff % +3.15% +3.00% +4.98% -1.52%

Seeing similar run-to-run variance as previous results. Seems like the two builds show the same performance.

@sharon-wang
Copy link
Contributor

For this initial set of changes to enable mixed builds with CMake, are we focused on JDK11 specifically, or do we want to enable this feature for all versions? I assume this also depends on which versions/platforms CMake is currently available on.

@keithc-ca
Copy link
Contributor

The changes to openj9 to support mixed references should not be specific to any version of java, so I would expect it should work for all (with changes similar to ibmruntimes/openj9-openjdk-jdk11#359 made for the other extensions repositories).

@sharon-wang
Copy link
Contributor

Just an update that all CMake mixed refs changes are now merged. The test story is in progress and can be followed here: #9231.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants