Slower performance than C2 on hot thread local access #7699

headius · 2019-11-05T20:49:01Z

I was optimizing a particularly hot threadlocal in JRuby today and noticed that current OpenJ9 Java 8 (nightly from Oct 12th or so) appears to be nearly 2x slower than OpenJDK8 C2 (recent-ish build). It seemed worth posting an issue after discussing it in Slack.

The optimized PR/branch and the benchmark are provided along with a link to C2 assembly output in jruby/jruby#5959

My numbers locally for Java 8 C2 vs J9:

[] ~/projects/jruby $ pickjdk 4 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8.jdk
3.65159464
3.34112901
3.31879354
...
3.23659174
3.53413285
3.26626448
3.23872332
^C
[] ~/projects/jruby $ pickjdk 3 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8-openj9.jdk
9.12280868
7.90640427
7.60049042
7.5055133
...
6.04649531
6.04165871
5.98665528
5.97691534
^C

Note that I also saw both Java 11 C2 and J9 degrade, but the degradation seemed proportional. Something has changed in Java 11 ThreadLocal that impacts both runtimes.

The text was updated successfully, but these errors were encountered:

JamesKingdon · 2019-11-06T14:25:55Z

The timing loop in the benchmark is in the main method, so it will be a dlt compile. It would be worth checking what happens if the loop is moved into a separate method and warmed up before running the test.

headius · 2019-11-07T21:24:29Z

I admit the benchmark is poorly-formed. Here's a run with the for loop moved to a separate method and warmed up (it doesn't make a great deal of difference, unfortunately):

[] ~/projects/jruby $ pickjdk 4 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8.jdk
4.37305029
3.98216038
3.71040556
4.00719827
4.19681166
4.03058245
3.74562432
3.70559625
3.69658124
3.71174565
3.70673965
3.69922997
3.65048014
3.68935385
3.68400718
^C
[] ~/projects/jruby $ pickjdk 3 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8-openj9.jdk
8.9936078
7.43961186
8.70367504
8.45563959
7.14172598
8.05580192
9.91437537
10.08027979
9.87137747
9.94407308
9.91858358
8.24154539
6.86044516
6.27043835
6.09401586
6.11884365
6.08598684
6.18706203
6.17403144
6.10032537
^C

JamesKingdon · 2019-11-07T21:34:13Z

Many thanks for trying that out.

JamesKingdon · 2019-11-07T22:45:31Z

I modified the testcase, reproduced the problem, took a log and was wading through it when I realised that I was using a really old version of jruby (1.7.22). That's probably not so useful :)

(As an aside, with the older version of jruby OpenJ9 and Hotspot are pretty much on par, I guess for some reason OpenJ9 isn't benefiting from the improvement in jruby/jruby#5959 - confirmed, comparing 'master' with the old jruby, Hotspot gets faster and OpenJ9 gets slower )

JamesKingdon · 2019-11-08T14:10:33Z

In case it's useful, I'll attach my modified version of the testcase:
ContextGetter.java.zip

The options to get a compilation log are:

-Xjit:dontInline={ContextGetter.doTest*},{ContextGetter.doTest*}{scorching}(tracefull,log=doTest)

andrewcraik · 2019-11-11T16:46:48Z

Thanks @JamesKingdon for the builds you shared. So there are some interesting things happening here:

in the scorching compile of the doTest method there are some paths in the ThreadLocal logic that appear to be unlikely to execute based on profiling (so we don't inline a number calls related to the map and a few other bits). Forcing the inlining cuts the gap by a bit more than half on my system
there are still a number of profiled guards
in addition OpenJ9 has its own implementation of java/lang/ref/Reference and java/lang/ref/SoftReference. Based on perf stats I suspect OpenJDK's Reference family is not using a volatile field - we seem to have more stalls and such around a full fence for writing to SoftReference.age. I also see an if check in the OpenJ9 Reference.get testing for a GC mode - this does get removed but I'm not sure we are doing this unconditionally - eg we may be preserving modification safety which is not necessary and so costing us some perf.

So baseline for OpenJ9 on my system 5.6s, tryToInline={*} on the scorching compile 4.5s, JCL fixes to remove volatile and extra if: 4.4s, use new JProfiling implementation 4.3s. OpenJDK with the same test is 3.8s. Those changes above cut the overhead from ~50% slower to ~12% slower. I suspect the last bit is that we are generating some profiled guards in places where we don't need them (eg on a super.get in SoftReference.get). If that is true that would be adding some additional loading and stalling and may well help get us down to the level seen in OpenJDK.

So no changes to 'fix' this yet, but there are at least some of the causes of the delta.

vijaysun-omr · 2019-11-11T17:22:16Z

@andrewcraik does the new JProfiling implementation get used in the very-hot profiled compilation before scorching compilation by default in the OpenJ9 build you are running ? If not, is that what "use new JProfiling implementation" did in your run ?

Do you still need the tryToInline if you are using the JProfiling implementation ? i.e. does using a different profiler affect the inlining on the paths you mentioned in the scorching compilation such that you don't need to force it ?

andrewcraik · 2019-11-11T17:26:15Z

@vijaysun-omr My build does not have the new JProfiling implementation on by default (it is a bit older than head) so I was enabling it in my build to test the performance. The tryToInline is still important for performance even with the new JProfiling enabled.

headius · 2019-11-26T14:45:05Z

Is it possible for me to build this "JProfiling" implementation and give it a try?

andrewcraik · 2019-11-28T17:51:20Z

JProfiling is already in the builds. Adding -Xjit:enableJProfilingInProfilingCompilations will cause the JProfiler to be used for profiling compilations rather than the current default JITProfiler.

DanHeidinga · 2020-06-02T18:59:29Z

Moving this forward based on the info here. @andrewcraik if you think this is addressed (has jprofile shipped now?) then please retarget / close

andrewcraik · 2020-06-02T19:41:25Z

JProfiling is now on by default - not all the issues I identified have been fully explored yet, but we haven't had more time to work on this yet. I'll leave it open for now.

DanHeidinga · 2020-08-04T18:30:07Z

Moving this forward as it hasn't gotten any attention for this release

andrewcraik · 2020-09-02T15:22:02Z

We have some ideas how to fix this, not sure it will make the release - will leave it here for now, but not certain it will be addressed.

andrewcraik · 2020-09-10T15:21:29Z

We still haven't had time to implement the perf improvements due to other work - moving forward. Let me know if the priority increases.

liqunl · 2020-11-25T16:17:23Z

We're still looking at this perf issue. For the comments from #7699 (comment), Andrew suspect OpenJDK is not using volatile field in their implementation of Reference classes, which may cause a perf difference. However, OpenJ9 treat the volatile field like a non-volatile field here
https://github.com/eclipse/openj9/blob/fea3bfb3c0fadfe5b70399488940e00771455de7/runtime/compiler/compile/J9SymbolReferenceTable.cpp#L855-L860
So this is no problem here

There are profiled guards that we can get rid of with the type constraint on the receiver, this should give us 20% boost. @rpshukla is working on it.

headius · 2020-12-02T18:25:09Z

FWIW we usually try to avoid this ThreadLocal by passing the relevant object through the call stack, but it is hit heavily when we can't use the stack to pass it. For example, Java calling into Ruby code needs to acquire this context again using the threadlocal every time, and it can't be cached since it is thread-specific.

So long story short, this should not usually affect pure-Ruby apps but apps with lots of Java integration will see more impact.

liqunl · 2020-12-03T17:43:13Z

Update: the VP change with 20% boost is still in progress, we expect to get it in soon. But the remaining gap is unlikely to be closed in next release. Given #7699 (comment), we should defer it to next release, and continue investigating the remaining gap.

0xdaryl · 2020-12-09T13:27:45Z

Moving this forward another release for the reasons Liqun highlighted.

0xdaryl · 2021-02-18T03:10:47Z

We haven't had much of an opportunity to make progress on this one. Moving forward to 0.26.

0xdaryl · 2021-03-31T23:11:54Z

No updates. Moving to 0.27.

0xdaryl · 2021-06-21T12:42:12Z

No updates in Liqun's absence. Moving to 0.28.

fjeremic added perf userRaised labels Nov 5, 2019

pshipton added this to the Release 0.18 (Java 8, 11, 13) Jan refresh milestone Nov 5, 2019

pshipton modified the milestones: Release 0.18 (Java 8, 11, 13) Jan refresh, Release 0.19 (Java 14) Dec 4, 2019

pshipton modified the milestones: Release 0.19 (Java 14), Release 0.20 (Java 8, 11, 14) Apr refresh Feb 12, 2020

pshipton modified the milestones: Release 0.20 (Java 8, 11, 14) Apr refresh, Release 0.21 (Java 8, 11, 14) July refresh Mar 18, 2020

DanHeidinga added the comp:jit label Jun 2, 2020

DanHeidinga modified the milestones: Release 0.21 (Java 8, 11, 14) July refresh, Release 0.22 (Java 15) Jun 2, 2020

DanHeidinga modified the milestones: Release 0.22 (Java 15), Release 0.23 (Java 8, 11, 15) Oct refresh Aug 4, 2020

andrewcraik removed this from the Release 0.23 (Java 8, 11, 15) Oct refresh milestone Sep 10, 2020

andrewcraik added this to the Release 0.24 (Java 8, 11, 15) Jan refresh milestone Sep 10, 2020

0xdaryl modified the milestones: Release 0.24 (Java 8, 11, 15) Jan refresh, Release 0.25 (Java 16) Dec 9, 2020

0xdaryl modified the milestones: Release 0.25 (Java 16), Release 0.26 (Java 8, 11, 16) Apr refresh Feb 18, 2021

headius mentioned this issue Mar 2, 2021

Provide direct byte[] read and write for RubyIO jruby/jruby#6590

Merged

0xdaryl modified the milestones: Release 0.26 (Java 8, 11, 16) Apr refresh, Release 0.27 (Java 8, 11, 16) July refresh Mar 31, 2021

0xdaryl modified the milestones: Release 0.27 (Java 8, 11, 16) July refresh, Release 0.28 (Java 17) Jun 21, 2021

tajila modified the milestones: Release 0.28 (Java 17), Release 0.29 (Java 8, 11, 17) Oct refresh Sep 7, 2021

0xdaryl modified the milestones: Release 0.29 (Java 8, 11, 17) Oct refresh, Release 0.30 (Java 8, 11, 17) Jan refresh Oct 8, 2021

0xdaryl modified the milestones: Release 0.30 (Java 8, 11, 17) Jan refresh, Release 0.31 (Java 18) Dec 14, 2021

pshipton modified the milestones: Release 0.31 (Java 18), Release 0.32 (Java 8, 11, 17, 18) Apr refresh Jan 11, 2022

pshipton modified the milestones: Release 0.32 (Java 8, 11, 17, 18) Apr refresh, Tracking Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower performance than C2 on hot thread local access #7699

Slower performance than C2 on hot thread local access #7699

headius commented Nov 5, 2019

JamesKingdon commented Nov 6, 2019

headius commented Nov 7, 2019 •

edited

Loading

JamesKingdon commented Nov 7, 2019

JamesKingdon commented Nov 7, 2019 •

edited

Loading

JamesKingdon commented Nov 8, 2019

andrewcraik commented Nov 11, 2019

vijaysun-omr commented Nov 11, 2019

andrewcraik commented Nov 11, 2019

headius commented Nov 26, 2019

andrewcraik commented Nov 28, 2019

DanHeidinga commented Jun 2, 2020

andrewcraik commented Jun 2, 2020

DanHeidinga commented Aug 4, 2020

andrewcraik commented Sep 2, 2020

andrewcraik commented Sep 10, 2020

liqunl commented Nov 25, 2020

headius commented Dec 2, 2020

liqunl commented Dec 3, 2020

0xdaryl commented Dec 9, 2020

0xdaryl commented Feb 18, 2021

0xdaryl commented Mar 31, 2021

0xdaryl commented Jun 21, 2021

Slower performance than C2 on hot thread local access #7699

Slower performance than C2 on hot thread local access #7699

Comments

headius commented Nov 5, 2019

JamesKingdon commented Nov 6, 2019

headius commented Nov 7, 2019 • edited Loading

JamesKingdon commented Nov 7, 2019

JamesKingdon commented Nov 7, 2019 • edited Loading

JamesKingdon commented Nov 8, 2019

andrewcraik commented Nov 11, 2019

vijaysun-omr commented Nov 11, 2019

andrewcraik commented Nov 11, 2019

headius commented Nov 26, 2019

andrewcraik commented Nov 28, 2019

DanHeidinga commented Jun 2, 2020

andrewcraik commented Jun 2, 2020

DanHeidinga commented Aug 4, 2020

andrewcraik commented Sep 2, 2020

andrewcraik commented Sep 10, 2020

liqunl commented Nov 25, 2020

headius commented Dec 2, 2020

liqunl commented Dec 3, 2020

0xdaryl commented Dec 9, 2020

0xdaryl commented Feb 18, 2021

0xdaryl commented Mar 31, 2021

0xdaryl commented Jun 21, 2021

headius commented Nov 7, 2019 •

edited

Loading

JamesKingdon commented Nov 7, 2019 •

edited

Loading