Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower performance than C2 on hot thread local access #7699

Open
headius opened this issue Nov 5, 2019 · 22 comments
Open

Slower performance than C2 on hot thread local access #7699

headius opened this issue Nov 5, 2019 · 22 comments

Comments

@headius
Copy link

headius commented Nov 5, 2019

I was optimizing a particularly hot threadlocal in JRuby today and noticed that current OpenJ9 Java 8 (nightly from Oct 12th or so) appears to be nearly 2x slower than OpenJDK8 C2 (recent-ish build). It seemed worth posting an issue after discussing it in Slack.

The optimized PR/branch and the benchmark are provided along with a link to C2 assembly output in jruby/jruby#5959

My numbers locally for Java 8 C2 vs J9:

[] ~/projects/jruby $ pickjdk 4 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8.jdk
3.65159464
3.34112901
3.31879354
...
3.23659174
3.53413285
3.26626448
3.23872332
^C
[] ~/projects/jruby $ pickjdk 3 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8-openj9.jdk
9.12280868
7.90640427
7.60049042
7.5055133
...
6.04649531
6.04165871
5.98665528
5.97691534
^C

Note that I also saw both Java 11 C2 and J9 degrade, but the degradation seemed proportional. Something has changed in Java 11 ThreadLocal that impacts both runtimes.

@JamesKingdon
Copy link
Contributor

The timing loop in the benchmark is in the main method, so it will be a dlt compile. It would be worth checking what happens if the loop is moved into a separate method and warmed up before running the test.

@headius
Copy link
Author

headius commented Nov 7, 2019

I admit the benchmark is poorly-formed. Here's a run with the for loop moved to a separate method and warmed up (it doesn't make a great deal of difference, unfortunately):

[] ~/projects/jruby $ pickjdk 4 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8.jdk
4.37305029
3.98216038
3.71040556
4.00719827
4.19681166
4.03058245
3.74562432
3.70559625
3.69658124
3.71174565
3.70673965
3.69922997
3.65048014
3.68935385
3.68400718
^C
[] ~/projects/jruby $ pickjdk 3 ; java -cp lib/jruby.jar:. ContextGetter
New JDK: adoptopenjdk-8-openj9.jdk
8.9936078
7.43961186
8.70367504
8.45563959
7.14172598
8.05580192
9.91437537
10.08027979
9.87137747
9.94407308
9.91858358
8.24154539
6.86044516
6.27043835
6.09401586
6.11884365
6.08598684
6.18706203
6.17403144
6.10032537
^C

@JamesKingdon
Copy link
Contributor

Many thanks for trying that out.

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Nov 7, 2019

I modified the testcase, reproduced the problem, took a log and was wading through it when I realised that I was using a really old version of jruby (1.7.22). That's probably not so useful :)

(As an aside, with the older version of jruby OpenJ9 and Hotspot are pretty much on par, I guess for some reason OpenJ9 isn't benefiting from the improvement in jruby/jruby#5959 - confirmed, comparing 'master' with the old jruby, Hotspot gets faster and OpenJ9 gets slower )

@JamesKingdon
Copy link
Contributor

In case it's useful, I'll attach my modified version of the testcase:
ContextGetter.java.zip

The options to get a compilation log are:

-Xjit:dontInline={ContextGetter.doTest*},{ContextGetter.doTest*}{scorching}(tracefull,log=doTest)

@andrewcraik
Copy link
Contributor

Thanks @JamesKingdon for the builds you shared. So there are some interesting things happening here:

  • in the scorching compile of the doTest method there are some paths in the ThreadLocal logic that appear to be unlikely to execute based on profiling (so we don't inline a number calls related to the map and a few other bits). Forcing the inlining cuts the gap by a bit more than half on my system
  • there are still a number of profiled guards
  • in addition OpenJ9 has its own implementation of java/lang/ref/Reference and java/lang/ref/SoftReference. Based on perf stats I suspect OpenJDK's Reference family is not using a volatile field - we seem to have more stalls and such around a full fence for writing to SoftReference.age. I also see an if check in the OpenJ9 Reference.get testing for a GC mode - this does get removed but I'm not sure we are doing this unconditionally - eg we may be preserving modification safety which is not necessary and so costing us some perf.

So baseline for OpenJ9 on my system 5.6s, tryToInline={*} on the scorching compile 4.5s, JCL fixes to remove volatile and extra if: 4.4s, use new JProfiling implementation 4.3s. OpenJDK with the same test is 3.8s. Those changes above cut the overhead from ~50% slower to ~12% slower. I suspect the last bit is that we are generating some profiled guards in places where we don't need them (eg on a super.get in SoftReference.get). If that is true that would be adding some additional loading and stalling and may well help get us down to the level seen in OpenJDK.

So no changes to 'fix' this yet, but there are at least some of the causes of the delta.

@vijaysun-omr
Copy link
Contributor

@andrewcraik does the new JProfiling implementation get used in the very-hot profiled compilation before scorching compilation by default in the OpenJ9 build you are running ? If not, is that what "use new JProfiling implementation" did in your run ?

Do you still need the tryToInline if you are using the JProfiling implementation ? i.e. does using a different profiler affect the inlining on the paths you mentioned in the scorching compilation such that you don't need to force it ?

@andrewcraik
Copy link
Contributor

@vijaysun-omr My build does not have the new JProfiling implementation on by default (it is a bit older than head) so I was enabling it in my build to test the performance. The tryToInline is still important for performance even with the new JProfiling enabled.

@headius
Copy link
Author

headius commented Nov 26, 2019

Is it possible for me to build this "JProfiling" implementation and give it a try?

@andrewcraik
Copy link
Contributor

JProfiling is already in the builds. Adding -Xjit:enableJProfilingInProfilingCompilations will cause the JProfiler to be used for profiling compilations rather than the current default JITProfiler.

@DanHeidinga
Copy link
Member

Moving this forward based on the info here. @andrewcraik if you think this is addressed (has jprofile shipped now?) then please retarget / close

@andrewcraik
Copy link
Contributor

JProfiling is now on by default - not all the issues I identified have been fully explored yet, but we haven't had more time to work on this yet. I'll leave it open for now.

@DanHeidinga
Copy link
Member

Moving this forward as it hasn't gotten any attention for this release

@andrewcraik
Copy link
Contributor

We have some ideas how to fix this, not sure it will make the release - will leave it here for now, but not certain it will be addressed.

@andrewcraik
Copy link
Contributor

We still haven't had time to implement the perf improvements due to other work - moving forward. Let me know if the priority increases.

@liqunl
Copy link
Contributor

liqunl commented Nov 25, 2020

We're still looking at this perf issue. For the comments from #7699 (comment), Andrew suspect OpenJDK is not using volatile field in their implementation of Reference classes, which may cause a perf difference. However, OpenJ9 treat the volatile field like a non-volatile field here
https://github.com/eclipse/openj9/blob/fea3bfb3c0fadfe5b70399488940e00771455de7/runtime/compiler/compile/J9SymbolReferenceTable.cpp#L855-L860
So this is no problem here

There are profiled guards that we can get rid of with the type constraint on the receiver, this should give us 20% boost. @rpshukla is working on it.

@headius
Copy link
Author

headius commented Dec 2, 2020

FWIW we usually try to avoid this ThreadLocal by passing the relevant object through the call stack, but it is hit heavily when we can't use the stack to pass it. For example, Java calling into Ruby code needs to acquire this context again using the threadlocal every time, and it can't be cached since it is thread-specific.

So long story short, this should not usually affect pure-Ruby apps but apps with lots of Java integration will see more impact.

@liqunl
Copy link
Contributor

liqunl commented Dec 3, 2020

Update: the VP change with 20% boost is still in progress, we expect to get it in soon. But the remaining gap is unlikely to be closed in next release. Given #7699 (comment), we should defer it to next release, and continue investigating the remaining gap.

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 9, 2020

Moving this forward another release for the reasons Liqun highlighted.

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 18, 2021

We haven't had much of an opportunity to make progress on this one. Moving forward to 0.26.

@0xdaryl
Copy link
Contributor

0xdaryl commented Mar 31, 2021

No updates. Moving to 0.27.

@0xdaryl
Copy link
Contributor

0xdaryl commented Jun 21, 2021

No updates in Liqun's absence. Moving to 0.28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants