-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slower performance than C2 on hot thread local access #7699
Comments
The timing loop in the benchmark is in the main method, so it will be a dlt compile. It would be worth checking what happens if the loop is moved into a separate method and warmed up before running the test. |
I admit the benchmark is poorly-formed. Here's a run with the for loop moved to a separate method and warmed up (it doesn't make a great deal of difference, unfortunately):
|
Many thanks for trying that out. |
I modified the testcase, reproduced the problem, took a log and was wading through it when I realised that I was using a really old version of jruby (1.7.22). That's probably not so useful :) (As an aside, with the older version of jruby OpenJ9 and Hotspot are pretty much on par, I guess for some reason OpenJ9 isn't benefiting from the improvement in jruby/jruby#5959 - confirmed, comparing 'master' with the old jruby, Hotspot gets faster and OpenJ9 gets slower ) |
In case it's useful, I'll attach my modified version of the testcase: The options to get a compilation log are:
|
Thanks @JamesKingdon for the builds you shared. So there are some interesting things happening here:
So baseline for OpenJ9 on my system 5.6s, tryToInline={*} on the scorching compile 4.5s, JCL fixes to remove volatile and extra if: 4.4s, use new JProfiling implementation 4.3s. OpenJDK with the same test is 3.8s. Those changes above cut the overhead from ~50% slower to ~12% slower. I suspect the last bit is that we are generating some profiled guards in places where we don't need them (eg on a super.get in SoftReference.get). If that is true that would be adding some additional loading and stalling and may well help get us down to the level seen in OpenJDK. So no changes to 'fix' this yet, but there are at least some of the causes of the delta. |
@andrewcraik does the new JProfiling implementation get used in the very-hot profiled compilation before scorching compilation by default in the OpenJ9 build you are running ? If not, is that what "use new JProfiling implementation" did in your run ? Do you still need the tryToInline if you are using the JProfiling implementation ? i.e. does using a different profiler affect the inlining on the paths you mentioned in the scorching compilation such that you don't need to force it ? |
@vijaysun-omr My build does not have the new JProfiling implementation on by default (it is a bit older than head) so I was enabling it in my build to test the performance. The tryToInline is still important for performance even with the new JProfiling enabled. |
Is it possible for me to build this "JProfiling" implementation and give it a try? |
JProfiling is already in the builds. Adding -Xjit:enableJProfilingInProfilingCompilations will cause the JProfiler to be used for profiling compilations rather than the current default JITProfiler. |
Moving this forward based on the info here. @andrewcraik if you think this is addressed (has jprofile shipped now?) then please retarget / close |
JProfiling is now on by default - not all the issues I identified have been fully explored yet, but we haven't had more time to work on this yet. I'll leave it open for now. |
Moving this forward as it hasn't gotten any attention for this release |
We have some ideas how to fix this, not sure it will make the release - will leave it here for now, but not certain it will be addressed. |
We still haven't had time to implement the perf improvements due to other work - moving forward. Let me know if the priority increases. |
We're still looking at this perf issue. For the comments from #7699 (comment), Andrew suspect OpenJDK is not using volatile field in their implementation of Reference classes, which may cause a perf difference. However, OpenJ9 treat the volatile field like a non-volatile field here There are profiled guards that we can get rid of with the type constraint on the receiver, this should give us 20% boost. @rpshukla is working on it. |
FWIW we usually try to avoid this ThreadLocal by passing the relevant object through the call stack, but it is hit heavily when we can't use the stack to pass it. For example, Java calling into Ruby code needs to acquire this context again using the threadlocal every time, and it can't be cached since it is thread-specific. So long story short, this should not usually affect pure-Ruby apps but apps with lots of Java integration will see more impact. |
Update: the VP change with 20% boost is still in progress, we expect to get it in soon. But the remaining gap is unlikely to be closed in next release. Given #7699 (comment), we should defer it to next release, and continue investigating the remaining gap. |
Moving this forward another release for the reasons Liqun highlighted. |
We haven't had much of an opportunity to make progress on this one. Moving forward to 0.26. |
No updates. Moving to 0.27. |
No updates in Liqun's absence. Moving to 0.28. |
I was optimizing a particularly hot threadlocal in JRuby today and noticed that current OpenJ9 Java 8 (nightly from Oct 12th or so) appears to be nearly 2x slower than OpenJDK8 C2 (recent-ish build). It seemed worth posting an issue after discussing it in Slack.
The optimized PR/branch and the benchmark are provided along with a link to C2 assembly output in jruby/jruby#5959
My numbers locally for Java 8 C2 vs J9:
Note that I also saw both Java 11 C2 and J9 degrade, but the degradation seemed proportional. Something has changed in Java 11 ThreadLocal that impacts both runtimes.
The text was updated successfully, but these errors were encountered: