Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK 21, lusearch, and Lucene "regression" #264

Open
shipilev opened this issue Mar 18, 2024 · 33 comments
Open

JDK 21, lusearch, and Lucene "regression" #264

shipilev opened this issue Mar 18, 2024 · 33 comments

Comments

@shipilev
Copy link

shipilev commented Mar 18, 2024

Wanted to report a "problem" with current lusearch benchmark in 23.11-chopin, which so far looks like the issue in Lucene support for JDK preview features. In short, running with different JDKs shows that lusearch is substantially slower with JDK 21.

Running self-built JDKs on Graviton 3 instance with:

% shipilev-jdk/build/linux-aarch64-server-release/images/jdk/bin/java -jar dacapo-23.11-chopin.jar lusearch -n 20

...yields these results:

JDK 17-dev: 1584 msec
JDK 21-dev: 2446 msec (!!!)
JDK 22+0:   2455 msec (!!!)
JDK 22+1:   1527 msec
JDK 23-dev: 1519 msec

Bisection shows that the "regression" starts in JDK 19 with: 8282191: Implementation of Foreign Function & Memory API (Preview). We cannot run JDK 19 prior that changeset: Lucene fails with NCDFE trying to access MemorySegment. We suspect that Lucene is opting in some FFM preview features, which are slower than what it used before. Mainline JDK is fast again, but only after JDK starts to identify itself as JDK 22: openjdk/jdk@5a706fb

I suspect that Lucene that ships with 23.11-chopin actually wants JDK 22 to perform well, and the intermediate version that opts-in to JDK 19 preview features is actually slower on this test.

@ChrisHegarty, is there something we can do here? Maybe there is a feature flag that switches what Lucene is using? Maybe Dacapo should update the Lucene to some other version?

@shipilev shipilev changed the title JDK 21, lusearch and Lucene "regression" JDK 21, lusearch, and Lucene "regression" Mar 18, 2024
@ChrisHegarty
Copy link

ChrisHegarty commented Mar 19, 2024

What version of Lucene are you using? It seems that the MemorySegment backed mmap directory is being used by default. You can disable it by setting the following system property:

-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false.

I'm surprised that using the MemorySegment backed mmap directory causes a perf regression. This is not what we see in our tests, so it's likely that there's something else going on here too.

@steveblackburn
Copy link
Contributor

We're using 9.7.0 in the Chopin release.

@caizixian
Copy link

Yes, I can confirm that we can see the same slowdown in our setup. 2845ms in Temurin 11 (0.74% 95CI), 2825ms in Temurin 17 (0.74% 95CI), and 3238ms (1% 95CI)

@caizixian
Copy link

Setting the -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false flag indeed improves the performance on Temurin 21. It's now on par with 17 and 11.

steveblackburn pushed a commit that referenced this issue Apr 18, 2024
@steveblackburn
Copy link
Contributor

I just want to confirm that this problem persists with Lucene 9.10.

@ChrisHegarty is this what you expect?

I'm preparing for an upcoming release, but given what I saw with 9.10, I am now wondering whether I should be turning the flag off by default. My (strong) preference is to have Lucene run "as-is". But I'd also rather not ship the suite with a known major performance regression (I regret that I did not see this when I made the recent release).

@shipilev do you have thoughts on this?

@shipilev
Copy link
Author

@shipilev do you have thoughts on this?

It sounds to me that there is a problem with Lucene and JDK 21, which Dacapo correctly caught. (Pats the benchmark on its back.) And JDK 21 is de-facto LTS release, which would be around for years to come. So I think this problem should be fixed in Lucene, and then Dacapo should update with new version of Lucene. If there is no good version for Lucene that solves the issue with JDK 21, Dacapo should not be forced to do anything to work around performance bugs.

@ChrisHegarty
Copy link

@ChrisHegarty is this what you expect?

We run Lucene 9.10 with JDK 21 in many scenarios WITHOUT issue. Whatever it is that you're encountering, it is not an obvious or known issue with Lucene 9.10 and JDK 21. Can you please try to determine where the performance difference is, and why enableMemorySegments=false would have any affect on it.

@steveblackburn
Copy link
Contributor

I can confirm that the issue appears to have been introduced in 9.7. It is not evident in 9.6 or earlier.

I'll look into this further tomorrow.

@steveblackburn
Copy link
Contributor

This seems to be the culprit: apache/lucene#12294

The next question is why we're seeing the issue and you're not.

@steveblackburn
Copy link
Contributor

steveblackburn commented Apr 23, 2024

@ChrisHegarty

A quick look at a perf profile shows the following:

9.6

  8.30%  Query0           [JIT] tid 786719      [.] int org.apache.lucene.util.compress.LZ4.decompress(org.apache.lucene.store.DataInput, int, byte[], int)
   7.86%  Query0           [JIT] tid 786719      [.] java.nio.ByteBuffer java.nio.ByteBuffer.getArray(int, byte[], int, int)
   6.88%  Query0           [JIT] tid 786719      [.] StubRoutines (final stubs)
   4.41%  Query0           [JIT] tid 786719      [.] short org.apache.lucene.store.ByteBufferIndexInput.readShort()

9.7

  10.20%  Query0           [JIT] tid 792317      [.] int java.lang.invoke.VarHandleGuards.guard_LJ_I(java.lang.invoke.VarHandle, java.lang.Object, long, java.lang.invoke.VarHandle$AccessDescriptor)
   9.71%  Query0           [JIT] tid 792317      [.] StubRoutines (final stubs)
   8.50%  Query0           [JIT] tid 792317      [.] void org.apache.lucene.store.MemorySegmentIndexInput.readBytes(byte[], int, int)
   6.63%  Query0           [JIT] tid 792317      [.] int java.lang.invoke.MethodHandle.linkToStatic(java.lang.Object, java.lang.Object, long, java.lang.invoke.MemberName)
   5.86%  Query0           [JIT] tid 792317      [.] short java.lang.invoke.VarHandleSegmentAsShorts.get(java.lang.invoke.VarHandle, java.lang.Object, long)
   5.74%  Query0           [JIT] tid 792317      [.] int org.apache.lucene.util.compress.LZ4.decompress(org.apache.lucene.store.DataInput, int, byte[], int)
   5.23%  Query0           [JIT] tid 792317      [.] short org.apache.lucene.store.MemorySegmentIndexInput.readShort()

So the link to MemorySegments is clear.

This is with Corretto-21.0.0.35.1 (build 21+35-LTS)

I'm running DaCapo lusearch with arguments -n 20 -t 1 (20 invocations of the workload, single threaded). The relevant code is here.

This runs 10 M simple queries in a single worker thread ("Query0").

@caizixian
Copy link

Running with -Xint and profiling using VisualVM shows the full stack with and without enableMemorySegments.
This is not apparent from the above perf output since most of the methods are inline into LZ4.decompress.

image

image

So I think we learnt the following:

  1. Bulk of DaCapo's query threads' time is spent on LZ4.decompress, which heavily exercise IndexInput's various read* methods.
  2. The MemorySegment implementation of IndexInput seems to be inefficient on older JDKs, especially with more mutator threads (running DaCapo with -t 1 or -t 32 shows very different result with and without enableMemorySegments)

@caizixian
Copy link

I wonder whether this fixes the issue on newer JDKs openjdk/jdk@1594653

@steveblackburn
Copy link
Contributor

steveblackburn commented Apr 24, 2024

Thanks @caizixian

@ChrisHegarty this seems fairly clear now.

In short, enabling memory segments does not play nicely with JDK 21 (as @shipilev said), and we now have a profile that shows why this is so.

As @caizixian says, from all the testing we did, you will see this if you have an intense workload with a fair amount of parallelism (I saw 2X slowdown with 24 threads, but none when running single-threaded).

I am not sure why this did not show up in your testing.

Given what we know about the pathologies of JDK 21 and the data here, I wonder if it would be better for MemorySegmentIndexInput to be off by default for JDK 21. If so, my plan for DaCapo would be to include that change in the upcoming release. If not, perhaps my best option would be to downgrade to 9.6 for the upcoming release.

@ChrisHegarty
Copy link

I think that the results of -Xint are not that useful, because perf using MemorySegments and VarHandles rely heavily on the JVM being able to eliminate (or otherwise hoist, or ...) various checks.

Can you please check with the most recent JDK, say JDK 21.0.2 and/or JDK 22.0.1? Preferably or as well as the Oracle GPL binaries. e.g. https://jdk.java.net/21/, https://jdk.java.net/22/

@caizixian
Copy link

I think that the results of -Xint are not that useful

We were running with -Xint only to show the full call hierarchy that leads to the MemeorySegment backed IndexInput.

The performance different is clearly visible when running with default parameters, and the relevant symbols can be seen using perf

We were running 21.0.2.

openjdk version "21.0.2" 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)

I just tested with Oracle commercial binary of JDK 21 and it gave the same performance result, so it's not a OpenJDK/Temurin specific problem.

java version "21.0.3" 2024-04-16 LTS
Java(TM) SE Runtime Environment (build 21.0.3+7-LTS-152)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.3+7-LTS-152, mixed mode, sharing)

Lucene 9.7 + JDK 22 doesn't produce the performance pathology because the MemorySegment IndexInput is guarded for java19~21 https://github.com/apache/lucene/tree/releases/lucene/9.7.0/lucene/core/src/java21/org/apache/lucene

@caizixian
Copy link

We can reproduce the performance pathology using Lucene 2.10 + JDK 22, where MemorySegment IndexInput is enabled for JDK 22 as well per apache/lucene#12706

@steveblackburn
Copy link
Contributor

steveblackburn commented Apr 26, 2024

@ChrisHegarty

Can you please check with the most recent JDK

Here are the times in msec for five JDKs I happen to have at hand:

Family Build 9.6 9.7 9.8 9.9 9.10
Corretto 17.0.8+7-LTS 3562 3694 3517 3508 3508
Corretto 21+35-LTS 3474 6371 6227 6183 6065
Oracle 21.0.3+7-LTS-152 3387 6229 6133 6300 6318
Corretto 21.0.3+9-LTS 3398 6221 5959 5985 6177
Corretto 22+36-FR 3401 3550 3395 3378 6183

Run as: java -jar dacapo-evaluation-git-d348bbe0.jar lusearch -n 20, with five builds, the first four of each have a single line change to libs.xml to change Lucene version from 9.10, which is what d348bbe is using.

As @caizixian mentioned above, the problem manifests in 22 also, which I had not noticed.

Here's the same analysis, now using just Lucene 9.10 and toggling enableMemorySegments:

Family Build false true
Corretto 17.0.8+7-LTS 3611 3522
Corretto 21+35-LTS 3352 6392
Oracle 21.0.3+7-LTS-152 3354 6075
Corretto 21.0.3+9-LTS 3387 6185
Corretto 22+36-FR 3440 6411

Run as: java -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=<value> -jar dacapo-evaluation-git-d348bbe0.jar lusearch -n 20, using the default level of parallelism (which is the available level hardware parallelism; 24 in this case).

I think that the results of -Xint are not that useful

@caizixian only used -Xint to expose the calling context.

My profile results in the post above @caizixian's pointed to MemorySegmentIndexInput but did not have the calling context since I was not using -Xint. He used the interpreter to provide you with that info. Although the interpreter is obviously problematic in general, the results align with the profile from the run without -Xint and are consistent with all of the other evidence, so I think they're helpful.

@ChrisHegarty
Copy link

For awareness I filed the following Lucene issue to help track and investigate this, apache/lucene#13325

@mikemccand
Copy link

@shipilev do you have thoughts on this?

It sounds to me that there is a problem with Lucene and JDK 21, which Dacapo correctly caught. (Pats the benchmark on its back.)

+1, thank you for running these benchmarks -- it's awesome and vital when they catch otherwise missed performance regressions. It's spooky this regression made it this far without detection! It's like a neutrino. We need to improve our detectors.

We need to understand why Lucene's own nightly benchmarks failed to detect this too ... I've opened mikemccand/luceneutil#267 to get to the bottom of that.

@uschindler
Copy link

uschindler commented Apr 29, 2024

Hi, after studying the benchmark I think I found the issue: Looks like Wrong use of Lucene APIs!

What happens:

  • Look at the search tool:
    public void run() {
    try {
    int count = totalQueries / threadCount + (threadID < (totalQueries % threadCount) ? 1 : 0);
    for (int r = 0; r < iterations; r++) {
    for (int i = 0, queryId = threadID; i < count; i++, queryId += threadCount) {
    // make and run query
    new QueryProcessor(parent, name, queryId, index, outBase, queryBase, field, normsField, raw, hitsPerPage, totalQueries, iterations, threadID).run();
    }
    }
    } catch (Exception e) {
    e.printStackTrace();
    }
    }
    : It runs each batch of queries in a new QueryProcessor instance one per thread.
  • Look at QueryProcessor:
    public QueryProcessor(Search parent, String name, int queryID, String index, String outBase, String queryBase, String field, String normsField, boolean raw,
    int hitsPerPage, int totalQueries, int iterations, int threadID) {
    this.parent = parent;
    this.threadID = threadID;
    this.field = field;
    this.raw = raw;
    this.hitsPerPage = hitsPerPage;
    this.fivePercent = iterations*totalQueries/20;
    this.iterations = iterations;
    try {
    reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
    /*if (normsField != null)
    reader = new OneNormsReader(reader, normsField);*/
    searcher = new IndexSearcher(reader);
    String query = queryBase + File.separator + "query" + (queryID < 10 ? "000" : (queryID < 100 ? "00" : (queryID < 1000 ? "0" : ""))) + queryID + ".txt";
    in = new BufferedReader(new FileReader(query));
    out = new PrintWriter(new BufferedWriter(new FileWriter(outBase + queryID)));
    } catch (Exception e) {
    e.printStackTrace();
    }
    }
    : It opens a new IndexSearcher before each query batch and closes it after the query batch. The close() is the bad thing in multithreaded environments. Depending on the size of batches it may open and expensively close a new IndexReader quite often (on same index!?), one for every run and one for every thread: Never ever do this! Open a single IndexReader/IndexSearcher combination for all threads once in the main setup and after that spawn threads.

The problem now is the following: Due to all the time closing the indexes, the JVM goes into a safepoint to do thread local handshakes, all other threads need to stop and (due to safepoint) to ensure other threads do not access the memory unmapped.

I think the whole benachmark should be fixed to only open a single IndexSearcher and execute all threads in parallel on this single instance, which is closed at end.

The usage of Lucene as done in the benchmark is not real-life. This is why Lucene does not see the issue in its own benchamrks.

The slowdown comes from the fact that MemorySegments allow to safely unmap mmaped files with the cost of extra safepoints on the IndexReader#close call. This leads to the fact that the heavily used methods may stop in safepoints and deoptimize all the time. @shipilev may explain this better: When you close an IndexReader in Lucene we close all shared MemorySegments causing a thread-local handshake.

@uschindler
Copy link

I wonder whether this fixes the issue on newer JDKs openjdk/jdk@1594653

You gave the correct hint here. This won't fix it, the problem will still exist. Don't close IndexReader all the time as it causes a stop of all (possibly unrelated threads).

@uschindler
Copy link

uschindler commented Apr 29, 2024

Can you rewrite the benchmark code to open the IndexReader/IndexSearcher in the main thread once, then spawn all threads and then close the Index at end before the benchmark exits?

If this restores perf, then its verified that the thread-local handshakes are the problem for this code.

Anyways: The old ByteBuffer code is risky as it may crash the JVM with SIGSEGV because we use sun.misc.Unsafe to unmap all bytebuffers. If there are threads executing queries at same time they will SIGSEGV.

So with the additional safety we buy a more expensive close() but that's the only way to go as it is the only viable future for Lucene's MMapDirectory. Lucene main branch (Java 21+) no longer uses ByteBuffers and switched to MemorySegment completely - for exactly that reason. In addition, if you don't close indexes all the time the code is faster (seen by Elasticsearch benchmarks).

It has also nothing to do with Preview features. The slowdown is by design.

@uschindler
Copy link

uschindler commented Apr 29, 2024

See also Talk of @mcimadamore at FOSDEM 2024, slide 13 ("Arena-based memory management"):

Strong safety guarantee: no use-after-free
• When the arena is closed, all its segments are invalidated, atomically
• Closing a shared arena triggers a thread-local handshake (JEP 312)

Also: https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/doc/panama_memaccess.md

(1): Shared arenas rely on VM thread-local handshakes (JEP 312) to implement lock-free, safe, shared memory access; that is, when it comes to memory access, there should be no difference in performance between a shared segment and a confined segment. On the other hand, Arena::close might be slower on shared arenas than on confined ones.

I hope this helps for understanding. You have to add to that statement that the handshake not only make the close slower, it also affects other threads using MemorySegment due do deoptimization (as far as I remember)..

@steveblackburn
Copy link
Contributor

The slowdown comes from the fact that MemorySegments allow to safely unmap mmaped files with the cost of extra safepoints on the IndexReader#close call. This leads to the fact that the heavily used methods may stop in safepoints and deoptimize all the time. @shipilev may explain this better: When you close an IndexReader in Lucene we close all shared MemorySegments causing a thread-local handshake.

Thanks for that. That code is old. AFAIK when we originally wrote the workload, it was based on a Lucene performance benchmark.

I will investigate re-writing to address this pathology.

@steveblackburn
Copy link
Contributor

I made the minimal change suggested by @uschindler of lifting the construction of the IndexReader and IndexSearcher outside the loop nest (so a single instance, shared across all threads, across all batches). As he suggested, this completely addresses the problem.

We see a substantial performance win and the sensitivity to enableMemorySegments has gone.

Here's a quick and rough performance analysis:

Family Build d348bbe true d348bbe false 21b653d true 21b653d false
Corretto 17.0.8+7-LTS 3710 3621 2673 2448
Corretto 21+35-LTS 6392 3547 2455 2399
Oracle 21.0.3+7-LTS-152 6214 3562 2274 2356
Corretto 21.0.3+9-LTS 6371 3550 2372 2326
Corretto 22+36-FR 6416 3455 2223 2325

Thanks @uschindler for identifying cause of the problem and to @shipilev for spotting the issue in the first place.

@uschindler
Copy link

Thanks for the confirmation. I am still wondering why this is so dramatic. Maybe the batches are very short so the open/close is happening all the time. And in addition, if the batches execute in different speed at end you always have one IndexReader that is closed.

Anyways we should talk to the Hotspot people to figure out how we can improve the safepoint/thread-local handshake in a way that it does not get so expensive for other threads. To me it looks like on every close, all methods accessing MemorySegments get deoptimized, but I haven't looked into this.

@caizixian
Copy link

Deoptimization indeed seems to be the problem. Running JDK 21 with -t 1, the deopt log looks similar with and without MemorySegment. But if we use MemorySegment and have more than one thread, the log is filled with deopt such as the below

DEOPT UNPACKING thread=0x00007fa13c2d4ae0 vframeArray=0x00007f9fb0030fc0 mode=0
   Virtual frames (outermost/oldest first):
DEOPT UNPACKING thread=0x00007fa13c2db830 vframeArray=0x00007f9f980795c0 mode=0
   Virtual frames (outermost/oldest first):
      VFrame 4 (0x00007f9fb0032408) - org.apache.lucene.search.TopScoreDocCollector.create(ILorg/apache/lucene/search/ScoreDoc;Lorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)Lorg/apache/lucene/search/TopScoreDocCollector; - invokespecial @ bci=39 sp=0x00007fa0ad2e1660
      VFrame 1 (0x00007f9fcc05f368) - org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress(Lorg/apache/lucene/store/DataInput;IIILorg/apache/lucene/util/BytesRef;)V - invokestatic @ bci=265 sp=0x00007fa0ad7e63c8
      VFrame 0 (0x00007fa00402e228) - org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.postings(Lorg/apache/lucene/index/PostingsEnum;I)Lorg/apache/lucene/index/PostingsEnum; - invokevirtual @ bci=54 sp=0x00007fa0ae5f4620

DEOPT UNPACKING thread=0x00007fa13c2d80c0 vframeArray=0x00007f9fac07c1e0 mode=0
   Virtual frames (outermost/oldest first):
      VFrame 1 (0x00007f9fac07d508) - org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress(Lorg/apache/lucene/store/DataInput;IIILorg/apache/lucene/util/BytesRef;)V - invokestatic @ bci=265 sp=0x00007fa0acfde3c8
      VFrame 3 (0x00007f9fb00323a8) - org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector.<init>(ILorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)V - invokespecial @ bci=4 sp=0x00007fa0ad2e15e8
      VFrame 0 (0x00007f9fcc05f308) - org.apache.lucene.util.compress.LZ4.decompress(Lorg/apache/lucene/store/DataInput;I[BI)I - if_icmplt @ bci=245 sp=0x00007fa0ad7e6360

      VFrame 0 (0x00007f9f9807a888) - org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTermLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; - if_icmplt @ bci=226 sp=0x00007fa0accdb4f8

      VFrame 2 (0x00007f9fb0032348) - org.apache.lucene.search.TopScoreDocCollector.<init>(ILorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)V - invokespecial @ bci=7 sp=0x00007fa0ad2e1568
      VFrame 0 (0x00007f9fd00697e8) - org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts.readInts8(Lorg/apache/lucene/store/IndexInput;I[JI)V - goto @ bci=204 sp=0x00007fa0adae9398

      VFrame 1 (0x00007f9fb00322e8) - org.apache.lucene.search.HitQueue.<init>(IZ)V - invokespecial @ bci=8 sp=0x00007fa0ad2e14d8
      VFrame 0 (0x00007f9fb0032288) - org.apache.lucene.util.PriorityQueue.<init>(ILjava/util/function/Supplier;)V - goto @ bci=118 sp=0x00007fa0ad2e1480

      VFrame 0 (0x00007f9fac07d4a8) - org.apache.lucene.util.compress.LZ4.decompress(Lorg/apache/lucene/store/DataInput;I[BI)I - if_icmplt @ bci=245 sp=0x00007fa0acfde360

@caizixian
Copy link

LogCompilation log looks like

<deoptimized thread='664511' reason='constraint' pc='0x00007f4f0857bff4' compile_id='2446' compiler='c2' level='4'>
<jvms bci='245' method='org.apache.lucene.util.compress.LZ4 decompress (Lorg/apache/lucene/store/DataInput;I[BI)I' bytes='250' count='1537' backedge_count='180079' iicount='1537'/>
</deoptimized>

@uschindler
Copy link

The innermost frames would be more interesting.

@uschindler
Copy link

Anyways to conclude: There's room for improvement in the Hotspot team to make the thread-local handshakes cheaper, so actually this "bad designed Lucene benchmark" helped to identify the issues with thread local handshakes.

Elasticsearch and Solr of course need to close IR from time to time in NRT use cases and they have benchmarks for this. The benchmark setup here was justa a good "code example" showing the issue.

@shipilev -> your turn!

If we get the innermost frames of the deoptimizations (inside MemorySegment varhandles code) we can open issues in OpenJDK. I think the main problem is: The thread local handshakes trigger safepoints in all affected threads and a side effect of those threadpoints seem to de deoptimizations. But actually the deoptimization is not needed, maybe this can be prevented in hotspot code: after the handshake, the code could reuse the already optimized code!?

@uschindler
Copy link

One question: what CPU architecture did you benchmark on? Because cheap thread-local-handshakes are only available on x86-64 and SPARC (see JEP 312), on all other platforms it uses safepoints. So it would be good to know which platform.

@uschindler
Copy link

Looks like AARCH64 also has thread-local handshakes: https://bugs.openjdk.org/browse/JDK-8189596

@caizixian
Copy link

@shipilev used aarch64 (Graviton 3) and I used x86_64 (Ryzen 9 7950X).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants