JDK 21, lusearch, and Lucene "regression" #264

shipilev · 2024-03-18T19:21:21Z

Wanted to report a "problem" with current lusearch benchmark in 23.11-chopin, which so far looks like the issue in Lucene support for JDK preview features. In short, running with different JDKs shows that lusearch is substantially slower with JDK 21.

Running self-built JDKs on Graviton 3 instance with:

% shipilev-jdk/build/linux-aarch64-server-release/images/jdk/bin/java -jar dacapo-23.11-chopin.jar lusearch -n 20

...yields these results:

JDK 17-dev: 1584 msec
JDK 21-dev: 2446 msec (!!!)
JDK 22+0:   2455 msec (!!!)
JDK 22+1:   1527 msec
JDK 23-dev: 1519 msec

Bisection shows that the "regression" starts in JDK 19 with: 8282191: Implementation of Foreign Function & Memory API (Preview). We cannot run JDK 19 prior that changeset: Lucene fails with NCDFE trying to access MemorySegment. We suspect that Lucene is opting in some FFM preview features, which are slower than what it used before. Mainline JDK is fast again, but only after JDK starts to identify itself as JDK 22: openjdk/jdk@5a706fb

I suspect that Lucene that ships with 23.11-chopin actually wants JDK 22 to perform well, and the intermediate version that opts-in to JDK 19 preview features is actually slower on this test.

@ChrisHegarty, is there something we can do here? Maybe there is a feature flag that switches what Lucene is using? Maybe Dacapo should update the Lucene to some other version?

The text was updated successfully, but these errors were encountered:

ChrisHegarty · 2024-03-19T11:08:48Z

What version of Lucene are you using? It seems that the MemorySegment backed mmap directory is being used by default. You can disable it by setting the following system property:

-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false.

I'm surprised that using the MemorySegment backed mmap directory causes a perf regression. This is not what we see in our tests, so it's likely that there's something else going on here too.

steveblackburn · 2024-03-19T18:08:49Z

We're using 9.7.0 in the Chopin release.

caizixian · 2024-03-21T00:30:35Z

Yes, I can confirm that we can see the same slowdown in our setup. 2845ms in Temurin 11 (0.74% 95CI), 2825ms in Temurin 17 (0.74% 95CI), and 3238ms (1% 95CI)

caizixian · 2024-03-21T00:36:56Z

Setting the -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false flag indeed improves the performance on Temurin 21. It's now on par with 17 and 11.

steveblackburn · 2024-04-22T07:50:08Z

I just want to confirm that this problem persists with Lucene 9.10.

@ChrisHegarty is this what you expect?

I'm preparing for an upcoming release, but given what I saw with 9.10, I am now wondering whether I should be turning the flag off by default. My (strong) preference is to have Lucene run "as-is". But I'd also rather not ship the suite with a known major performance regression (I regret that I did not see this when I made the recent release).

@shipilev do you have thoughts on this?

shipilev · 2024-04-22T08:18:44Z

@shipilev do you have thoughts on this?

It sounds to me that there is a problem with Lucene and JDK 21, which Dacapo correctly caught. (Pats the benchmark on its back.) And JDK 21 is de-facto LTS release, which would be around for years to come. So I think this problem should be fixed in Lucene, and then Dacapo should update with new version of Lucene. If there is no good version for Lucene that solves the issue with JDK 21, Dacapo should not be forced to do anything to work around performance bugs.

ChrisHegarty · 2024-04-22T08:25:54Z

@ChrisHegarty is this what you expect?

We run Lucene 9.10 with JDK 21 in many scenarios WITHOUT issue. Whatever it is that you're encountering, it is not an obvious or known issue with Lucene 9.10 and JDK 21. Can you please try to determine where the performance difference is, and why enableMemorySegments=false would have any affect on it.

steveblackburn · 2024-04-22T09:30:45Z

I can confirm that the issue appears to have been introduced in 9.7. It is not evident in 9.6 or earlier.

I'll look into this further tomorrow.

steveblackburn · 2024-04-22T09:42:36Z

This seems to be the culprit: apache/lucene#12294

The next question is why we're seeing the issue and you're not.

steveblackburn · 2024-04-23T10:46:05Z

@ChrisHegarty

A quick look at a perf profile shows the following:

9.6

  8.30%  Query0           [JIT] tid 786719      [.] int org.apache.lucene.util.compress.LZ4.decompress(org.apache.lucene.store.DataInput, int, byte[], int)
   7.86%  Query0           [JIT] tid 786719      [.] java.nio.ByteBuffer java.nio.ByteBuffer.getArray(int, byte[], int, int)
   6.88%  Query0           [JIT] tid 786719      [.] StubRoutines (final stubs)
   4.41%  Query0           [JIT] tid 786719      [.] short org.apache.lucene.store.ByteBufferIndexInput.readShort()

9.7

  10.20%  Query0           [JIT] tid 792317      [.] int java.lang.invoke.VarHandleGuards.guard_LJ_I(java.lang.invoke.VarHandle, java.lang.Object, long, java.lang.invoke.VarHandle$AccessDescriptor)
   9.71%  Query0           [JIT] tid 792317      [.] StubRoutines (final stubs)
   8.50%  Query0           [JIT] tid 792317      [.] void org.apache.lucene.store.MemorySegmentIndexInput.readBytes(byte[], int, int)
   6.63%  Query0           [JIT] tid 792317      [.] int java.lang.invoke.MethodHandle.linkToStatic(java.lang.Object, java.lang.Object, long, java.lang.invoke.MemberName)
   5.86%  Query0           [JIT] tid 792317      [.] short java.lang.invoke.VarHandleSegmentAsShorts.get(java.lang.invoke.VarHandle, java.lang.Object, long)
   5.74%  Query0           [JIT] tid 792317      [.] int org.apache.lucene.util.compress.LZ4.decompress(org.apache.lucene.store.DataInput, int, byte[], int)
   5.23%  Query0           [JIT] tid 792317      [.] short org.apache.lucene.store.MemorySegmentIndexInput.readShort()

So the link to MemorySegments is clear.

This is with Corretto-21.0.0.35.1 (build 21+35-LTS)

I'm running DaCapo lusearch with arguments -n 20 -t 1 (20 invocations of the workload, single threaded). The relevant code is here.

This runs 10 M simple queries in a single worker thread ("Query0").

caizixian · 2024-04-24T02:06:44Z

Running with -Xint and profiling using VisualVM shows the full stack with and without enableMemorySegments.
This is not apparent from the above perf output since most of the methods are inline into LZ4.decompress.

So I think we learnt the following:

Bulk of DaCapo's query threads' time is spent on LZ4.decompress, which heavily exercise IndexInput's various read* methods.
The MemorySegment implementation of IndexInput seems to be inefficient on older JDKs, especially with more mutator threads (running DaCapo with -t 1 or -t 32 shows very different result with and without enableMemorySegments)

caizixian · 2024-04-24T02:20:12Z

I wonder whether this fixes the issue on newer JDKs openjdk/jdk@1594653

steveblackburn · 2024-04-24T02:48:26Z

Thanks @caizixian

@ChrisHegarty this seems fairly clear now.

In short, enabling memory segments does not play nicely with JDK 21 (as @shipilev said), and we now have a profile that shows why this is so.

As @caizixian says, from all the testing we did, you will see this if you have an intense workload with a fair amount of parallelism (I saw 2X slowdown with 24 threads, but none when running single-threaded).

I am not sure why this did not show up in your testing.

Given what we know about the pathologies of JDK 21 and the data here, I wonder if it would be better for MemorySegmentIndexInput to be off by default for JDK 21. If so, my plan for DaCapo would be to include that change in the upcoming release. If not, perhaps my best option would be to downgrade to 9.6 for the upcoming release.

ChrisHegarty · 2024-04-25T14:01:58Z

I think that the results of -Xint are not that useful, because perf using MemorySegments and VarHandles rely heavily on the JVM being able to eliminate (or otherwise hoist, or ...) various checks.

Can you please check with the most recent JDK, say JDK 21.0.2 and/or JDK 22.0.1? Preferably or as well as the Oracle GPL binaries. e.g. https://jdk.java.net/21/, https://jdk.java.net/22/

caizixian · 2024-04-26T02:36:46Z

I think that the results of -Xint are not that useful

We were running with -Xint only to show the full call hierarchy that leads to the MemeorySegment backed IndexInput.

The performance different is clearly visible when running with default parameters, and the relevant symbols can be seen using perf

We were running 21.0.2.

openjdk version "21.0.2" 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)

I just tested with Oracle commercial binary of JDK 21 and it gave the same performance result, so it's not a OpenJDK/Temurin specific problem.

java version "21.0.3" 2024-04-16 LTS
Java(TM) SE Runtime Environment (build 21.0.3+7-LTS-152)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.3+7-LTS-152, mixed mode, sharing)

Lucene 9.7 + JDK 22 doesn't produce the performance pathology because the MemorySegment IndexInput is guarded for java19~21 https://github.com/apache/lucene/tree/releases/lucene/9.7.0/lucene/core/src/java21/org/apache/lucene

caizixian · 2024-04-26T02:41:36Z

We can reproduce the performance pathology using Lucene 2.10 + JDK 22, where MemorySegment IndexInput is enabled for JDK 22 as well per apache/lucene#12706

steveblackburn · 2024-04-26T05:16:45Z

@ChrisHegarty

Can you please check with the most recent JDK

Here are the times in msec for five JDKs I happen to have at hand:

Family	Build	9.6	9.7	9.8	9.9	9.10
Corretto	17.0.8+7-LTS	3562	3694	3517	3508	3508
Corretto	21+35-LTS	3474	6371	6227	6183	6065
Oracle	21.0.3+7-LTS-152	3387	6229	6133	6300	6318
Corretto	21.0.3+9-LTS	3398	6221	5959	5985	6177
Corretto	22+36-FR	3401	3550	3395	3378	6183

Run as: java -jar dacapo-evaluation-git-d348bbe0.jar lusearch -n 20, with five builds, the first four of each have a single line change to libs.xml to change Lucene version from 9.10, which is what d348bbe is using.

As @caizixian mentioned above, the problem manifests in 22 also, which I had not noticed.

Here's the same analysis, now using just Lucene 9.10 and toggling enableMemorySegments:

Family	Build	`false`	`true`
Corretto	17.0.8+7-LTS	3611	3522
Corretto	21+35-LTS	3352	6392
Oracle	21.0.3+7-LTS-152	3354	6075
Corretto	21.0.3+9-LTS	3387	6185
Corretto	22+36-FR	3440	6411

Run as: java -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=<value> -jar dacapo-evaluation-git-d348bbe0.jar lusearch -n 20, using the default level of parallelism (which is the available level hardware parallelism; 24 in this case).

I think that the results of -Xint are not that useful

@caizixian only used -Xint to expose the calling context.

My profile results in the post above @caizixian's pointed to MemorySegmentIndexInput but did not have the calling context since I was not using -Xint. He used the interpreter to provide you with that info. Although the interpreter is obviously problematic in general, the results align with the profile from the run without -Xint and are consistent with all of the other evidence, so I think they're helpful.

ChrisHegarty · 2024-04-29T08:28:07Z

For awareness I filed the following Lucene issue to help track and investigate this, apache/lucene#13325

mikemccand · 2024-04-29T14:06:09Z

@shipilev do you have thoughts on this?

It sounds to me that there is a problem with Lucene and JDK 21, which Dacapo correctly caught. (Pats the benchmark on its back.)

+1, thank you for running these benchmarks -- it's awesome and vital when they catch otherwise missed performance regressions. It's spooky this regression made it this far without detection! It's like a neutrino. We need to improve our detectors.

We need to understand why Lucene's own nightly benchmarks failed to detect this too ... I've opened mikemccand/luceneutil#267 to get to the bottom of that.

uschindler · 2024-04-29T15:35:57Z

Hi, after studying the benchmark I think I found the issue: Looks like Wrong use of Lucene APIs!

What happens:

Look at the search tool:

dacapobench/benchmarks/bms/lusearch/src/org/dacapo/lusearch/Search.java

Lines 193 to 205 in 2ae32dc

    
           public void run() { 
        
             try { 
        
               int count = totalQueries / threadCount + (threadID < (totalQueries % threadCount) ? 1 : 0); 
        
               for (int r = 0; r < iterations; r++) { 
        
                 for (int i = 0, queryId = threadID; i < count; i++, queryId += threadCount) { 
        
                   // make and run query 
        
                   new QueryProcessor(parent, name, queryId, index, outBase, queryBase, field, normsField, raw, hitsPerPage, totalQueries, iterations, threadID).run(); 
        
                 } 
        
               } 
        
             } catch (Exception e) { 
        
               e.printStackTrace(); 
        
             } 
        
           }

: It runs each batch of queries in a new QueryProcessor instance one per thread.

Look at QueryProcessor:

dacapobench/benchmarks/bms/lusearch/src/org/dacapo/lusearch/Search.java

Lines 222 to 243 in 2ae32dc

    
               public QueryProcessor(Search parent, String name, int queryID, String index, String outBase, String queryBase, String field, String normsField, boolean raw, 
        
                   int hitsPerPage, int totalQueries, int iterations, int threadID) { 
        
                 this.parent = parent; 
        
                 this.threadID = threadID; 
        
                 this.field = field; 
        
                 this.raw = raw; 
        
                 this.hitsPerPage = hitsPerPage; 
        
                 this.fivePercent = iterations*totalQueries/20; 
        
                 this.iterations = iterations; 
        
                 try { 
        
                   reader = DirectoryReader.open(FSDirectory.open(Paths.get(index))); 
        
                   /*if (normsField != null) 
        
                     reader = new OneNormsReader(reader, normsField);*/ 
        
                   searcher = new IndexSearcher(reader); 
        
                   String query = queryBase + File.separator + "query" + (queryID < 10 ? "000" : (queryID < 100 ? "00" : (queryID < 1000 ? "0" : ""))) + queryID + ".txt"; 
        
                   in = new BufferedReader(new FileReader(query)); 
        
                   out = new PrintWriter(new BufferedWriter(new FileWriter(outBase + queryID))); 
        
                 } catch (Exception e) { 
        
                   e.printStackTrace(); 
        
                 } 
        
               }

: It opens a new IndexSearcher before each query batch and closes it after the query batch. The close() is the bad thing in multithreaded environments. Depending on the size of batches it may open and expensively close a new IndexReader quite often (on same index!?), one for every run and one for every thread: Never ever do this! Open a single IndexReader/IndexSearcher combination for all threads once in the main setup and after that spawn threads.

The problem now is the following: Due to all the time closing the indexes, the JVM goes into a safepoint to do thread local handshakes, all other threads need to stop and (due to safepoint) to ensure other threads do not access the memory unmapped.

I think the whole benachmark should be fixed to only open a single IndexSearcher and execute all threads in parallel on this single instance, which is closed at end.

The usage of Lucene as done in the benchmark is not real-life. This is why Lucene does not see the issue in its own benchamrks.

The slowdown comes from the fact that MemorySegments allow to safely unmap mmaped files with the cost of extra safepoints on the IndexReader#close call. This leads to the fact that the heavily used methods may stop in safepoints and deoptimize all the time. @shipilev may explain this better: When you close an IndexReader in Lucene we close all shared MemorySegments causing a thread-local handshake.

uschindler · 2024-04-29T15:49:18Z

I wonder whether this fixes the issue on newer JDKs openjdk/jdk@1594653

You gave the correct hint here. This won't fix it, the problem will still exist. Don't close IndexReader all the time as it causes a stop of all (possibly unrelated threads).

uschindler · 2024-04-29T15:53:33Z

Can you rewrite the benchmark code to open the IndexReader/IndexSearcher in the main thread once, then spawn all threads and then close the Index at end before the benchmark exits?

If this restores perf, then its verified that the thread-local handshakes are the problem for this code.

Anyways: The old ByteBuffer code is risky as it may crash the JVM with SIGSEGV because we use sun.misc.Unsafe to unmap all bytebuffers. If there are threads executing queries at same time they will SIGSEGV.

So with the additional safety we buy a more expensive close() but that's the only way to go as it is the only viable future for Lucene's MMapDirectory. Lucene main branch (Java 21+) no longer uses ByteBuffers and switched to MemorySegment completely - for exactly that reason. In addition, if you don't close indexes all the time the code is faster (seen by Elasticsearch benchmarks).

It has also nothing to do with Preview features. The slowdown is by design.

uschindler · 2024-04-29T16:02:46Z

See also Talk of @mcimadamore at FOSDEM 2024, slide 13 ("Arena-based memory management"):

Strong safety guarantee: no use-after-free
• When the arena is closed, all its segments are invalidated, atomically
• Closing a shared arena triggers a thread-local handshake (JEP 312)

Also: https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/doc/panama_memaccess.md

(1): Shared arenas rely on VM thread-local handshakes (JEP 312) to implement lock-free, safe, shared memory access; that is, when it comes to memory access, there should be no difference in performance between a shared segment and a confined segment. On the other hand, Arena::close might be slower on shared arenas than on confined ones.

I hope this helps for understanding. You have to add to that statement that the handshake not only make the close slower, it also affects other threads using MemorySegment due do deoptimization (as far as I remember)..

steveblackburn · 2024-04-30T04:21:23Z

The slowdown comes from the fact that MemorySegments allow to safely unmap mmaped files with the cost of extra safepoints on the IndexReader#close call. This leads to the fact that the heavily used methods may stop in safepoints and deoptimize all the time. @shipilev may explain this better: When you close an IndexReader in Lucene we close all shared MemorySegments causing a thread-local handshake.

Thanks for that. That code is old. AFAIK when we originally wrote the workload, it was based on a Lucene performance benchmark.

I will investigate re-writing to address this pathology.

steveblackburn · 2024-04-30T10:12:03Z

I made the minimal change suggested by @uschindler of lifting the construction of the IndexReader and IndexSearcher outside the loop nest (so a single instance, shared across all threads, across all batches). As he suggested, this completely addresses the problem.

We see a substantial performance win and the sensitivity to enableMemorySegments has gone.

Here's a quick and rough performance analysis:

Family	Build	`d348bbe` `true`	`d348bbe` `false`	`21b653d` `true`	`21b653d` `false`
Corretto	17.0.8+7-LTS	3710	3621	2673	2448
Corretto	21+35-LTS	6392	3547	2455	2399
Oracle	21.0.3+7-LTS-152	6214	3562	2274	2356
Corretto	21.0.3+9-LTS	6371	3550	2372	2326
Corretto	22+36-FR	6416	3455	2223	2325

Thanks @uschindler for identifying cause of the problem and to @shipilev for spotting the issue in the first place.

uschindler · 2024-04-30T10:18:49Z

Thanks for the confirmation. I am still wondering why this is so dramatic. Maybe the batches are very short so the open/close is happening all the time. And in addition, if the batches execute in different speed at end you always have one IndexReader that is closed.

Anyways we should talk to the Hotspot people to figure out how we can improve the safepoint/thread-local handshake in a way that it does not get so expensive for other threads. To me it looks like on every close, all methods accessing MemorySegments get deoptimized, but I haven't looked into this.

caizixian · 2024-04-30T11:46:50Z

Deoptimization indeed seems to be the problem. Running JDK 21 with -t 1, the deopt log looks similar with and without MemorySegment. But if we use MemorySegment and have more than one thread, the log is filled with deopt such as the below

DEOPT UNPACKING thread=0x00007fa13c2d4ae0 vframeArray=0x00007f9fb0030fc0 mode=0
   Virtual frames (outermost/oldest first):
DEOPT UNPACKING thread=0x00007fa13c2db830 vframeArray=0x00007f9f980795c0 mode=0
   Virtual frames (outermost/oldest first):
      VFrame 4 (0x00007f9fb0032408) - org.apache.lucene.search.TopScoreDocCollector.create(ILorg/apache/lucene/search/ScoreDoc;Lorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)Lorg/apache/lucene/search/TopScoreDocCollector; - invokespecial @ bci=39 sp=0x00007fa0ad2e1660
      VFrame 1 (0x00007f9fcc05f368) - org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress(Lorg/apache/lucene/store/DataInput;IIILorg/apache/lucene/util/BytesRef;)V - invokestatic @ bci=265 sp=0x00007fa0ad7e63c8
      VFrame 0 (0x00007fa00402e228) - org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.postings(Lorg/apache/lucene/index/PostingsEnum;I)Lorg/apache/lucene/index/PostingsEnum; - invokevirtual @ bci=54 sp=0x00007fa0ae5f4620

DEOPT UNPACKING thread=0x00007fa13c2d80c0 vframeArray=0x00007f9fac07c1e0 mode=0
   Virtual frames (outermost/oldest first):
      VFrame 1 (0x00007f9fac07d508) - org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress(Lorg/apache/lucene/store/DataInput;IIILorg/apache/lucene/util/BytesRef;)V - invokestatic @ bci=265 sp=0x00007fa0acfde3c8
      VFrame 3 (0x00007f9fb00323a8) - org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector.<init>(ILorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)V - invokespecial @ bci=4 sp=0x00007fa0ad2e15e8
      VFrame 0 (0x00007f9fcc05f308) - org.apache.lucene.util.compress.LZ4.decompress(Lorg/apache/lucene/store/DataInput;I[BI)I - if_icmplt @ bci=245 sp=0x00007fa0ad7e6360

      VFrame 0 (0x00007f9f9807a888) - org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTermLeaf(Lorg/apache/lucene/util/BytesRef;Z)Lorg/apache/lucene/index/TermsEnum$SeekStatus; - if_icmplt @ bci=226 sp=0x00007fa0accdb4f8

      VFrame 2 (0x00007f9fb0032348) - org.apache.lucene.search.TopScoreDocCollector.<init>(ILorg/apache/lucene/search/HitsThresholdChecker;Lorg/apache/lucene/search/MaxScoreAccumulator;)V - invokespecial @ bci=7 sp=0x00007fa0ad2e1568
      VFrame 0 (0x00007f9fd00697e8) - org.apache.lucene.codecs.lucene90.compressing.StoredFieldsInts.readInts8(Lorg/apache/lucene/store/IndexInput;I[JI)V - goto @ bci=204 sp=0x00007fa0adae9398

      VFrame 1 (0x00007f9fb00322e8) - org.apache.lucene.search.HitQueue.<init>(IZ)V - invokespecial @ bci=8 sp=0x00007fa0ad2e14d8
      VFrame 0 (0x00007f9fb0032288) - org.apache.lucene.util.PriorityQueue.<init>(ILjava/util/function/Supplier;)V - goto @ bci=118 sp=0x00007fa0ad2e1480

      VFrame 0 (0x00007f9fac07d4a8) - org.apache.lucene.util.compress.LZ4.decompress(Lorg/apache/lucene/store/DataInput;I[BI)I - if_icmplt @ bci=245 sp=0x00007fa0acfde360

caizixian · 2024-04-30T11:49:20Z

LogCompilation log looks like

<deoptimized thread='664511' reason='constraint' pc='0x00007f4f0857bff4' compile_id='2446' compiler='c2' level='4'>
<jvms bci='245' method='org.apache.lucene.util.compress.LZ4 decompress (Lorg/apache/lucene/store/DataInput;I[BI)I' bytes='250' count='1537' backedge_count='180079' iicount='1537'/>
</deoptimized>

uschindler · 2024-04-30T12:36:04Z

The innermost frames would be more interesting.

uschindler · 2024-04-30T13:14:13Z

Anyways to conclude: There's room for improvement in the Hotspot team to make the thread-local handshakes cheaper, so actually this "bad designed Lucene benchmark" helped to identify the issues with thread local handshakes.

Elasticsearch and Solr of course need to close IR from time to time in NRT use cases and they have benchmarks for this. The benchmark setup here was justa a good "code example" showing the issue.

@shipilev -> your turn!

If we get the innermost frames of the deoptimizations (inside MemorySegment varhandles code) we can open issues in OpenJDK. I think the main problem is: The thread local handshakes trigger safepoints in all affected threads and a side effect of those threadpoints seem to de deoptimizations. But actually the deoptimization is not needed, maybe this can be prevented in hotspot code: after the handshake, the code could reuse the already optimized code!?

uschindler · 2024-05-01T11:45:45Z

One question: what CPU architecture did you benchmark on? Because cheap thread-local-handshakes are only available on x86-64 and SPARC (see JEP 312), on all other platforms it uses safepoints. So it would be good to know which platform.

uschindler · 2024-05-01T11:55:40Z

Looks like AARCH64 also has thread-local handshakes: https://bugs.openjdk.org/browse/JDK-8189596

caizixian · 2024-05-02T04:26:40Z

@shipilev used aarch64 (Graviton 3) and I used x86_64 (Ryzen 9 7950X).

shipilev changed the title ~~JDK 21, lusearch and Lucene "regression"~~ JDK 21, lusearch, and Lucene "regression" Mar 18, 2024

steveblackburn pushed a commit that referenced this issue Apr 18, 2024

#264 upgrade to lucene 9.10

d348bbe

ChrisHegarty mentioned this issue Apr 29, 2024

Examine performance of individual data accessor methods of MemorySegmentIndexInput apache/lucene#13325

Closed

mikemccand mentioned this issue Apr 29, 2024

Understand why Lucene's nightly benchmark didn't detect a possible slowdown due to Java 21 + Panama mmap mikemccand/luceneutil#267

Closed

steveblackburn pushed a commit that referenced this issue Apr 30, 2024

#264 lift searcher and reader out of loop nest

21b653d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JDK 21, lusearch, and Lucene "regression" #264

JDK 21, lusearch, and Lucene "regression" #264

shipilev commented Mar 18, 2024 •

edited

ChrisHegarty commented Mar 19, 2024 •

edited

steveblackburn commented Mar 19, 2024

caizixian commented Mar 21, 2024

caizixian commented Mar 21, 2024

steveblackburn commented Apr 22, 2024

shipilev commented Apr 22, 2024

ChrisHegarty commented Apr 22, 2024

steveblackburn commented Apr 22, 2024

steveblackburn commented Apr 22, 2024

steveblackburn commented Apr 23, 2024 •

edited

caizixian commented Apr 24, 2024

caizixian commented Apr 24, 2024

steveblackburn commented Apr 24, 2024 •

edited

ChrisHegarty commented Apr 25, 2024

caizixian commented Apr 26, 2024

caizixian commented Apr 26, 2024

steveblackburn commented Apr 26, 2024 •

edited

ChrisHegarty commented Apr 29, 2024

mikemccand commented Apr 29, 2024

uschindler commented Apr 29, 2024 •

edited

uschindler commented Apr 29, 2024

uschindler commented Apr 29, 2024 •

edited

uschindler commented Apr 29, 2024 •

edited

steveblackburn commented Apr 30, 2024

steveblackburn commented Apr 30, 2024

uschindler commented Apr 30, 2024

caizixian commented Apr 30, 2024

caizixian commented Apr 30, 2024

uschindler commented Apr 30, 2024

uschindler commented Apr 30, 2024

uschindler commented May 1, 2024

uschindler commented May 1, 2024

caizixian commented May 2, 2024

JDK 21, lusearch, and Lucene "regression" #264

JDK 21, lusearch, and Lucene "regression" #264

Comments

shipilev commented Mar 18, 2024 • edited

ChrisHegarty commented Mar 19, 2024 • edited

steveblackburn commented Mar 19, 2024

caizixian commented Mar 21, 2024

caizixian commented Mar 21, 2024

steveblackburn commented Apr 22, 2024

shipilev commented Apr 22, 2024

ChrisHegarty commented Apr 22, 2024

steveblackburn commented Apr 22, 2024

steveblackburn commented Apr 22, 2024

steveblackburn commented Apr 23, 2024 • edited

caizixian commented Apr 24, 2024

caizixian commented Apr 24, 2024

steveblackburn commented Apr 24, 2024 • edited

ChrisHegarty commented Apr 25, 2024

caizixian commented Apr 26, 2024

caizixian commented Apr 26, 2024

steveblackburn commented Apr 26, 2024 • edited

ChrisHegarty commented Apr 29, 2024

mikemccand commented Apr 29, 2024

uschindler commented Apr 29, 2024 • edited

uschindler commented Apr 29, 2024

uschindler commented Apr 29, 2024 • edited

uschindler commented Apr 29, 2024 • edited

steveblackburn commented Apr 30, 2024

steveblackburn commented Apr 30, 2024

uschindler commented Apr 30, 2024

caizixian commented Apr 30, 2024

caizixian commented Apr 30, 2024

uschindler commented Apr 30, 2024

uschindler commented Apr 30, 2024

uschindler commented May 1, 2024

uschindler commented May 1, 2024

caizixian commented May 2, 2024

shipilev commented Mar 18, 2024 •

edited

ChrisHegarty commented Mar 19, 2024 •

edited

steveblackburn commented Apr 23, 2024 •

edited

steveblackburn commented Apr 24, 2024 •

edited

steveblackburn commented Apr 26, 2024 •

edited

uschindler commented Apr 29, 2024 •

edited

uschindler commented Apr 29, 2024 •

edited

uschindler commented Apr 29, 2024 •

edited