New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK 21, lusearch, and Lucene "regression" #264
Comments
What version of Lucene are you using? It seems that the MemorySegment backed mmap directory is being used by default. You can disable it by setting the following system property:
I'm surprised that using the MemorySegment backed mmap directory causes a perf regression. This is not what we see in our tests, so it's likely that there's something else going on here too. |
We're using 9.7.0 in the Chopin release. |
Yes, I can confirm that we can see the same slowdown in our setup. 2845ms in Temurin 11 (0.74% 95CI), 2825ms in Temurin 17 (0.74% 95CI), and 3238ms (1% 95CI) |
Setting the |
I just want to confirm that this problem persists with Lucene 9.10. @ChrisHegarty is this what you expect? I'm preparing for an upcoming release, but given what I saw with 9.10, I am now wondering whether I should be turning the flag off by default. My (strong) preference is to have Lucene run "as-is". But I'd also rather not ship the suite with a known major performance regression (I regret that I did not see this when I made the recent release). @shipilev do you have thoughts on this? |
It sounds to me that there is a problem with Lucene and JDK 21, which Dacapo correctly caught. (Pats the benchmark on its back.) And JDK 21 is de-facto LTS release, which would be around for years to come. So I think this problem should be fixed in Lucene, and then Dacapo should update with new version of Lucene. If there is no good version for Lucene that solves the issue with JDK 21, Dacapo should not be forced to do anything to work around performance bugs. |
We run Lucene 9.10 with JDK 21 in many scenarios WITHOUT issue. Whatever it is that you're encountering, it is not an obvious or known issue with Lucene 9.10 and JDK 21. Can you please try to determine where the performance difference is, and why |
I can confirm that the issue appears to have been introduced in 9.7. It is not evident in 9.6 or earlier. I'll look into this further tomorrow. |
This seems to be the culprit: apache/lucene#12294 The next question is why we're seeing the issue and you're not. |
A quick look at a perf profile shows the following: 9.6
9.7
So the link to MemorySegments is clear. This is with Corretto-21.0.0.35.1 (build 21+35-LTS) I'm running DaCapo lusearch with arguments This runs 10 M simple queries in a single worker thread ("Query0"). |
Running with So I think we learnt the following:
|
I wonder whether this fixes the issue on newer JDKs openjdk/jdk@1594653 |
Thanks @caizixian @ChrisHegarty this seems fairly clear now. In short, enabling memory segments does not play nicely with JDK 21 (as @shipilev said), and we now have a profile that shows why this is so. As @caizixian says, from all the testing we did, you will see this if you have an intense workload with a fair amount of parallelism (I saw 2X slowdown with 24 threads, but none when running single-threaded). I am not sure why this did not show up in your testing. Given what we know about the pathologies of JDK 21 and the data here, I wonder if it would be better for |
I think that the results of Can you please check with the most recent JDK, say JDK 21.0.2 and/or JDK 22.0.1? Preferably or as well as the Oracle GPL binaries. e.g. https://jdk.java.net/21/, https://jdk.java.net/22/ |
We were running with The performance different is clearly visible when running with default parameters, and the relevant symbols can be seen using We were running 21.0.2.
I just tested with Oracle commercial binary of JDK 21 and it gave the same performance result, so it's not a OpenJDK/Temurin specific problem.
Lucene 9.7 + JDK 22 doesn't produce the performance pathology because the MemorySegment IndexInput is guarded for java19~21 https://github.com/apache/lucene/tree/releases/lucene/9.7.0/lucene/core/src/java21/org/apache/lucene |
We can reproduce the performance pathology using Lucene 2.10 + JDK 22, where MemorySegment IndexInput is enabled for JDK 22 as well per apache/lucene#12706 |
Here are the times in msec for five JDKs I happen to have at hand:
Run as: As @caizixian mentioned above, the problem manifests in 22 also, which I had not noticed. Here's the same analysis, now using just Lucene 9.10 and toggling
Run as:
@caizixian only used My profile results in the post above @caizixian's pointed to |
For awareness I filed the following Lucene issue to help track and investigate this, apache/lucene#13325 |
+1, thank you for running these benchmarks -- it's awesome and vital when they catch otherwise missed performance regressions. It's spooky this regression made it this far without detection! It's like a neutrino. We need to improve our detectors. We need to understand why Lucene's own nightly benchmarks failed to detect this too ... I've opened mikemccand/luceneutil#267 to get to the bottom of that. |
Hi, after studying the benchmark I think I found the issue: Looks like Wrong use of Lucene APIs! What happens:
The problem now is the following: Due to all the time closing the indexes, the JVM goes into a safepoint to do thread local handshakes, all other threads need to stop and (due to safepoint) to ensure other threads do not access the memory unmapped. I think the whole benachmark should be fixed to only open a single IndexSearcher and execute all threads in parallel on this single instance, which is closed at end. The usage of Lucene as done in the benchmark is not real-life. This is why Lucene does not see the issue in its own benchamrks. The slowdown comes from the fact that MemorySegments allow to safely unmap mmaped files with the cost of extra safepoints on the IndexReader#close call. This leads to the fact that the heavily used methods may stop in safepoints and deoptimize all the time. @shipilev may explain this better: When you close an IndexReader in Lucene we close all shared MemorySegments causing a thread-local handshake. |
You gave the correct hint here. This won't fix it, the problem will still exist. Don't close IndexReader all the time as it causes a stop of all (possibly unrelated threads). |
Can you rewrite the benchmark code to open the IndexReader/IndexSearcher in the main thread once, then spawn all threads and then close the Index at end before the benchmark exits? If this restores perf, then its verified that the thread-local handshakes are the problem for this code. Anyways: The old ByteBuffer code is risky as it may crash the JVM with SIGSEGV because we use sun.misc.Unsafe to unmap all bytebuffers. If there are threads executing queries at same time they will SIGSEGV. So with the additional safety we buy a more expensive close() but that's the only way to go as it is the only viable future for Lucene's MMapDirectory. Lucene main branch (Java 21+) no longer uses ByteBuffers and switched to MemorySegment completely - for exactly that reason. In addition, if you don't close indexes all the time the code is faster (seen by Elasticsearch benchmarks). It has also nothing to do with Preview features. The slowdown is by design. |
See also Talk of @mcimadamore at FOSDEM 2024, slide 13 ("Arena-based memory management"):
Also: https://github.com/openjdk/panama-foreign/blob/foreign-memaccess%2Babi/doc/panama_memaccess.md
I hope this helps for understanding. You have to add to that statement that the handshake not only make the close slower, it also affects other threads using MemorySegment due do deoptimization (as far as I remember).. |
Thanks for that. That code is old. AFAIK when we originally wrote the workload, it was based on a Lucene performance benchmark. I will investigate re-writing to address this pathology. |
I made the minimal change suggested by @uschindler of lifting the construction of the IndexReader and IndexSearcher outside the loop nest (so a single instance, shared across all threads, across all batches). As he suggested, this completely addresses the problem. We see a substantial performance win and the sensitivity to Here's a quick and rough performance analysis:
Thanks @uschindler for identifying cause of the problem and to @shipilev for spotting the issue in the first place. |
Thanks for the confirmation. I am still wondering why this is so dramatic. Maybe the batches are very short so the open/close is happening all the time. And in addition, if the batches execute in different speed at end you always have one IndexReader that is closed. Anyways we should talk to the Hotspot people to figure out how we can improve the safepoint/thread-local handshake in a way that it does not get so expensive for other threads. To me it looks like on every close, all methods accessing MemorySegments get deoptimized, but I haven't looked into this. |
Deoptimization indeed seems to be the problem. Running JDK 21 with
|
<deoptimized thread='664511' reason='constraint' pc='0x00007f4f0857bff4' compile_id='2446' compiler='c2' level='4'>
<jvms bci='245' method='org.apache.lucene.util.compress.LZ4 decompress (Lorg/apache/lucene/store/DataInput;I[BI)I' bytes='250' count='1537' backedge_count='180079' iicount='1537'/>
</deoptimized> |
The innermost frames would be more interesting. |
Anyways to conclude: There's room for improvement in the Hotspot team to make the thread-local handshakes cheaper, so actually this "bad designed Lucene benchmark" helped to identify the issues with thread local handshakes. Elasticsearch and Solr of course need to close IR from time to time in NRT use cases and they have benchmarks for this. The benchmark setup here was justa a good "code example" showing the issue. @shipilev -> your turn! If we get the innermost frames of the deoptimizations (inside MemorySegment varhandles code) we can open issues in OpenJDK. I think the main problem is: The thread local handshakes trigger safepoints in all affected threads and a side effect of those threadpoints seem to de deoptimizations. But actually the deoptimization is not needed, maybe this can be prevented in hotspot code: after the handshake, the code could reuse the already optimized code!? |
One question: what CPU architecture did you benchmark on? Because cheap thread-local-handshakes are only available on x86-64 and SPARC (see JEP 312), on all other platforms it uses safepoints. So it would be good to know which platform. |
Looks like AARCH64 also has thread-local handshakes: https://bugs.openjdk.org/browse/JDK-8189596 |
@shipilev used aarch64 (Graviton 3) and I used x86_64 (Ryzen 9 7950X). |
Wanted to report a "problem" with current
lusearch
benchmark in 23.11-chopin, which so far looks like the issue in Lucene support for JDK preview features. In short, running with different JDKs shows thatlusearch
is substantially slower with JDK 21.Running self-built JDKs on Graviton 3 instance with:
...yields these results:
Bisection shows that the "regression" starts in JDK 19 with: 8282191: Implementation of Foreign Function & Memory API (Preview). We cannot run JDK 19 prior that changeset: Lucene fails with NCDFE trying to access
MemorySegment
. We suspect that Lucene is opting in some FFM preview features, which are slower than what it used before. Mainline JDK is fast again, but only after JDK starts to identify itself as JDK 22: openjdk/jdk@5a706fbI suspect that Lucene that ships with 23.11-chopin actually wants JDK 22 to perform well, and the intermediate version that opts-in to JDK 19 preview features is actually slower on this test.
@ChrisHegarty, is there something we can do here? Maybe there is a feature flag that switches what Lucene is using? Maybe Dacapo should update the Lucene to some other version?
The text was updated successfully, but these errors were encountered: