New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LuceneSailConnection is not threadsafe #301
Comments
Are you using a build with the changes from #192 ? |
@pulquero good catch, he's probably not. This is my mistake: I have neglected to merge the release branch back into master. I'll take care of it ASAP. |
Merge done. Try again. |
I will merge master back into that branch today and see if it is replicable still. If it is still occurring for me (and hudson) it is possible that it is either a bug with my changes, or an existing bug in the lucene/sail code that is being triggered more reliably with my changes that more proactively close iterators which may be releasing resources that were previously being locked up for slightly longer. |
After merging master and fixing the trix pom.xml, I am (possibly also as the issue above is intermittent) receiving another error, should I open a separate issue or coopt this issue for it:
|
The current master (5f4d262) still has the reader closed issue on Hudson: https://hudson.eclipse.org/rdf4j/job/rdf4j-master-verify/52/console It may also have dead/live-locked ( @jeenbroekstra may need to shut the build off manually). I put in a fix for the testsuite locking up in my unrelated pull request where I was having trouble testing because of it. Feel free to cherry-pick the following commits (sorry I don't have a squashed version available right now) to master/etc. to make sure the testsuite doesn't deadlock in this case: |
Thanks Peter, I put up a PR that cherrypicks those fixes for merge into master. |
The elasticsearch error is unrelated. I can't reproduce it. Every so often I can reproduce a red health cluster, but that does not lead to any stack trace. You could try setting the WAIT_FOR_NODES_KEY, WAIT_FOR_ACTIVE_SHARDS_KEY search index properties to see if that solves it for you. |
Merged the concurrency fix. I'm not seeing any further problems, so barring info to the contrary I think we can resolve this issue? |
I will run some more builds locally with this patch today to see if it still triggers intermittently for me. |
@ansell be aware that this isn't merged to master yet, only the release branch. Will fix ASAP. |
Okay, I will run some tests on the release branch for now until it hits master. |
Just merged it because why the hell not :) |
Testing this on my issue branch (after rebasing and squashing it down to a single commit) showed the error on the first attempt. The commit I tested it on was: |
I can also intermittently reproduce the bug using the Eclipse JUnit runner, so it isn't related to maven/surefire. |
If I understand the problem correctly, the core of the issue is that when a (update removed fix suggestion because I looked a bit closer and realized I was wrong) |
I think AbstractReaderMonitor has a flaw in the concurrency logic: it uses both an AtomicBoolean and a reference counter, without enforcing synchronizing their reading. E.g. line 55:
It's entirely possible for readingCount to be re-incremented by the time the second part of the I'm also slightly confused by the fact that it has two AtomicBooleans, one called |
The more I look at it, the stranger I find the construction. |
The way Lucene internally manages reference counts is also fundamentally broken and a bit scary to look through. Instead of calling AtomicInteger.incrementAndGet like usual, they use the following obtuse method in IndexReader that is prone to decrements happening concurrently. Seems strange to do it that way if you just have just signalled that you want to add to the reference count, but it is fairly likely to get two threads to not work concurrently if one completes too quickly and decrements the reference count to zero before the second thread is able to start:
|
issue#301 : Work on tightening the contract for lucene sail components
Fixes merged into master and the 2.0 release branch. Can we consider this resolved now? The fact that you no longer get the issue in 30+ runs seems pretty solid to me. |
Sure, we can reopen or open a follow up if/when it pops up again. To clarify that was 30 runs just of LuceneSailTest, where the issue was occurring, inside of Eclipse JUnit runner. I didn't run the maven build 30 times and it was only on my 2 CPU VirtualBox virtual machine. But it was occurring consistently for me in the 2 CPU virtual machine previously, so in that way it was a valid test host. |
The LuceneSail sends the same LuceneIndex to each LuceneSailConnection created from it. Each of the connections are implemented to call LuceneIndex.endReading when they are closed, which fails sporadically with the following (or similar) stacktrace:
The failure is not deterministic for whatever reason.
In general RepositoryConnection is not specified to be threadsafe, but closing a repository connection should not affect other connections before the entire Repository/Sail is shutdown. In this test the shutdown should not happen before the test completes. The bug is reproducible fairly consistently for me, happening in at least 1 in 20 builds with a change to randomly vary the number of insertions for each of the threads so they don't all complete at a similar time consistently.
The text was updated successfully, but these errors were encountered: