Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LuceneSailConnection is not threadsafe #301

Closed
ansell opened this issue Aug 30, 2016 · 21 comments
Closed

LuceneSailConnection is not threadsafe #301

ansell opened this issue Aug 30, 2016 · 21 comments
Labels
🐞 bug issue is a bug
Milestone

Comments

@ansell
Copy link
Contributor

ansell commented Aug 30, 2016

The LuceneSail sends the same LuceneIndex to each LuceneSailConnection created from it. Each of the connections are implemented to call LuceneIndex.endReading when they are closed, which fails sporadically with the following (or similar) stacktrace:

Exception in thread "Thread-1" java.lang.AssertionError: org.eclipse.rdf4j.repository.RepositoryException: org.eclipse.rdf4j.sail.SailException: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.eclipse.rdf4j.sail.lucene.AbstractLuceneSailTest$1.run(AbstractLuceneSailTest.java:1069)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.rdf4j.repository.RepositoryException: org.eclipse.rdf4j.sail.SailException: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:172)
    at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.conditionalCommit(AbstractRepositoryConnection.java:366)
    at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.add(AbstractRepositoryConnection.java:491)
    at org.eclipse.rdf4j.sail.lucene.AbstractLuceneSailTest$1.run(AbstractLuceneSailTest.java:1063)
    ... 1 more
Caused by: org.eclipse.rdf4j.sail.SailException: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.commit(LuceneSailConnection.java:237)
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:169)
    ... 4 more
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:274)
    at org.apache.lucene.index.SegmentReader.getPostingsReader(SegmentReader.java:227)
    at org.apache.lucene.index.CodecReader.fields(CodecReader.java:102)
    at org.apache.lucene.index.LeafReader.terms(LeafReader.java:207)
    at org.apache.lucene.index.LeafReader.termDocsEnum(LeafReader.java:318)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.addDocuments(LuceneIndex.java:461)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.getDocuments(LuceneIndex.java:452)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.getDocuments(LuceneIndex.java:338)
    at org.eclipse.rdf4j.sail.lucene.AbstractSearchIndex.addRemoveStatements(AbstractSearchIndex.java:237)
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.addRemoveStatements(LuceneSailConnection.java:251)
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.commit(LuceneSailConnection.java:217)
    ... 5 more
org.eclipse.rdf4j.repository.RepositoryException: org.eclipse.rdf4j.sail.SailException: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:172)
    at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.conditionalCommit(AbstractRepositoryConnection.java:366)
    at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.add(AbstractRepositoryConnection.java:491)
    at org.eclipse.rdf4j.sail.lucene.AbstractLuceneSailTest$1.run(AbstractLuceneSailTest.java:1063)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.eclipse.rdf4j.sail.SailException: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.commit(LuceneSailConnection.java:237)
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:169)
    ... 4 more
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
    at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:274)
    at org.apache.lucene.index.SegmentReader.getPostingsReader(SegmentReader.java:227)
    at org.apache.lucene.index.CodecReader.fields(CodecReader.java:102)
    at org.apache.lucene.index.LeafReader.terms(LeafReader.java:207)
    at org.apache.lucene.index.LeafReader.termDocsEnum(LeafReader.java:318)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.addDocuments(LuceneIndex.java:461)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.getDocuments(LuceneIndex.java:452)
    at org.eclipse.rdf4j.sail.lucene.LuceneIndex.getDocuments(LuceneIndex.java:338)
    at org.eclipse.rdf4j.sail.lucene.AbstractSearchIndex.addRemoveStatements(AbstractSearchIndex.java:237)
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.addRemoveStatements(LuceneSailConnection.java:251)
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.commit(LuceneSailConnection.java:217)
    ... 5 more

The failure is not deterministic for whatever reason.

In general RepositoryConnection is not specified to be threadsafe, but closing a repository connection should not affect other connections before the entire Repository/Sail is shutdown. In this test the shutdown should not happen before the test completes. The bug is reproducible fairly consistently for me, happening in at least 1 in 20 builds with a change to randomly vary the number of insertions for each of the threads so they don't all complete at a similar time consistently.

@pulquero
Copy link

Are you using a build with the changes from #192 ?

@abrokenjester
Copy link
Contributor

@pulquero good catch, he's probably not. This is my mistake: I have neglected to merge the release branch back into master. I'll take care of it ASAP.

@abrokenjester
Copy link
Contributor

abrokenjester commented Aug 30, 2016

Merge done. Try again.

@ansell
Copy link
Contributor Author

ansell commented Aug 30, 2016

I will merge master back into that branch today and see if it is replicable still. If it is still occurring for me (and hudson) it is possible that it is either a bug with my changes, or an existing bug in the lucene/sail code that is being triggered more reliably with my changes that more proactively close iterators which may be releasing resources that were previously being locked up for slightly longer.

@ansell
Copy link
Contributor Author

ansell commented Aug 31, 2016

After merging master and fixing the trix pom.xml, I am (possibly also as the issue above is intermittent) receiving another error, should I open a separate issue or coopt this issue for it:

[main] INFO org.elasticsearch.node - [Asp] version[1.7.0], pid[5571], build[929b973/2015-07-16T14:31:07Z]
[main] INFO org.elasticsearch.node - [Asp] initializing ...
[main] INFO org.elasticsearch.plugins - [Asp] loaded [], sites []
[main] INFO org.elasticsearch.env - [Asp] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [69.3gb], net total_space [90.4gb], types [ext4]
[main] INFO org.elasticsearch.node - [Asp] initialized
[main] INFO org.elasticsearch.node - [Asp] starting ...
[main] INFO org.elasticsearch.transport - [Asp] bound_address {local[5]}, publish_address {local[5]}
[main] INFO org.elasticsearch.discovery - [Asp] sail_test/YFAyJLDCTkS4ZVgok1p2mA
[elasticsearch[Asp][clusterService#updateTask][T#1]] INFO org.elasticsearch.cluster.service - [Asp] master {new [Asp][YFAyJLDCTkS4ZVgok1p2mA][mint-Shakespeare][local[5]]{local=true}}, removed {[Asp][MC__PpqxQCm9mqm6_z-hwA][mint-Shakespeare][local[5]]{local=true},}, reason: local-disco-initial_connect(master)
[elasticsearch[Asp][clusterService#updateTask][T#1]] INFO org.elasticsearch.gateway - [Asp] recovered [0] indices into cluster_state
[main] INFO org.elasticsearch.http - [Asp] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.2.15:9200]}
[main] INFO org.elasticsearch.node - [Asp] started
[elasticsearch[Asp][clusterService#updateTask][T#1]] INFO org.elasticsearch.cluster.metadata - [Asp] [elastic-search-sail] creating index, cause [api], templates [], shards [1]/[0], mappings []
[elasticsearch[Asp][clusterService#updateTask][T#1]] INFO org.elasticsearch.cluster.metadata - [Asp] [elastic-search-sail] create_mapping [resource]
[main] INFO org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchIndex - Field mappings:
{_all={enabled=false}, properties={_geopoint_http://www.opengis.net/ont/geosparql#asWKT={type=geo_point}, context={type=string, index=not_analyzed}, text={type=string}, uri={type=string, index=not_analyzed}}}
[main] INFO org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchIndex - Cluster health: RED
[main] INFO org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchIndex - Cluster nodes: 1 (data 1)
[main] INFO org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchIndex - Index health: RED
[main] INFO org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchIndex - Index shards: 1 (active 0 [primary 0], initializing 1, unassigned 0, relocating 0)
[main] ERROR org.eclipse.rdf4j.sail.lucene.LuceneSailConnection - Committing operations in lucenesail, encountered exception org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed. Only some operations were stored, 1 operations are discarded. Lucene Index is now corrupt.
org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[main] INFO org.elasticsearch.node - [Asp] stopping ...
[main] INFO org.elasticsearch.node - [Asp] stopped
[main] INFO org.elasticsearch.node - [Asp] closing ...
[main] INFO org.elasticsearch.node - [Asp] closed
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.852 sec <<< FAILURE! - in org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchSailIndexedPropertiesTest
testRegularQuery(org.eclipse.rdf4j.sail.elasticsearch.ElasticsearchSailIndexedPropertiesTest)  Time elapsed: 0.278 sec  <<< ERROR!
org.eclipse.rdf4j.repository.RepositoryException: org.eclipse.rdf4j.sail.SailException: org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:172)
    at org.eclipse.rdf4j.sail.lucene.AbstractLuceneSailIndexedPropertiesTest.setUp(AbstractLuceneSailIndexedPropertiesTest.java:121)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Caused by: org.eclipse.rdf4j.sail.SailException: org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed
    at org.eclipse.rdf4j.sail.lucene.LuceneSailConnection.commit(LuceneSailConnection.java:237)
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:169)
    at org.eclipse.rdf4j.sail.lucene.AbstractLuceneSailIndexedPropertiesTest.setUp(AbstractLuceneSailIndexedPropertiesTest.java:121)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
    at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:264)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:124)
    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query_fetch], all shards failed
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
    at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
    at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

@ansell
Copy link
Contributor Author

ansell commented Aug 31, 2016

The current master (5f4d262) still has the reader closed issue on Hudson:

https://hudson.eclipse.org/rdf4j/job/rdf4j-master-verify/52/console

It may also have dead/live-locked ( @jeenbroekstra may need to shut the build off manually). I put in a fix for the testsuite locking up in my unrelated pull request where I was having trouble testing because of it. Feel free to cherry-pick the following commits (sorry I don't have a squashed version available right now) to master/etc. to make sure the testsuite doesn't deadlock in this case:

214cbff

b6bbac1

62cb90a

@abrokenjester
Copy link
Contributor

Thanks Peter, I put up a PR that cherrypicks those fixes for merge into master.

abrokenjester added a commit that referenced this issue Aug 31, 2016
@abrokenjester abrokenjester added the 🐞 bug issue is a bug label Sep 1, 2016
@pulquero pulquero mentioned this issue Sep 2, 2016
3 tasks
@pulquero
Copy link

pulquero commented Sep 2, 2016

The elasticsearch error is unrelated. I can't reproduce it. Every so often I can reproduce a red health cluster, but that does not lead to any stack trace. You could try setting the WAIT_FOR_NODES_KEY, WAIT_FOR_ACTIVE_SHARDS_KEY search index properties to see if that solves it for you.

@abrokenjester abrokenjester added this to the 2.0.1 milestone Sep 3, 2016
@abrokenjester
Copy link
Contributor

Merged the concurrency fix. I'm not seeing any further problems, so barring info to the contrary I think we can resolve this issue?

@ansell
Copy link
Contributor Author

ansell commented Sep 4, 2016

I will run some more builds locally with this patch today to see if it still triggers intermittently for me.

@abrokenjester
Copy link
Contributor

abrokenjester commented Sep 4, 2016

@ansell be aware that this isn't merged to master yet, only the release branch. Will fix ASAP.

@ansell
Copy link
Contributor Author

ansell commented Sep 4, 2016

Okay, I will run some tests on the release branch for now until it hits master.

@abrokenjester
Copy link
Contributor

Just merged it because why the hell not :)

@ansell
Copy link
Contributor Author

ansell commented Sep 4, 2016

Testing this on my issue branch (after rebasing and squashing it down to a single commit) showed the error on the first attempt. The commit I tested it on was:

4e49be2

@ansell
Copy link
Contributor Author

ansell commented Sep 4, 2016

I can also intermittently reproduce the bug using the Eclipse JUnit runner, so it isn't related to maven/surefire.

@abrokenjester
Copy link
Contributor

abrokenjester commented Sep 5, 2016

If I understand the problem correctly, the core of the issue is that when a LuceneSailConnection is closed, it calls LuceneIndex.endReading, which in turn (eventually) calls ReaderMonitor.endReading, and this method calls close() without verifying if other parallel connections are still using that same index.

(update removed fix suggestion because I looked a bit closer and realized I was wrong)

@abrokenjester
Copy link
Contributor

I think AbstractReaderMonitor has a flaw in the concurrency logic: it uses both an AtomicBoolean and a reference counter, without enforcing synchronizing their reading. E.g. line 55:

    if (readingCount.decrementAndGet() == 0 && doClose.get()) {

It's entirely possible for readingCount to be re-incremented by the time the second part of the && is evaluated.

I'm also slightly confused by the fact that it has two AtomicBooleans, one called doClose and one called closed.

@abrokenjester
Copy link
Contributor

abrokenjester commented Sep 5, 2016

The more I look at it, the stranger I find the construction. LuceneSailConnection calls LuceneIndex.beginReading which in turn calls ReaderMonitor.beginReading, which increases the count - but as far as I can tell it doesn't enforce or check anywhere whether at that point the underlying reader has been closed or not.

@ansell
Copy link
Contributor Author

ansell commented Sep 5, 2016

The way Lucene internally manages reference counts is also fundamentally broken and a bit scary to look through.

Instead of calling AtomicInteger.incrementAndGet like usual, they use the following obtuse method in IndexReader that is prone to decrements happening concurrently. Seems strange to do it that way if you just have just signalled that you want to add to the reference count, but it is fairly likely to get two threads to not work concurrently if one completes too quickly and decrements the reference count to zero before the second thread is able to start:

  public final boolean tryIncRef() {
    int count;
    while ((count = refCount.get()) > 0) {
      if (refCount.compareAndSet(count, count+1)) {
        return true;
      }
    }
    return false;
  }

abrokenjester added a commit that referenced this issue Sep 5, 2016
issue#301 : Work on tightening the contract for lucene sail components
@abrokenjester
Copy link
Contributor

Fixes merged into master and the 2.0 release branch. Can we consider this resolved now? The fact that you no longer get the issue in 30+ runs seems pretty solid to me.

@ansell
Copy link
Contributor Author

ansell commented Sep 5, 2016

Sure, we can reopen or open a follow up if/when it pops up again. To clarify that was 30 runs just of LuceneSailTest, where the issue was occurring, inside of Eclipse JUnit runner. I didn't run the maven build 30 times and it was only on my 2 CPU VirtualBox virtual machine. But it was occurring consistently for me in the 2 CPU virtual machine previously, so in that way it was a valid test host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants