Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Multiple tests failing with "some shards are still open" error #52021

Closed
mark-vieira opened this issue Feb 6, 2020 · 10 comments · Fixed by #52099
Closed

[CI] Multiple tests failing with "some shards are still open" error #52021

mark-vieira opened this issue Feb 6, 2020 · 10 comments · Fixed by #52099
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

This has failed 4 times already today:

:modules:reindex:test » org.elasticsearch.index.reindex.ReindexFailureTests » classMethod (0.007s)
Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.
java.lang.IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.Open stacktrace

Here's are some build scans. Seems to be isolated to 7.x for now:

https://gradle-enterprise.elastic.co/s/gcnoaw5pepki2/tests/5nfchac5yggnq-rjw76l2zc7kn2
https://gradle-enterprise.elastic.co/s/bqddtyap54zai/tests/5nfchac5yggnq-rjw76l2zc7kn2
https://gradle-enterprise.elastic.co/s/crutz4g7pzhay/tests/5nfchac5yggnq-rjw76l2zc7kn2
https://gradle-enterprise.elastic.co/s/zkfh4z2lj67g2/tests/5nfchac5yggnq-rjw76l2zc7kn2

@mark-vieira mark-vieira added >test-failure Triaged test failures from CI :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Feb 6, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Reindex)

@mark-vieira
Copy link
Contributor Author

Got another one so I've gone ahead and muted this test with 16afbf9.

https://gradle-enterprise.elastic.co/s/mdkgsffws5mly/tests/failed

@mark-vieira
Copy link
Contributor Author

Interestingly, I also just saw ShardSizeTermsIT fail with the same error:

https://gradle-enterprise.elastic.co/s/c5ihsym47tw2y/tests/kyv2y2z3r4v7m-tvapgmw3k3fym

@mark-vieira mark-vieira changed the title [CI] ReindexFailureTests failing in 7.x with leak error [CI] Multiple tests failing with "some shards are still open" error Feb 7, 2020
@mark-vieira mark-vieira added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Feb 7, 2020
@mark-vieira
Copy link
Contributor Author

Ok, so this seems to be a more generic error that's happening in master and 7.x and looks to have started on February 6th. From what I see so far this has happened in RecoveryWhileUnderLoadIT, ReindexFailureTests, SearchRestCancellationIT, ShardSizeTermsIT and AsyncSearchActionTests. So this is likely some more underlying issue. Here are a number of example build scans:

https://gradle-enterprise.elastic.co/s/llocok2qap4c2/tests/kyv2y2z3r4v7m-n5sx3bdgtt26o
https://gradle-enterprise.elastic.co/s/fbwpfift4mjsw/tests/5nfchac5yggnq-rjw76l2zc7kn2
https://gradle-enterprise.elastic.co/s/b2cthh57ikcbi/tests/xnglhbfnjd7ae-iwi4iupewsj6m
https://gradle-enterprise.elastic.co/s/c5ihsym47tw2y/tests/kyv2y2z3r4v7m-tvapgmw3k3fym
https://gradle-enterprise.elastic.co/s/jsstsgsksw5je/tests/xnglhbfnjd7ae-iwi4iupewsj6m

I'm assuming this falls under @elastic/es-distributed just because I see the word "shards" a lot, but please educate me if this is a wrong assumption.

Given this looks to be a very recent issue, and is happening with some regularity, I think we should give this some priority. Given the number of tests this seems to be cropping up in, muting isn't going to be an effective strategy here.

@dnhatn
Copy link
Member

dnhatn commented Feb 7, 2020

These failures might relate to #46091.

@DaveCTurner
Copy link
Contributor

I suspect #51708, on the grounds that on my machine 1dcf1df passed 140 iterations of ./gradlew :modules:reindex:test --tests org.elasticsearch.index.reindex.ReindexFailureTests.testResponseOnSearchFailure but eb69c6f fails with Some shards are still open about one run in ten. I'm running stress -c 16 -m 8 -d 2 too, not sure if that's helpful to reproduce or not.

@dnhatn would you like to investigate further, or should we ask the search team to take a look?

@dnhatn
Copy link
Member

dnhatn commented Feb 8, 2020

@DaveCTurner Good find. I will take a look. Thank you for looking :).

@dnhatn dnhatn self-assigned this Feb 8, 2020
@dnhatn
Copy link
Member

dnhatn commented Feb 8, 2020

Yes, these failures are indeed caused by #51708. I am working on the fix.

@dnhatn dnhatn added the :Search/Search Search-related issues that do not fall into other categories label Feb 8, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@dnhatn
Copy link
Member

dnhatn commented Feb 8, 2020

I've opened #52099.

dnhatn added a commit that referenced this issue Feb 10, 2020
We might leak a searcher if the target shard is removed (i.e., its index 
is deleted) or relocated while we are creating a SearchContext from a
SearchRewriteContext.

Relates #51708
Closes #52021

I labelled this non-issue for an unreleased bug introduced in #51708.
dnhatn added a commit that referenced this issue Feb 10, 2020
We might leak a searcher if the target shard is removed (i.e., its index
is deleted) or relocated while we are creating a SearchContext from a
SearchRewriteContext.

Relates #51708
Closes #52021

I labelled this non-issue for an unreleased bug introduced in #51708.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants