Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inc store reference before refresh #28656

Merged
merged 1 commit into from
Feb 13, 2018
Merged

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Feb 13, 2018

If a tragic even happens while we are refreshing a searcher/reader the engine can open new files on a store that is already closed.
For instance the following CI job failed because a refresh was concurrently called on a failing shard:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+oracle-java10-periodic/84
This change increments the ref count of the store during a refresh in order to postpone the closing after a tragic event.

If a tragic even happens while we are refreshing a searcher/reader the engine can open new files on a store that is already closed
For instance the following CI job failed because a merge was concurrently called on a failing shard:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+oracle-java10-periodic/84
This change increments the ref count of the store during a refresh in order to postpone the closing after a tragic event.
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM good catch

@jimczi jimczi merged commit 3b9f530 into elastic:master Feb 13, 2018
@jimczi jimczi deleted the refresh_inc_store branch February 13, 2018 14:38
jimczi added a commit that referenced this pull request Feb 13, 2018
If a tragic even happens while we are refreshing a searcher/reader the engine can open new files on a store that is already closed
For instance the following CI job failed because a merge was concurrently called on a failing shard:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+oracle-java10-periodic/84
This change increments the ref count of the store during a refresh in order to postpone the closing after a tragic event.
@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2018

great catch. I think it's trappy that acquiring a lock isn't enough to keep things alive. I think this kind of failure can happen in many other ways. I'm wondering if should increment the store count everytime we issue a lock and decrement it when the lock is freed. Just a thought. This change is great.

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants