Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist all engine failure exceptions #12168

Closed
wants to merge 1 commit into from

Conversation

areek
Copy link
Contributor

@areek areek commented Jul 9, 2015

Currently when an engine fails the shard state file is no longer deleted (see #11933)
and the underlying store is only marked as corrupted for index corruption exceptions.
This means that the store can be opened, even after it failed with IOE, OOM exceptions.

It would be useful to persist the engine failures that are not due to corruption for
inspection, these can be exposed later through #11545

Currently when an engine fails the shard state file is no longer deleted elastic#11933
and the underlying store is only marked as corrupted for index corruption exceptions.
This means that the store can be opened, even after it failed with IOE, OOM exceptions.

It would be useful to persist the engine failures that are not due to corruption for
inspection, these can be exposed later through elastic#11545
@areek
Copy link
Contributor Author

areek commented Jul 9, 2015

@s1monw @kimchy @bleskes thoughts? This is a spin-off from #11933. We keep all the failures around but only mark a store as corrupted on index corruption.

@bleskes
Copy link
Contributor

bleskes commented Jul 12, 2015

@areek it's greet you are picking it up - I'm on the move now so excuse for me for being brief. I think that non-corruption issue shoulnd't be stored on the store but rather in the shard state file (or some other file next to it). The reason is that the store it self can be shared across machines (shadow replicas FTW), so it feels wrong to mark it as "failed" where it didn't really. Take the case of an OOM - it's the node that failed, not the data. The state file is local to the node, so it's different there.

@areek
Copy link
Contributor Author

areek commented Mar 16, 2016

Closing as now we have UnassignedInfo (eg: through _cat/shards) for any failed and unassigned shards

@areek areek closed this Mar 16, 2016
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Engine :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants