New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to recover into a folder containing a corrupted shard #10558

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@bleskes
Member

bleskes commented Apr 13, 2015

At the moment, we are very strict when handling data folders containing corrupted shards and will fail any recovery attempt into it. Typically this wouldn't be a problem as the shard will be assigned to another node (which we try first anyway when a shard fails). However, it has been proven to be too strict for smaller clusters which may not have an extra node available (either because of allocation filtering, disk space issues etc.). This commit changes the behavior to force a full recovery. Once all the new files are verified we remove the old corrupted data and start the shard.

This also fixes a small issue where the shard state file wasn't deleted on an engine failure (we had a protection against deleting the state file on an active shard, but in this case the shard is still active but will be removed). The state deletion is also moved to before the failure handlers are called, to avoid race conditions when calling the master (it will potentially try to read it when allocating the shard)

Recovery: allow to recover into a folder containing a corrupted shard
At the moment, we are very strict when handling data folders containing corrupted shards and will fail any recovery attempt into it. Typically this wouldn't be a problem as the shard will be assigned to another node (which we try first anyway when a shard fails due to corruption). However, it has been proven to be strict for smaller clusters when may not have an extra node available (either because of allocation filtering, disk space issues etc.). This commit changes the behavior to force a full recovery. Once all the new files are verified we remove the old corrupted data and start the shard.

This also fixes a small issue where the shard state file wasn't delete on an engine failure (we had a protection against deleting the state file on an active shard, but in this case the shard is still active but will be removed). The state deletion is also moved to before the failure handlers are called.
@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 13, 2015

Contributor

I don't think this needs to go in to 1.5.2 it's not a bugfix it a new way of handling things as well as a pretty controversial change IMO. Yet I think it LGTM but doesn't qualify as a bugfix

Contributor

s1monw commented Apr 13, 2015

I don't think this needs to go in to 1.5.2 it's not a bugfix it a new way of handling things as well as a pretty controversial change IMO. Yet I think it LGTM but doesn't qualify as a bugfix

@bleskes bleskes removed the v1.5.2 label Apr 13, 2015

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Apr 13, 2015

Member

@s1monw fair enough - I removed the 1.5.2 label. Double checking - are you +1 on this as it is now?

Member

bleskes commented Apr 13, 2015

@s1monw fair enough - I removed the 1.5.2 label. Double checking - are you +1 on this as it is now?

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Apr 13, 2015

Contributor

LGTM

Contributor

s1monw commented Apr 13, 2015

LGTM

@bleskes bleskes closed this in 8e302f1 Apr 13, 2015

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Apr 13, 2015

Recovery: allow to recover into a folder containing a corrupted shard
At the moment, we are very strict when handling data folders containing corrupted shards and will fail any recovery attempt into it. Typically this wouldn't be a problem as the shard will be assigned to another node (which we try first anyway when a shard fails). However, it has been proven to be too strict for smaller clusters which may not have an extra node available (either because of allocation filtering, disk space issues etc.). This commit changes the behavior to force a full recovery. Once all the new files are verified we remove the old corrupted data and start the shard.

Closes #10558

@bleskes bleskes deleted the bleskes:corrupted_replica branch Apr 13, 2015

@clintongormley clintongormley removed the review label Aug 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment