Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail restore when the shard allocations max retries count is reached #27493

Merged
merged 12 commits into from Dec 12, 2017

Conversation

@tlrx
Copy link
Member

commented Nov 22, 2017

When the allocation of a shard has been retried too many times, the
MaxRetryDecider is engaged to prevent any future allocation of the
failed shard. If it happens while restoring a snapshot, the restore
hangs and never completes because it stays around waiting for the
shards to be assigned. It also blocks future attempts to restore the
snapshot again.

This commit changes the current behaviour in order to fail the restore if
a shard reached the maximum allocations attempts without being successfully
assigned.

This is the second part of the #26865 issue.

core/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated
// check if the maximum number of attempts to restore the shard has been reached. If so, we can fail
// the restore and leave the shards unassigned.
IndexMetaData indexMetaData = metaData.getIndexSafe(unassignedShard.getKey().getIndex());
int maxRetry = MaxRetryAllocationDecider.SETTING_ALLOCATION_MAX_RETRY.get(indexMetaData.getSettings());

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 22, 2017

Contributor

I think the solution needs to be more generic than depending on the settings of specific allocation deciders. I think we can use unassignedInfo.getLastAllocationStatus for that and check if it is DECIDERS_NO.

This comment has been minimized.

Copy link
@tlrx

tlrx Nov 22, 2017

Author Member

That's a good suggestion as I suppose that a restore can also be stuck because the deciders cannot assign the shard (no enough space on disk, awareness rules forbid allocation etc). I also like it to be more generic.

I think I can give it a try by reverted portion of code and override unassignedInfoUpdated()... I'll push something if it works.

@imotov
imotov approved these changes Nov 22, 2017
Copy link
Member

left a comment

LGTM. Thanks a lot for fixing it!

@tlrx

This comment has been minimized.

Copy link
Member Author

commented Nov 23, 2017

Thanks for your reviews. I think @ywelsch is right, we can be more generic by checking the last allocation status and fail the restore if needed.

I updated the code to revert some changes I did and use the last allocation status instead. I also added another test based on shard allocation filtering settings that prevent the shards to be assigned. Please let me know what you think :)

Copy link
Contributor

left a comment

I prefer this approach over the previous one. I've left some more suggestions here and there. It will be interesting to see how this plays together with #27086
where we plan on adding delay between retrying failed allocations. If we chose to infinitely retry (with exponential backoff), then this would be disastrous for the restore here (as it would never abort).

core/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard was denied allocation by all allocation deciders";
changes(snapshot).unassignedShards.put(unassignedShard.shardId(),

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

I think instead of adding more types here (unassignedShards), better we do the reverse and fold failedShards, startedShards and unassignedShards into just "updates".
It's not worth separating them just to have this one assertion I've put there.

This comment has been minimized.

Copy link
@tlrx

tlrx Nov 24, 2017

Author Member

Pff I should have seen that... I agree, that would be better, thanks :)

core/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated
if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) {
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard was denied allocation by all allocation deciders";

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

I would just put "shard could not be allocated on any of the nodes"

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java Outdated
.setSettings(Settings.builder().put("location", repositoryLocation)));

// create a test index
final String indexName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT);

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

just choose a fixed index name, no need for randomization here :)

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java Outdated

// attempt to restore the snapshot
RestoreSnapshotResponse restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap")
.setMasterNodeTimeout("30s")

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

isn't this the default?

This comment has been minimized.

Copy link
@tlrx

tlrx Nov 24, 2017

Author Member

It is, thanks

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java Outdated
assertBusy(() -> {
Client client = client();
for (int shardId = 0; shardId < numShards.numPrimaries; shardId++) {
ClusterAllocationExplainResponse allocationResponse = client.admin().cluster().prepareAllocationExplain()

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

no need for allocation explain API. you can check all this directly on the cluster state that you get below. I also think that assertBusy won't be needed then.

This comment has been minimized.

Copy link
@tlrx

tlrx Nov 24, 2017

Author Member

Right, we can use the cluster state in this case and it's even easier. But it seems that the assertBusy() is still needed as the cluster state change can take few miliseconds to propagate.

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java Outdated
// Option 2: delete the index and restore again
assertAcked(client().admin().indices().prepareDelete(indexName));
restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap")
.setMasterNodeTimeout("30s")

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

idem

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java Outdated
}

// Wait for the shards to be assigned
waitForRelocation();

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

just ensureGreen()

* Test that restoring an index with shard allocation filtering settings that prevents
* its allocation does not hang indefinitely.
*/
public void testUnrestorableIndexDuringRestore() throws Exception {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Nov 23, 2017

Contributor

I think it's possible to share most of the code with the previous test by calling a generic method with two parameters:

  • restoreIndexSettings (which would set maxRetries in first case and filters in second case)
  • fixupAction (the action to run to fix the issue)

This comment has been minimized.

Copy link
@tlrx

tlrx Nov 24, 2017

Author Member

I tried to do something like that, please let me know what you think.

@tlrx

This comment has been minimized.

Copy link
Member Author

commented Nov 24, 2017

Thanks for the reviews! I updated the code, can you please have another look?

If we chose to infinitely retry (with exponential backoff), then this would be disastrous for the restore here (as it would never abort).

Right and this is what this pull request tries to avoid. In case of infinite retries we could maybe add a timeout/maximum execution time at the restore request level and fail the restore if this timeout is reached.

@tlrx tlrx force-pushed the tlrx:fail-restore-on-max-retries-allocations branch Dec 5, 2017
@tlrx

This comment has been minimized.

Copy link
Member Author

commented Dec 5, 2017

@ywelsch I added the allocation decider we talked about and I had to rebase and adapt an existing test. The CI is still failing but I'm not sure it is related to my change, I'm still digging. Anyway, I'd be happy if you can have a look, thanks!

Copy link
Contributor

left a comment

Change looks great. I've left some minor comments

core/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated
if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) {
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard could not be allocated on any of the nodes";

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

allocated to

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated

final RecoverySource recoverySource = shardRouting.recoverySource();
if (recoverySource == null || recoverySource.getType() != RecoverySource.Type.SNAPSHOT) {
return allocation.decision(Decision.YES, NAME, "shard is not being restored");

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

I would use the following message:
"ignored as shard is not being recovered from a snapshot"
and not have an explicit check for shardRouting.primary() == false. That case is automatically handled by this case too as replica shards are never recovered from snapshot (their recovery source is always PEER).

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 5, 2017

Author Member

Right, thanks

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated
}
}
return allocation.decision(Decision.NO, NAME, "shard has failed to be restored from the snapshot [%s] because of [%s] - " +
"manually delete the index [%s] in order to retry to restore the snapshot again or use the Reroute API to force the " +

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

"close or delete the index"

I would also lowercase the "reroute API"

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated
if (restoresInProgress != null) {
for (RestoreInProgress.Entry restoreInProgress : restoresInProgress.entries()) {
if (restoreInProgress.snapshot().equals(snapshot)) {
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state";

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

this assertion is not correct I think.
If a restore for a shard fails 5 times, it's marked as completed only in one of the next cluster state updates (see cleanupRestoreState)

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 5, 2017

Author Member

The assertion asserts that the restore in progress for the current allocation is not completed, so I think it's good? It will be marked later as you noticed.

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 7, 2017

Author Member

Note: we talked about this and Yannick is right, this assertion can be problematic on more busy clusters if a reroute kicks in between the moment the restore completes and the moment the restore is removed from the cluster state by the CleanRestoreStateTaskExecutor

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated
if (restoreInProgress.snapshot().equals(snapshot)) {
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state";
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId());
if (shardRestoreStatus.state() != RestoreInProgress.State.FAILURE) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

just wondering if it's possible for shardRestoreStatus to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

I would write this check as

if (shardRestoreStatus.state().completed() == false) {

and then add an assertion that shardRestoreStatus.state() != SUCCESS (as the shard should have been moved to started and the recovery source cleaned up at that point).

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 5, 2017

Author Member

just wondering if it's possible for shardRestoreStatus to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.

Good catch, thanks!

}

@Override
public Decision canAllocate(final ShardRouting shardRouting, final RoutingNode node, final RoutingAllocation allocation) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

can you also add

@Override
    public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
        assert shardRouting.primary() : "must not call canForceAllocatePrimary on a non-primary shard " + shardRouting;
        return canAllocate(shardRouting, node, allocation);
    }

as this is a hard constraint with no exceptions

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 5, 2017

Author Member

Good catch

core/src/main/java/org/elasticsearch/snapshots/RestoreService.java Outdated
RestoreInProgress.State.FAILURE, "recovery source type changed from snapshot to " + initializedShard.recoverySource()));
}
}

@Override
public void unassignedInfoUpdated(ShardRouting unassignedShard, UnassignedInfo newUnassignedInfo) {
if (unassignedShard.primary()) {

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 5, 2017

Contributor

only primaries can have a snapshot recovery source, so no need for this extra check here.

@tlrx

This comment has been minimized.

Copy link
Member Author

commented Dec 5, 2017

@ywelsch Thanks for the review! I updated the code.

@lcawl lcawl added v6.0.2 v5.6.6 and removed v6.0.1 v5.6.5 labels Dec 6, 2017
@ywelsch
ywelsch approved these changes Dec 7, 2017
Copy link
Contributor

left a comment

LGTM

...org/elasticsearch/cluster/routing/allocation/decider/RestoreInProgressAllocationDecider.java Outdated
if (restoreInProgress.snapshot().equals(snapshot)) {
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId());
if (shardRestoreStatus != null && shardRestoreStatus.state().completed() == false) {
assert shardRestoreStatus.state() != RestoreInProgress.State.SUCCESS : "expected initializing shard";

This comment has been minimized.

Copy link
@ywelsch

ywelsch Dec 7, 2017

Contributor

can you also add the shardRestoreStatus state and the shard routing to the failure message here?

This comment has been minimized.

Copy link
@tlrx

tlrx Dec 7, 2017

Author Member

Sure, will do once the current CI build is finished

tlrx added 7 commits Nov 22, 2017
When the allocation of a shard has been retried too many times, the
MaxRetryDecider is engaged to prevent any future allocation of the
failed shard. If it happens while restoring a snapshot, the restore
hangs and never completes because it stays around waiting for the
shards to be assigned. It also blocks future attempts to restore the
snapshot again.

This commit changes the current behavior in order to fail the restore if
a shard reached the maximum allocations attempts without being successfully
assigned.

This is the second part of the #26865 issue.

closes #26865
tlrx added 2 commits Dec 7, 2017
@tlrx tlrx force-pushed the tlrx:fail-restore-on-max-retries-allocations branch to 8ced615 Dec 7, 2017
@tlrx

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2017

Thanks @ywelsch!

@imotov This pull request changed since you LGTMd it, would you like to have another look?

@imotov
imotov approved these changes Dec 11, 2017
Copy link
Member

left a comment

LGTM. Thanks!

@tlrx tlrx merged commit a1ed347 into elastic:master Dec 12, 2017
2 checks passed
2 checks passed
CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details
@tlrx

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2017

Thanks a lot @ywelsch and @imotov. If you're ok I'd like to keep this in master for few days before backporting. I'd like to see it run multiple times on our CI.

tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot 
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be 
allocated again. This way, when a restore is impossible or failed too many 
times, the user is forced to take a manual action (like deleting the index 
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a 
shard has been retried too many times, the MaxRetryDecider is engaged 
to prevent any future allocation of the failed shard. If it happens while 
restoring a snapshot, the restore hanged and was never completed because 
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards 
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.2.0 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot 
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be 
allocated again. This way, when a restore is impossible or failed too many 
times, the user is forced to take a manual action (like deleting the index 
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a 
shard has been retried too many times, the MaxRetryDecider is engaged 
to prevent any future allocation of the failed shard. If it happens while 
restoring a snapshot, the restore hanged and was never completed because 
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards 
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.1.2 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be
allocated again. This way, when a restore is impossible or failed too many
times, the user is forced to take a manual action (like deleting the index
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a
shard has been retried too many times, the MaxRetryDecider is engaged
to prevent any future allocation of the failed shard. If it happens while
restoring a snapshot, the restore hanged and was never completed because
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.0.2 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be
allocated again. This way, when a restore is impossible or failed too many
times, the user is forced to take a manual action (like deleting the index
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a
shard has been retried too many times, the MaxRetryDecider is engaged
to prevent any future allocation of the failed shard. If it happens while
restoring a snapshot, the restore hanged and was never completed because
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added v5.6.6 and removed backport pending labels Dec 22, 2017
@tlrx

This comment has been minimized.

Copy link
Member Author

commented Dec 22, 2017

Backported to 6.x,6.1,6.0 and 5.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.