Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail restore when the shard allocations max retries count is reached #27493

Merged
merged 12 commits into from Dec 12, 2017

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Nov 22, 2017

When the allocation of a shard has been retried too many times, the
MaxRetryDecider is engaged to prevent any future allocation of the
failed shard. If it happens while restoring a snapshot, the restore
hangs and never completes because it stays around waiting for the
shards to be assigned. It also blocks future attempts to restore the
snapshot again.

This commit changes the current behaviour in order to fail the restore if
a shard reached the maximum allocations attempts without being successfully
assigned.

This is the second part of the #26865 issue.

// check if the maximum number of attempts to restore the shard has been reached. If so, we can fail
// the restore and leave the shards unassigned.
IndexMetaData indexMetaData = metaData.getIndexSafe(unassignedShard.getKey().getIndex());
int maxRetry = MaxRetryAllocationDecider.SETTING_ALLOCATION_MAX_RETRY.get(indexMetaData.getSettings());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the solution needs to be more generic than depending on the settings of specific allocation deciders. I think we can use unassignedInfo.getLastAllocationStatus for that and check if it is DECIDERS_NO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion as I suppose that a restore can also be stuck because the deciders cannot assign the shard (no enough space on disk, awareness rules forbid allocation etc). I also like it to be more generic.

I think I can give it a try by reverted portion of code and override unassignedInfoUpdated()... I'll push something if it works.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for fixing it!

@tlrx
Copy link
Member Author

tlrx commented Nov 23, 2017

Thanks for your reviews. I think @ywelsch is right, we can be more generic by checking the last allocation status and fail the restore if needed.

I updated the code to revert some changes I did and use the last allocation status instead. I also added another test based on shard allocation filtering settings that prevent the shards to be assigned. Please let me know what you think :)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this approach over the previous one. I've left some more suggestions here and there. It will be interesting to see how this plays together with #27086
where we plan on adding delay between retrying failed allocations. If we chose to infinitely retry (with exponential backoff), then this would be disastrous for the restore here (as it would never abort).

if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard was denied allocation by all allocation deciders";
changes(snapshot).unassignedShards.put(unassignedShard.shardId(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of adding more types here (unassignedShards), better we do the reverse and fold failedShards, startedShards and unassignedShards into just "updates".
It's not worth separating them just to have this one assertion I've put there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pff I should have seen that... I agree, that would be better, thanks :)

if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) {
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard was denied allocation by all allocation deciders";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just put "shard could not be allocated on any of the nodes"

.setSettings(Settings.builder().put("location", repositoryLocation)));

// create a test index
final String indexName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just choose a fixed index name, no need for randomization here :)


// attempt to restore the snapshot
RestoreSnapshotResponse restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap")
.setMasterNodeTimeout("30s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, thanks

assertBusy(() -> {
Client client = client();
for (int shardId = 0; shardId < numShards.numPrimaries; shardId++) {
ClusterAllocationExplainResponse allocationResponse = client.admin().cluster().prepareAllocationExplain()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for allocation explain API. you can check all this directly on the cluster state that you get below. I also think that assertBusy won't be needed then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can use the cluster state in this case and it's even easier. But it seems that the assertBusy() is still needed as the cluster state change can take few miliseconds to propagate.

// Option 2: delete the index and restore again
assertAcked(client().admin().indices().prepareDelete(indexName));
restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap")
.setMasterNodeTimeout("30s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

}

// Wait for the shards to be assigned
waitForRelocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ensureGreen()

* Test that restoring an index with shard allocation filtering settings that prevents
* its allocation does not hang indefinitely.
*/
public void testUnrestorableIndexDuringRestore() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to share most of the code with the previous test by calling a generic method with two parameters:

  • restoreIndexSettings (which would set maxRetries in first case and filters in second case)
  • fixupAction (the action to run to fix the issue)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do something like that, please let me know what you think.

@tlrx
Copy link
Member Author

tlrx commented Nov 24, 2017

Thanks for the reviews! I updated the code, can you please have another look?

If we chose to infinitely retry (with exponential backoff), then this would be disastrous for the restore here (as it would never abort).

Right and this is what this pull request tries to avoid. In case of infinite retries we could maybe add a timeout/maximum execution time at the restore request level and fail the restore if this timeout is reached.

@tlrx tlrx force-pushed the fail-restore-on-max-retries-allocations branch from 42bb551 to b32842d Compare December 5, 2017 09:35
@tlrx
Copy link
Member Author

tlrx commented Dec 5, 2017

@ywelsch I added the allocation decider we talked about and I had to rebase and adapt an existing test. The CI is still failing but I'm not sure it is related to my change, I'm still digging. Anyway, I'd be happy if you can have a look, thanks!

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks great. I've left some minor comments

if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) {
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) {
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot();
String reason = "shard could not be allocated on any of the nodes";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocated to


final RecoverySource recoverySource = shardRouting.recoverySource();
if (recoverySource == null || recoverySource.getType() != RecoverySource.Type.SNAPSHOT) {
return allocation.decision(Decision.YES, NAME, "shard is not being restored");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use the following message:
"ignored as shard is not being recovered from a snapshot"
and not have an explicit check for shardRouting.primary() == false. That case is automatically handled by this case too as replica shards are never recovered from snapshot (their recovery source is always PEER).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks

}
}
return allocation.decision(Decision.NO, NAME, "shard has failed to be restored from the snapshot [%s] because of [%s] - " +
"manually delete the index [%s] in order to retry to restore the snapshot again or use the Reroute API to force the " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"close or delete the index"

I would also lowercase the "reroute API"

if (restoresInProgress != null) {
for (RestoreInProgress.Entry restoreInProgress : restoresInProgress.entries()) {
if (restoreInProgress.snapshot().equals(snapshot)) {
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assertion is not correct I think.
If a restore for a shard fails 5 times, it's marked as completed only in one of the next cluster state updates (see cleanupRestoreState)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion asserts that the restore in progress for the current allocation is not completed, so I think it's good? It will be marked later as you noticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we talked about this and Yannick is right, this assertion can be problematic on more busy clusters if a reroute kicks in between the moment the restore completes and the moment the restore is removed from the cluster state by the CleanRestoreStateTaskExecutor

if (restoreInProgress.snapshot().equals(snapshot)) {
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state";
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId());
if (shardRestoreStatus.state() != RestoreInProgress.State.FAILURE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if it's possible for shardRestoreStatus to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this check as

if (shardRestoreStatus.state().completed() == false) {

and then add an assertion that shardRestoreStatus.state() != SUCCESS (as the shard should have been moved to started and the recovery source cleaned up at that point).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if it's possible for shardRestoreStatus to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.

Good catch, thanks!

}

@Override
public Decision canAllocate(final ShardRouting shardRouting, final RoutingNode node, final RoutingAllocation allocation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add

@Override
    public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
        assert shardRouting.primary() : "must not call canForceAllocatePrimary on a non-primary shard " + shardRouting;
        return canAllocate(shardRouting, node, allocation);
    }

as this is a hard constraint with no exceptions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

RestoreInProgress.State.FAILURE, "recovery source type changed from snapshot to " + initializedShard.recoverySource()));
}
}

@Override
public void unassignedInfoUpdated(ShardRouting unassignedShard, UnassignedInfo newUnassignedInfo) {
if (unassignedShard.primary()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only primaries can have a snapshot recovery source, so no need for this extra check here.

@tlrx
Copy link
Member Author

tlrx commented Dec 5, 2017

@ywelsch Thanks for the review! I updated the code.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if (restoreInProgress.snapshot().equals(snapshot)) {
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId());
if (shardRestoreStatus != null && shardRestoreStatus.state().completed() == false) {
assert shardRestoreStatus.state() != RestoreInProgress.State.SUCCESS : "expected initializing shard";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the shardRestoreStatus state and the shard routing to the failure message here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do once the current CI build is finished

When the allocation of a shard has been retried too many times, the
MaxRetryDecider is engaged to prevent any future allocation of the
failed shard. If it happens while restoring a snapshot, the restore
hangs and never completes because it stays around waiting for the
shards to be assigned. It also blocks future attempts to restore the
snapshot again.

This commit changes the current behavior in order to fail the restore if
a shard reached the maximum allocations attempts without being successfully
assigned.

This is the second part of the elastic#26865 issue.

closes elastic#26865
@tlrx tlrx force-pushed the fail-restore-on-max-retries-allocations branch from b8476f6 to 8ced615 Compare December 7, 2017 21:01
@tlrx
Copy link
Member Author

tlrx commented Dec 8, 2017

Thanks @ywelsch!

@imotov This pull request changed since you LGTMd it, would you like to have another look?

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@tlrx tlrx merged commit a1ed347 into elastic:master Dec 12, 2017
@tlrx
Copy link
Member Author

tlrx commented Dec 12, 2017

Thanks a lot @ywelsch and @imotov. If you're ok I'd like to keep this in master for few days before backporting. I'd like to see it run multiple times on our CI.

tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot 
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be 
allocated again. This way, when a restore is impossible or failed too many 
times, the user is forced to take a manual action (like deleting the index 
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a 
shard has been retried too many times, the MaxRetryDecider is engaged 
to prevent any future allocation of the failed shard. If it happens while 
restoring a snapshot, the restore hanged and was never completed because 
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards 
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.2.0 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot 
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be 
allocated again. This way, when a restore is impossible or failed too many 
times, the user is forced to take a manual action (like deleting the index 
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a 
shard has been retried too many times, the MaxRetryDecider is engaged 
to prevent any future allocation of the failed shard. If it happens while 
restoring a snapshot, the restore hanged and was never completed because 
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards 
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.1.2 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be
allocated again. This way, when a restore is impossible or failed too many
times, the user is forced to take a manual action (like deleting the index
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a
shard has been retried too many times, the MaxRetryDecider is engaged
to prevent any future allocation of the failed shard. If it happens while
restoring a snapshot, the restore hanged and was never completed because
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx tlrx added the v6.0.2 label Dec 22, 2017
tlrx added a commit that referenced this pull request Dec 22, 2017
…27493)

This commit changes the RestoreService so that it now fails the snapshot
restore if one of the shards to restore has failed to be allocated. It also adds
a new RestoreInProgressAllocationDecider that forbids such shards to be
allocated again. This way, when a restore is impossible or failed too many
times, the user is forced to take a manual action (like deleting the index
which failed shards) in order to try to restore it again.

This behaviour has been implemented because when the allocation of a
shard has been retried too many times, the MaxRetryDecider is engaged
to prevent any future allocation of the failed shard. If it happens while
restoring a snapshot, the restore hanged and was never completed because
it stayed around waiting for the shards to be assigned (and that won't happen).
It also blocked future attempts to restore the snapshot again. With this commit,
the restore does not hang and is marked as failed, leaving failed shards
around for investigation.

This is the second part of the #26865 issue.

Closes #26865
@tlrx
Copy link
Member Author

tlrx commented Dec 22, 2017

Backported to 6.x,6.1,6.0 and 5.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants