Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move reset recovery into RecoveriesCollection #19466

Merged
merged 5 commits into from Jul 19, 2016

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Jul 18, 2016

Today when we reset a recovery because of the source not being
ready or the shard is getting removed on the source (for whatever reason)
we wipe all temp files and reset the recovery without respecting any
reference counting or locking etc. all streams are closed and files are
wiped. Yet, this is problematic since we assert that some files are on disk
etc. when we finish writing a file. These assertions don't hold anymore if we
concurrently wipe the tmp files.

This change moves the logic out of RecoveryTarget into RecoveriesCollection which
basically clones the RecoveryTarget on reset instead which allows in-flight operations
to finish gracefully. This means we now have a single path for cleanups in RecoveryTarget
and can safely use assertions in the class since files won't be removed unless the recovery
is either cancelled, failed or finished.

Closes #19473

Today when we reset a recovery because of the source not being
ready or the shard is getting removed on the source (for whatever reason)
we wipe all temp files and reset the recovery without respecting any
reference counting or locking etc. all streams are closed and files are
wiped. Yet, this is problematic since we assert that some files are on disk
etc. when we finish writing a file. These assertions don't hold anymore if we
concurrently wipe the tmp files.

This change moves the logic out of RecoveryTarget into RecoveriesCollection which
bascially clones the RecoveryTarget on reset instead which allows in-flight operations
to finish gracefully. This means we now have a single path for cleanups in RecoveryTarget
and can safely use assertions in the class since files won't be removed unless the recovery
is either cancled, failed or finished.
@s1monw s1monw added >bug review :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha5 labels Jul 18, 2016
@s1monw
Copy link
Contributor Author

s1monw commented Jul 18, 2016

@bleskes can you take a look?

success = true;
} finally {
if (success == false) {
copy.cancel("reset failed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to remove this from onGoingRecoveries and only call cancel if it was found, otherwise we may leave a lingering recovery id, pointing at a cancelled recovery.

@bleskes
Copy link
Contributor

bleskes commented Jul 18, 2016

It took a while to convince my self that keeping the same recovery id is the right way to go for now. I like how it's done and left some minor comments. Thx @s1monw

@s1monw
Copy link
Contributor Author

s1monw commented Jul 18, 2016

@bleskes I pushed new commits

success = true;
} finally {
if (success == false) {
copy.cancel("reset failed");
onGoingRecoveries.remove(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now that we folded the close and performRecoveryRestart into status.resetRecovery() we don't need the success pattern anymore. This can be:

if (onGoingRecoveries.replace(id, status, resetRecovery) == false) {
       resetRecovery.cancel("replace failed");
       throw new IllegalStateException("failed to replace recovery target");
}

@bleskes
Copy link
Contributor

bleskes commented Jul 18, 2016

LGTM. Left minor suggestions - feel free to accept or reject and push away.

success = true;
} finally {
if (success == false) {
onGoingRecoveries.remove(id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying from a comment on the commit-

nit: now that we folded the close and performRecoveryRestart into status.resetRecovery() we don't need the success pattern anymore. This can be:

if (onGoingRecoveries.replace(id, status, resetRecovery) == false) {
       resetRecovery.cancel("replace failed");
       throw new IllegalStateException("failed to replace recovery target");
}

@s1monw s1monw merged commit 5b07f81 into elastic:master Jul 19, 2016
@s1monw s1monw deleted the fix_recovery_reset branch July 19, 2016 08:23
@s1monw s1monw removed the review label Jul 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha5
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cancellation of recovery deletes files still held onto by writes
2 participants