New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move reset recovery
into RecoveriesCollection
#19466
Conversation
Today when we reset a recovery because of the source not being ready or the shard is getting removed on the source (for whatever reason) we wipe all temp files and reset the recovery without respecting any reference counting or locking etc. all streams are closed and files are wiped. Yet, this is problematic since we assert that some files are on disk etc. when we finish writing a file. These assertions don't hold anymore if we concurrently wipe the tmp files. This change moves the logic out of RecoveryTarget into RecoveriesCollection which bascially clones the RecoveryTarget on reset instead which allows in-flight operations to finish gracefully. This means we now have a single path for cleanups in RecoveryTarget and can safely use assertions in the class since files won't be removed unless the recovery is either cancled, failed or finished.
@bleskes can you take a look? |
success = true; | ||
} finally { | ||
if (success == false) { | ||
copy.cancel("reset failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to remove this from onGoingRecoveries and only call cancel if it was found, otherwise we may leave a lingering recovery id, pointing at a cancelled recovery.
It took a while to convince my self that keeping the same recovery id is the right way to go for now. I like how it's done and left some minor comments. Thx @s1monw |
@bleskes I pushed new commits |
success = true; | ||
} finally { | ||
if (success == false) { | ||
copy.cancel("reset failed"); | ||
onGoingRecoveries.remove(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now that we folded the close and performRecoveryRestart into status.resetRecovery()
we don't need the success pattern anymore. This can be:
if (onGoingRecoveries.replace(id, status, resetRecovery) == false) {
resetRecovery.cancel("replace failed");
throw new IllegalStateException("failed to replace recovery target");
}
LGTM. Left minor suggestions - feel free to accept or reject and push away. |
success = true; | ||
} finally { | ||
if (success == false) { | ||
onGoingRecoveries.remove(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying from a comment on the commit-
nit: now that we folded the close and performRecoveryRestart into status.resetRecovery()
we don't need the success pattern anymore. This can be:
if (onGoingRecoveries.replace(id, status, resetRecovery) == false) {
resetRecovery.cancel("replace failed");
throw new IllegalStateException("failed to replace recovery target");
}
Today when we reset a recovery because of the source not being
ready or the shard is getting removed on the source (for whatever reason)
we wipe all temp files and reset the recovery without respecting any
reference counting or locking etc. all streams are closed and files are
wiped. Yet, this is problematic since we assert that some files are on disk
etc. when we finish writing a file. These assertions don't hold anymore if we
concurrently wipe the tmp files.
This change moves the logic out of RecoveryTarget into RecoveriesCollection which
basically clones the RecoveryTarget on reset instead which allows in-flight operations
to finish gracefully. This means we now have a single path for cleanups in RecoveryTarget
and can safely use assertions in the class since files won't be removed unless the recovery
is either cancelled, failed or finished.
Closes #19473