Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5007] [checkpointing] Retain externalized checkpoint on suspension #2750

Closed
wants to merge 1 commit into from

Conversation

uce
Copy link
Contributor

@uce uce commented Nov 3, 2016

Handles graceful cluster shut down (non-HA) like cancellation and respects the configured clean up behaviour.

ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION => delete on suspension

ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION => retain on suspension

@uce
Copy link
Contributor Author

uce commented Nov 28, 2016

@StephanEwen Do you have time to look at this? Currently, when externalized checkpoints are configured and the cluster shuts down via suspending all jobs, the externalized checkpoints are cleaned up. This PR proposes to handle suspension like a cancellation and respect the corresponding cleanup configuration, e.g. retain if RETAIN_ON_CANCELLATION and delete if DELETE_ON_CANCELLATION.

@StephanEwen
Copy link
Contributor

StephanEwen commented Dec 6, 2016

A suspension is usually a master failure or loss of leadership. Suspending a job does not delete/remove any HA checkpoints.

The same should be the case for externalized checkpoints, in my opinion. Why should an externalized checkpoint be deleted in suspend, when a regular HA checkpoint is not?

@uce
Copy link
Contributor Author

uce commented Dec 6, 2016

In HA mode, checkpoints are not deleted on suspension. This PR won't change that behaviour. It only affects non-HA behaviour.

Currently, the behaviour is to remove checkpoints on suspension, which is definitely a problem. But in non-HA mode suspension happens also for graceful shut down (for example when terminating a YARN session). Never deleting on suspend means that users who have DELETE_ON_CANCELLATION configured, will have externalized checkpoints lingering around when they shut down their non-HA cluster. That's why I thought it might be better to treat this the same as the retain/delete on cancellation configuration.

Does this make sense? For HA, these setting do not apply during suspension.

@StephanEwen
Copy link
Contributor

Okay, understood, that makes sense.

+1 from my side

…sion

Handles graceful cluster shut down (non-HA) like cancellation.
@uce uce closed this Dec 13, 2016
@uce uce deleted the 5007-suspend_external branch February 16, 2017 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants