Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5193] [jm] Harden job recovery in case of recovery failures #2909

Closed

Conversation

tillrohrmann
Copy link
Contributor

When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so that a single
failure won't make the complete recovery fail. Furthermore, this PR improves the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.

Add test case

Fix failing JobManagerHACheckpointRecoveryITCase

When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so that a single
failure won't stall the complete recovery. Furthermore, this PR improves the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.

Add test case

Fix failing JobManagerHACheckpointRecoveryITCase
@tillrohrmann
Copy link
Contributor Author

Forwarding @uce and @StephanEwen review from the backport to this PR.

Rebasing on the latest master and if Travis gives green light, then I will merge this PR.

@tillrohrmann
Copy link
Contributor Author

Merging...

@asfgit asfgit closed this in add3765 Dec 9, 2016
static-max pushed a commit to static-max/flink that referenced this pull request Dec 13, 2016
When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so that a single
failure won't stall the complete recovery. Furthermore, this PR improves the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.

Add test case

Fix failing JobManagerHACheckpointRecoveryITCase

This closes apache#2909.
joseprupi pushed a commit to joseprupi/flink that referenced this pull request Feb 12, 2017
When recovering multiple jobs a single recovery failure caused all jobs to be not recovered.
This PR changes this behaviour to make the recovery of jobs independent so that a single
failure won't stall the complete recovery. Furthermore, this PR improves the error reporting
for failures originating in the ZooKeeperSubmittedJobGraphStore.

Add test case

Fix failing JobManagerHACheckpointRecoveryITCase

This closes apache#2909.
@tillrohrmann tillrohrmann deleted the fixJobRecoveryFailure branch March 6, 2017 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants