-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212
Conversation
cc @Aitozi @tweise @wangyang0918 @morhidi This is a larger change that fixes a couple outstanding critical issues and also aims to make the whole flow a bit simpler, more robust and easier to understand. I would appreciate your review/feedback :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job for dealing with these important blocker issues in one shot. I have carefully gone though this PR and do not have major comments. I will play this PR with some more manual tests today and then share the feedback.
- Trigger fatal error when upgrade progress is stuck
BTW, I do not fully understand why the upgrade progress could be stuck and how you fix it.
...src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java
Show resolved
Hide resolved
.../main/java/org/apache/flink/kubernetes/operator/observer/deployment/ApplicationObserver.java
Show resolved
Hide resolved
"Cannot perform savepoint upgrade on missing/failed JobManager deployment", | ||
"JobManager deployment is missing and HA data is not available to make stateful upgrades. " | ||
+ "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. " | ||
+ "Manual restore required.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be done in another ticket. But we might need to document that how to do the manual restore. It is important for the release 1.14 and 1.13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a ticket open for these docs, I will work on that next week to have it for the release.
By the way this is the exact situation that answers your question regarding the "stuck" upgrade progress. This is a situation where the JM deployment is missing, and also HA is not available to get last state info.
We cannot deal with it, that's why we throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint.
I have done some manual tests with both 1.14 and 1.15 versions. It works smoothly and I believe the Flink deployment is pretty more robust now. 👍🏻
I found some small issues when testing. |
Thanks @wangyang0918 for the testing. I will improve the rollback -> upgrade scenario that you encountered. The events should also be fixed after merging #213 will open a separate blocker ticket for that. |
…ring job with last state
This PR contains the following improvements/fixes for a number of loosely connected blocker issues: