[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

gyfora · 2022-05-13T16:48:11Z

This PR contains the following improvements/fixes for a number of loosely connected blocker issues:

Eliminate bug that causes 0 delay infinite reconcile loop with the UpdateControl management logic
Ensure HA metadata is available when we are relying on it during stateful upgrades
Make JM deployment recovery condititional on availablity of HA metadata
Trigger fatal error when upgrade progress is stuck
Clean up and improve stateful upgrade conditions with added detailed debug logging
Greately simplify code around suspend/cancellation and restore operation in the reconciler
Make sure finished jobs are marked as stable to avoid rollback loop for short jobs

gyfora · 2022-05-13T16:50:24Z

cc @Aitozi @tweise @wangyang0918 @morhidi

This is a larger change that fixes a couple outstanding critical issues and also aims to make the whole flow a bit simpler, more robust and easier to understand. I would appreciate your review/feedback :)

wangyang0918

Great job for dealing with these important blocker issues in one shot. I have carefully gone though this PR and do not have major comments. I will play this PR with some more manual tests today and then share the feedback.

Trigger fatal error when upgrade progress is stuck

BTW, I do not fully understand why the upgrade progress could be stuck and how you fix it.

...src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java

.../main/java/org/apache/flink/kubernetes/operator/observer/deployment/ApplicationObserver.java

wangyang0918 · 2022-05-15T03:43:06Z

...n/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java

-                    "Cannot perform savepoint upgrade on missing/failed JobManager deployment",
+                    "JobManager deployment is missing and HA data is not available to make stateful upgrades. "
+                            + "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. "
+                            + "Manual restore required.",


It could be done in another ticket. But we might need to document that how to do the manual restore. It is important for the release 1.14 and 1.13.

I have a ticket open for these docs, I will work on that next week to have it for the release.

By the way this is the exact situation that answers your question regarding the "stuck" upgrade progress. This is a situation where the JM deployment is missing, and also HA is not available to get last state info.

We cannot deal with it, that's why we throw an error.

Thanks for the hint.

wangyang0918 · 2022-05-16T03:33:15Z

I have done some manual tests with both 1.14 and 1.15 versions. It works smoothly and I believe the Flink deployment is pretty more robust now. 👍🏻

Invalid image and rollback
Cancel/Fail Flink and verify last-state upgrade. Expect fatal error in 1.14 and successful recovery for 1.15.
Recover missing JobManager deployment
Checking the status and operator logs

I found some small issues when testing.

It is unnecessary to trigger an upgrade when the spec is exactly same with stable spec. For example, the invalid image is reverted.
We are generating too many ERROR events for FlinkDeployment. Each reconciliation will append a new one.

gyfora · 2022-05-16T06:57:16Z

Thanks @wangyang0918 for the testing. I will improve the rollback -> upgrade scenario that you encountered.

The events should also be fixed after merging #213 will open a separate blocker ticket for that.

…ring job with last state

gyfora requested a review from wangyang0918 May 13, 2022 16:48

wangyang0918 reviewed May 15, 2022

View reviewed changes

gyfora force-pushed the FLINK-27572 branch from bb9758d to 0036e28 Compare May 16, 2022 09:11

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before resto…

d2f1efe

…ring job with last state

gyfora force-pushed the FLINK-27572 branch from 0036e28 to d2f1efe Compare May 16, 2022 09:19

gyfora merged commit a0aca64 into apache:main May 16, 2022

gyfora deleted the FLINK-27572 branch June 27, 2022 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

gyfora commented May 13, 2022

gyfora commented May 13, 2022

wangyang0918 left a comment

wangyang0918 May 15, 2022

gyfora May 15, 2022

wangyang0918 May 16, 2022

wangyang0918 commented May 16, 2022

gyfora commented May 16, 2022

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

Conversation

gyfora commented May 13, 2022

gyfora commented May 13, 2022

wangyang0918 left a comment

Choose a reason for hiding this comment

wangyang0918 May 15, 2022

Choose a reason for hiding this comment

gyfora May 15, 2022

Choose a reason for hiding this comment

wangyang0918 May 16, 2022

Choose a reason for hiding this comment

wangyang0918 commented May 16, 2022

gyfora commented May 16, 2022