Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27572][FLINK-27594] Ensure HA metadata is present before restoring job with last state #212

Merged
merged 1 commit into from
May 16, 2022

Conversation

gyfora
Copy link
Contributor

@gyfora gyfora commented May 13, 2022

This PR contains the following improvements/fixes for a number of loosely connected blocker issues:

  1. Eliminate bug that causes 0 delay infinite reconcile loop with the UpdateControl management logic
  2. Ensure HA metadata is available when we are relying on it during stateful upgrades
  3. Make JM deployment recovery condititional on availablity of HA metadata
  4. Trigger fatal error when upgrade progress is stuck
  5. Clean up and improve stateful upgrade conditions with added detailed debug logging
  6. Greately simplify code around suspend/cancellation and restore operation in the reconciler
  7. Make sure finished jobs are marked as stable to avoid rollback loop for short jobs

@gyfora gyfora requested a review from wangyang0918 May 13, 2022 16:48
@gyfora
Copy link
Contributor Author

gyfora commented May 13, 2022

cc @Aitozi @tweise @wangyang0918 @morhidi

This is a larger change that fixes a couple outstanding critical issues and also aims to make the whole flow a bit simpler, more robust and easier to understand. I would appreciate your review/feedback :)

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job for dealing with these important blocker issues in one shot. I have carefully gone though this PR and do not have major comments. I will play this PR with some more manual tests today and then share the feedback.

  1. Trigger fatal error when upgrade progress is stuck

BTW, I do not fully understand why the upgrade progress could be stuck and how you fix it.

"Cannot perform savepoint upgrade on missing/failed JobManager deployment",
"JobManager deployment is missing and HA data is not available to make stateful upgrades. "
+ "It is possible that the job has finished or terminally failed, or the configmaps have been deleted. "
+ "Manual restore required.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done in another ticket. But we might need to document that how to do the manual restore. It is important for the release 1.14 and 1.13.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a ticket open for these docs, I will work on that next week to have it for the release.

By the way this is the exact situation that answers your question regarding the "stuck" upgrade progress. This is a situation where the JM deployment is missing, and also HA is not available to get last state info.

We cannot deal with it, that's why we throw an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint.

@wangyang0918
Copy link
Contributor

I have done some manual tests with both 1.14 and 1.15 versions. It works smoothly and I believe the Flink deployment is pretty more robust now. 👍🏻

  • Invalid image and rollback
  • Cancel/Fail Flink and verify last-state upgrade. Expect fatal error in 1.14 and successful recovery for 1.15.
  • Recover missing JobManager deployment
  • Checking the status and operator logs

I found some small issues when testing.

  1. It is unnecessary to trigger an upgrade when the spec is exactly same with stable spec. For example, the invalid image is reverted.
  2. We are generating too many ERROR events for FlinkDeployment. Each reconciliation will append a new one.
    image

@gyfora
Copy link
Contributor Author

gyfora commented May 16, 2022

Thanks @wangyang0918 for the testing. I will improve the rollback -> upgrade scenario that you encountered.

The events should also be fixed after merging #213 will open a separate blocker ticket for that.

@gyfora gyfora merged commit a0aca64 into apache:main May 16, 2022
@gyfora gyfora deleted the FLINK-27572 branch June 27, 2022 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants