New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-27675] Improve manual savepoint tracking #225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I am doing a manual verification in my minikube. |
return Optional.of(errorMsg); | ||
} else { | ||
LOG.info("Savepoint operation not running yet, waiting within grace period..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not running yet -> not finished yet? The savepoint should already be triggered successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
thanks! |
@@ -98,6 +98,10 @@ public void observe(FlinkDeployment flinkApp, Context context) { | |||
} | |||
} | |||
|
|||
if (!ReconciliationUtils.isJobRunning(flinkApp.getStatus())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also need to create an event when we canceling the savepoint operation. At least, we need to add a log when doing a concrete reset(when the triggerId
is not empty).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works well after manual verification. After this PR, the savepoint failure now is more easily to be found.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SavepointError 33m Operator Savepoint failed for savepointTriggerNonce: 2
Warning SavepointError 17m Operator Savepoint failed for savepointTriggerNonce: 3
The savepoint process still could be improved in FLINK-27257 by supporting retrying at some given exception.
Will merge this PR after CI pass. |
Cancelling savepoint operation on application failures
Cancelling savepoint operation on savepoint fetching failures
Generating events for failed savepoint operations