Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-27675] Improve manual savepoint tracking #225

Merged
merged 1 commit into from May 19, 2022

Conversation

morhidi
Copy link
Contributor

@morhidi morhidi commented May 18, 2022

Cancelling savepoint operation on application failures
Cancelling savepoint operation on savepoint fetching failures
Generating events for failed savepoint operations

@morhidi
Copy link
Contributor Author

morhidi commented May 18, 2022

cc @wangyang0918 @SteNicholas

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wangyang0918
Copy link
Contributor

I am doing a manual verification in my minikube.

return Optional.of(errorMsg);
} else {
LOG.info("Savepoint operation not running yet, waiting within grace period...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not running yet -> not finished yet? The savepoint should already be triggered successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@morhidi
Copy link
Contributor Author

morhidi commented May 19, 2022

I am doing a manual verification in my minikube.

thanks!

@@ -98,6 +98,10 @@ public void observe(FlinkDeployment flinkApp, Context context) {
}
}

if (!ReconciliationUtils.isJobRunning(flinkApp.getStatus())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also need to create an event when we canceling the savepoint operation. At least, we need to add a log when doing a concrete reset(when the triggerId is not empty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, PTAL

Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works well after manual verification. After this PR, the savepoint failure now is more easily to be found.

Events:
  Type     Reason          Age   From      Message
  ----     ------          ----  ----      -------
  Warning  SavepointError  33m   Operator  Savepoint failed for savepointTriggerNonce: 2
  Warning  SavepointError  17m   Operator  Savepoint failed for savepointTriggerNonce: 3

The savepoint process still could be improved in FLINK-27257 by supporting retrying at some given exception.

@wangyang0918
Copy link
Contributor

Will merge this PR after CI pass.

@wangyang0918 wangyang0918 merged commit 8c21487 into apache:main May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants