New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-27609] Tracking flink-version and flink-revision in FlinkDeploymentStatus #217
Conversation
if (isJmDeploymentReady(flinkApp)) { | ||
observeClusterInfo(flinkApp, observeConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the cluster info does not change except for upgrading/rollback, we do not need to fetch it for every observation. The rest HTTP call is a little heavy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point :) Moved it to (hopefully) the right spot. PTAL
ccee350
to
5b1f96b
Compare
@@ -112,6 +125,7 @@ protected void observeJmDeployment( | |||
if (JobManagerDeploymentStatus.DEPLOYED_NOT_READY == previousJmStatus) { | |||
logger.info("JobManager deployment is ready"); | |||
deploymentStatus.setJobManagerDeploymentStatus(JobManagerDeploymentStatus.READY); | |||
observeClusterInfo(flinkApp, effectiveConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobManagerDeploymentStatus.READY
does not mean the JobManager Rest API is ready for serving. So we might failed to get the cluster info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Updated the logic again, @wangyang0918. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Do you have verified this PR could work correctly when upgrade with new image or rollback to old image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed the broken e2e tests, looking...
@wangyang0918 Do you have any idea why the CI test
|
I managed to debug this issue locally finally on an older mac (there's no Flink v1.13 for silicon :/). It is indeed an 501 on the first try, then a JSON parsing issue on the retries. (The error handling in the rest client is a mess, the client side errors are swallowed completely) I'll try to come up with something tomorrow:
This makes more sense ^^^ than the ones I saw in the CI logs. |
@morhidi very confusing error message indeed. We can easily fix this tomorrow by the copy trick we did in other incompatible cases :) |
@morhidi You could find the following exception in the e2e tests with default namespace. We do not print the operator logs when
|
@gyfora Flink is now providing an experimental REST API specification following the OpenAPI standard. Maybe we could manage to use the pure HTTP request/response for posting/getting the cluster-info, savepoint, checkpoint, etc. TBH, I do not like to copy the classes from Flink and make some tiny changes. I do not mean we need to do this in this PR. It indeed deserves more discussion after release-1.0.0. |
This exception is not the real root cause, unfortunately, and it's misleading. I noticed when debugging, when the rest client is handling the original exception, the real response arrives, and it couldn't parse that response into a ErrorResponseBody. We should address this in core Flink. Regardless I'm planning to use a custom response class with the only information we need here. Overall I agree with the approach of having our own request/response classes in the operator that are more resilient for API changes. The current ones are far from ideal. We're going to hit these issues all the time. |
6c511c0
to
3886e66
Compare
This PR is meant to improve the ability to identify malicious Flink versions (CVE affected, deprecated, etc.) in managed environments. The operator propagates exact versioning information in status: