-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit c4136d0 (Mon Feb 21 18:01:46 UTC 2022) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
…be queried by the REST API.
| executionGraphInfo.getArchivedExecutionGraph(); | ||
| if (executionGraph.getFailureInfo() == null) { | ||
| return new JobExceptionsInfoWithHistory(); | ||
| return new JobExceptionsInfoWithHistory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really something that should be solved in the REST API? This looks more like an issue with the ArchivedExecutionGraph that the adaptive scheduler returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure. The single root exception entry has been deprecated (at least from the API perspective). Need to check whether this failure info still makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, this is a general problem of the execution graph with the AS, that we always return the current one (which might not have all the information) to the UI.
This is something we want to address eventually, but it should be a more systematic approach. Until that I'm fine with the current solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).
niklassemmler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good 👍
| executionGraphInfo.getArchivedExecutionGraph(); | ||
| if (executionGraph.getFailureInfo() == null) { | ||
| return new JobExceptionsInfoWithHistory(); | ||
| return new JobExceptionsInfoWithHistory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).
|
Thanks for the contributions and reviews. The CI run succeeded. I'm gonna merge it. |
https://issues.apache.org/jira/browse/FLINK-26289
In FLINK-21439, we've started collecting a history of exceptions in the Adaptive Scheduler. We have a good coverage that this part works properly, but we've missed the part that exposes the history via REST API.
The problematic part is that execution graph attached with the
ExecutionGraphInfodoes no longer contain a failure info.This is covered by a new integration test.