Skip to content

Conversation

@dmvk
Copy link
Member

@dmvk dmvk commented Feb 21, 2022

https://issues.apache.org/jira/browse/FLINK-26289

In FLINK-21439, we've started collecting a history of exceptions in the Adaptive Scheduler. We have a good coverage that this part works properly, but we've missed the part that exposes the history via REST API.

The problematic part is that execution graph attached with the ExecutionGraphInfo does no longer contain a failure info.

This is covered by a new integration test.

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 21, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit c4136d0 (Mon Feb 21 18:01:46 UTC 2022)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!
  • This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

executionGraphInfo.getArchivedExecutionGraph();
if (executionGraph.getFailureInfo() == null) {
return new JobExceptionsInfoWithHistory();
return new JobExceptionsInfoWithHistory(
Copy link
Contributor

@zentol zentol Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really something that should be solved in the REST API? This looks more like an issue with the ArchivedExecutionGraph that the adaptive scheduler returns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure. The single root exception entry has been deprecated (at least from the API perspective). Need to check whether this failure info still makes sense.

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/JobExceptionsInfo.java#L45

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, this is a general problem of the execution graph with the AS, that we always return the current one (which might not have all the information) to the UI.

This is something we want to address eventually, but it should be a more systematic approach. Until that I'm fine with the current solution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).

Copy link

@niklassemmler niklassemmler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good 👍

executionGraphInfo.getArchivedExecutionGraph();
if (executionGraph.getFailureInfo() == null) {
return new JobExceptionsInfoWithHistory();
return new JobExceptionsInfoWithHistory(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).

@XComp
Copy link
Contributor

XComp commented Mar 1, 2022

Thanks for the contributions and reviews. The CI run succeeded. I'm gonna merge it.

@XComp XComp merged commit 5c4d263 into apache:master Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants