[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866

dmvk · 2022-02-21T17:57:36Z

https://issues.apache.org/jira/browse/FLINK-26289

In FLINK-21439, we've started collecting a history of exceptions in the Adaptive Scheduler. We have a good coverage that this part works properly, but we've missed the part that exposes the history via REST API.

The problematic part is that execution graph attached with the ExecutionGraphInfo does no longer contain a failure info.

This is covered by a new integration test.

flinkbot · 2022-02-21T18:01:28Z

CI report:

795d5c1 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

flinkbot · 2022-02-21T18:01:46Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit c4136d0 (Mon Feb 21 18:01:46 UTC 2022)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!
This pull request references an unassigned Jira ticket. According to the code contribution guide, tickets need to be assigned before starting with the implementation work.

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

…be queried by the REST API.

zentol · 2022-02-22T08:51:34Z

flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java

                executionGraphInfo.getArchivedExecutionGraph();
        if (executionGraph.getFailureInfo() == null) {
-            return new JobExceptionsInfoWithHistory();
+            return new JobExceptionsInfoWithHistory(


Is this really something that should be solved in the REST API? This looks more like an issue with the ArchivedExecutionGraph that the adaptive scheduler returns.

Not 100% sure. The single root exception entry has been deprecated (at least from the API perspective). Need to check whether this failure info still makes sense.

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/JobExceptionsInfo.java#L45

Anyway, this is a general problem of the execution graph with the AS, that we always return the current one (which might not have all the information) to the UI.

This is something we want to address eventually, but it should be a more systematic approach. Until that I'm fine with the current solution.

To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).

niklassemmler

looks good 👍

niklassemmler · 2022-02-28T10:31:52Z

flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobExceptionsHandler.java

                executionGraphInfo.getArchivedExecutionGraph();
        if (executionGraph.getFailureInfo() == null) {
-            return new JobExceptionsInfoWithHistory();
+            return new JobExceptionsInfoWithHistory(


To me this looks fine. The execution graph only contains the failureInfo if the job has been failed (which happens only once). This collides with restarting the job (which can happen multiple times).

XComp · 2022-03-01T06:35:37Z

Thanks for the contributions and reviews. The CI run succeeded. I'm gonna merge it.

dmvk force-pushed the FLINK-26289 branch from c4136d0 to 2cabf24 Compare February 21, 2022 17:58

[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to …

795d5c1

…be queried by the REST API.

dmvk force-pushed the FLINK-26289 branch from 2cabf24 to 795d5c1 Compare February 21, 2022 18:13

rmetzger added the component=Runtime/Coordination label Feb 21, 2022

zentol reviewed Feb 22, 2022

View reviewed changes

niklassemmler approved these changes Feb 28, 2022

View reviewed changes

XComp merged commit 5c4d263 into apache:master Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866

[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866

Uh oh!

dmvk commented Feb 21, 2022

Uh oh!

flinkbot commented Feb 21, 2022 •

edited

Loading

Uh oh!

flinkbot commented Feb 21, 2022

Uh oh!

zentol Feb 22, 2022 •

edited

Loading

Uh oh!

dmvk Feb 22, 2022

Uh oh!

dmvk Feb 22, 2022

Uh oh!

niklassemmler Feb 28, 2022

Uh oh!

niklassemmler left a comment

Uh oh!

niklassemmler Feb 28, 2022

Uh oh!

XComp commented Mar 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866

[FLINK-26289][runtime] AdaptiveScheduler: Allow exception history to … #18866

Uh oh!

Conversation

dmvk commented Feb 21, 2022

Uh oh!

flinkbot commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

flinkbot commented Feb 21, 2022

Automated Checks

Review Progress

Uh oh!

zentol Feb 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmvk Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

dmvk Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

niklassemmler Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

niklassemmler left a comment

Choose a reason for hiding this comment

Uh oh!

niklassemmler Feb 28, 2022

Choose a reason for hiding this comment

Uh oh!

XComp commented Mar 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

flinkbot commented Feb 21, 2022 •

edited

Loading

zentol Feb 22, 2022 •

edited

Loading