Skip to content

Conversation

@XComp
Copy link
Contributor

@XComp XComp commented Mar 22, 2021

What is the purpose of the change

The goal is to not only provide the root cause of a task or global failure but exceptions that were caught concurrently in other Executions which would be affected by the partial or full restart of the job.

Brief change log

  • Introduced RooExceptionHistoryEntry
  • Introduced ExceptionHistoryEntryExtractor that deals with creating the history entries
  • Created exceptionhistory sub-package

Verifying this change

This change added tests and can be verified as follows:

  • ExceptionHistoryEntryExtractorTest was added dealing with the actual entry extraction

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? REST API docs

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 22, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 14a4a8a (Sat Aug 28 11:11:01 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 22, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@XComp XComp force-pushed the FLINK-21189 branch 2 times, most recently from 0923b25 to 1d0123a Compare March 25, 2021 08:13
@XComp XComp marked this pull request as ready for review March 26, 2021 11:42
@XComp
Copy link
Contributor Author

XComp commented Mar 26, 2021

I addressed all comments and marked the PR as reviewable. The AzureCI failure seems to be unrelated FLINK-21929.

@XComp
Copy link
Contributor Author

XComp commented Mar 26, 2021

I added a separate commit to cover adding the timestamp to the FailureHandlingResult. I kept the commit separately to make it easier to review the two issues (the actual purpose of this PR and the timestamp refactoring).

I verified manually that the UI is not affected by this change. @zentol please give it another round...

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some CI failures that should be looked at.

@XComp XComp force-pushed the FLINK-21189 branch 2 times, most recently from 97997bc to bac508e Compare March 29, 2021 13:01
@XComp
Copy link
Contributor Author

XComp commented Mar 30, 2021

The JobMasterTest failed due to a NullPointerException caused by me adding a notNull check to the FailureHandlingResult as part of my initial efforts to introduce ErrorInfo in FailureHandlingResult instead of dedicated members for the failureCause and timestamp. I reverted that change due to problems with the SerializedThrowable swallowing annotations that is necessary for identifying non-recoverable errors. I reverted the null check.

@XComp
Copy link
Contributor Author

XComp commented Mar 30, 2021

Writing down the previous comment got me thinking again: Ideally, we would want to have this null check since a failure should always have a cause. We didn't introduce a null check so far because of FLINK-21376. There is ErrorInfo. createErrorInfoWithNullableCause for handling this. But it feels to be handled in the wrong place. Instead, we should substitute the null value by the FlinkException in TaskExecutionState's constructor. This way, we make sure that the null never ends up in the ExecutionGraph-related code which enables us to make the invariants stricter on our end.

@zentol
Copy link
Contributor

zentol commented Mar 30, 2021

But that's all part of FLINK-21376, correct? FYI, I don't think the TaskExecutionState should set an exception on it's own, instead it should enforce that if the exception variants are used the exception is not null.

Copy link
Contributor

@zentol zentol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider that blocking the PR, or you can we go ahead with the merge?

@XComp
Copy link
Contributor Author

XComp commented Mar 31, 2021

But that's all part of FLINK-21376, correct? FYI, I don't think the TaskExecutionState should set an exception on it's own, instead it should enforce that if the exception variants are used the exception is not null.

It would be some preparation/cleanup task for FLINK-21376. I created FLINK-22060 to cover this. Hence, it's not blocking this PR.

@zentol zentol merged commit 39c959c into apache:master Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants