Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-31890][runtime] Introduce DefaultScheduler failure enrichment/labeling #22506

Merged
merged 1 commit into from
May 19, 2023

Conversation

pgaref
Copy link
Contributor

@pgaref pgaref commented May 2, 2023

https://issues.apache.org/jira/browse/FLINK-31890

  • Introduce async task failure labeling as part of ExecutionFailureHandler#handleFailure (both for local and global failures)
  • Introduce two fields to ExceptionHistoryEntry: a transient CompletableFuture<Map<String, String>> failureLabelsFuture as well as a Map<String, String> failureLabels -- the failureLabels are set as soon as failureLabelsFuture is completed
  • Extend ExceptionHistoryEntry, FailureHandlingResult, FailureHandlingResultSnapshot to expose labels as part of ExceptionHistory
  • Extend existing tests (e.g., DefaultSchedulerTest, FailureHandlingResultTest) to validate functionality

@flinkbot
Copy link
Collaborator

flinkbot commented May 2, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @pgaref; this already looks pretty good. I've left a couple of comments. My biggest concern is not having a dedicated test in the JobMaster test suite.

@pgaref
Copy link
Contributor Author

pgaref commented May 4, 2023

Thanks for the PR @pgaref; this already looks pretty good. I've left a couple of comments. My biggest concern is not having a dedicated test in the JobMaster test suite.

Thanks for the comments @dmvk ! Addressed as part of 0365fb5 and opened FLINK-31993 to track the configuration changes, PTAL

Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR! @pgaref
I have a few comments. Mainly about that some task failures are not labeled.
Please take a look.

@pgaref pgaref force-pushed the FLINK-31890 branch 2 times, most recently from 76ceaad to 5bc0eb5 Compare May 10, 2023 04:14
@pgaref pgaref changed the title [FLINK-31890][runtime] Introduce JobMaster per-task failure enrichment/labeling [FLINK-31890][runtime] Introduce SchedulerBase per-task failure enrichment/labeling May 10, 2023
@pgaref pgaref force-pushed the FLINK-31890 branch 4 times, most recently from cacb7cb to a3b5f5a Compare May 10, 2023 23:04
@pgaref
Copy link
Contributor Author

pgaref commented May 10, 2023

Thanks for the comments @zhuzhurk and @dmvk !
Latest PR introduces async task failure labeling for DefaultScheduler (Adaptive and Global failures are follow ups) as part of SchedulerBase#updateTaskExecutionState -- this also covers the InternalFailuresListener cases as the failures are eventually propagated to the scheduler.

Decided to pass a CompletableFuture<Map<String, String>> all the way down to ExceptionHistoryEntry as discussed -- we can then decide on the JobExceptionHandler how to handle for the Rest/UI endpoints.

This approach is also giving the flexibility to restart strategies in the future to block/wait for label results if they want to, or just ignore them.

Keep in mind I had to to introduce a SerializableCompletableFuture class as ErrorInfo is a Serializable class and we can not break that contract.

Please let me know what you think.

@pgaref pgaref force-pushed the FLINK-31890 branch 3 times, most recently from b1acee4 to cffe686 Compare May 11, 2023 15:34
@pgaref pgaref force-pushed the FLINK-31890 branch 3 times, most recently from 08aedd5 to fcec058 Compare May 15, 2023 21:28
@pgaref pgaref changed the title [FLINK-31890][runtime] Introduce SchedulerBase per-task failure enrichment/labeling [FLINK-31890][runtime] Introduce DefaultScheduler failure enrichment/labeling May 15, 2023
@pgaref pgaref force-pushed the FLINK-31890 branch 2 times, most recently from 42d460b to 22b4d12 Compare May 16, 2023 01:17
@pgaref
Copy link
Contributor Author

pgaref commented May 16, 2023

Thanks for the comments once again @zhuzhurk !
PTAL on latest PR:

  • introducing async task failure labeling as part of ExecutionFailureHandler#handleFailure (for local and global failures)
  • transient CompletableFuture with serializable labels as part of ErrorInfo

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, @pgaref! This already looks pretty solid. Not relying on the labeling to complete seems to have simplified things a lot.

I've left some minor comments, PTAL, but we're mostly good to go.

@pgaref pgaref force-pushed the FLINK-31890 branch 5 times, most recently from eaade21 to d5c766f Compare May 17, 2023 16:53
Copy link
Contributor

@zhuzhurk zhuzhurk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments! @pgaref
The PR looks almost good to me.
I have 2 last comments, including one that can be addressed in a later PR of FLINK-32114.

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minus the concurrent failure labels; once we get rid of those, this should be good to merge, other comments already have scheduled follow-ups; nice work!

…labeling

* Introduce async task failure labeling as part of ExecutionFailureHandler#handleFailure (both for local and global failures)
* Introduce two fields to ExceptionHistoryEntry: a transient CompletableFuture<Map<String, String>> failureLabelsFuture as well as a Map<String, String> failureLabels -- the failureLabels are set as soon as failureLabelsFuture is completed
* Extend ExceptionHistoryEntry, FailureHandlingResult, FailureHandlingResultSnapshot to expose labels as part of ExceptionHistory
* Extend existing tests (e.g., DefaultSchedulerTest, FailureHandlingResultTest) to validate functionality
@pgaref pgaref force-pushed the FLINK-31890 branch 2 times, most recently from b160a65 to 606c628 Compare May 18, 2023 18:28
@pgaref
Copy link
Contributor Author

pgaref commented May 18, 2023

Thanks @dmvk and @zhuzhurk for the valuable comments!
This is good to go now :)

Copy link
Member

@dmvk dmvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🎉

@dmvk dmvk merged commit a9383fd into apache:master May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants