-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-31890][runtime] Introduce DefaultScheduler failure enrichment/labeling #22506
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @pgaref; this already looks pretty good. I've left a couple of comments. My biggest concern is not having a dedicated test in the JobMaster test suite.
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
...main/java/org/apache/flink/runtime/scheduler/exceptionhistory/RootExceptionHistoryEntry.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/JobMasterTest.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/utils/JobMasterBuilder.java
Outdated
Show resolved
Hide resolved
...t/java/org/apache/flink/runtime/scheduler/exceptionhistory/ExceptionHistoryEntryMatcher.java
Outdated
Show resolved
Hide resolved
...c/main/java/org/apache/flink/runtime/jobmaster/factories/DefaultJobMasterServiceFactory.java
Show resolved
Hide resolved
Thanks for the comments @dmvk ! Addressed as part of 0365fb5 and opened FLINK-31993 to track the configuration changes, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR! @pgaref
I have a few comments. Mainly about that some task failures are not labeled.
Please take a look.
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
Outdated
Show resolved
Hide resolved
76ceaad
to
5bc0eb5
Compare
cacb7cb
to
a3b5f5a
Compare
Thanks for the comments @zhuzhurk and @dmvk ! Decided to pass a This approach is also giving the flexibility to restart strategies in the future to block/wait for label results if they want to, or just ignore them. Keep in mind I had to to introduce a Please let me know what you think. |
b1acee4
to
cffe686
Compare
...e/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java
Outdated
Show resolved
Hide resolved
...c/test/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraphDeploymentTest.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/flink/runtime/executiongraph/failover/flip1/ExecutionFailureHandler.java
Show resolved
Hide resolved
flink-core/src/main/java/org/apache/flink/util/concurrent/FutureUtils.java
Outdated
Show resolved
Hide resolved
08aedd5
to
fcec058
Compare
42d460b
to
22b4d12
Compare
Thanks for the comments once again @zhuzhurk !
|
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, @pgaref! This already looks pretty solid. Not relying on the labeling to complete seems to have simplified things a lot.
I've left some minor comments, PTAL, but we're mostly good to go.
.../main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FailureHandlingResult.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java
Show resolved
Hide resolved
...e/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/runtime/scheduler/exceptionhistory/ExceptionHistoryEntry.java
Show resolved
Hide resolved
...java/org/apache/flink/runtime/executiongraph/failover/flip1/ExecutionFailureHandlerTest.java
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/apache/flink/runtime/scheduler/adaptivebatch/AdaptiveBatchScheduler.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/flink/runtime/executiongraph/failover/flip1/ExecutionFailureHandler.java
Show resolved
Hide resolved
eaade21
to
d5c766f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! @pgaref
The PR looks almost good to me.
I have 2 last comments, including one that can be addressed in a later PR of FLINK-32114.
...main/java/org/apache/flink/runtime/scheduler/exceptionhistory/RootExceptionHistoryEntry.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/flink/runtime/executiongraph/failover/flip1/ExecutionFailureHandler.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minus the concurrent failure labels; once we get rid of those, this should be good to merge, other comments already have scheduled follow-ups; nice work!
...main/java/org/apache/flink/runtime/scheduler/exceptionhistory/RootExceptionHistoryEntry.java
Outdated
Show resolved
Hide resolved
…labeling * Introduce async task failure labeling as part of ExecutionFailureHandler#handleFailure (both for local and global failures) * Introduce two fields to ExceptionHistoryEntry: a transient CompletableFuture<Map<String, String>> failureLabelsFuture as well as a Map<String, String> failureLabels -- the failureLabels are set as soon as failureLabelsFuture is completed * Extend ExceptionHistoryEntry, FailureHandlingResult, FailureHandlingResultSnapshot to expose labels as part of ExceptionHistory * Extend existing tests (e.g., DefaultSchedulerTest, FailureHandlingResultTest) to validate functionality
b160a65
to
606c628
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 🎉
https://issues.apache.org/jira/browse/FLINK-31890