[FLINK-21439][core] WIP: Adds Exception History for AdaptiveScheduler #15898

bytesandwich · 2021-05-11T18:37:11Z

What is the purpose of the change

This pull request will address saving exception histories in AdaptiveScheduler.

Brief change log

three tests in aptiveSchedulerTest for 3 conditions: knowing the specific failing task, for not knowing, and for concurrent failures.
added the history queue to AdaptiveScheduler along with requestJob
added archiveExecutionFailure to StateWithExecutionGraph to populate the queue in AdaptiveScheduler along with uses of this method in archiveAnyFailure in Executing and StopWithSavepoint.
added failingExecutionVertexId to Executing.FailureResult.
added userCodeClassLoader to StateWithExecutionGraph
add error archiving to Failing and Restarting
added a FailureHandlingResultSnapshot factory method that accepts what we have available in AdaptiveScheduler when we have failures. I think maybe it's a broader type of FailureSnapshot now.

Verifying this change

This change adds exception state accumulation to the AdaptiveScheduler and has associated tests in AdaptiveSchedulerTest.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): noi
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2021-05-11T18:40:48Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 699bacf (Tue May 11 18:40:48 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-05-11T18:53:03Z

CI report:

903a79e Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java

XComp

Thanks @bytesandwich for your draft. And thanks for your patience. I looked through your proposal and the changes look good already. I had a few remarks and questions which are listed below. Looking forward to your reply.

I didn't go through the StopWithSavepointTest in detail, yet: Some of the tests are failing. Keep in mind that we also want to cover the other StateWithExecutionGraph implementations as well for the exception history.

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

XComp · 2021-06-13T13:41:02Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java

+        final Collection<RootExceptionHistoryEntry> copy = new ArrayList<>(exceptionHistory.size());
+        exceptionHistory.forEach(copy::add);
+        return copy;


We could think of moving this logic into BoundedFIFOQueue considering that SchedulerBase uses the exact same code. WDYT?

Does making BoundedFIFOQueue an AbstractQueue and replacing this with new ArrayList{<the queue>) work?

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

XComp · 2021-06-14T06:08:56Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

     * The {@link FailureResult} describes how a failure shall be handled. Currently, there are two
     * alternatives: Either restarting the job or failing it.
     */
    static final class FailureResult {


I'm wondering whether we should introduce a unit test for FailureResult considering that it becomes more "powerful". And, maybe, moving it into AdaptiveScheduler might make sense? WDYT?

I think there's a follow up refactor of handleAnyFailure etc to be had.

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java

XComp · 2021-06-14T06:56:58Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java


-        assertThat(scheduler.howToHandleFailure(new Exception("test")).canRestart(), is(false));
+        assertThat(
+                scheduler.howToHandleFailure(null, new Exception("test")).canRestart(), is(false));


Theoretically, we would have to test passing a non-null value here as well for the failingExecutionVertexId parameter. Introducing a FailureResultTest as mentioned above would free us from doing that.

Thanks for addressing this. Could you move this case into its own test method?

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/ExecutingTest.java

XComp · 2021-06-14T07:09:53Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

+        assertThat(failure.getTimestamp(), greaterThanOrEqualTo(start));
+        assertThat(failure.getTimestamp(), lessThanOrEqualTo(end));
+        assertThat(failure.getTaskManagerLocation(), Matchers.is(nullValue()));
+        assertThat(failure.getFailingTaskName(), Matchers.is(nullValue()));


There's a ExceptionHistoryEntryMatcher which you could use (and extend) instead. @

Do we need to add the code from getFailureTimestamp in DefaultSchedulerTest to use that?

I think that's a good idea. That makes the test more precise as well.

XComp

Thanks @bytesandwich . The changes go into the right direction, I think. 👍 We're still missing the support of concurrent failures. I tried to pin-point this in some of my comments below. Looking forward to your response.

XComp · 2021-06-26T05:39:25Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

        @Nullable private final Duration backoffTime;

+        /**
+         * the {@link ExecutionVertexID} refering to the {@link ExecutionVertex} the failure is


Suggested change

* the {@link ExecutionVertexID} refering to the {@link ExecutionVertex} the failure is

* The {@link ExecutionVertexID} refering to the {@link ExecutionVertex} the failure is

nit

XComp · 2021-06-26T06:04:20Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java

+    @Nullable
+    protected ExecutionVertexID getExecutionVertexId(ExecutionAttemptID id) {
+        Execution execution = getExecutionGraph().getRegisteredExecutions().get(id);
+        if (execution == null) {


Looks like I missed that last time: This seems to be wrong, doesn't it? This method returning null would lead to the failure being interpreted as a global one. It feels to be the wrong location for this decision. I'd propose that the method expects the ID to be present. Setting the null value should be done in the handleGlobalFailure method explicitly. Alternatively, you could follow what DefaultScheduler/SchedulerBase are doing with returning an Optional and doing the state check in case of an successful update.

Is this the same behavior as the original method:

flink/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java

Line 519 in d105eb3

return Optional.ofNullable(executionGraph.getRegisteredExecutions().get(executionAttemptId))

?

Not sure whether I understand you correctly: The method implementation exists in SchedulerBase as well, yes.

XComp · 2021-06-26T06:17:32Z

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

+     * @return The {@code FailureHandlingResultSnapshot}.
+     */
+    public static FailureHandlingResultSnapshot create(
+            Optional<ExecutionVertexID> failingExecutionVertexId,


Passing an Optional here causes an unnecessary wrapping in StateWithExecutionGraph:325 just to have it unwrapped in the method. Instead, we could do a failureHandlingResult.getExecutionVertexIdOfFailedTask().orElse(null) in the factory method above and make this parameter @Nullable.

XComp · 2021-06-26T06:20:16Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java

+        /**
+         * Archive the details of an execution failure for future retrieval and inspection.
+         *
+         * @param failureHandlingResultSnapshot


Suggested change

* @param failureHandlingResultSnapshot

* @param failureHandlingResultSnapshot The {@link FailureHandlingResultSnapshot} holding the failure information that needs to be archived.

nit: just to please the IDE and remove a warning.

XComp · 2021-06-26T06:26:51Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java


-        assertThat(scheduler.howToHandleFailure(new Exception("test")).canRestart(), is(false));
+        assertThat(
+                scheduler.howToHandleFailure(null, new Exception("test")).canRestart(), is(false));


Thanks for addressing this. Could you move this case into its own test method?

XComp · 2021-06-26T06:37:31Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

+                Matchers.is(expectedException));
+        assertThat(failure.getTimestamp(), greaterThanOrEqualTo(start));
+        assertThat(failure.getTimestamp(), lessThanOrEqualTo(end));
+        assertThat(failure.getTaskManagerLocation(), Matchers.is(nullValue()));


Suggested change

assertThat(failure.getTaskManagerLocation(), Matchers.is(nullValue()));

assertThat(failure.getTaskManagerLocation(), is(nullValue()));

nit: This code imports already org.hamcrest.core.Is.is; statically. The Matchers. is not necessary here.

XComp · 2021-06-26T06:39:48Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

+import static org.hamcrest.Matchers.nullValue;
 import static org.hamcrest.core.Is.is;
 import static org.junit.Assert.assertFalse;
 import static org.junit.Assert.assertThat;


nit: I know it's not caused by you, but could you replace the org.junit.Assert.assertThat import by org.hamcrest.MatcherAssert.assertThat in a hotfix commit just to have the deprecation warning removed?

XComp · 2021-06-26T07:05:47Z

...runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java

+        assertThat(failure.getTaskManagerLocation(), Matchers.is(nullValue()));
+        assertThat(failure.getFailingTaskName(), Matchers.is(nullValue()));
+    }
+


Can you add the concurrent failure test here as well? This should fail right now since we're not covering the failure archiving in the restart state.

XComp · 2021-06-26T07:10:53Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java

+        final FailureResult failureResult =
+                context.howToHandleFailure(failingExecutionVertexId, cause);
+
+        archiveExecutionFailure(failingExecutionVertexId, cause);


Thinking about it once more: I guess, it's not the right location to archive the failure considering that we also want to identify concurrent failures. We haven't addressed that in this PR, yet.

To achieve that, we have to collect the failure snapshot here and pass it over to the next state (failure or restart). Any failure that pops up in these subsequent states has to be collected as well. The archiving should happen when re-instantiating the ExecutionGraph.

We should also cover this in the corresponding StateWithExecutionGraphTest test implementations.

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java

XComp · 2021-07-06T10:38:11Z

@bytesandwich just letting you know: The feature freeze for Flink 1.14 is beginning of August 2021. How realistic is it finish this PR till then? FYI: @zentol might take over as I'm out for the next weeks.

zentol · 2021-07-20T22:25:09Z

@bytesandwich ping

bytesandwich · 2021-07-28T06:47:19Z

Hi @zentol I just got back from a vacation so I'm looking at this again. I'm not sure what we want specification wise with concurrent failure support. I can imagine all sorts of things failing concurrently. It seems like maybe concurrent failures would be best tested in a more elaborate integration test to have a very clear expectation of correct behavior? Perhaps that could be a follow up ticket to make this current minimal exception handling landable quickly after I address the current feedback?

bytesandwich · 2021-07-28T07:31:33Z

Regarding using ExceptionHistoryEntryMatcher, I'm not sure what exact timestamp to expect. Maybe it's best to stick with the range that's in the test now without the matcher?

FYI I uploaded the changes, I had made before vacation, that I think addressed the review. I'd like to see what the integration test runs do, since developing on windows makes it hard to run the build.

XComp · 2021-08-03T12:16:48Z

Hi @zentol I just got back from a vacation so I'm looking at this again. I'm not sure what we want specification wise with concurrent failure support. I can imagine all sorts of things failing concurrently. It seems like maybe concurrent failures would be best tested in a more elaborate integration test to have a very clear expectation of correct behavior? Perhaps that could be a follow up ticket to make this current minimal exception handling landable quickly after I address the current feedback?

Hi @bytesandwich , I'm back from vacation so I am able to answer your questions.
Testing concurrent failures should be possible as part of the AdaptiveSchedulerTest. Similarly to what you've done in AdaptiveSchedulerTest:929 with one updateTaskExecutionState call you should be able to do with two calls. The first call will make the AdaptiveScheduler switch into restarting state. Calling the updateTaskExecutionState again would not catch the second exception right now. Implementing the exception handling also in the Restarting state class should solve the issue.

Analogously, that has to be done for cases where the scheduler does not switch into Restarting state but Failing. Does that make sense to you?

XComp · 2021-08-03T12:26:45Z

Regarding using ExceptionHistoryEntryMatcher, I'm not sure what exact timestamp to expect. Maybe it's best to stick with the range that's in the test now without the matcher?

FYI I uploaded the changes, I had made before vacation, that I think addressed the review. I'd like to see what the integration test runs do, since developing on windows makes it hard to run the build.

The same way, how you get the ExecutionAttemptID in AdaptiveSchedulerTest:920-924 you could also get the failure info. Just instead of calling getAttemptId() to retrieve the ID, you could call getFailureInfo() to retrieve the ErrorInfo which includes the timestamp after updateTaskExecutionState is called.

I hope that helped. Let me know if you have further questions.

XComp

Thanks @bytesandwich. I looked through your questions and changes and responded to all of them. Please see my comments above and below. Feel free to reach out for further questions.

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

This adds a first failing test case for AdaptiveScheduler to return an ExceptionHistory.

…tiveScheduler

bytesandwich · 2021-08-19T01:59:45Z

Hi @XComp I just switched jobs so I was out for a bit. I see how you envision the test and I implemented it that way. PTAL!

XComp

Thanks, @bytesandwich . The changes look good. We're going into the right direction with it. Great! 👍 I added some comments to your code changes. Please check them below.

I didn't go through the tests, yet. AzureCI seems to be failing quite a bit. Is this related to your changes?

XComp · 2021-08-20T10:56:36Z

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

+     * @param failingExecutionVertexId an {@link ExecutionVertexID} the failure originates from, or
+     *     {@code None}.


Suggested change

* @param failingExecutionVertexId an {@link ExecutionVertexID} the failure originates from, or

* {@code None}.

* @param failingExecutionVertexId the {@link ExecutionVertexID} referring to the {@link Execution} that failed. {@code null} should be used in case of a global failure.

I adapted the text a bit since we're not relying on Optional anymore, i.e. None is not exactly correct in this context.

XComp · 2021-08-20T10:59:29Z

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

+     * @param concurrentVertexIds {@link ExecutionVertexID} Task vertices concurrently failing with
+     *     the {@code failingExecutionVertexID}.


Suggested change

* @param concurrentVertexIds {@link ExecutionVertexID} Task vertices concurrently failing with

* the {@code failingExecutionVertexID}.

* @param concurrentVertexIds {@link ExecutionVertexID} referring to {@link Execution Executions} that failed while the initial failure of {@code failingExecutionVertexID} was handled.

XComp · 2021-08-20T11:00:33Z

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java

+    public static FailureHandlingResultSnapshot create(
+            @Nullable ExecutionVertexID failingExecutionVertexId,
+            Throwable rootCause,
+            Set<ExecutionVertexID> concurrentVertexIds,


Suggested change

Set<ExecutionVertexID> concurrentVertexIds,

Set<ExecutionVertexID> concurrentlyFailingExecutionVertexIds,

nit: a small suggestion to make the parameter more expressive.

XComp · 2021-08-20T11:10:08Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java

        }
    }

+    void maybeArchiveExecutionFailure(TaskExecutionStateTransition taskExecutionStateTransition) {


Suggested change

void maybeArchiveExecutionFailure(TaskExecutionStateTransition taskExecutionStateTransition) {

void archiveExecutionFailureIfFailed(TaskExecutionStateTransition taskExecutionStateTransition) {

as an idea to make it more explicit what the method does

XComp · 2021-08-20T11:11:18Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java

+
+        if (taskExecutionStateTransition.getExecutionState() != ExecutionState.FAILED) {
+            return;
+        }


Suggested change

if (taskExecutionStateTransition.getExecutionState() != ExecutionState.FAILED) {

return;

}

if (taskExecutionStateTransition.getExecutionState() != ExecutionState.FAILED) {

return;

}

nit: To visually separate the failed state handling...

XComp · 2021-08-20T11:31:38Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java

+    }
+
+    void archiveExecutionFailure(
+            @Nullable ExecutionVertexID failingExecutionVertexId, Throwable cause) {


Could we add some JavaDoc here as well to describe the @Nullable contract?

XComp · 2021-08-20T11:37:25Z

...ntime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java


    private final Logger logger;

+    protected final ClassLoader userCodeClassLoader;


Suggested change

protected final ClassLoader userCodeClassLoader;

private final ClassLoader userCodeClassLoader;

Sharing members with subclasses might not be the best. What about making userCodeClassLoader private and providing a getError(TaskExecutionStateTransition stateTransition) in StateWithExecutionGraph? All subclasses could use that method as well

XComp · 2021-08-20T11:53:12Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Restarting.java

    }

    @Override
    boolean updateTaskExecutionState(TaskExecutionStateTransition taskExecutionStateTransition) {


It's not enough to collect only task failures. We also want to collect global failures. Hence, you have to cover the archiving in Restarting.handleGlobalFailure.

XComp · 2021-08-20T11:55:33Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Failing.java


    @Override
    boolean updateTaskExecutionState(TaskExecutionStateTransition taskExecutionStateTransition) {
+        maybeArchiveExecutionFailure(taskExecutionStateTransition);


We also want to handle global failures, i.e. we want to extend Failing.handleGlobalFailure

XComp · 2021-08-20T12:33:03Z

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Canceling.java

            OperatorCoordinatorHandler operatorCoordinatorHandler,
-            Logger logger) {
-        super(context, executionGraph, executionGraphHandler, operatorCoordinatorHandler, logger);
+            Logger logger,


Just for the record: I verified that we do not collect failures while being already in cancelling stage for the DefaultScheduler (see source). Hence, not implementing the archiving in this class is correct. 👍

XComp · 2021-09-08T08:54:39Z

@bytesandwich Any update on your side? If not, I might pick it up to finalize it for Flink 1.15 after the upcoming release of 1.14.

bytesandwich · 2021-09-12T05:03:29Z

Hi @XComp I think that would be a good idea, unless you feel that the changes you asked for are the final changes to land the diff? I am not working with Flink at my day job at the moment.

github-actions · 2025-01-14T18:07:27Z

This PR is being marked as stale since it has not had any activity in the last 180 days.
If you would like to keep this PR alive, please leave a comment asking for a review.
If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out to the [community](https://flink.apache.org/what-is-flink/community/).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 90 days, it will be automatically closed.

XComp · 2025-01-15T11:52:24Z

This PR was superseded by #18689. Closing the PR.

bytesandwich marked this pull request as draft May 11, 2021 18:37

rmetzger added the review=description? label May 11, 2021

rmetzger added the component=Runtime/Coordination label May 11, 2021

bytesandwich changed the title ~~[FLINK-21439][core] WIP: Adds failing test case for AdaptiveScheduler~~ [FLINK-21439][core] WIP: Adds Exception History for AdaptiveScheduler Jun 2, 2021

XComp reviewed Jun 2, 2021

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java Outdated Show resolved Hide resolved

XComp requested changes Jun 14, 2021

View reviewed changes

XComp requested changes Jun 26, 2021

View reviewed changes

zentol self-assigned this Jul 9, 2021

XComp requested changes Aug 3, 2021

View reviewed changes

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java Outdated Show resolved Hide resolved

.../java/org/apache/flink/runtime/scheduler/exceptionhistory/FailureHandlingResultSnapshot.java Outdated Show resolved Hide resolved

zentol removed their assignment Aug 5, 2021

bytesandwich added 12 commits August 18, 2021 17:42

[FLINK-21439][core] WIP: Adds failing test case for AdaptiveScheduler

be278b1

This adds a first failing test case for AdaptiveScheduler to return an ExceptionHistory.

add failingExecutionVertexId to Executing.FailureResult

374ba35

Pass failure information through from StateWithExecutionGraph to Adap…

558c52f

…tiveScheduler

mend

0add77f

fix test failures

51675dc

address review

93b5819

fix tests

b5f65e1

commit for upstream tests

72ad1f0

add failing concurrent test

230ac40

fix javadoc

4274246

add userCodeClassLoader to StateWithExecutionGraph

8c47f3d

handle concurrent failures in AdaptiveScheduler

1645ccf

fix test typo

903a79e

bytesandwich marked this pull request as ready for review August 19, 2021 01:59

XComp requested changes Aug 20, 2021

View reviewed changes

XComp closed this Jan 15, 2025

	* the {@link ExecutionVertexID} refering to the {@link ExecutionVertex} the failure is
	* The {@link ExecutionVertexID} refering to the {@link ExecutionVertex} the failure is

	* @param failureHandlingResultSnapshot
	* @param failureHandlingResultSnapshot The {@link FailureHandlingResultSnapshot} holding the failure information that needs to be archived.

	assertThat(failure.getTaskManagerLocation(), Matchers.is(nullValue()));
	assertThat(failure.getTaskManagerLocation(), is(nullValue()));

		* @param failingExecutionVertexId an {@link ExecutionVertexID} the failure originates from, or
		* {@code None}.

	* @param failingExecutionVertexId an {@link ExecutionVertexID} the failure originates from, or
	* {@code None}.
	* @param failingExecutionVertexId the {@link ExecutionVertexID} referring to the {@link Execution} that failed. {@code null} should be used in case of a global failure.

		* @param concurrentVertexIds {@link ExecutionVertexID} Task vertices concurrently failing with
		* the {@code failingExecutionVertexID}.

	Set<ExecutionVertexID> concurrentVertexIds,
	Set<ExecutionVertexID> concurrentlyFailingExecutionVertexIds,

	void maybeArchiveExecutionFailure(TaskExecutionStateTransition taskExecutionStateTransition) {
	void archiveExecutionFailureIfFailed(TaskExecutionStateTransition taskExecutionStateTransition) {


		private final Logger logger;

		protected final ClassLoader userCodeClassLoader;

	protected final ClassLoader userCodeClassLoader;
	private final ClassLoader userCodeClassLoader;

[FLINK-21439][core] WIP: Adds Exception History for AdaptiveScheduler #15898

[FLINK-21439][core] WIP: Adds Exception History for AdaptiveScheduler #15898

Uh oh!

Conversation

bytesandwich commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented May 11, 2021

Automated Checks

Review Progress

Uh oh!

flinkbot commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

Uh oh!

XComp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bytesandwich Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bytesandwich commented May 11, 2021 •

edited

Loading

flinkbot commented May 11, 2021 •

edited

Loading

XComp left a comment •

edited

Loading

bytesandwich Jun 23, 2021 •

edited

Loading

XComp commented Aug 3, 2021 •

edited

Loading

XComp left a comment •

edited

Loading