[FLINK-21258] Add Canceling state for DeclarativeScheduler#14909

Closed

rmetzger wants to merge 3 commits intoapache:masterfrom

rmetzger:FLINK-21258-cancelling

Contributor

rmetzger commented Feb 9, 2021

This PR is based on #14879

What is the purpose of the change

Declarative Scheduler consists of a number of internal states.

Note that this change is currently not usable as-is, as the other parts of declarative scheduler are not merged yet (See for the prototype this PR is based on: https://github.com/tillrohrmann/flink/tree/declarative-scheduler)

Verifying this change

The change is adding unit tests.
Note that integration tests for the declarative scheduler will cover additional functionality.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Will be handled in a separate PR.

Collaborator

flinkbot commented Feb 9, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit b1053be (Tue Feb 09 11:32:29 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

rmetzger added the review=description? label

Collaborator

flinkbot commented Feb 9, 2021 •

edited

Loading

CI report:

41d226e Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

rmetzger added the component=Runtime/Coordination label

tillrohrmann requested changes

View reviewed changes

Contributor

tillrohrmann left a comment

Thanks for creating this PR @rmetzger. The changes look already quite good. I had a couple of comments concerning the tests. Please take a look.

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/declarative/CancelingTest.java Outdated

+                  public void testTransitionToFinishedWhenCancellationCompletes() throws Exception {
+                      try (MockStateWithExecutionGraphContext ctx = new MockStateWithExecutionGraphContext()) {
+                          Canceling canceling = createCancelingState(ctx);

Contributor

tillrohrmann Feb 10, 2021

here we don't have to call onEnter?

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/declarative/CancelingTest.java Outdated

+                  @Test
+                  public void testTransitionToFinishedWhenCancellationCompletes() throws Exception {
+                      try (MockStateWithExecutionGraphContext ctx = new MockStateWithExecutionGraphContext()) {
+                          Canceling canceling = createCancelingState(ctx);

Contributor

tillrohrmann Feb 10, 2021

It is a bit hidden why this test works. I assume it is the case because the ExecutionGraph does not contain any Executions which are in DEPLOYING or RUNNING state. That's why a ExecutionGraph.cancel can directly cancel all Executions. Maybe we can make this a bit more explicit.

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/declarative/CancelingTest.java Outdated

+                  public void testTransitionToSuspend() throws Exception {
+                      try (MockStateWithExecutionGraphContext ctx = new MockStateWithExecutionGraphContext()) {
+                          Canceling canceling = createCancelingState(ctx);

Contributor

tillrohrmann Feb 10, 2021

I assume that you don't call onEnter because it would make the state go to Finished. I think that we should stick to the convention of State and call onEnter. Otherwise we risk that we are testing something which is impossible to reach in production. The proper solution would be to ensure that the ExecutionGraph does not directly go to CANCELLED when onEnter is called.

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/declarative/CancelingTest.java

+                                                  ExecutionState.FAILED,
+                                                  new RuntimeException()));
+                          canceling.updateTaskExecutionState(update);
+                          ctx.assertNoStateTransition();

Contributor

tillrohrmann Feb 10, 2021

We should also assert that the ExecutionGraph has been updated. Otherwise a passing implementation of the Canceling.updateTaskExecutionState could be an empty method.

Technically, we would also have to do this for the other StateWithExecutionGraph sub classes.

flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/declarative/CancelingTest.java

Comment on lines +167 to +185

+                      public void notifyGlobalFailure(Throwable t) {
+                          canceling.handleGlobalFailure(t);
+                      }

Contributor

tillrohrmann Feb 10, 2021

For which test do we need this implementation?

Contributor Author

rmetzger Feb 10, 2021

For the testTaskFailuresAreIgnored() test.

I'm actually proposing to remove that test. It doesn't belong here. We need to test the proper handling of the updateTaskExecutionState() on the DeclarativeScheduler itself (which registers an InternalFailuresListener and forwards failures to the handleGlobalFailure() method of the State).
Since we are testing the handleGlobalFailure() in a unit test, having a test for updateTaskExecutionState() here doesn't make much sense.

@zentol: Can we add such a test to the DeclarativeScheduler skeleton PR?

Contributor

tillrohrmann Feb 11, 2021

If I am not mistaken, then we don't need an InternalFailuresListener for testing the updateTaskExecutionState. I thought that his listener is only used when a failure in some other operation (e.g. deploy occurs). Testing that updateTaskExecutionState stays in the CANCELING state makes sense to me.

tillrohrmann and others added 3 commits

February 10, 2021 16:13


          [FLINK-21258] Add Canceling state for DeclarativeScheduler

12546ab


          [FLINK-21258] Add test for Canceling state

e23a1c8


          adress comments

41d226e

Contributor Author

rmetzger commented Feb 10, 2021

I rebased this change to the latest master (thus fewer commits are included), and addressed all comments
Once you confirm that the testTaskFailuresAreIgnored() doesn't make sense, I'll remove it.

I once again introduced a MockExecutionGraph for this test. But extracting an interface from the ExecutionGraph is a really involved change that would not be worth it for my needs in this PR.
It is easier to introduce a ExecutionGraph interface in a separate change.

rmetzger force-pushed the FLINK-21258-cancelling branch from b1053be to 41d226e Compare

February 10, 2021 18:41

zentol mentioned this pull request

[FLINK-21100][coordination] Add DeclarativeScheduler #14921

Merged

tillrohrmann approved these changes

View reviewed changes

Contributor

tillrohrmann left a comment

Thanks for updating this PR @rmetzger. I think it makes sense to keep testTaskFailuresAreIgnored because we want to ensure that we don't leave the state on a failed task state update. Moreover, I think we shouldn't need TestInternalFailuresListener.

Contributor Author

rmetzger commented Feb 11, 2021

Thanks for your review!

I'll remove the TestInternalFailuresListener and mock the ExecutionGraph to intercept the failGlobal() method, then merge the change.

rmetzger closed this in

fe41328

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component=Runtime/Coordination review=description?