[FLINK-21990][streaming] Cancel task before clean up if execution was… #15496

akalash · 2021-04-06T15:27:09Z

… failed

What is the purpose of the change

This PR adds the invocation of task cancel before clean up method when the execution was failed which allows sending cancel signal to custom code in order to avoiding hangs

Brief change log

(for example:)

Invocation of cancelTask before cleanUpInvoke when the invocation is failed
Invocation of declineCheckpoint instead of the propagation of exception when performCheckpoint failed with exception

Verifying this change

This change added tests and can be verified as follows:

(example:)

Added integration tests for failing checkpoint in snapshotState method(CheckpointFailureManagerITCase

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2021-04-06T15:29:25Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 2b29cc5 (Sat Aug 28 11:14:15 UTC 2021)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2021-04-06T15:42:47Z

CI report:

2b29cc5 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

akalash · 2021-04-08T07:54:54Z

@flinkbot run azure

… failed

akalash · 2021-04-09T07:54:19Z

@flinkbot run azure

akalash · 2021-04-12T07:11:31Z

@flinkbot run azure

rkhachatryan

Thanks for the fix @akalash , production code changes look good to me.

I have some concerns about the test, PTAL at the comments below.

I'd also consider putting the test into StreamTaskTest (and not in CheckpointFailureManagerITCase as the fix has nothing to do with Failure Manager).

rkhachatryan · 2021-04-13T16:48:25Z

...-tests/src/test/java/org/apache/flink/test/checkpointing/CheckpointFailureManagerITCase.java

+    public void testSourceFailureTriggerJobFailed() throws Exception {
+        // given: Environment with failed source and no restart strategy.
+        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
+        env.enableCheckpointing(2000L);


I guess this big interval is needed to ensure that the LegacySourceThread has actually started.
Could we use some more explicit mean instead? (like future / latch / condition)

I didn't think about the true target of this property. I will take a look and indeed change it to a latch if possible.

rkhachatryan · 2021-04-13T16:52:40Z

...-tests/src/test/java/org/apache/flink/test/checkpointing/CheckpointFailureManagerITCase.java

+        public void run(SourceContext<String> ctx) throws Exception {
+            while (running) {
+                ctx.collect("test");


To use ctx.collect, synchronized (ctx.getCheckpointLock()) must be used.
But probably we don't need to emit anything here? Can we just block on some contidion (like latch.await).

latch.await is not ok because the target is the simulation of the loop with checking 'running' flag. But it can be easily replaced by parkNanos for example.

rkhachatryan · 2021-04-13T16:53:51Z

...-tests/src/test/java/org/apache/flink/test/checkpointing/CheckpointFailureManagerITCase.java

+    private static class FailedSource extends RichParallelSourceFunction<String>
+            implements CheckpointedFunction {
+
+        public static final AtomicInteger INITIALIZE_TIMES = new AtomicInteger(0);


This variable ideally needs to be reset before running the test.

rkhachatryan · 2021-04-13T17:04:23Z

...-tests/src/test/java/org/apache/flink/test/checkpointing/CheckpointFailureManagerITCase.java

+            Optional<RuntimeException> throwable =
+                    findThrowable(jobException, RuntimeException.class);
+
+            // then: Job failed with expected exception.
+            assertTrue(throwable.isPresent());
+            assertEquals(FailedSource.SOURCE_FAILED_MSG, throwable.get().getMessage());


These assertions seem a bit fragile to me (what if Flink wraps an exception into RuntimeException?).
And they are not actually checking the production code, but the test itself: without the fix the test will time out; and without the exception thrown the job might have exited for some other reason.

But the setup is quite simple IMO, so I'd remove them.

WDYT?

I think it makes sense what you said but the main target to check that the cluster fails only by expected reason. But anyway I think I will rework this place to avoid fragility

rkhachatryan

Thanks for updating the PR @akalash ,
it LGTM except that I'd replace parkNanos with something interruptiible (please see comment below).

flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamTaskTest.java

rkhachatryan

Thanks for updating the PR @akalash, LGTM.
I'd merge it once tests finish.

rkhachatryan · 2021-04-16T06:21:13Z

Merged into master 345bf34.

akalash marked this pull request as draft April 6, 2021 15:27

rmetzger added the review=description? label Apr 6, 2021

rmetzger added component=Runtime/Checkpointing component=Runtime/Task labels Apr 6, 2021

akalash force-pushed the flink-21990 branch from 98b8b7b to c756b73 Compare April 7, 2021 16:49

akalash marked this pull request as ready for review April 8, 2021 07:59

[FLINK-21990][streaming] Cancel task before clean up if execution was…

db7d0ea

… failed

akalash force-pushed the flink-21990 branch from c756b73 to db7d0ea Compare April 8, 2021 12:33

rkhachatryan requested changes Apr 13, 2021

View reviewed changes

fixup: FaildeSource test was reworked and moved to StreamTaskTest

1ea7a92

rkhachatryan requested changes Apr 14, 2021

View reviewed changes

flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamTaskTest.java Outdated Show resolved Hide resolved

akalash added 2 commits April 15, 2021 09:46

fixup: parkNanos replaced by sleep

bb79391

fixup: busy wait replaced by blocking wait

6f83c57

rkhachatryan approved these changes Apr 15, 2021

View reviewed changes

fixup: fixed code style

2b29cc5

rkhachatryan closed this Apr 16, 2021

[FLINK-21990][streaming] Cancel task before clean up if execution was… #15496

[FLINK-21990][streaming] Cancel task before clean up if execution was… #15496

Uh oh!

Conversation

akalash commented Apr 6, 2021

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Automated Checks

Review Progress

Uh oh!

flinkbot commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

akalash commented Apr 8, 2021

Uh oh!

akalash commented Apr 9, 2021

Uh oh!

akalash commented Apr 12, 2021

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

akalash Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

akalash Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

rkhachatryan Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

akalash Apr 13, 2021

Choose a reason for hiding this comment

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rkhachatryan left a comment

Choose a reason for hiding this comment

Uh oh!

rkhachatryan commented Apr 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flinkbot commented Apr 6, 2021 •

edited

Loading

flinkbot commented Apr 6, 2021 •

edited

Loading