-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-21990][streaming] Cancel task before clean up if execution was… #15496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 2b29cc5 (Sat Aug 28 11:14:15 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
@flinkbot run azure |
|
@flinkbot run azure |
1 similar comment
|
@flinkbot run azure |
rkhachatryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @akalash , production code changes look good to me.
I have some concerns about the test, PTAL at the comments below.
I'd also consider putting the test into StreamTaskTest (and not in CheckpointFailureManagerITCase as the fix has nothing to do with Failure Manager).
| public void testSourceFailureTriggerJobFailed() throws Exception { | ||
| // given: Environment with failed source and no restart strategy. | ||
| final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); | ||
| env.enableCheckpointing(2000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this big interval is needed to ensure that the LegacySourceThread has actually started.
Could we use some more explicit mean instead? (like future / latch / condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think about the true target of this property. I will take a look and indeed change it to a latch if possible.
| public void run(SourceContext<String> ctx) throws Exception { | ||
| while (running) { | ||
| ctx.collect("test"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use ctx.collect, synchronized (ctx.getCheckpointLock()) must be used.
But probably we don't need to emit anything here? Can we just block on some contidion (like latch.await).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latch.await is not ok because the target is the simulation of the loop with checking 'running' flag. But it can be easily replaced by parkNanos for example.
| private static class FailedSource extends RichParallelSourceFunction<String> | ||
| implements CheckpointedFunction { | ||
|
|
||
| public static final AtomicInteger INITIALIZE_TIMES = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable ideally needs to be reset before running the test.
| Optional<RuntimeException> throwable = | ||
| findThrowable(jobException, RuntimeException.class); | ||
|
|
||
| // then: Job failed with expected exception. | ||
| assertTrue(throwable.isPresent()); | ||
| assertEquals(FailedSource.SOURCE_FAILED_MSG, throwable.get().getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assertions seem a bit fragile to me (what if Flink wraps an exception into RuntimeException?).
And they are not actually checking the production code, but the test itself: without the fix the test will time out; and without the exception thrown the job might have exited for some other reason.
But the setup is quite simple IMO, so I'd remove them.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense what you said but the main target to check that the cluster fails only by expected reason. But anyway I think I will rework this place to avoid fragility
rkhachatryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @akalash ,
it LGTM except that I'd replace parkNanos with something interruptiible (please see comment below).
flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamTaskTest.java
Outdated
Show resolved
Hide resolved
rkhachatryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR @akalash, LGTM.
I'd merge it once tests finish.
|
Merged into master 345bf34. |
… failed
What is the purpose of the change
This PR adds the invocation of task cancel before clean up method when the execution was failed which allows sending cancel signal to custom code in order to avoiding hangs
Brief change log
(for example:)
Verifying this change
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation