-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-10955] Flink Java Runner test flake: Could not find Flink job #15127
Conversation
Run Java PreCommit |
Run Java PreCommit |
6 similar comments
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
Run Java PreCommit |
R: @ibzib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix checkstyle (:runners:flink:1.13:checkstyleTest
)
@@ -159,6 +157,9 @@ private void runSavepointAndRestore(boolean isPortablePipeline) throws Exception | |||
// Initial parallelism | |||
options.setParallelism(2); | |||
options.setRunner(FlinkRunner.class); | |||
// Enable checkpointing interval for streaming non portable pipeline to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not a problem for portable pipelines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, but when I set a checkpointing interval for a portable pipeline, it shows a timeout error like in https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3819/testReport/org.apache.beam.runners.flink/FlinkSavepointTest/testSavepointRestorePortable_2/.
The reason behind this fix is to enable restart after some job failure.
When this test fails, continuously shows the error: "Recovery is suppressed by NoRestartBackoffTimeStrategy" like in https://scans.gradle.com/s/n2coqujl4jc7i/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy?top-execution=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this, but when I set a checkpointing interval for a portable pipeline, it shows a timeout error like in https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3819/testReport/org.apache.beam.runners.flink/FlinkSavepointTest/testSavepointRestorePortable/.
I'm not sure which error you are talking about? If the test passed, it's likely it's benign.
The reason behind this fix is to enable restart after some job failure.
When this test fails, continuously shows the error: "Recovery is suppressed by NoRestartBackoffTimeStrategy" like in https://scans.gradle.com/s/n2coqujl4jc7i/tests/:runners:flink:1.13:test/org.apache.beam.runners.flink.FlinkSavepointTest/testSavepointRestoreLegacy?top-execution=1.
Thanks for getting the build scan. It looks like something is going wrong while taking the savepoint. It looks like it could be a real bug, so let's wait to merge this until we are sure that's not the case.
Caused by: java.lang.Exception: Could not materialize checkpoint 2 for operator VerificationStage/ParMultiDo(Anonymous) (1/2)#0. |
| at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:257) |
| ... 4 more |
| Caused by: java.lang.IllegalArgumentException |
| at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) |
| at org.apache.flink.runtime.checkpoint.CheckpointMetrics.<init>(CheckpointMetrics.java:74) |
| at org.apache.flink.runtime.checkpoint.CheckpointMetricsBuilder.build(CheckpointMetricsBuilder.java:135) |
| at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportCompletedSnapshotStates(AsyncCheckpointRunnable.java:206) |
| at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:158) |
| ... 3 more
It looks like the failed precondition is checking alignmentDurationNanos
. I'm not sure however what the unacceptable value is, nor where it is coming from. https://github.com/apache/flink/blob/3909c9f0a11e8b38b264db9e7716fb41e75cc524/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java#L74
Fixed |
@@ -194,7 +196,7 @@ private void runSavepointAndRestore(boolean isPortablePipeline) throws Exception | |||
private JobID executeLegacy(Pipeline pipeline) throws Exception { | |||
JobGraph jobGraph = getJobGraph(pipeline); | |||
flinkCluster.submitJob(jobGraph).get(); | |||
return jobGraph.getJobID(); | |||
return waitForJobToBeReady(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this line here, maybe if this PR is accepted and merged we can close #15119
What is the next step on this PR? |
I'm pretty sure the true cause of the flake is a Flink issue (FLINK-23201), so it is fine to ignore the test until we upgrade to the Flink patch releases containing the fix. |
Is it ok to close this PR then? |
Yes. |
A different approach from #15107 to fix this flaky test, FlinkSavepointTest in Java PreCommit
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.