-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[FLINK-34200][test] Fix the bug that AutoRescalingITCase.testCheckpointRescalingInKeyedState fails occasionally #24246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -86,7 +86,7 @@ | |
| import java.util.concurrent.TimeUnit; | ||
|
|
||
| import static org.apache.flink.runtime.testutils.CommonTestUtils.waitForAllTaskRunning; | ||
| import static org.apache.flink.runtime.testutils.CommonTestUtils.waitForOneMoreCheckpoint; | ||
| import static org.apache.flink.runtime.testutils.CommonTestUtils.waitForNewCheckpoint; | ||
| import static org.apache.flink.test.scheduling.UpdateJobResourceRequirementsITCase.waitForAvailableSlots; | ||
| import static org.apache.flink.test.scheduling.UpdateJobResourceRequirementsITCase.waitForRunningTasks; | ||
| import static org.junit.Assert.assertEquals; | ||
|
|
@@ -163,6 +163,8 @@ public void setup() throws Exception { | |
| NettyShuffleEnvironmentOptions.NETWORK_BUFFERS_PER_CHANNEL, buffersPerChannel); | ||
|
|
||
| config.set(JobManagerOptions.SCHEDULER, JobManagerOptions.SchedulerType.Adaptive); | ||
| // Disable the scaling cooldown to speed up the test | ||
| config.set(JobManagerOptions.SCHEDULER_SCALING_INTERVAL_MIN, Duration.ofMillis(0)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The commit message needs to be changed to someting like |
||
|
|
||
| // speed the test suite up | ||
| // - lower refresh interval -> controls how fast we invalidate ExecutionGraphCache | ||
|
|
@@ -261,7 +263,11 @@ public void testCheckpointRescalingKeyedState(boolean scaleOut) throws Exception | |
|
|
||
| waitForAllTaskRunning(cluster.getMiniCluster(), jobGraph.getJobID(), false); | ||
|
|
||
| waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); | ||
| // We need to wait for a checkpoint to be completed that was triggered after all the | ||
| // data was processed. That ensures the entire data being flushed out of the Operator's | ||
| // network buffers to avoid reprocessing test data twice after the restore (see | ||
| // FLINK-34200). | ||
| waitForNewCheckpoint(jobID, cluster.getMiniCluster()); | ||
1996fanrui marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| SubtaskIndexSource.SOURCE_LATCH.reset(); | ||
|
|
||
|
|
@@ -328,7 +334,7 @@ public void testCheckpointRescalingNonPartitionedStateCausesException() throws E | |
| // wait until the operator handles some data | ||
| StateSourceBase.workStartedLatch.await(); | ||
|
|
||
| waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); | ||
| waitForNewCheckpoint(jobID, cluster.getMiniCluster()); | ||
XComp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| JobResourceRequirements.Builder builder = JobResourceRequirements.newBuilder(); | ||
| for (JobVertex vertex : jobGraph.getVertices()) { | ||
|
|
@@ -411,7 +417,7 @@ public void testCheckpointRescalingWithKeyedAndNonPartitionedState() throws Exce | |
| // clear the CollectionSink set for the restarted job | ||
| CollectionSink.clearElementsSet(); | ||
|
|
||
| waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); | ||
| waitForNewCheckpoint(jobID, cluster.getMiniCluster()); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the scenario can happen in this test as well because it's almost the same test implementation as in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering whether the redundant code could be removed here. But that's probably a bit out-of-scope for this issue.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean I checked them, they have a lot of differences in details. Such as:
I will check could they extract some common code later. If yes, I can submit a hotfix PR and cc you.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I created a PR to show what I had in mind for the code redundancy reduction
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems out-of-scope of this JIRA. Would you mind if we refactor it in a hotfix PR? Your code is ready. After this PR, you can submit a official PR, and I can help review. WDYT? |
||
|
|
||
| SubtaskIndexSource.SOURCE_LATCH.reset(); | ||
|
|
||
|
|
@@ -513,7 +519,7 @@ public void testCheckpointRescalingPartitionedOperatorState( | |
| // wait until the operator handles some data | ||
| StateSourceBase.workStartedLatch.await(); | ||
|
|
||
| waitForOneMoreCheckpoint(jobID, cluster.getMiniCluster()); | ||
| waitForNewCheckpoint(jobID, cluster.getMiniCluster()); | ||
XComp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| JobResourceRequirements.Builder builder = JobResourceRequirements.newBuilder(); | ||
| for (JobVertex vertex : jobGraph.getVertices()) { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we make the number of checkpoints to wait for configurable? That way, we can pass in
2in the test implementation analogously towaitForCheckpoint. I also feel like we can remove some redundant code within the two methods. 🤔Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the semantic between
waitForCheckpointandwaitForOneMoreCheckpointare different. (waitForOneMoreCheckpointis renamed towaitForNewCheckpointin this PR.)waitForCheckpointcheck the total count of all completed checkpoints.waitForOneMoreCheckpointcheck the whether the new checkpoint is completed after it's called.waitForOneMoreCheckpointwill wait for checkpoint-11 is completed.BTW, I have refactored the
waitForNewCheckpoint. I check the checkpoint trigger time instead of checkpointCount.I think checking trigger time is clearer than checkpointCount >= 2. Other developers might don't know why check 2 checkpoint here, and
checkpointCount >= 2doesn't work when enabling the concurrent checkpoint.WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Checking the trigger time is a better solution. I like that. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the redundant code:
...just to clarify what I meant. Feel free to ignore that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your code, this refactor makes sense to me. I have updated.