fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask with ugly Thread.sleep by clintropolis · Pull Request #13344 · apache/druid

clintropolis · 2022-11-09T23:55:11Z

RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask fails pretty consistently if run until failure in intelij. After adding this thread.sleep i let it run for over 2k iterations without failure.

I hate it, but it seems to significantly reduce the flakiness (at least i saw no failures) and I wasn't able to determine a "good" fix in a short amount of time so lets do this for now.

The underlying issue appears to be a race condition with test zk server and worker startup, where if the timing is incorrect an INITIALIZED event that happens after the first pending task is added, can result in the task runner calling runPendingTask, before the test is able to call runPendingTask, which makes the test assertions no longer true.

In successful runs, the logs have a section like:

2022-11-03T01:20:05,937 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task id with spaces
2022-11-03T01:20:05,938 ERROR [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Exception while trying to assign task: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner, exceptionType=class java.lang.IllegalArgumentException, exceptionMessage=task id != workItem id, taskId=wrongId}
java.lang.IllegalArgumentException: task id != workItem id
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) ~[guava-16.0.1.jar:?]
	at org.apache.druid.indexing.overlord.RemoteTaskRunner.tryAssignTask(RemoteTaskRunner.java:847) ~[classes/:?]
	at org.apache.druid.indexing.overlord.RemoteTaskRunner.runPendingTask(RemoteTaskRunner.java:771) ~[classes/:?]

but in the failure, there is no exception:

2022-11-03T01:20:17,391 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task id with spaces
2022-11-03T01:20:17,399 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Assigning task [task id with spaces] to worker [worker]
2022-11-03T01:20:17,423 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task [task id with spaces] started running on worker [worker]
2022-11-03T01:20:18,316 INFO [SessionTracker] org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2022-11-03T01:20:18,397 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Stopping RemoteTaskRunner...

…h ugly Thread.sleep

kfaraz

Thanks for fixing this, @clintropolis ! This has been failing consistently on recent PRs.

fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask wit…

49b3c56

…h ugly Thread.sleep

clintropolis added the Flaky test label Nov 9, 2022

kfaraz approved these changes Nov 10, 2022

View reviewed changes

abhishekagarwal87 merged commit 44f2903 into apache:master Nov 10, 2022

abhishekagarwal87 mentioned this pull request Nov 10, 2022

Flaky Test: RemoteTaskRunnerTest #12643

Closed

clintropolis deleted the sad-rtr-flaky-test-fix branch November 10, 2022 10:46

kfaraz added this to the 25.0 milestone Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask with ugly Thread.sleep#13344

fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask with ugly Thread.sleep#13344
abhishekagarwal87 merged 1 commit intoapache:masterfrom
clintropolis:sad-rtr-flaky-test-fix

clintropolis commented Nov 9, 2022 •

edited

Loading

Uh oh!

kfaraz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

clintropolis commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clintropolis commented Nov 9, 2022 •

edited

Loading