Skip to content

Comments

fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask with ugly Thread.sleep#13344

Merged
abhishekagarwal87 merged 1 commit intoapache:masterfrom
clintropolis:sad-rtr-flaky-test-fix
Nov 10, 2022
Merged

fix flaky RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask with ugly Thread.sleep#13344
abhishekagarwal87 merged 1 commit intoapache:masterfrom
clintropolis:sad-rtr-flaky-test-fix

Conversation

@clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 9, 2022

RemoteTaskRunnerTest.testRunPendingTaskFailToAssignTask fails pretty consistently if run until failure in intelij. After adding this thread.sleep i let it run for over 2k iterations without failure.

I hate it, but it seems to significantly reduce the flakiness (at least i saw no failures) and I wasn't able to determine a "good" fix in a short amount of time so lets do this for now.

The underlying issue appears to be a race condition with test zk server and worker startup, where if the timing is incorrect an INITIALIZED event that happens after the first pending task is added, can result in the task runner calling runPendingTask, before the test is able to call runPendingTask, which makes the test assertions no longer true.

In successful runs, the logs have a section like:

2022-11-03T01:20:05,937 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task id with spaces
2022-11-03T01:20:05,938 ERROR [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Exception while trying to assign task: {class=org.apache.druid.indexing.overlord.RemoteTaskRunner, exceptionType=class java.lang.IllegalArgumentException, exceptionMessage=task id != workItem id, taskId=wrongId}
java.lang.IllegalArgumentException: task id != workItem id
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) ~[guava-16.0.1.jar:?]
	at org.apache.druid.indexing.overlord.RemoteTaskRunner.tryAssignTask(RemoteTaskRunner.java:847) ~[classes/:?]
	at org.apache.druid.indexing.overlord.RemoteTaskRunner.runPendingTask(RemoteTaskRunner.java:771) ~[classes/:?]

but in the failure, there is no exception:

2022-11-03T01:20:17,391 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Added pending task task id with spaces
2022-11-03T01:20:17,399 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Assigning task [task id with spaces] to worker [worker]
2022-11-03T01:20:17,423 INFO [rtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.RemoteTaskRunner - Task [task id with spaces] started running on worker [worker]
2022-11-03T01:20:18,316 INFO [SessionTracker] org.apache.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!
2022-11-03T01:20:18,397 INFO [Time-limited test] org.apache.druid.indexing.overlord.RemoteTaskRunner - Stopping RemoteTaskRunner...

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, @clintropolis ! This has been failing consistently on recent PRs.

@abhishekagarwal87 abhishekagarwal87 merged commit 44f2903 into apache:master Nov 10, 2022
@clintropolis clintropolis deleted the sad-rtr-flaky-test-fix branch November 10, 2022 10:46
@kfaraz kfaraz added this to the 25.0 milestone Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants