-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timeout in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup #6207
Fix timeout in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup #6207
Conversation
|
||
Assert.assertNull(serviceEmitter.getStackTrace()); | ||
Assert.assertNull(serviceEmitter.getExceptionMessage()); | ||
Assert.assertNull(serviceEmitter.getStackTrace(), serviceEmitter.getStackTrace()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a fix or just temp debugging arrangement to see what is in the stacktrace so that later a fix can be done ?
test times out because serviceEmitter.getStackTrace() is non-null , so this assertion is gonna fail in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this doesn't fix the root cause, but fixes the infinite loop and print the stack trace so that we can fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR title misled me :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, do you want this PR to be merged or you're trying builds on this PR to see things whenever this test fails ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the title looks somewhat confusing :(. Do you have a better idea?
I thought it would be great if the CI for this PR shows something interesting, but unfortunately it succeeded. I can restart it until I find something, or check other PRs if they fail with some stack traces. I don't have a strong opinion here, but I would say this PR gives us a small benefit by making CI jobs fail faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, "Add debug information to expose transient failure cause in KafkaSupervisorTest.testCheckpointForInactiveTaskGroup", if this PR is merged without actually fixing.
I'm approving it so feel free to merge whenever you want after trying out the builds here. I triggered one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It catches the exception. I'll check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception was
KafkaSupervisorTest.testCheckpointForInactiveTaskGroup:2118 java.lang.AssertionError:
Unexpected method call TaskRunner.getRunningTasks():
at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:44)
at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
at com.sun.proxy.$Proxy43.getRunningTasks(Unknown Source)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor$1.getTaskLocation(KafkaSupervisor.java:296)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor.checkpointTaskGroup(KafkaSupervisor.java:1503)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor.checkTaskDuration(KafkaSupervisor.java:1433)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor.runInternal(KafkaSupervisor.java:879)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor$RunNotice.handle(KafkaSupervisor.java:595)
at io.druid.indexing.kafka.supervisor.KafkaSupervisor$2.run(KafkaSupervisor.java:369)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
So, I added expect(taskRunner.getRunningTasks()).andReturn(workItems).anyTimes();
. Hopefully this fixes this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jihoonson this failure persists. See https://travis-ci.org/apache/incubator-druid/jobs/424467954.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov thanks. I've also noticed that before. The problem is that the supervisor periodically executes runNotice
and this can potentially call every method of the supervisor which in turn requires to mock everything using EasyMock
.
I think it's better to refactor the whole KafkaSupervisorTest to be based on a sort of more controllable mockup environment by creating taskRunner
, taskMaster
, taskStorage
, and taskClient
for test purpose instead of making them using EasyMock.
Raised #6296.
Let's see what is the error message.