-
Notifications
You must be signed in to change notification settings - Fork 152
APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom #672
APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom #672
Conversation
Travis builds seem to fail most of the time in org.apache.apex.malhar.flume.sink.FlumeSinkTest - APEXMALHAR-2544 |
Yes. I spent a few minutes today morning to see the root cause as this is failing across many PRs. All my attempts on the local build seem to go fine. I enabled some debug flags on the local build and it looks like one of the first few things that happen is to connect to a server on a port. Will explore this a little bit further and report progress if any on the other JIRA that was created for this problem. |
This PR seems suitable to debug the flume test issue as it reproduces the problem (Travis seems to have stickiness wrt worker). So we should add debug info to the test until we know where it gets stuck. |
8ce1f0a
to
bb4d567
Compare
@tweise - Please ignore the changes to FlumeSinkTest as I have added log statements to trace out what is happening on the Travis server. Will resubmit this PR with all the unnecessary log statements removed after getting the root cause. I will notify when the PR is ready for review comments. Until then please ignore changes to this PR. |
bb4d567
to
214531e
Compare
pom.xml
Outdated
@@ -52,6 +52,8 @@ | |||
<apex.core.version>3.6.0</apex.core.version> | |||
<semver.plugin.skip>false</semver.plugin.skip> | |||
<surefire.args>-Xmx2048m</surefire.args> | |||
<mockito-version>2.8.47</mockito-version> | |||
<power-mockito-version>1.7.1</power-mockito-version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be power-mock-version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected this.
e13822c
to
8c547dd
Compare
@tweise - Here are some interesting observations regarding the Flume sink tests. The tests always seem to pass when there is some logging present. Looking at the travis error message itself, it says it is aborting the test run because it did not get any sings of activity from the test itself. So I am wondering if just adding a couple of log statements is a fix for the issue in JIRA MALHAR-2544. |
The test will be aborted after there is no activity for a long time. However, the tests, when they pass, don't take a long time. I suspect there is a race condition and the added logging changes the timing. Is the result consistent, i.e. with logging it always passes and which of the logging statements changes the balance? I don't mind adding a logging statement as a workaround, but we need to fix the root cause eventually. |
The logging approach seems to be consistent although I dont trust the result yet :) .. |
It would make sense because in Travis the logging is reduced compared to all other environments. But if you can find out which statement makes the difference that might help to identify the root cause (can probably also use yield or sleep(1)). |
Here is another attempt to get the root cause. This time around, I took the approach of doing a stack dump every 5 minutes. ( Assumption is that the test should have ideally been done in 5 minutes and Travis by default kills a Test JVM in 10 minutes ) . Since we do not have access to Travis box, I took the approach of dumping all the current threads using the JMX Bean approach. The scheduled executor that is running as part of the test gets all the current threads and dumps them along with their state as a "System.out" so that it can be viewed in the Travis log. During one of the runs, the following interesting observation was noticed: "main" Id=1 RUNNABLE
at sun.nio.ch.SelectorImpl.<init>(SelectorImpl.java:58)
at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:64)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at java.nio.channels.Selector.open(Selector.java:227)
at com.datatorrent.netlet.DefaultEventLoop.<init>(DefaultEventLoop.java:69)
at org.apache.apex.malhar.flume.sink.FlumeSink.start(FlumeSink.java:300)
at org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:99)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) This is not reproducible in all runs though. The current thinking is that the epoll binding on the current OS and the current version of the jdk are the root cause. Still exploring it a bit further as there seems to be interesting chatter about this on other threads as well. Perhaps this also explains why when a System.out is involved, the system seems to pass as writing out might have triggered the kernel state to handle the stalled epoll ( Hypothesis at this point ) |
cc366e2
to
7eec727
Compare
} | ||
|
||
}; | ||
|
||
DefaultEventLoop eventloop = new DefaultEventLoop("Eventloop-TestClient"); | ||
eventloop.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make a difference if you move sink.start()
after eventloop.start()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tweise - Moving the event loop did help ( although I have attempted this only once ) . Will have to re-run the test a few times more to confirm this.
Some more information is present in the comment below. Also I have seen some very strange behaviors in some of the builds. Jenkins build failed on the EventLoop as well once when there was a launch of the travis Flume test ( Jenkins does not seem to be configured for running the Flume tests and only Travis does the Flume testing ) but what is more confusing is that the test failed in another component ( I guess it was the JDBCPollTest ) . The Jenkins test also failed in t he Event Loop logic. Will add more details when I hit this issue once more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are other flaky tests that fail the CI sometimes, but not nearly as often as this issue. See also comment WRT logging in advertise
- your thread dump was valuable to get closer to the root cause.
7790bc2
to
007ccc3
Compare
Optimized the thread dump a bit more and here is a more reproducible stack dump ( Snipped all other threads except for the threads under contention ) The main thread is in a foreever waiting state "main" Id=1 WAITING on org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9
at java.lang.Object.wait(Native Method)
- waiting on org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9
at java.lang.Object.wait(Object.java:503)
at org.apache.apex.malhar.flume.sink.FlumeSinkTest$1.discover(FlumeSinkTest.java:87)
at org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:144)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
... The main thread is having the following Lock Info: The full stack trace of this thread is java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.apex.malhar.flume.sink.FlumeSinkTest$1.discover(FlumeSinkTest.java:87)
org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:144)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
org.junit.runners.ParentRunner.run(ParentRunner.java:309)
org.junit.runners.Suite.runChild(Suite.java:127)
org.junit.runners.Suite.runChild(Suite.java:26)
org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
org.junit.runners.ParentRunner.run(ParentRunner.java:309)
org.junit.runner.JUnitCore.run(JUnitCore.java:160)
org.junit.runner.JUnitCore.run(JUnitCore.java:138)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:141)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:114)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:86)
org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:134)
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103) The second thread is the one of the 2 event loops: "Eventloop-TestClient" Id=10 RUNNABLE (in native)
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked sun.nio.ch.Util$2@57c63b
- locked java.util.Collections$UnmodifiableSet@2974baa
- locked sun.nio.ch.EPollSelectorImpl@1ae8873a
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
at com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
at java.lang.Thread.run(Thread.java:748) This thread does not seem to be waiting on any locks. However the following are the lock monitors for this thread. sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) The full stack trace of this thread is sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
java.lang.Thread.run(Thread.java:748) The third thread is "EventLoop-null" Id=9 RUNNABLE (in native)
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked sun.nio.ch.Util$2@1805fea3
- locked java.util.Collections$UnmodifiableSet@31908ba7
- locked sun.nio.ch.EPollSelectorImpl@8c09a5d
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
at com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
at java.lang.Thread.run(Thread.java:748) This thread too does not have any lock Objects but is having the following LockMonitors sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) The full stack trace of this thread is sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
java.lang.Thread.run(Thread.java:748) |
@ananthc Is it possible that the test gets stuck because |
Output looks like this:
|
@tweise - Regarding the test for logging advertise and discover , it turned out that none of the messages were logged. The deadlock is happening even before that is happening. Looking at the stack trace, the discover() method is in a blocked state waiting for a lock. The lock is obtained on the "this" object. ( discover() is a synchronized method ). The other location which is taking this lock is in the onMessage() of the default client implementation. It is in this method that notify() is being called. There are other synchronized methods in the innerclass implementations. |
845b8a2
to
1f9c652
Compare
@tweise - I ran the modifications as discussed and the tests are consistently passing when the even loop is moved much higher point in the invocation timeline. I am removing all the debugging logic and the PR is ready for your review now. Could you please review ? |
1f9c652
to
8e43bac
Compare
wait occurs after notify, race condition |
notify(); | ||
} | ||
|
||
@Override | ||
public synchronized void advertise(Service<byte[]> service) | ||
{ | ||
logger.info("Advertise invoked"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try adding Thread.sleep(1), also add a comment that this is due to race condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
d8b8ab5
to
8e43bac
Compare
The workaround for the flume test is good enough for this PR. Will take it up in APEXMALHAR-2544 Thanks for the big effort to identify the root cause of the test failure, even though it was not related to your changes. |
jenkins test this please |
@tweise - Here is a summary of this change:
Could you please review or assign this to someone who can help me with the review ?