Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom #672

Conversation

ananthc
Copy link

@ananthc ananthc commented Sep 29, 2017

@tweise - Here is a summary of this change:

  • Moved the power mock module to the parent.
  • Bumped the version of mock to 2.x versions. Please let me know if this change requires a community discussion. The build did not fail on the local machine.

Could you please review or assign this to someone who can help me with the review ?

@tweise
Copy link
Contributor

tweise commented Sep 30, 2017

Travis builds seem to fail most of the time in org.apache.apex.malhar.flume.sink.FlumeSinkTest - APEXMALHAR-2544

@ananthc
Copy link
Author

ananthc commented Sep 30, 2017

Yes. I spent a few minutes today morning to see the root cause as this is failing across many PRs. All my attempts on the local build seem to go fine. I enabled some debug flags on the local build and it looks like one of the first few things that happen is to connect to a server on a port. Will explore this a little bit further and report progress if any on the other JIRA that was created for this problem.

@tweise
Copy link
Contributor

tweise commented Oct 2, 2017

This PR seems suitable to debug the flume test issue as it reproduces the problem (Travis seems to have stickiness wrt worker). So we should add debug info to the test until we know where it gets stuck.

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from 8ce1f0a to bb4d567 Compare October 2, 2017 18:37
@ananthc
Copy link
Author

ananthc commented Oct 2, 2017

@tweise - Please ignore the changes to FlumeSinkTest as I have added log statements to trace out what is happening on the Travis server. Will resubmit this PR with all the unnecessary log statements removed after getting the root cause. I will notify when the PR is ready for review comments. Until then please ignore changes to this PR.

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from bb4d567 to 214531e Compare October 2, 2017 19:14
pom.xml Outdated
@@ -52,6 +52,8 @@
<apex.core.version>3.6.0</apex.core.version>
<semver.plugin.skip>false</semver.plugin.skip>
<surefire.args>-Xmx2048m</surefire.args>
<mockito-version>2.8.47</mockito-version>
<power-mockito-version>1.7.1</power-mockito-version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be power-mock-version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected this.

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch 4 times, most recently from e13822c to 8c547dd Compare October 3, 2017 17:43
@ananthc
Copy link
Author

ananthc commented Oct 3, 2017

@tweise - Here are some interesting observations regarding the Flume sink tests. The tests always seem to pass when there is some logging present. Looking at the travis error message itself, it says it is aborting the test run because it did not get any sings of activity from the test itself. So I am wondering if just adding a couple of log statements is a fix for the issue in JIRA MALHAR-2544.

@tweise
Copy link
Contributor

tweise commented Oct 3, 2017

The test will be aborted after there is no activity for a long time. However, the tests, when they pass, don't take a long time. I suspect there is a race condition and the added logging changes the timing. Is the result consistent, i.e. with logging it always passes and which of the logging statements changes the balance? I don't mind adding a logging statement as a workaround, but we need to fix the root cause eventually.

@ananthc
Copy link
Author

ananthc commented Oct 3, 2017

The logging approach seems to be consistent although I dont trust the result yet :) ..

@tweise
Copy link
Contributor

tweise commented Oct 3, 2017

It would make sense because in Travis the logging is reduced compared to all other environments. But if you can find out which statement makes the difference that might help to identify the root cause (can probably also use yield or sleep(1)).

@ananthc
Copy link
Author

ananthc commented Oct 3, 2017

Here is another attempt to get the root cause. This time around, I took the approach of doing a stack dump every 5 minutes. ( Assumption is that the test should have ideally been done in 5 minutes and Travis by default kills a Test JVM in 10 minutes ) .

Since we do not have access to Travis box, I took the approach of dumping all the current threads using the JMX Bean approach. The scheduled executor that is running as part of the test gets all the current threads and dumps them along with their state as a "System.out" so that it can be viewed in the Travis log.

During one of the runs, the following interesting observation was noticed:

"main" Id=1 RUNNABLE
	at sun.nio.ch.SelectorImpl.<init>(SelectorImpl.java:58)
	at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:64)
	at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
	at java.nio.channels.Selector.open(Selector.java:227)
	at com.datatorrent.netlet.DefaultEventLoop.<init>(DefaultEventLoop.java:69)
	at org.apache.apex.malhar.flume.sink.FlumeSink.start(FlumeSink.java:300)
	at org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:99)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

This is not reproducible in all runs though.

The current thinking is that the epoll binding on the current OS and the current version of the jdk are the root cause. Still exploring it a bit further as there seems to be interesting chatter about this on other threads as well. Perhaps this also explains why when a System.out is involved, the system seems to pass as writing out might have triggered the kernel state to handle the stalled epoll ( Hypothesis at this point )

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch 2 times, most recently from cc366e2 to 7eec727 Compare October 4, 2017 11:25
}

};

DefaultEventLoop eventloop = new DefaultEventLoop("Eventloop-TestClient");
eventloop.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make a difference if you move sink.start() after eventloop.start() ?

Copy link
Author

@ananthc ananthc Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tweise - Moving the event loop did help ( although I have attempted this only once ) . Will have to re-run the test a few times more to confirm this.

Some more information is present in the comment below. Also I have seen some very strange behaviors in some of the builds. Jenkins build failed on the EventLoop as well once when there was a launch of the travis Flume test ( Jenkins does not seem to be configured for running the Flume tests and only Travis does the Flume testing ) but what is more confusing is that the test failed in another component ( I guess it was the JDBCPollTest ) . The Jenkins test also failed in t he Event Loop logic. Will add more details when I hit this issue once more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are other flaky tests that fail the CI sometimes, but not nearly as often as this issue. See also comment WRT logging in advertise - your thread dump was valuable to get closer to the root cause.

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from 7790bc2 to 007ccc3 Compare October 4, 2017 17:32
@ananthc
Copy link
Author

ananthc commented Oct 4, 2017

Optimized the thread dump a bit more and here is a more reproducible stack dump ( Snipped all other threads except for the threads under contention )

The main thread is in a foreever waiting state

"main" Id=1 WAITING on org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9
	at java.lang.Object.wait(Native Method)
	-  waiting on org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9
	at java.lang.Object.wait(Object.java:503)
	at org.apache.apex.malhar.flume.sink.FlumeSinkTest$1.discover(FlumeSinkTest.java:87)
	at org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:144)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	...

The main thread is having the following Lock Info:
Lock inf/name o org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9 org.apache.apex.malhar.flume.sink.FlumeSinkTest$1@121157c9 Lock Owner Namenull Lock Owner id -1

The full stack trace of this thread is

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.apex.malhar.flume.sink.FlumeSinkTest$1.discover(FlumeSinkTest.java:87)
org.apache.apex.malhar.flume.sink.FlumeSinkTest.testServer(FlumeSinkTest.java:144)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
org.junit.runners.ParentRunner.run(ParentRunner.java:309)
org.junit.runners.Suite.runChild(Suite.java:127)
org.junit.runners.Suite.runChild(Suite.java:26)
org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
org.junit.runners.ParentRunner.run(ParentRunner.java:309)
org.junit.runner.JUnitCore.run(JUnitCore.java:160)
org.junit.runner.JUnitCore.run(JUnitCore.java:138)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:141)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:114)
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:86)
org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:134)
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:200)
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)

The second thread is the one of the 2 event loops:

"Eventloop-TestClient" Id=10 RUNNABLE (in native)
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	-  locked sun.nio.ch.Util$2@57c63b
	-  locked java.util.Collections$UnmodifiableSet@2974baa
	-  locked sun.nio.ch.EPollSelectorImpl@1ae8873a
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
	at com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
	at java.lang.Thread.run(Thread.java:748)

This thread does not seem to be waiting on any locks. However the following are the lock monitors for this thread.

sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)

The full stack trace of this thread is

sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
java.lang.Thread.run(Thread.java:748)

The third thread is

"EventLoop-null" Id=9 RUNNABLE (in native)
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	-  locked sun.nio.ch.Util$2@1805fea3
	-  locked java.util.Collections$UnmodifiableSet@31908ba7
	-  locked sun.nio.ch.EPollSelectorImpl@8c09a5d
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
	at com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
	at java.lang.Thread.run(Thread.java:748)

This thread too does not have any lock Objects but is having the following LockMonitors

sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)

The full stack trace of this thread is

sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
com.datatorrent.netlet.DefaultEventLoop.runEventLoop(DefaultEventLoop.java:255)
com.datatorrent.netlet.DefaultEventLoop.run(DefaultEventLoop.java:119)
java.lang.Thread.run(Thread.java:748)

@tweise
Copy link
Contributor

tweise commented Oct 4, 2017

@ananthc Is it possible that the test gets stuck because wait occurs after notify? How about adding logging into advertise and unadvertise?

@tweise
Copy link
Contributor

tweise commented Oct 4, 2017

Output looks like this:

2017-10-04 16:24:05,479 [main] WARN  sink.FlumeSinkTest discover - about to wait..
2017-10-04 16:24:05,487 [EventLoop-null] WARN  sink.FlumeSinkTest advertise - advertise
2017-10-04 16:24:05,487 [main] WARN  sink.FlumeSinkTest discover - done waiting

@ananthc
Copy link
Author

ananthc commented Oct 5, 2017

@tweise - Regarding the test for logging advertise and discover , it turned out that none of the messages were logged. The deadlock is happening even before that is happening.

Looking at the stack trace, the discover() method is in a blocked state waiting for a lock. The lock is obtained on the "this" object. ( discover() is a synchronized method ). The other location which is taking this lock is in the onMessage() of the default client implementation. It is in this method that notify() is being called. There are other synchronized methods in the innerclass implementations.

@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from 845b8a2 to 1f9c652 Compare October 5, 2017 11:28
@ananthc
Copy link
Author

ananthc commented Oct 5, 2017

@tweise - I ran the modifications as discussed and the tests are consistently passing when the even loop is moved much higher point in the invocation timeline. I am removing all the debugging logic and the PR is ready for your review now. Could you please review ?

@ananthc ananthc closed this Oct 5, 2017
@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from 1f9c652 to 8e43bac Compare October 5, 2017 11:36
@ananthc ananthc reopened this Oct 5, 2017
@tweise
Copy link
Contributor

tweise commented Oct 5, 2017

2017-10-05 12:05:08,985 [main] INFO  sink.FlumeSinkTest testServer - Sink started
2017-10-05 12:05:08,991 [EventLoop-null] INFO  sink.FlumeSinkTest advertise - Advertise invoked
2017-10-05 12:05:09,004 [main] INFO  sink.FlumeSinkTest discover - Discover invoked

wait occurs after notify, race condition

notify();
}

@Override
public synchronized void advertise(Service<byte[]> service)
{
logger.info("Advertise invoked");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try adding Thread.sleep(1), also add a comment that this is due to race condition

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ananthc ananthc closed this Oct 5, 2017
@ananthc ananthc force-pushed the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch from d8b8ab5 to 8e43bac Compare October 5, 2017 18:34
@ananthc ananthc reopened this Oct 5, 2017
@tweise
Copy link
Contributor

tweise commented Oct 5, 2017

The workaround for the flume test is good enough for this PR. Will take it up in APEXMALHAR-2544 Thanks for the big effort to identify the root cause of the test failure, even though it was not related to your changes.

@tweise
Copy link
Contributor

tweise commented Oct 5, 2017

jenkins test this please

@tweise tweise merged commit 2ef615b into apache:master Oct 5, 2017
@ananthc ananthc deleted the APEXMALHAR-2546.Mocking-dependencies-should-be-part-of-parent-pom branch October 27, 2017 21:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants