STORM-350: Upgrade to newer version of disruptor #750

revans2 · 2015-09-21T20:03:51Z

This upgrades to version 3.3.2 of the Disruptor Queue. There have been questions about stability in the past, and also out of order delivery.

I really wanted to be sure that everything would be about the same. I ran the DisruptorQueue related unit tests over the weekend and got no failures at all, with well over 10,000 runs.

I did some performance tests too using the FastWordCountTopology I added as a part of this. I ran 5 times with the original 0.11.0-SNAPSHOT this is based off of (e859210) and with this branch. By setting the topology.max.spout.pending to 200 I got essentially identical results. The new Queue was slightly faster but it was small enough it could just be noise.

Similarly when I did not set topology.max.spout.pending and relied on the automatic throttling I got very similar numbers, although the variance between the runs was much higher so having a real comparison there is much more difficult.

@HeartSaVioR in the past you did some testing to see if out of order delivery was happening, I would love it if you could take a look at this patch and test it similarly.

Anecdotally I have seen this version behave better than the current one we are using. I have seen no NPEs from tuples disappearing and I have seen that show up in some of my stress testing using the old queue. Again I don't know how often this happens so I cannot guarantee that it was a disruptor bug.

revans2 · 2015-09-21T20:21:31Z

The travis failure looks unrelated. I think we need to fix how we do ports with the DRPC test. It seems that every so often we cannot bind to the port we want.

HeartSaVioR · 2015-09-21T22:36:01Z

@revans2
Patch looks good overall.
I'll run performance test and see there's no failed tuple. If it doesn't show failed tuples I think we're safe to merge.

vesense · 2015-09-22T01:35:24Z

+1 LGTM

HeartSaVioR · 2015-09-22T22:52:28Z

@revans2

I ran performance test from small cluster (three machines, each one has zookeeper and supervisor, and only one has zookeeper, nimbus, supervisor) with configuration below,

com.yahoo.storm.perftest.Main --ack --bolt 4 --name test -l 1 -n 1 --workers 4 --spout 3 --testTimeSec 900 -c topology.max.spout.pending=1092 --messageSize 10

Unfortunately I'm seeing failed tuples again.

I packaged perf test with 0.10.0-rc of storm-core, but I think it doesn't matter cause its scope is provided.

I still don't know why this happens. According to log files, there's no Netty connection issue.

Could you test it too?

revans2 · 2015-09-23T13:31:52Z

@HeartSaVioR I figured out the failed tuples and will be pushing a new patch shortly. This is the same null issue we were seeing previously. Just with the newer disruptor it exacerbates the two issues.

First MutableObject is not synchronized so it is possible for the data to not be flushed to memory and if the receiver thread is on a different core you can get a null out.

Additionally I noticed that the non-bocking receive seems to not be putting the correct barriers in place, and the polling can try to read a MutableObject that has not been updated yet, and is still set to null. The only way I could find to fix it was to implement it in terms of the blocking call. From my performance tests I did not see any impact from these changes.

revans2 · 2015-09-23T13:44:59Z

I pushed the fix, please try again.

HeartSaVioR · 2015-09-23T23:28:34Z

@revans2
Seems like it doesn't resolve the failed tuples. First time I can't see failed tuple, but at second trial it appears.

revans2 · 2015-09-24T13:58:02Z

@HeartSaVioR can you share the topology that you are running so I can try and reproduce/debug it locally.

revans2 · 2015-09-24T18:49:05Z

@HeartSaVioR
I added an in-order test case that I ran for over 50 mins on both Linux and my Mac and they both passed with no errors and each processed over 35 million tuples. Any help you could give me in reproducing the failures would be greatly appreciated.

HeartSaVioR · 2015-09-25T00:34:05Z

@revans2
I mailed you to attach jar file. If you can't receive it, I'll find another way to share. (via dropbox?)

revans2 · 2015-09-25T14:20:59Z

@HeartSaVioR Oh you are using my perftest https://github.com/yahoo/storm-perf-test. I will try it out and see if there are any failures.

revans2 · 2015-09-25T15:50:30Z

@HeartSaVioR I ran a number of tests, and I am fairly sure that what you are seeing is a failure in the messaging layer and not disruptor. I limited the number of workers to just 1 so that I would be sure to isolate things.
storm jar storm_perf_test.jar com.yahoo.storm.perftest.Main --ack --ackers 8 --bolt 8 --maxPending 200 --workers 1 --spout 8 --testTime 8000 --pollFreq 30

I didn't run for the full 8000 seconds, but part way through I got this screen shot.

I hope 185 million tuples from this test plus the 35 million tuples from my in-order test all with no failures is enough to convince you that the new disruptor queue is solid. If you want me to try and debug the messaging layer and try to reproduce the failures you saw I am fine with that, but I would like to do it as a part of another JIRA.

HeartSaVioR · 2015-10-07T00:48:44Z

@revans2
It just dropped throughput when tuples are being failed.
It occurred nearly end of performance test, so I'd like to test it again with longer test duration.
If it occurs nearly end of performance test again, I also treat it to other issue, maybe transfer issue while worker shutdown.

revans2 · 2015-10-13T14:19:40Z

@HeartSaVioR we started doing some stress/load testing on 0.10 + this patch and ran into the NPE issue. I am going to close this pull request until the NPE can be resolved, but it looks like it is something with disruptor itself. The exact same code but with the version of disruptor reverted back saw no issues.

I will also update my batching pull request #765 to be based off of the original version of the disruptor queue.

revans2 force-pushed the disruptor-upgrade branch from 46df80c to 14eab88 Compare September 21, 2015 20:09

revans2 mentioned this pull request Sep 28, 2015

STORM-1151: Batching in DisruptorQueue #765

Merged

revans2 force-pushed the disruptor-upgrade branch from 020aae0 to 4dedbcd Compare September 28, 2015 20:43

Robert (Bobby) Evans added 4 commits October 5, 2015 13:52

STORM-350: Upgrade to newer version of disruptor

4882b38

Fixed null reads from disruptor.

0684314

Added in an in-order test case.

f96bed0

Addressed some review comments

8eb2775

revans2 force-pushed the disruptor-upgrade branch from 4dedbcd to 8eb2775 Compare October 5, 2015 19:00

revans2 closed this Oct 13, 2015

revans2 mentioned this pull request Oct 14, 2015

STORM-350: Upgrade to newer version of disruptor #797

Merged

STORM-350: Upgrade to newer version of disruptor #750

STORM-350: Upgrade to newer version of disruptor #750

Uh oh!

Conversation

revans2 commented Sep 21, 2015

Uh oh!

revans2 commented Sep 21, 2015

Uh oh!

HeartSaVioR commented Sep 21, 2015

Uh oh!

vesense commented Sep 22, 2015

Uh oh!

HeartSaVioR commented Sep 22, 2015

Uh oh!

revans2 commented Sep 23, 2015

Uh oh!

revans2 commented Sep 23, 2015

Uh oh!

HeartSaVioR commented Sep 23, 2015

Uh oh!

revans2 commented Sep 24, 2015

Uh oh!

revans2 commented Sep 24, 2015

Uh oh!

HeartSaVioR commented Sep 25, 2015

Uh oh!

revans2 commented Sep 25, 2015

Uh oh!

revans2 commented Sep 25, 2015

Uh oh!

HeartSaVioR commented Oct 7, 2015

Uh oh!

revans2 commented Oct 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants