Skip to content

Conversation

@revans2
Copy link
Contributor

@revans2 revans2 commented Sep 21, 2015

This upgrades to version 3.3.2 of the Disruptor Queue. There have been questions about stability in the past, and also out of order delivery.

I really wanted to be sure that everything would be about the same. I ran the DisruptorQueue related unit tests over the weekend and got no failures at all, with well over 10,000 runs.

I did some performance tests too using the FastWordCountTopology I added as a part of this. I ran 5 times with the original 0.11.0-SNAPSHOT this is based off of (e859210) and with this branch. By setting the topology.max.spout.pending to 200 I got essentially identical results. The new Queue was slightly faster but it was small enough it could just be noise.

Similarly when I did not set topology.max.spout.pending and relied on the automatic throttling I got very similar numbers, although the variance between the runs was much higher so having a real comparison there is much more difficult.

@HeartSaVioR in the past you did some testing to see if out of order delivery was happening, I would love it if you could take a look at this patch and test it similarly.

Anecdotally I have seen this version behave better than the current one we are using. I have seen no NPEs from tuples disappearing and I have seen that show up in some of my stress testing using the old queue. Again I don't know how often this happens so I cannot guarantee that it was a disruptor bug.

@revans2
Copy link
Contributor Author

revans2 commented Sep 21, 2015

The travis failure looks unrelated. I think we need to fix how we do ports with the DRPC test. It seems that every so often we cannot bind to the port we want.

@HeartSaVioR
Copy link
Contributor

@revans2
Patch looks good overall.
I'll run performance test and see there's no failed tuple. If it doesn't show failed tuples I think we're safe to merge.

@vesense
Copy link
Member

vesense commented Sep 22, 2015

+1 LGTM

@HeartSaVioR
Copy link
Contributor

@revans2

I ran performance test from small cluster (three machines, each one has zookeeper and supervisor, and only one has zookeeper, nimbus, supervisor) with configuration below,

com.yahoo.storm.perftest.Main --ack --bolt 4 --name test -l 1 -n 1 --workers 4 --spout 3 --testTimeSec 900 -c topology.max.spout.pending=1092 --messageSize 10

Unfortunately I'm seeing failed tuples again.

storm-ui-failed-tuple-appeared

I packaged perf test with 0.10.0-rc of storm-core, but I think it doesn't matter cause its scope is provided.

I still don't know why this happens. According to log files, there's no Netty connection issue.

Could you test it too?

@revans2
Copy link
Contributor Author

revans2 commented Sep 23, 2015

@HeartSaVioR I figured out the failed tuples and will be pushing a new patch shortly. This is the same null issue we were seeing previously. Just with the newer disruptor it exacerbates the two issues.

First MutableObject is not synchronized so it is possible for the data to not be flushed to memory and if the receiver thread is on a different core you can get a null out.

Additionally I noticed that the non-bocking receive seems to not be putting the correct barriers in place, and the polling can try to read a MutableObject that has not been updated yet, and is still set to null. The only way I could find to fix it was to implement it in terms of the blocking call. From my performance tests I did not see any impact from these changes.

@revans2
Copy link
Contributor Author

revans2 commented Sep 23, 2015

I pushed the fix, please try again.

@HeartSaVioR
Copy link
Contributor

@revans2
Seems like it doesn't resolve the failed tuples. First time I can't see failed tuple, but at second trial it appears.
storm_ui-suspicious-2

@revans2
Copy link
Contributor Author

revans2 commented Sep 24, 2015

@HeartSaVioR can you share the topology that you are running so I can try and reproduce/debug it locally.

@revans2
Copy link
Contributor Author

revans2 commented Sep 24, 2015

@HeartSaVioR
I added an in-order test case that I ran for over 50 mins on both Linux and my Mac and they both passed with no errors and each processed over 35 million tuples. Any help you could give me in reproducing the failures would be greatly appreciated.

@HeartSaVioR
Copy link
Contributor

@revans2
I mailed you to attach jar file. If you can't receive it, I'll find another way to share. (via dropbox?)

@revans2
Copy link
Contributor Author

revans2 commented Sep 25, 2015

@HeartSaVioR Oh you are using my perftest https://github.com/yahoo/storm-perf-test. I will try it out and see if there are any failures.

@revans2
Copy link
Contributor Author

revans2 commented Sep 25, 2015

@HeartSaVioR I ran a number of tests, and I am fairly sure that what you are seeing is a failure in the messaging layer and not disruptor. I limited the number of workers to just 1 so that I would be sure to isolate things.
storm jar storm_perf_test.jar com.yahoo.storm.perftest.Main --ack --ackers 8 --bolt 8 --maxPending 200 --workers 1 --spout 8 --testTime 8000 --pollFreq 30

I didn't run for the full 8000 seconds, but part way through I got this screen shot.

run3

I hope 185 million tuples from this test plus the 35 million tuples from my in-order test all with no failures is enough to convince you that the new disruptor queue is solid. If you want me to try and debug the messaging layer and try to reproduce the failures you saw I am fine with that, but I would like to do it as a part of another JIRA.

@revans2 revans2 force-pushed the disruptor-upgrade branch from 4dedbcd to 8eb2775 Compare October 5, 2015 19:00
@HeartSaVioR
Copy link
Contributor

@revans2
It just dropped throughput when tuples are being failed.
It occurred nearly end of performance test, so I'd like to test it again with longer test duration.
If it occurs nearly end of performance test again, I also treat it to other issue, maybe transfer issue while worker shutdown.

@revans2
Copy link
Contributor Author

revans2 commented Oct 13, 2015

@HeartSaVioR we started doing some stress/load testing on 0.10 + this patch and ran into the NPE issue. I am going to close this pull request until the NPE can be resolved, but it looks like it is something with disruptor itself. The exact same code but with the version of disruptor reverted back saw no issues.

I will also update my batching pull request #765 to be based off of the original version of the disruptor queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants