-
Notifications
You must be signed in to change notification settings - Fork 4.1k
STORM-350: Upgrade to newer version of disruptor #750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46df80c to
14eab88
Compare
|
The travis failure looks unrelated. I think we need to fix how we do ports with the DRPC test. It seems that every so often we cannot bind to the port we want. |
|
@revans2 |
|
+1 LGTM |
|
I ran performance test from small cluster (three machines, each one has zookeeper and supervisor, and only one has zookeeper, nimbus, supervisor) with configuration below,
Unfortunately I'm seeing failed tuples again. I packaged perf test with 0.10.0-rc of storm-core, but I think it doesn't matter cause its scope is provided. I still don't know why this happens. According to log files, there's no Netty connection issue. Could you test it too? |
|
@HeartSaVioR I figured out the failed tuples and will be pushing a new patch shortly. This is the same null issue we were seeing previously. Just with the newer disruptor it exacerbates the two issues. First MutableObject is not synchronized so it is possible for the data to not be flushed to memory and if the receiver thread is on a different core you can get a null out. Additionally I noticed that the non-bocking receive seems to not be putting the correct barriers in place, and the polling can try to read a MutableObject that has not been updated yet, and is still set to null. The only way I could find to fix it was to implement it in terms of the blocking call. From my performance tests I did not see any impact from these changes. |
|
I pushed the fix, please try again. |
|
@revans2 |
|
@HeartSaVioR can you share the topology that you are running so I can try and reproduce/debug it locally. |
|
@HeartSaVioR |
|
@revans2 |
|
@HeartSaVioR Oh you are using my perftest https://github.com/yahoo/storm-perf-test. I will try it out and see if there are any failures. |
|
@HeartSaVioR I ran a number of tests, and I am fairly sure that what you are seeing is a failure in the messaging layer and not disruptor. I limited the number of workers to just 1 so that I would be sure to isolate things. I didn't run for the full 8000 seconds, but part way through I got this screen shot. I hope 185 million tuples from this test plus the 35 million tuples from my in-order test all with no failures is enough to convince you that the new disruptor queue is solid. If you want me to try and debug the messaging layer and try to reproduce the failures you saw I am fine with that, but I would like to do it as a part of another JIRA. |
020aae0 to
4dedbcd
Compare
4dedbcd to
8eb2775
Compare
|
@revans2 |
|
@HeartSaVioR we started doing some stress/load testing on 0.10 + this patch and ran into the NPE issue. I am going to close this pull request until the NPE can be resolved, but it looks like it is something with disruptor itself. The exact same code but with the version of disruptor reverted back saw no issues. I will also update my batching pull request #765 to be based off of the original version of the disruptor queue. |



This upgrades to version 3.3.2 of the Disruptor Queue. There have been questions about stability in the past, and also out of order delivery.
I really wanted to be sure that everything would be about the same. I ran the DisruptorQueue related unit tests over the weekend and got no failures at all, with well over 10,000 runs.
I did some performance tests too using the FastWordCountTopology I added as a part of this. I ran 5 times with the original 0.11.0-SNAPSHOT this is based off of (e859210) and with this branch. By setting the topology.max.spout.pending to 200 I got essentially identical results. The new Queue was slightly faster but it was small enough it could just be noise.
Similarly when I did not set topology.max.spout.pending and relied on the automatic throttling I got very similar numbers, although the variance between the runs was much higher so having a real comparison there is much more difficult.
@HeartSaVioR in the past you did some testing to see if out of order delivery was happening, I would love it if you could take a look at this patch and test it similarly.
Anecdotally I have seen this version behave better than the current one we are using. I have seen no NPEs from tuples disappearing and I have seen that show up in some of my stress testing using the old queue. Again I don't know how often this happens so I cannot guarantee that it was a disruptor bug.