-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-2678 Improve performance of LoadAwareShuffleGrouping #2261
Conversation
Build failure is missing removing CHANGELOG.md to binary distribution pom. I'll address and rebase. |
* construct ring which represents distribution of tasks based on load * chooseTasks() just accesses the ring sequentially * port related tests from Clojure to Java
* add performance tests for multi-threads
From second commits I added performance tests on multi threads: I started exploring this with 2 threads, and may explore it with more threads if necessary.
The test result clearly shows that both are having performance hit on multi-threads but new LASG is affected much less than old LASG. |
* introduce 'skip checking update count' * we no longer call System.currentTimeMillis() every time * but we call AtomicInteger.incrementAndGet() every time * this may hurt multi-thread perf. but really faster with single-thread
Now I introduce 'skip checking update count' to avoid calling System.currentTimeMillis() every time, but it has clear trade-off, we should call AtomicInteger.incrementAndGet() every time.
Test result seems to fluctuate, but we can see that the change is good roughly. Now multi-threads fluctuates more and hurts on performance compared to before, but still faster than old LASG's. What we really get is performance improvement with single-thread: we reduced more than half of time than before. The value of ‘skip checking update count' should be reasonable to answer the question: "Are we OK to delay updating load information if less than N (the value) calls occurred within 1 sec?" We may want to put better efforts to find the value (given that test results was not stable enough), but at least from test result, 100 seems be a good value. Higher value doesn't show linear performance improvement. Btw, update duration (M secs) is another variable to explore. Maybe also need to see how often origin load information gets updated, since it is meaningless that LASG updates the information more often then origin load information gets updated. UPDATE: load is updated per 1 sec so update duration seems correct. |
* add a new way to provide LoadMapping: via push * refresh load and push updated LoadMapping to all groupers when refreshLoadTimer is activated * update tests to reflect the change
I just take opposite approach, pushing updated load mapping when load updater timer is activated. We no longer need any tricks or optimizations to reduce checking, and even no need to check updating duration. This is based on fact that I guess we couldn't optimize better easily unless we change some specifications like allowing non-thread-safety or so.
Given that it changes some interfaces, I would like to see many reviewers reviewing and providing opinions. |
* change everything to Array and pre-allocate all * use static length for choices * prepare backup array for choices pre-allocated and swap to avoid allocating arrays
I explored and applied more changes:
This might use a bit more memory, but will get rid of objects allocation on each updating loads. |
Now I have another numbers to persuade this patch. I just take same approach to what @Ethanlm is done with his patch #2270
Used 08038b6 (last commit) for this patch, and 77354fe for baseline (master).
So the new LoadAwareShuffleGrouping is definitely faster than current LoadAwareShuffleGrouping (about 28%), and even faster than current ShuffleGrouping (about 4%). Logically the performance for LoadAwareShuffleGrouping is same or slower than ShuffleGrouping (since the logic behind chooseTask() is identical and LoadAwareShuffleGrouping has another overhead on reconstructing information), but the test showed opposite than I expected. |
I got back to 8f63d5a which doesn't touch any interfaces and do same tests:
It is a bit slower than ShuffleGrouping but still faster than LoadAwareShuffleGrouping (about 22%). So we can choose either better improvement with touching multiple parts or still great improvement without touching other parts. I have tested another thing, replacing List with Array in ShuffleGrouping. Test result is below:
It doesn't seem to bring noticeable improvement. The difference may be the length of the array: the array is too small (would have 1 element) in test and had to call another We could grow the array in |
Raw numbers are here: https://gist.github.com/HeartSaVioR/5e80ab3a58b3e8cf40bab9c6da482639 |
@revans2 @roshannaik |
if (rightNow < CAPACITY) { | ||
return rets[choices[rightNow]]; | ||
} else if (rightNow == CAPACITY) { | ||
current.set(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are trying to make this thread safe, i suspect this current.set(0)
is a race condition. not sure if its an acceptable race condition or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I borrowed this from ShuffleGrouping, and even I'm not clear whether it makes race condition, I think it is acceptable since race condition scenarios don't incur out of index, just letting some threads selecting same index and maybe skip some indices.
Moreover this patch contains tests which addresses multi-thread safety.
Above assumption also makes updateRing()
thread-safe.
We can still replace set
with getAndSet
to make it fully thread-safe, but we need to do more experiments to see how it affects performance if we would want. Same applies to ShuffleGrouping.
The impact of loadAware that you show here seems inline with what I have seen. Encouraging to see these improvements. |
} | ||
|
||
shuffleArray(choices); | ||
current = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically this should be -1 because we increment and get, but it doesn't hurt much.
(Same applies to ShuffleGrouping)
choices = prepareChoices; | ||
prepareChoices = tempForSwap; | ||
|
||
current.set(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again logically this should be -1 because we increment and get and unlike in chooseTasks()
we don't read the value in this method, but it doesn't hurt much anyway.
* Let chooseTask() read from index 0, not 1
@@ -20,5 +20,6 @@ | |||
import java.util.List; | |||
|
|||
public interface LoadAwareCustomStreamGrouping extends CustomStreamGrouping { | |||
void refreshLoad(LoadMapping loadMapping); | |||
List<Integer> chooseTasks(int taskId, List<Object> values, LoadMapping load); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to refresh the load out of band then lets delete the LoadMapping from the parameters passed into chooseTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I was just waiting for reviewing to see if we are OK with changing the way of providing LoadMapping. Looks like you're OK with pushing so I'll remove the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
From a code perspective the changes look fine to me. It would be nice to possibly clean up the load aware grouping Interface some. From a performance perspective it is a bit of a wash from what I am seeing. I ran ThroughputVsLatency with 1 spout, 1 splitter, and 1 counter. I also did the same with 2 splitters, so there was a choice to be made. From a cost perspective I saw that for the 1 splitter case everything appeared to be within the noise range. For the 2 splitter case I saw the new version taking up about 5% more CPU than the old version, so I would like to explore this a bit more. Similarly for latency at the mean, 99%ile and 99.9%ile latency measurements. The original one was slightly better, around 5%, when there were 2 splitters. I do want to explore these a bit more, because it seams counter the benchmarks that others have run. |
I spent the last few hours running more tests and I get the same results. I am not too concerned about it. The overhead appears to be rather low if any. |
@revans2
I've shown numbers from grouper performance test, and also ConstSpoutNullBoltTopo which clearly shows newer is (much) faster, so actually the result is confusing to me. Maybe better to put more efforts to standardize approach of performance tests. I'll try to find time to have a look at loadgen. |
* remove LoadAware specific chooseTasks method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 These changes look good to me
…o STORM-2678 STORM-2678: Improve performance of LoadAwareShuffleGrouping This closes #2261
I'm not expert of micro-benchmark but I also craft some of simple performance tests which you can see them from LoadAwareShuffleGroupingTest. They are
testBenchmarkLoadAwareShuffleGroupingEvenLoad
andtestBenchmarkLoadAwareShuffleGroupingUnevenLoad
, and I put@Ignore
to avoid running unless we want to do performance test on.Here's my observation on running them, using old and new LoadAwareShuffleGrouping:
You can see that modified LoadAwareShuffleGrouping is faster than before, 5% or more for single threaded access. Maybe would want to do multi-threading performance test, with keeping in mind that accessing OutputCollector with single-thread is preferred over multi-threads.
This still respects thread-safety, and I think respecting thread-safety is better than before, given that we only allow one thread to update the ring, and we replace the new information at once, not updating information on the fly while other threads are referencing.
We still don't guard information with mutual-exclusion manner, but I think it is tolerable like we do before.
I'm planning to explore some more, mostly about reducing call System.currentTimeMillis() in chooseTasks(). I'll put additional commits if I find any more improvements: it will be easy to revert some if we don't want to.