[SPARK-18196] [CORE] Optimise CompactBuffer implementation #15713

a-roberts · 2016-11-01T15:00:31Z

What changes were proposed in this pull request?

See the JIRA for details - summary is slightly increased footprint in the class but 4% performance increase observed on PageRank with HiBench large

How was this patch tested?

Existing unit tests and HiBench large

See the JIRA for details - summary is slightly increased footprint in the class but 4% performance increase observed on PageRank with HiBench large

srowen · 2016-11-01T15:20:25Z

It leads to a bigger memory footprint because of the extra elements that may be allocated but unused in the array? and the performance win is probably avoiding those branches? I could believe it.

But then there's little point to this class. Reading the original problem statement, I wonder why it wasn't easier to just allocate an ArrayBuffer with a small initial capacity?

SparkQA · 2016-11-01T16:52:19Z

Test build #67898 has finished for PR 15713 at commit cdf8b8d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2016-11-01T17:11:38Z

+CC @rxin who can elaborate better.
The reason iirc why the first two variables are inline is to do with usual low cardinality of the buffer - it effectively comes for "free" due to padding overheads of an object : lowering number of objects created and so gc impact.

For specific workloads, ofcourse this might not be the case.

Having said that, this was done in 1.0-1.1 timeframe, and perhaps not applicable anymore ?
A more direct optimization could be to have specialized versions of the class for the primitive types (or rely on something like trove or apache collections). We used to do that for some pretty remarkable gains when working with primitives - for both AppendOnlyMap and CompactBuffer

a-roberts · 2016-11-02T10:27:42Z

Thanks for the prompt feedback, we found this opportunity when profiling Spark 1.6.2 with HiBench large and again this showed up as a hot method with the PageRank benchmark, we can gather data to see if it's still hot with Spark 2 also and I'm planning to contribute lots of similar improvements

Paraphrasing from a colleague:

This data structure is the backing data structure used by RDDs that are doing group by operations (we saw it from a PairRDD doing a groupByKey in PageRank)

The downside of the existing implementation is that every method in this class has an if ... else ... if ... else ... which handles element 0, element 1 and then everything else respectively

We found that on PageRank this change provides a throughput boost of around 5% and costs us about 1 MB of estimated RDD size (86.5 MB to just under 88 MB)_

Note that with our testing using OpenJDK 8 we didn't see a noticeable performance improvement (nor a regression) despite the very slight footprint increase (an increase of 2 MB instead of 1.5 MB), ideally we'll improve the performance for everybody so there may be scope for optimisations here that'll be of use to OpenJDK users too

srowen · 2016-11-02T10:39:35Z

I believe the benchmark, in which case it seems like this class could just be removed at this point, and replaced with an ArrayBuffer that sets an appropriately small initial size. Is that possible to easily benchmark too while you have the machinery in place?

a-roberts · 2016-11-02T11:05:19Z

Good point, I'll first see if we can build and pass the unit tests by replacing CompactBuffer with ArrayBuffer[2] and we'd also repeat the profiling (this would verify if groupByKey is still hot for PageRank).

I'll keep the CompactBufferSuite for our experiment and replace the CompactBuffer references with ArrayBuffer to verify we can still perform all of the operations that the tests exercise.

My concern is that in comment for the CompactBuffer class we see the downsides of said ArrayBuffer (the 16 entry Object array we'd get, so we'd indeed want a default two element ArrayBuffer that only grows if needed -- which sounds like what we currently have but encapsulated within this class so we can override two operators and grow the backing array as required)

andrewcraik · 2016-11-02T13:39:20Z

Having worked with @a-roberts to help develop this change I can see that the type itself may not be required any more with the change proposed, but keeping it around would allow us to distinguish between ArrayBuffers which are supposed to be small and ArrayBuffers that are general purpose so that if the best trade-off for the case of small buffers changes again in the future we still have the single point at which to tune it. I would be interested to hear @srowen thoughts on future tuneability vs maintenance.

mridulm · 2016-11-02T20:18:12Z

@andrewcraik If we are simply delegating to ArrayBuffer, it makes sense to remove the class - it is an additional object created for each key being aggregated.
Alternative would be to implement the array management within CompactBuffer and specialize array for primitives - reducing overhead if relevant.

Having said that, I do think the original intention for the class continues to hold - depends on cardinality of values per key. For larger cardinality, it might be as simple changing order of conditions to improve performance: newIndex > 1 followed by 0, 1.

andrewcraik · 2016-11-02T20:55:31Z

I do agree we should save the extra object allocation, but again whether we should flatten the array into the class or remove the class and use ArrayBuffer directly is the question - do we want to preserve the semantic distinction of the type intents or loose it for simplicity? I'm not sure what the right trade-off is, but preserving the class would provide future ability to tune that would be lost by removal of the type.

While reordering branches might help a bit, the size of the method in bytecodes is inflated by having the primitive fields for 0 and 1 leading to reduced icache locality, longer distance branches and more complexity for the JIT compiler to handle. Part of the performance gain from removing the field is just from making the code smaller and simpler for the Java JIT.

Perhaps if the feeling is that there is value in the semantic meaning of the type due to the trade-offs it implies finding some way to not just delegate might be the best way forward?

rxin · 2016-11-03T00:46:18Z

What is this benchmark? Can you show the query and the explain plan, if it is in Spark SQL?

The current change doesn't make a lot of sense. If the findings are correct, we should just use ArrayBuffer (or java ArrayList) directly.

a-roberts · 2016-11-03T10:49:48Z

The benchmark used is HiBench 5 which works with Spark 1.6, I expect there will be HiBench 6 officially available soon with Spark 2 support

We can see here that the Spark PageRank example is used and that the main functions, in order of their occurrence, are textFile, map, split, slice, groupByKey, mapValues, join, flatMap, reduceByKey

srowen · 2016-11-07T10:55:43Z

The benchmark sounds fine. I think it would indeed be interesting to see how a simple ArrayBuffer works, then.

rxin · 2016-11-08T01:41:05Z

So it seems like the benchmark is the worst case for the current implementation -- it's doing a big groupByKey to potentially add a lot of items to the array buffer.

Do we have cases that only add one or two items?

srowen · 2016-11-21T12:01:19Z

@a-roberts I'm interested in your thoughts on that last point which is a fair one. If it's as-good for small cases and better for large cases to just use ArrayBuffer, that's a great win.

a-roberts · 2016-11-28T11:24:39Z

I'll add a print to keep track of how big the CompactBuffer actually gets when used with, say, SparkSqlPerf (1 scale factor then 2 scale factor) and HiBench.

Once this is complete and we have an idea of how big it gets for these workloads, I'll be running the unit tests with an ArrayBuffer(2) replacing all CompactBuffer instances, I have the code ready and hoping to soon finish up the Partitioned* class PR to speed up the PageRank implementation and to contribute the first phase of improvements to the SizeEstimator.

a-roberts · 2016-11-30T10:50:09Z

In response to @rxin's question, for HiBench CompactBuffers are used only on PageRank (none of the other 11) and these buffers mainly have between 3 and 40 elements, no more than 60, never with only two elements. The PageRank workload processes 500k pages (large profile), we have 500k CompactBuffer constructor calls and 500k prints in the += method when curSize <= 2, indicating they're always expanding.

I don't know of any cases where we're adding only a couple of elements, I also ran SparkSqlPerf, all 100 queries, again we have no output indicating that we use this class (no prints from the constructor, the growToSize or the += methods).

Here's a breakdown of growBySize invocations (prints the curSize variable) with PageRank so we have an idea of how big the CompactBuffers actually become.

I used the Spark WordCount example on the 677mb stdout file containing my prints to generate this data and we have a total of 18,762,361 growth events.

(3,500000), (4,500000), (5,500000), (6,500000), (7,500000), (8,500000), (9,500000), (10,500000), (11,500000), (12,500000), (13,500000), (14,500000), (15,500000), (16,500000), (17,500000), (18,500000), (19,500000), (20,500000), (21,499998), (22,499995), (23,499992), (24,499978), (25,499951), (26,499879), (27,499729), (28,499321), (29,498517), (30,496984), (31,494114), (32,488878), (33,480328), (34,467214), (35,447829), (36,421619), (37,387790), (38,346826), (39,300660), (40,251266), (41,201702), (42,155372) (43,114024), (44,79886), (45,53196), (46,33580), (47,20146), (48,11569), (49,6222), (50,3143), (51,1491), (52,684), (53,289), (54,126), (55,39), (56,15), (57,6), (58,1), (59,1), (60,1)

On the left we have the CompactBuffer size in elements and on the right we have a number representing how many times this appeared in the output file (therefore the CompactBuffer has grown to have this many elements that many times).

If there are better ways to figure this out or other workloads to suggest do let me know, I've got the code ready that replaces CompactBuffer with ArrayBuffer(2) for profiling and testing.

a-roberts · 2016-12-01T17:01:56Z

Performance results against the Spark master branch on a 48 core machine running PageRank with 500k pages follow

Vanilla CompactBuffer, no changes, run time and throughput (bytes per second) provided

ScalaSparkPagerank 2016-12-01 13:16:09 259928115            47.933               5422738
ScalaSparkPagerank 2016-12-01 13:22:41 259928115            45.551               5706309
ScalaSparkPagerank 2016-12-01 13:26:31 259928115            46.745               5560554
ScalaSparkPagerank 2016-12-01 13:28:58 259928115            51.699               5027720
ScalaSparkPagerank 2016-12-01 13:33:26 259928115            48.415               5368751
240.343s / 5 = 48.068s avg

The commit here

ScalaSparkPagerank 2016-12-01 10:26:12 259928115            48.706               5336675
ScalaSparkPagerank 2016-12-01 10:37:30 259928115            48.947               5310399
ScalaSparkPagerank 2016-12-01 10:40:16 259928115            49.768               5222796
ScalaSparkPagerank 2016-12-01 12:55:37 259928115            48.873               5318439
ScalaSparkPagerank 2016-12-01 12:58:12 259928115            47.535               5468141
243.829 / 5 = 48.7658s avg

Way too similar so attributing this to benchmark noise, without the 51s run this would be a few percentage points worse though

Use an ArrayBuffer (initial capacity of 16, default) instead of CompactBuffer

ScalaSparkPagerank 2016-12-01 13:42:45 259928115            62.190               4179580
ScalaSparkPagerank 2016-12-01 13:55:20 259928115            54.112               4803520
ScalaSparkPagerank 2016-12-01 13:59:06 259928115            60.818               4273868
ScalaSparkPagerank 2016-12-01 14:06:26 259928115            57.428               4526156
ScalaSparkPagerank 2016-12-01 14:35:01 259928115            58.218               4464737
292.766 / 5 = 58.5532s avg

Use an ArrayBuffer (initial capacity of 2) instead of CompactBuffer

ScalaSparkPagerank 2016-12-01 15:31:16 259928115            53.544               4854476
ScalaSparkPagerank 2016-12-01 15:36:32 259928115            58.105               4473420
ScalaSparkPagerank 2016-12-01 15:38:45 259928115            53.976               4815623
ScalaSparkPagerank 2016-12-01 15:44:09 259928115            55.174               4711061
ScalaSparkPagerank 2016-12-01 15:50:01 259928115            55.084               4718758
275.883 / 5 = 55.1766s avg

With my tests I see that using an ArrayBuffer is noticeably worse, so I'll continue to look into what's going on to see if we can improve performance here as this is definitely a hot codepath for this particular algorithm

rxin · 2016-12-02T03:25:24Z

@a-roberts feel free to continue investigating, but at some point it's also important to ask the question: is this micro-optimization worth it? It only improves 4% for a very specific workload based on your measurement, and might regress other workloads but we are not sure about that. It also takes time from a lot of people to look at ...

srowen · 2016-12-11T10:32:41Z

Although I'd love to find we can get rid of a custom collections class, it seems like we can't do that and that the optimization here doesn't result in a win. Let's close this for now.

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

[SPARK-18196] [CORE] Optimise CompactBuffer implementation

cdf8b8d

See the JIRA for details - summary is slightly increased footprint in the class but 4% performance increase observed on PageRank with HiBench large

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017

Close stale PRs

ecaca37

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

srowen mentioned this pull request Jan 1, 2017

[BUILD] Close stale PRs #16447

Closed

asfgit closed this in ba48812 Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18196] [CORE] Optimise CompactBuffer implementation #15713

[SPARK-18196] [CORE] Optimise CompactBuffer implementation #15713

a-roberts commented Nov 1, 2016

srowen commented Nov 1, 2016

SparkQA commented Nov 1, 2016

mridulm commented Nov 1, 2016 •

edited

a-roberts commented Nov 2, 2016

srowen commented Nov 2, 2016

a-roberts commented Nov 2, 2016

andrewcraik commented Nov 2, 2016

mridulm commented Nov 2, 2016 •

edited

andrewcraik commented Nov 2, 2016

rxin commented Nov 3, 2016

a-roberts commented Nov 3, 2016

srowen commented Nov 7, 2016

rxin commented Nov 8, 2016

srowen commented Nov 21, 2016

a-roberts commented Nov 28, 2016

a-roberts commented Nov 30, 2016

a-roberts commented Dec 1, 2016

rxin commented Dec 2, 2016

srowen commented Dec 11, 2016

[SPARK-18196] [CORE] Optimise CompactBuffer implementation #15713

[SPARK-18196] [CORE] Optimise CompactBuffer implementation #15713

Conversation

a-roberts commented Nov 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Nov 1, 2016

SparkQA commented Nov 1, 2016

mridulm commented Nov 1, 2016 • edited

a-roberts commented Nov 2, 2016

srowen commented Nov 2, 2016

a-roberts commented Nov 2, 2016

andrewcraik commented Nov 2, 2016

mridulm commented Nov 2, 2016 • edited

andrewcraik commented Nov 2, 2016

rxin commented Nov 3, 2016

a-roberts commented Nov 3, 2016

srowen commented Nov 7, 2016

rxin commented Nov 8, 2016

srowen commented Nov 21, 2016

a-roberts commented Nov 28, 2016

a-roberts commented Nov 30, 2016

a-roberts commented Dec 1, 2016

rxin commented Dec 2, 2016

srowen commented Dec 11, 2016

mridulm commented Nov 1, 2016 •

edited

mridulm commented Nov 2, 2016 •

edited