[SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization #22894

Koraseg · 2018-10-30T12:41:31Z

What changes were proposed in this pull request?

Removal of intermediate structures in HighlyCompressedMapStatus will speed up its creation and deserialization time.

https://issues.apache.org/jira/browse/SPARK-25885

How was this patch tested?

Additional tests are not necessary for the patch.

mgaido91 · 2018-10-30T12:49:45Z

@Koraseg please check the contribution guide and update this PR accordingly:

first, please fill the PR description properly;
second, provide a demonstration of the improved performance with a benchmark;
third, IMO this is actually worsening performances, since you are using an immutable data structure, so the map is copied every time is modified....

Koraseg · 2018-10-30T13:12:28Z

Regarding the third one, the proposed implementation is, conceptually, the same as current one. Under the hood, hugeBlockSizesArray.toMap just updates an internal immutable map reference by tuples from the array and eventually returns it back.

mgaido91 · 2018-10-30T13:20:52Z

Practically, though, it generates a whole copy of the map at every update, so for 10 items, the implementation in the PR generates 9 copies of 1, 2, 3, ... elements, while the current one generates only 1 copy, at the end of size 10. So the proposed change is worse than the current solution. If you create a benchmark, you can see this.

Koraseg · 2018-10-31T11:44:58Z

Practically, though, it generates a whole copy of the map at every update, so for 10 items, the implementation in the PR generates 9 copies of 1, 2, 3, ... elements, while the current one generates only 1 copy, at the end of size 10. So the proposed change is worse than the current solution. If you create a benchmark, you can see this.

That is not a way how immutable persistent data structures handle updates and the scala map in particular. Moreover, as I mentioned above, it is exactly the same logic, which lies under the hood of ArrayBuffer -> Map conversion in the current implementation. I have only removed an intermediate layer.

I created a benchmark with cut versions of HighlyCompressedStatus (with empty blocks bitmap and huge blocks map only) and measured deserialization performance. The proposed version has shown about 10% performance boost for different blocks configurations.

You can check out the results on the repo below and repeat the test.
https://github.com/Koraseg/mapstatus-benchmark

mgaido91 · 2018-10-31T11:50:13Z

ehat if you use a mutable map instead of an immutable one? which is the perf comparison?

Koraseg · 2018-10-31T13:42:32Z

Thanks for the remark above. I have checked scala.mutable.Map performance, it is essentially better. For some cases, speed up is up to 2 times I will update the benchmark and the PR soon.

mgaido91 · 2018-10-31T13:44:52Z

that sounds more reasonable and a better implementation, thanks.

srowen · 2018-10-31T14:12:19Z

I also would not expect updating an immutable data structure to be faster. Building a map once from tuples at the end seems better than rebuilding a map each time. Under the hood the immutable map is going to be a HashTrieMap (a map of smaller optimized immutable maps) and its updated0 method does some clever stuff to avoid recreating the whole map.

But, yeah, why immutable here to begin with? it ought to be better still to update a mutable Map. And then I am still not sure why it would be faster to keep the map invariants over this loop rather than build the map with its size known ahead of time at the end.

Benchmarks are good evidence but we just need to make sure that the difference is material as used in Spark. It may well be.

Koraseg · 2018-10-31T16:23:36Z

I have also updated benchmarks here: https://github.com/Koraseg/mapstatus-benchmark.

By the way, the best performance has shown gnu.trove.map.hash.TIntByteHashMap. But the boost in comparison with java.util.HashMap seems not too big for a new library dependency.

mgaido91 · 2018-10-31T16:37:42Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

@@ -149,7 +150,7 @@ private[spark] class HighlyCompressedMapStatus private (
    private[this] var numNonEmptyBlocks: Int,
    private[this] var emptyBlocks: RoaringBitmap,
    private[this] var avgSize: Long,
-    private var hugeBlockSizes: Map[Int, Byte])
+    private[this] var hugeBlockSizes: mutable.Map[Int, Byte])


this shouldn't be changed, we should still have an immutable map here

What about to use more generic scala.collection.Map[Int, Byte] type here?

It's mutated though, and needs to be mutable. If it were exposed outside the class, or there was significant danger of accidentally mutating it elsewhere, I think it might be necessary to wrap the result in an immutable wrapper, but here this seems OK to me.

I don't think we should change it. Despite we are using a mutable map in order to build it, the result should be an immutable map, as it enforces correctness, avoiding potential bad updates. So I don't think this should be changed. You can just call toMap on the mutable one, but in this case, I think the performance would become again like the original one, as there is no override of toMap in scala. A better option would then probablt be using a immutable.Map.Builder. Could you check it please?

I like that possibility. If the cost of wrapping/building an immutable map isn't higher, it's better.

Basically, it means that we use immutable Map instead of mutable one, with worse performance characteristics, since MapBuilder just updates inner reference to immutable Map after each insert. To enforce correctness, what about to use Scala.collection.Map[Int, Byte]? It doesn't allow dangerous mutations operations and mutable.Map is its subtype.

yes, you're right. I just checked the implementation of the MapBuilder... Seems like there is no efficient way to build an immutable map... I am fine with the approach you're proposing using scala.collection.Map, but still seems a bit hacky to me.

mgaido91 · 2018-10-31T16:38:38Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

@@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
    emptyBlocks.readExternal(in)
    avgSize = in.readLong()
    val count = in.readInt()
-    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala


we should use a mutable.Map instead of the java implementation wrapped by scala, this is not a clean solution IMHO

srowen · 2018-10-31T16:42:12Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

@@ -189,13 +190,12 @@ private[spark] class HighlyCompressedMapStatus private (
    emptyBlocks.readExternal(in)
    avgSize = in.readLong()
    val count = in.readInt()
-    val hugeBlockSizesArray = mutable.ArrayBuffer[Tuple2[Int, Byte]]()
+    hugeBlockSizes = new util.HashMap[Int, Byte](count).asScala


How about just scala's mutable Map? I'd expect it's no slower than Java's, given it might specialize for primitives (not sure about this) and sometimes has smarter implementations internally.

scala.mutable.HashMap implementation does not have a way to set initial capacity out of the box. The performance gets worse, probably because of resizing hash table.

scala.mutable.OpenHashMap implementation does, but it is still slower than java.util.HashMap

However, if a kind of tradeoff between code cleanness and performance os needed, I would use one of the variants above.

Does calling HashMap.sizeHint(...) here actually help?
I'd stick to the scala collection for now

Unfortunately, its default implementation in Builder trait is empty, and it is not overridden for the mutable map.

srowen · 2018-11-03T14:26:12Z

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

    (0 until count).foreach { _ =>
      val block = in.readInt()
      val size = in.readByte()
-      hugeBlockSizesArray += Tuple2(block, size)
+      hugeBlockSizes.asInstanceOf[mutable.Map[Int, Byte]].update(block, size)


Why cast it? it is used as a mutable map and its type is a mutable map, so the type on line 151 is wrong. Also, just hugeBlockSizes(block) = size, no?

@srowen I used more generic map type in HighlyCompressedMapStatus, so cast was to make update operations on it possible. However, it does not seem necessary because HighlyCompressedMapStatus fields are not exposed outside. So we can simply change type of the map right there.

Yes, it is a mutable map and used as a mutable map. Its type must reflect that.

…optimization

srowen

OK, I'm good with this approach, after the quick discussion. Thanks!

mgaido91 · 2018-11-06T15:24:09Z

shall we trigger the tests for this @srowen ?

SparkQA · 2018-11-06T18:44:09Z

Test build #4414 has finished for PR 22894 at commit 57bdd75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Koraseg · 2018-11-06T20:06:31Z

retest this please

SparkQA · 2018-11-07T00:17:07Z

Test build #4416 has finished for PR 22894 at commit 57bdd75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-07T15:12:21Z

Merged to master

…construction optimization ## What changes were proposed in this pull request? Removal of intermediate structures in HighlyCompressedMapStatus will speed up its creation and deserialization time. https://issues.apache.org/jira/browse/SPARK-25885 ## How was this patch tested? Additional tests are not necessary for the patch. Closes apache#22894 from Koraseg/mapStatusesOptimization. Authored-by: koraseg <artem.kupchinsky@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

Koraseg changed the title ~~[SPARK-25885] HighlyCompressedMapStatus deserialization/construction optimization~~ [SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization Oct 30, 2018

Koraseg force-pushed the mapStatusesOptimization branch from f6ab475 to c31d0f5 Compare October 31, 2018 16:10

mgaido91 reviewed Oct 31, 2018

View reviewed changes

srowen reviewed Oct 31, 2018

View reviewed changes

Koraseg force-pushed the mapStatusesOptimization branch from c31d0f5 to 0ec13f8 Compare November 1, 2018 09:22

srowen requested changes Nov 3, 2018

View reviewed changes

Koraseg force-pushed the mapStatusesOptimization branch from b89e375 to 487df9f Compare November 6, 2018 13:11

[SPARK-25885] HighlyCompressedMapStatus deserialization/construction …

57bdd75

…optimization

Koraseg force-pushed the mapStatusesOptimization branch from 487df9f to 57bdd75 Compare November 6, 2018 14:36

srowen approved these changes Nov 6, 2018

View reviewed changes

asfgit closed this in 0a32238 Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization #22894

[SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization #22894

Koraseg commented Oct 30, 2018 •

edited

mgaido91 commented Oct 30, 2018

Koraseg commented Oct 30, 2018

mgaido91 commented Oct 30, 2018

Koraseg commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

Koraseg commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

srowen commented Oct 31, 2018

Koraseg commented Oct 31, 2018

mgaido91 Oct 31, 2018

Koraseg Oct 31, 2018

srowen Nov 6, 2018

mgaido91 Nov 6, 2018

srowen Nov 6, 2018

Koraseg Nov 6, 2018

mgaido91 Nov 6, 2018

mgaido91 Oct 31, 2018

srowen Oct 31, 2018

Koraseg Oct 31, 2018

srowen Oct 31, 2018

Koraseg Nov 1, 2018

srowen Nov 3, 2018

Koraseg Nov 6, 2018

srowen Nov 6, 2018

srowen left a comment

mgaido91 commented Nov 6, 2018

SparkQA commented Nov 6, 2018

Koraseg commented Nov 6, 2018

SparkQA commented Nov 7, 2018

srowen commented Nov 7, 2018

[SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization #22894

[SPARK-25885][Core][Minor] HighlyCompressedMapStatus deserialization/construction optimization #22894

Conversation

Koraseg commented Oct 30, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Oct 30, 2018

Koraseg commented Oct 30, 2018

mgaido91 commented Oct 30, 2018

Koraseg commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

Koraseg commented Oct 31, 2018

mgaido91 commented Oct 31, 2018

srowen commented Oct 31, 2018

Koraseg commented Oct 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

mgaido91 commented Nov 6, 2018

SparkQA commented Nov 6, 2018

Koraseg commented Nov 6, 2018

SparkQA commented Nov 7, 2018

srowen commented Nov 7, 2018

Koraseg commented Oct 30, 2018 •

edited