[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange #7456

JoshRosen · 2015-07-17T01:37:59Z

This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows. It also makes several general efficiency improvements to Exchange.

Key changes:

When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a (partitioningColumRow: InternalRow, row: InternalRow) pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle. After this patch's changes, Exchange now shuffles (partitionId: Int, row: InternalRow) pairs. This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
Exchange's compute() method has been significantly simplified; the new code has less duplication and thus is easier to understand.
When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized UnsafeRowSerializer to serialize these rows. This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes. Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.

SparkQA · 2015-07-17T01:47:30Z

Test build #37562 has finished for PR 7456 at commit 8dd3ff2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ShuffledRowRDD(
- class UnsafeRowSerializer(numFields: Int) extends Serializer

rxin · 2015-07-17T06:26:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala

+import org.apache.spark.unsafe.PlatformDependent
+
+
+class UnsafeRowSerializer(numFields: Int) extends Serializer {


this is a file that deserves a little bit more comment to explain what this is for.

Yep. I'm going to add comments now.

rxin · 2015-07-17T06:43:43Z

The current code looks pretty good to me.

JoshRosen · 2015-07-17T23:01:56Z

I'm going to rebase this on top of #7482 to make things easier to test; will rebase again once that patch is merged.

JoshRosen · 2015-07-18T22:14:24Z

Alright, I've updated this and it should be ready for another look. Added a very trivial test to trigger the new shuffle path and caught a bug related to UnsafeRowSerializer not being Serializable.

JoshRosen · 2015-07-18T22:31:36Z

Jenkins, retest this please.

SparkQA · 2015-07-18T22:35:38Z

Test build #37736 has finished for PR 7456 at commit 0082515.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Concat(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes
- class ShuffledRowRDD(

SparkQA · 2015-07-18T22:52:31Z

Test build #37738 has finished for PR 7456 at commit 0082515.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Concat(children: Seq[Expression]) extends Expression with ImplicitCastInputTypes
- class ShuffledRowRDD(

rxin · 2015-07-18T22:59:27Z

Can you also update the pull request description?

JoshRosen · 2015-07-18T23:00:23Z

Done; I removed the "work-in-progress" part.

JoshRosen · 2015-07-19T00:44:41Z

Huh, looks like a legitimate test failure in SparkSqlSerializer2SortMergeShuffleSuite:

org.apache.spark.rdd.MapPartitionsRDD cannot be cast to org.apache.spark.rdd.ShuffledRDD

SparkQA · 2015-07-19T02:23:32Z

Test build #37744 has finished for PR 7456 at commit 7e75259.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ShuffledRowRDD(

rxin · 2015-07-20T06:31:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala

+      this
+    }
+    override def writeKey[T: ClassTag](key: T): SerializationStream = {
+      assert(key.isInstanceOf[Int])


you need to add some comment explaining why we are not doing anything when writing keys.

rxin · 2015-07-20T06:38:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/UnsafeRowSerializer.scala

+      var dataRemaining: Int = row.getSizeInBytes
+      val baseObject = row.getBaseObject
+      var rowReadPosition: Long = row.getBaseOffset
+      while (dataRemaining > 0) {


probably doesn't matter in the MVP, but if we know the UnsafeRow is backed by a byte array, we don't need to do this copying, do we?

A nice way to address this would be to add a writeTo method to UnsafeRow itself. That method could contain a special case to handle the case where the row is backed by an on-heap byte array.

rxin · 2015-07-20T06:41:13Z

Looks pretty good. I'm going to merge it. Please submit a followup pr to address some of the comments on documentation and choice of buffer size.

…safeRows in Exchange) This patch addresses code review feedback from #7456. Author: Josh Rosen <joshrosen@databricks.com> Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits: 76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer 3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow

JoshRosen added 8 commits July 16, 2015 13:05

Iniitial cut at removing shuffle on KV pairs

3526868

Big code simplification in Exchange

3ca8515

Import ordering

0f2ac86

Add UnsafeRowSerializer

cbea80b

Merge remote-tracking branch 'origin/master' into unsafe-shuffle

7876f31

Add logic for choosing when to use UnsafeRowSerializer

035af21

Fix for copying logic

dd9c66d

Exchange outputs UnsafeRows when its child outputs them

8dd3ff2

rxin reviewed Jul 17, 2015
View reviewed changes

JoshRosen added 5 commits July 18, 2015 14:29

Merge remote-tracking branch 'origin/master' into unsafe-exchange

93904e7

Remove println() and add comments

359c6a4

Add simple test of UnsafeRow shuffling in Exchange.

741973c

Add missing newline

a27cfc1

Some additional comments + small cleanup to remove an unused parameter

0082515

JoshRosen changed the title ~~[SPARK-9023] [SQL] [WIP] Efficiency improvements for UnsafeRows in Exchange~~ [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange Jul 18, 2015

Fix cast in SparkSqlSerializer2Suite

7e75259

rxin reviewed Jul 20, 2015
View reviewed changes

asfgit closed this in 79ec072 Jul 20, 2015

JoshRosen mentioned this pull request Jul 21, 2015

[SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for UnsafeRows in Exchange) #7551

Closed

JoshRosen mentioned this pull request Sep 23, 2015

[DESIGN PROTOTYPE] [SPARK-7271] Binary shuffle code review JoshRosen/spark#6

Closed

JoshRosen mentioned this pull request Oct 20, 2015

[SPARK-10797] RDD's coalesce should not write out the temporary key #8979

Closed

JoshRosen deleted the unsafe-exchange branch December 29, 2015 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange #7456

[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange #7456

JoshRosen commented Jul 17, 2015

SparkQA commented Jul 17, 2015

rxin Jul 17, 2015

JoshRosen Jul 17, 2015

rxin commented Jul 17, 2015

JoshRosen commented Jul 17, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

SparkQA commented Jul 18, 2015

SparkQA commented Jul 18, 2015

rxin commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 19, 2015

SparkQA commented Jul 19, 2015

rxin Jul 20, 2015

rxin Jul 20, 2015

JoshRosen Jul 20, 2015

rxin Jul 20, 2015

rxin commented Jul 20, 2015

		import org.apache.spark.unsafe.PlatformDependent


		class UnsafeRowSerializer(numFields: Int) extends Serializer {

[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange #7456

[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange #7456

Conversation

JoshRosen commented Jul 17, 2015

SparkQA commented Jul 17, 2015

rxin Jul 17, 2015

Choose a reason for hiding this comment

JoshRosen Jul 17, 2015

Choose a reason for hiding this comment

rxin commented Jul 17, 2015

JoshRosen commented Jul 17, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

SparkQA commented Jul 18, 2015

SparkQA commented Jul 18, 2015

rxin commented Jul 18, 2015

JoshRosen commented Jul 18, 2015

JoshRosen commented Jul 19, 2015

SparkQA commented Jul 19, 2015

rxin Jul 20, 2015

Choose a reason for hiding this comment

rxin Jul 20, 2015

Choose a reason for hiding this comment

JoshRosen Jul 20, 2015

Choose a reason for hiding this comment

rxin Jul 20, 2015

Choose a reason for hiding this comment

rxin commented Jul 20, 2015