[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle #21101

cloud-fan · 2018-04-18T14:45:16Z

What changes were proposed in this pull request?

In Spark SQL, we usually reuse the UnsafeRow instance and need to copy the data when a place buffers non-serialized objects.

Shuffle may buffer objects if we don't make it to the bypass merge shuffle or unsafe shuffle.

ShuffleExchangeExec.needToCopyObjectsBeforeShuffle misses the case that, if spark.sql.shuffle.partitions is large enough, we could fail to run unsafe shuffle and go with the non-serialized shuffle.

This bug is very hard to hit since users wouldn't set such a large number of partitions(16 million) for Spark SQL exchange.

TODO: test

How was this patch tested?

todo.

cloud-fan · 2018-04-18T14:45:51Z

cc @JoshRosen @hvanhovell @gatorsmile

SparkQA · 2018-04-18T18:19:08Z

Test build #89513 has finished for PR 21101 at commit 40b2c5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2018-04-19T00:55:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

        // If we're using the original SortShuffleManager and the number of output partitions is
        // sufficiently small, then Spark will fall back to the hash-based shuffle write path, which
        // doesn't buffer deserialized records.
        // Note that we'll have to remove this case if we fix SPARK-6026 and remove this bypass.
        false
-      } else if (serializer.supportsRelocationOfSerializedObjects) {
+      } else if (numParts <= SortShuffleManager.MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {


I was almost going to suggest that we should we check for both conditions with an && here just as future-proofing in case serializer was changed, but I can now see why that isn't a huge risk in the current codebase: we always use an UnsafeRowSerializer here now. It was only in the pre-Tungsten era that we could use either UnsafeRowSerializer or SparkSqlSerializer here.

hvanhovell

LGTM

hvanhovell · 2018-04-19T15:54:26Z

Merging to master and 2.3. Let me know if more further backports are needed.

…uffle ## What changes were proposed in this pull request? In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy the data when a place buffers non-serialized objects. Shuffle may buffer objects if we don't make it to the bypass merge shuffle or unsafe shuffle. `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case that, if `spark.sql.shuffle.partitions` is large enough, we could fail to run unsafe shuffle and go with the non-serialized shuffle. This bug is very hard to hit since users wouldn't set such a large number of partitions(16 million) for Spark SQL exchange. TODO: test ## How was this patch tested? todo. Author: Wenchen Fan <wenchen@databricks.com> Closes #21101 from cloud-fan/shuffle. (cherry picked from commit 6e19f76) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

SQL exchange should copy data before non-serialized shuffle

40b2c5c

JoshRosen reviewed Apr 19, 2018

View reviewed changes

hvanhovell approved these changes Apr 19, 2018

View reviewed changes

asfgit closed this in 6e19f76 Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle #21101

[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle #21101

cloud-fan commented Apr 18, 2018 •

edited

Loading

cloud-fan commented Apr 18, 2018

SparkQA commented Apr 18, 2018

JoshRosen Apr 19, 2018 •

edited

Loading

hvanhovell left a comment

hvanhovell commented Apr 19, 2018 •

edited

Loading

[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle #21101

[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle #21101

Conversation

cloud-fan commented Apr 18, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Apr 18, 2018

SparkQA commented Apr 18, 2018

JoshRosen Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

hvanhovell commented Apr 19, 2018 • edited Loading

cloud-fan commented Apr 18, 2018 •

edited

Loading

JoshRosen Apr 19, 2018 •

edited

Loading

hvanhovell commented Apr 19, 2018 •

edited

Loading