[SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 #19387

rxin · 2017-09-28T22:51:32Z

What changes were proposed in this pull request?

Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100.

How was this patch tested?

Added a pretty sophisticated test based on chi square test ...

…e shuffle exchange (cherry picked from commit 8e51ae5) Signed-off-by: Reynold Xin <rxin@databricks.com>

mengxr

lgtm except for a minor comment

mengxr · 2017-09-28T22:53:03Z

core/src/main/scala/org/apache/spark/Partitioner.scala

@@ -108,9 +108,17 @@ class HashPartitioner(partitions: Int) extends Partitioner {
 class RangePartitioner[K : Ordering : ClassTag, V](
    partitions: Int,
    rdd: RDD[_ <: Product2[K, V]],
-    private var ascending: Boolean = true)
+    private var ascending: Boolean = true,
+    val samplePointsPerPartitionHint: Int = 20)


> 0 precondition check

mengxr

wait, I need to review the test

xuanwang14

Some rough calculation suggest the average value for the chisq test statistics may be of the order are ~ 50 and ~ 1200 in the two test cases respectively. So the threshold (100 and 1000) may make the test flaky

xuanwang14 · 2017-09-28T23:51:34Z

sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala

+
+    withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> numPartitions.toString) {
+      // The default chi-sq value should be low
+      assert(computeChiSquareTest() < 100)


This test may be flaky. It depends on the ratio of n/RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION. What is the default value of RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION ?

100 - which is pretty high

the actual value computed on my laptop is around 10, so 1000 is already three orders of magnitude larger

xuanwang14 · 2017-09-28T23:52:38Z

sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala

+      withSQLConf(SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION.key -> "1") {
+        // If we only sample one point, the range boundaries will be pretty bad and the
+        // chi-sq value would be very high.
+        assert(computeChiSquareTest() > 1000)


This test may be flaky as well.

the value i got from my laptop was 1800

rxin · 2017-09-29T00:15:16Z

I put up a comment saying this test result should be deterministic, since the sampling uses a fixed seed based on partition id.

xuanwang14 · 2017-09-29T00:15:58Z

LGTM

viirya · 2017-09-29T01:57:57Z

core/src/main/scala/org/apache/spark/Partitioner.scala

+  // 4th constructor parameter samplePointsPerPartitionHint. See SPARK-22160.
+  // This is added to make sure from a bytecode point of view, there is still a 3-arg ctor.
+  def this(partitions: Int, rdd: RDD[_ <: Product2[K, V]], ascending: Boolean) = {
+    this(partitions, rdd, ascending, samplePointsPerPartitionHint = 20)


The default value is 100 now in SQLConf, shall we also use 100 here as default value for samplePointsPerPartitionHint to be consistent?

That one has been there for much longer so I'd rather change the SQL default first and see what happens.

SparkQA · 2017-09-29T01:59:41Z

Test build #82296 has finished for PR 19387 at commit df27868.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-29T02:01:27Z

Test build #82295 has finished for PR 19387 at commit b46c92b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ConfigBehaviorSuite extends QueryTest with SharedSQLContext

SparkQA · 2017-09-29T03:07:21Z

Test build #82302 has finished for PR 19387 at commit 99c5bc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-09-29T04:06:45Z

Merging in master.

…oner) configurable and bump the default value up to 100 Spark's RangePartitioner hard codes the number of sampling points per partition to be 20. This is sometimes too low. This ticket makes it configurable, via spark.sql.execution.rangeExchange.sampleSizePerPartition, and raises the default in Spark SQL to be 100. Added a pretty sophisticated test based on chi square test ... Author: Reynold Xin <rxin@databricks.com> Closes apache#19387 from rxin/SPARK-22160. This commit contains the following squashed commits: 938326b NETFLIX-BUILD: Fixup backport of SPARK-22160.

rxin added 2 commits September 28, 2017 15:07

[SPARK-22160][SQL] Allow changing sample points per partition in rang…

843721b

…e shuffle exchange (cherry picked from commit 8e51ae5) Signed-off-by: Reynold Xin <rxin@databricks.com>

Add a test

b46c92b

mengxr approved these changes Sep 28, 2017

View reviewed changes

Add a precondition check

df27868

rxin changed the title ~~[SPARK-22160][SQL] Allow changing sample points per partition in range shuffle exchange~~ [SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 Sep 28, 2017

mengxr requested changes Sep 28, 2017

View reviewed changes

xuanwang14 reviewed Sep 28, 2017

View reviewed changes

Add a comment about the test being deterministic.

99c5bc6

viirya reviewed Sep 29, 2017

View reviewed changes

asfgit closed this in 323806e Sep 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 #19387

[SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 #19387

rxin commented Sep 28, 2017

mengxr left a comment

mengxr Sep 28, 2017

rxin Sep 28, 2017

mengxr left a comment

xuanwang14 left a comment •

edited

xuanwang14 Sep 28, 2017

rxin Sep 29, 2017

xuanwang14 Sep 28, 2017

rxin Sep 29, 2017

rxin commented Sep 29, 2017

xuanwang14 commented Sep 29, 2017

viirya Sep 29, 2017 •

edited

rxin Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

rxin commented Sep 29, 2017

[SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 #19387

[SPARK-22160][SQL] Make sample points per partition (in range partitioner) configurable and bump the default value up to 100 #19387

Conversation

rxin commented Sep 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

mengxr left a comment

Choose a reason for hiding this comment

mengxr Sep 28, 2017

Choose a reason for hiding this comment

rxin Sep 28, 2017

Choose a reason for hiding this comment

mengxr left a comment

Choose a reason for hiding this comment

xuanwang14 left a comment • edited

Choose a reason for hiding this comment

xuanwang14 Sep 28, 2017

Choose a reason for hiding this comment

rxin Sep 29, 2017

Choose a reason for hiding this comment

xuanwang14 Sep 28, 2017

Choose a reason for hiding this comment

rxin Sep 29, 2017

Choose a reason for hiding this comment

rxin commented Sep 29, 2017

xuanwang14 commented Sep 29, 2017

viirya Sep 29, 2017 • edited

Choose a reason for hiding this comment

rxin Sep 29, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

rxin commented Sep 29, 2017

xuanwang14 left a comment •

edited

viirya Sep 29, 2017 •

edited