[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

wzhfy · 2017-11-14T09:15:03Z

What changes were proposed in this pull request?

Currently, relation size is computed as the sum of file size, which is error-prone because storage format like parquet may have a much smaller file size compared to in-memory size. When we choose broadcast join based on file size, there's a risk of OOM. But if the number of rows is available in statistics, we can get a better estimation by numRows * rowSize, which helps to alleviate this problem.

How was this patch tested?

Added a new test case for data source table and hive table.

SparkQA · 2017-11-14T12:04:22Z

Test build #83841 has finished for PR 19743 at commit cc9ecc6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-11-14T14:10:34Z

cc @cloud-fan @gatorsmile

SparkQA · 2017-11-28T13:56:30Z

Test build #84256 has finished for PR 19743 at commit e8355e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-11-28T14:26:49Z

ping @cloud-fan Could you review this?

cloud-fan · 2017-11-28T14:53:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+      val attrStats = AttributeMap(planOutput.flatMap(a => colStats.get(a.name).map(a -> _)))
+      // Estimate size as number of rows * row size.
+      val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, attrStats)
+      Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = attrStats)
    } else {
      // When CBO is disabled, we apply the size-only estimation strategy, so there's no need to


now we need to update the comment: when CBP is disabled or the table doesn't have statistics

cloud-fan · 2017-11-28T14:55:18Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

@@ -41,7 +41,35 @@ import org.apache.spark.sql.types._


 class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleton {
-   test("Hive serde tables should fallback to HDFS for size estimation") {
+
+  test("size estimation for relations based on row size * number of rows") {


nit: is based on

cloud-fan · 2017-11-28T14:56:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+    val hiveTbl = "rel_est_hive_table"
+    withTable(dsTbl, hiveTbl) {
+      spark.range(1000L).write.format("parquet").saveAsTable(dsTbl)
+      sql(s"CREATE TABLE $hiveTbl STORED AS parquet AS SELECT * FROM $dsTbl")


nit: spark.range(1000L).write.format("hive").saveAsTable(hiveTbl)

cloud-fan · 2017-11-28T14:57:12Z

LGTM

SparkQA · 2017-11-28T18:39:07Z

Test build #84263 has finished for PR 19743 at commit acc498c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-28T19:43:34Z

Thanks! Merged to master.

relation estimation

e8355e0

wzhfy force-pushed the better_leaf_size branch from cc9ecc6 to e8355e0 Compare November 28, 2017 11:04

cloud-fan reviewed Nov 28, 2017

View reviewed changes

fix comments

acc498c

asfgit closed this in da35574 Nov 28, 2017

CodingCat mentioned this pull request Dec 6, 2017

[SPARK-22673][SQL] InMemoryRelation should utilize existing stats whenever possible #19864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

wzhfy commented Nov 14, 2017

SparkQA commented Nov 14, 2017

wzhfy commented Nov 14, 2017

SparkQA commented Nov 28, 2017

wzhfy commented Nov 28, 2017

cloud-fan Nov 28, 2017

cloud-fan Nov 28, 2017

cloud-fan Nov 28, 2017

cloud-fan commented Nov 28, 2017

SparkQA commented Nov 28, 2017

gatorsmile commented Nov 28, 2017

[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

Conversation

wzhfy commented Nov 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 14, 2017

wzhfy commented Nov 14, 2017

SparkQA commented Nov 28, 2017

wzhfy commented Nov 28, 2017

cloud-fan Nov 28, 2017

Choose a reason for hiding this comment

cloud-fan Nov 28, 2017

Choose a reason for hiding this comment

cloud-fan Nov 28, 2017

Choose a reason for hiding this comment

cloud-fan commented Nov 28, 2017

SparkQA commented Nov 28, 2017

gatorsmile commented Nov 28, 2017