Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743

Closed
wants to merge 2 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Nov 14, 2017

What changes were proposed in this pull request?

Currently, relation size is computed as the sum of file size, which is error-prone because storage format like parquet may have a much smaller file size compared to in-memory size. When we choose broadcast join based on file size, there's a risk of OOM. But if the number of rows is available in statistics, we can get a better estimation by numRows * rowSize, which helps to alleviate this problem.

How was this patch tested?

Added a new test case for data source table and hive table.

@SparkQA
Copy link

SparkQA commented Nov 14, 2017

Test build #83841 has finished for PR 19743 at commit cc9ecc6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Nov 14, 2017

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84256 has finished for PR 19743 at commit e8355e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Nov 28, 2017

ping @cloud-fan Could you review this?

val attrStats = AttributeMap(planOutput.flatMap(a => colStats.get(a.name).map(a -> _)))
// Estimate size as number of rows * row size.
val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, attrStats)
Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = attrStats)
} else {
// When CBO is disabled, we apply the size-only estimation strategy, so there's no need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we need to update the comment: when CBP is disabled or the table doesn't have statistics

@@ -41,7 +41,35 @@ import org.apache.spark.sql.types._


class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleton {
test("Hive serde tables should fallback to HDFS for size estimation") {

test("size estimation for relations based on row size * number of rows") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is based on

val hiveTbl = "rel_est_hive_table"
withTable(dsTbl, hiveTbl) {
spark.range(1000L).write.format("parquet").saveAsTable(dsTbl)
sql(s"CREATE TABLE $hiveTbl STORED AS parquet AS SELECT * FROM $dsTbl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spark.range(1000L).write.format("hive").saveAsTable(hiveTbl)

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84263 has finished for PR 19743 at commit acc498c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants