-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22515] [SQL] Estimation relation size based on numRows * rowSize #19743
Conversation
Test build #83841 has finished for PR 19743 at commit
|
cc9ecc6
to
e8355e0
Compare
Test build #84256 has finished for PR 19743 at commit
|
ping @cloud-fan Could you review this? |
val attrStats = AttributeMap(planOutput.flatMap(a => colStats.get(a.name).map(a -> _))) | ||
// Estimate size as number of rows * row size. | ||
val size = EstimationUtils.getOutputSize(planOutput, rowCount.get, attrStats) | ||
Statistics(sizeInBytes = size, rowCount = rowCount, attributeStats = attrStats) | ||
} else { | ||
// When CBO is disabled, we apply the size-only estimation strategy, so there's no need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we need to update the comment: when CBP is disabled or the table doesn't have statistics
@@ -41,7 +41,35 @@ import org.apache.spark.sql.types._ | |||
|
|||
|
|||
class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleton { | |||
test("Hive serde tables should fallback to HDFS for size estimation") { | |||
|
|||
test("size estimation for relations based on row size * number of rows") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is based on
val hiveTbl = "rel_est_hive_table" | ||
withTable(dsTbl, hiveTbl) { | ||
spark.range(1000L).write.format("parquet").saveAsTable(dsTbl) | ||
sql(s"CREATE TABLE $hiveTbl STORED AS parquet AS SELECT * FROM $dsTbl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spark.range(1000L).write.format("hive").saveAsTable(hiveTbl)
LGTM |
Test build #84263 has finished for PR 19743 at commit
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Currently, relation size is computed as the sum of file size, which is error-prone because storage format like parquet may have a much smaller file size compared to in-memory size. When we choose broadcast join based on file size, there's a risk of OOM. But if the number of rows is available in statistics, we can get a better estimation by
numRows * rowSize
, which helps to alleviate this problem.How was this patch tested?
Added a new test case for data source table and hive table.