[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349
[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349JkSelf wants to merge 2 commits into
Conversation
|
@cloud-fan Could you help to review this PR? Thanks for your help. |
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem: FilterEstimation.estimate() computes sizeInBytes via getOutputSize(plan.output, filteredRowCount, newColStats), which is rowCount × getSizePerRow(attributes, attrStats). getSizePerRow adds overhead per row (8 bytes base + 12 bytes per StringType column for UTF8String object layout). When the scan reports an accurate, compact sizeInBytes from table metadata (e.g., Parquet/Iceberg columnar data), the heuristic per-row computation inflates the estimate. In the PR's example, a single IntegerType column gets 12 bytes/row from the heuristic (8 overhead + 4 avgLen) versus ~4 bytes/row from actual scan metadata — a ~3x inflation that can push the planner above the broadcast hash join threshold.
Design approach: Take min(sizeByOutputAttrs, sizeByChildScaling) where sizeByChildScaling = childSizeInBytes × filteredRowCount / childRowCount. The min of two estimates handles both directions: when the child has accurate metadata (the scaling estimate is better) and when the child has wildly overestimated size like defaultSizeInBytes (the per-attribute estimate is better).
Key design decisions: The PR also adds two defensive improvements: (1) applying boundProbability to the final selectivity to guard against floating-point rounding in compound expressions (e.g., Or where p1 + p2 - p1*p2 can slightly exceed 1.0), and (2) capping filteredRowCount at childRowCount to prevent rounding from ceil. Both are good safety nets.
Implementation: Changes are confined to the estimate() method in FilterEstimation. No API changes, no impact on callers.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you, @JkSelf and @cloud-fan .
Please file a JIRA issue and have a proper PR title because it will be a commit log eventually.
Filed https://issues.apache.org/jira/browse/SPARK-56484 Jira issue. |
|
Thank you, @JkSelf . |
|
Thank you, @JkSelf and @cloud-fan . |

What changes were proposed in this pull request?
We observed a significant discrepancy in the logical plan's statistics estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For the customer table, the RelationV2 scan correctly identifies a sizeInBytes of 248.0 MiB based on actual metadata. However, after applying the Filter isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 743.9 MiB. Even though the rowCount remains unchanged , the heuristic recalculation of sizeInBytes triples the value. This "data inflation" after a filter causes the planner to exceed the 250 MiB threshold, incorrectly disabling the Broadcast Hash Join.
The Filter node should maintain its existing logic for estimating rowCount and updating attributeStats. However, for sizeInBytes, we should adopt a more conservative approach by selecting the minimum of two estimates:
Legacy Logic: getOutputSize(outputAttrs, filteredRowCount, newColStats). This often results in a value larger than the actual sizeInBytes from the Scan node due to heuristic row-width defaults.
New Scaling Logic: child.sizeInBytes * (filteredRowCount / childRowCount). This is more reasonable as a Filter does not change the row width; it only reduces the number of rows.
Final Decision: min(sizeByOutputAttrs, sizeByChildScaling)
This ensures the estimated size never exceeds the actual size of the child node. (e.g., preventing the original 260MB from being inflated to 780MB).
Why are the changes needed?
Missing BHJ optimization when setting 250MB bhj threshold.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added new unit tests
Was this patch authored or co-authored using generative AI tooling?
No