Skip to content

[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349

Closed
JkSelf wants to merge 2 commits into
apache:masterfrom
JkSelf:fix-filter-estimation
Closed

[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349
JkSelf wants to merge 2 commits into
apache:masterfrom
JkSelf:fix-filter-estimation

Conversation

@JkSelf
Copy link
Copy Markdown
Contributor

@JkSelf JkSelf commented Apr 15, 2026

What changes were proposed in this pull request?

We observed a significant discrepancy in the logical plan's statistics estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For the customer table, the RelationV2 scan correctly identifies a sizeInBytes of 248.0 MiB based on actual metadata. However, after applying the Filter isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 743.9 MiB. Even though the rowCount remains unchanged , the heuristic recalculation of sizeInBytes triples the value. This "data inflation" after a filter causes the planner to exceed the 250 MiB threshold, incorrectly disabling the Broadcast Hash Join.

Filter isnotnull(c_customer_sk#8913), Statistics(sizeInBytes=743.9 MiB, rowCount=6.50E+7)

+- RelationV2[c_customer_sk#8913] spark_catalog.tpcds_sf10000_parquet_zstd_iceberg_part_perfteam2.customer, Statistics(sizeInBytes=248.0 MiB, rowCount=6.50E+7)

The Filter node should maintain its existing logic for estimating rowCount and updating attributeStats. However, for sizeInBytes, we should adopt a more conservative approach by selecting the minimum of two estimates:

Legacy Logic: getOutputSize(outputAttrs, filteredRowCount, newColStats). This often results in a value larger than the actual sizeInBytes from the Scan node due to heuristic row-width defaults.

New Scaling Logic: child.sizeInBytes * (filteredRowCount / childRowCount). This is more reasonable as a Filter does not change the row width; it only reduces the number of rows.

Final Decision: min(sizeByOutputAttrs, sizeByChildScaling)

This ensures the estimated size never exceeds the actual size of the child node. (e.g., preventing the original 260MB from being inflated to 780MB).

Why are the changes needed?

Missing BHJ optimization when setting 250MB bhj threshold.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@JkSelf
Copy link
Copy Markdown
Contributor Author

JkSelf commented Apr 15, 2026

@cloud-fan Could you help to review this PR? Thanks for your help.

@JkSelf JkSelf changed the title IS NOT NULL should not increase sizeInBytes over child Filter IS NOT NULL expression 's sizeInBytes should not increase over child Apr 15, 2026
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem: FilterEstimation.estimate() computes sizeInBytes via getOutputSize(plan.output, filteredRowCount, newColStats), which is rowCount × getSizePerRow(attributes, attrStats). getSizePerRow adds overhead per row (8 bytes base + 12 bytes per StringType column for UTF8String object layout). When the scan reports an accurate, compact sizeInBytes from table metadata (e.g., Parquet/Iceberg columnar data), the heuristic per-row computation inflates the estimate. In the PR's example, a single IntegerType column gets 12 bytes/row from the heuristic (8 overhead + 4 avgLen) versus ~4 bytes/row from actual scan metadata — a ~3x inflation that can push the planner above the broadcast hash join threshold.

Design approach: Take min(sizeByOutputAttrs, sizeByChildScaling) where sizeByChildScaling = childSizeInBytes × filteredRowCount / childRowCount. The min of two estimates handles both directions: when the child has accurate metadata (the scaling estimate is better) and when the child has wildly overestimated size like defaultSizeInBytes (the per-attribute estimate is better).

Key design decisions: The PR also adds two defensive improvements: (1) applying boundProbability to the final selectivity to guard against floating-point rounding in compound expressions (e.g., Or where p1 + p2 - p1*p2 can slightly exceed 1.0), and (2) capping filteredRowCount at childRowCount to prevent rounding from ceil. Both are good safety nets.

Implementation: Changes are confined to the estimate() method in FilterEstimation. No API changes, no impact on callers.

@JkSelf JkSelf changed the title Filter IS NOT NULL expression 's sizeInBytes should not increase over child Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. Apr 15, 2026
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @JkSelf and @cloud-fan .

Please file a JIRA issue and have a proper PR title because it will be a commit log eventually.

@JkSelf JkSelf changed the title Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. [SPARK-56484][SQL]Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. Apr 15, 2026
@JkSelf
Copy link
Copy Markdown
Contributor Author

JkSelf commented Apr 15, 2026

Thank you, @JkSelf and @cloud-fan .

Please file a JIRA issue and have a proper PR title because it will be a commit log eventually.

Filed https://issues.apache.org/jira/browse/SPARK-56484 Jira issue.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Thank you, @JkSelf .

@dongjoon-hyun dongjoon-hyun dismissed their stale review April 15, 2026 15:58

Addressed.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-56484][SQL]Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. [SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on Apr 15, 2026
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Merged to master for Apache Spark 4.2.0 whose Feature Freeze is scheduled in two weeks on 2026-05-01.

Screenshot 2026-04-15 at 11 53 04

@dongjoon-hyun
Copy link
Copy Markdown
Member

Thank you, @JkSelf and @cloud-fan .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants