[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on by JkSelf · Pull Request #55349 · apache/spark

JkSelf · 2026-04-15T12:55:58Z

What changes were proposed in this pull request?

We observed a significant discrepancy in the logical plan's statistics estimation at the Filter node when running Q23a and Q23b in 10TB TPC-DS . For the customer table, the RelationV2 scan correctly identifies a sizeInBytes of 248.0 MiB based on actual metadata. However, after applying the Filter isnotnull(c_customer_sk) operator, the CBO inflates the estimated size to 743.9 MiB. Even though the rowCount remains unchanged , the heuristic recalculation of sizeInBytes triples the value. This "data inflation" after a filter causes the planner to exceed the 250 MiB threshold, incorrectly disabling the Broadcast Hash Join.

Filter isnotnull(c_customer_sk#8913), Statistics(sizeInBytes=743.9 MiB, rowCount=6.50E+7)

+- RelationV2[c_customer_sk#8913] spark_catalog.tpcds_sf10000_parquet_zstd_iceberg_part_perfteam2.customer, Statistics(sizeInBytes=248.0 MiB, rowCount=6.50E+7)

The Filter node should maintain its existing logic for estimating rowCount and updating attributeStats. However, for sizeInBytes, we should adopt a more conservative approach by selecting the minimum of two estimates:

Legacy Logic: getOutputSize(outputAttrs, filteredRowCount, newColStats). This often results in a value larger than the actual sizeInBytes from the Scan node due to heuristic row-width defaults.

New Scaling Logic: child.sizeInBytes * (filteredRowCount / childRowCount). This is more reasonable as a Filter does not change the row width; it only reduces the number of rows.

Final Decision: min(sizeByOutputAttrs, sizeByChildScaling)

This ensures the estimated size never exceeds the actual size of the child node. (e.g., preventing the original 260MB from being inflated to 780MB).

Why are the changes needed?

Missing BHJ optimization when setting 250MB bhj threshold.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new unit tests

Was this patch authored or co-authored using generative AI tooling?

No

JkSelf · 2026-04-15T12:56:56Z

@cloud-fan Could you help to review this PR? Thanks for your help.

cloud-fan

Summary

Prior state and problem: FilterEstimation.estimate() computes sizeInBytes via getOutputSize(plan.output, filteredRowCount, newColStats), which is rowCount × getSizePerRow(attributes, attrStats). getSizePerRow adds overhead per row (8 bytes base + 12 bytes per StringType column for UTF8String object layout). When the scan reports an accurate, compact sizeInBytes from table metadata (e.g., Parquet/Iceberg columnar data), the heuristic per-row computation inflates the estimate. In the PR's example, a single IntegerType column gets 12 bytes/row from the heuristic (8 overhead + 4 avgLen) versus ~4 bytes/row from actual scan metadata — a ~3x inflation that can push the planner above the broadcast hash join threshold.

Design approach: Take min(sizeByOutputAttrs, sizeByChildScaling) where sizeByChildScaling = childSizeInBytes × filteredRowCount / childRowCount. The min of two estimates handles both directions: when the child has accurate metadata (the scaling estimate is better) and when the child has wildly overestimated size like defaultSizeInBytes (the per-attribute estimate is better).

Key design decisions: The PR also adds two defensive improvements: (1) applying boundProbability to the final selectivity to guard against floating-point rounding in compound expressions (e.g., Or where p1 + p2 - p1*p2 can slightly exceed 1.0), and (2) capping filteredRowCount at childRowCount to prevent rounding from ceil. Both are good safety nets.

Implementation: Changes are confined to the estimate() method in FilterEstimation. No API changes, no impact on callers.

dongjoon-hyun

Thank you, @JkSelf and @cloud-fan .

Please file a JIRA issue and have a proper PR title because it will be a commit log eventually.

JkSelf · 2026-04-15T15:49:00Z

Thank you, @JkSelf and @cloud-fan .

Please file a JIRA issue and have a proper PR title because it will be a commit log eventually.

Filed https://issues.apache.org/jira/browse/SPARK-56484 Jira issue.

dongjoon-hyun · 2026-04-15T15:58:03Z

Thank you, @JkSelf .

Addressed.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2026-04-15T18:53:17Z

Merged to master for Apache Spark 4.2.0 whose Feature Freeze is scheduled in two weeks on 2026-05-01.

https://spark.apache.org/versioning-policy.html

dongjoon-hyun · 2026-04-15T18:53:25Z

Thank you, @JkSelf and @cloud-fan .

IS NOT NULL should not increase sizeInBytes over child

6b05acf

JkSelf changed the title ~~IS NOT NULL should not increase sizeInBytes over child~~ Filter IS NOT NULL expression 's sizeInBytes should not increase over child Apr 15, 2026

cloud-fan approved these changes Apr 15, 2026

View reviewed changes

Comment thread ...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

Comment thread ...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

JkSelf changed the title ~~Filter IS NOT NULL expression 's sizeInBytes should not increase over child~~ Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. Apr 15, 2026

Resolve comments

9c87447

dongjoon-hyun previously requested changes Apr 15, 2026

View reviewed changes

JkSelf changed the title ~~Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on.~~ [SPARK-56484][SQL]Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on. Apr 15, 2026

dongjoon-hyun changed the title ~~[SPARK-56484][SQL]Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on.~~ [SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on Apr 15, 2026

dongjoon-hyun approved these changes Apr 15, 2026

View reviewed changes

dongjoon-hyun closed this in 15d7783 Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349

[SPARK-56484][SQL] Filter's sizeInBytes estimation should not inflate over child's sizeInBytes when CBO is on#55349
JkSelf wants to merge 2 commits into
apache:masterfrom
JkSelf:fix-filter-estimation

JkSelf commented Apr 15, 2026

Uh oh!

JkSelf commented Apr 15, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Uh oh!

JkSelf commented Apr 15, 2026

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JkSelf commented Apr 15, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JkSelf commented Apr 15, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

JkSelf commented Apr 15, 2026

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

dongjoon-hyun commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants