-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-49000][SQL][3.5] Fix "select count(distinct 1) from t" where t is empty table by expanding RewriteDistinctAggregates #47566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backport to 3.5 ready, waiting for CI checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have landed the PR for 3.4, can you make the CI pass here, @uros-db ? |
The python doc failure is definitely unrelated, thanks, merging to 3.5! |
… is empty table by expanding RewriteDistinctAggregates ### What changes were proposed in this pull request? Fix `RewriteDistinctAggregates` rule to deal properly with aggregation on DISTINCT literals. Physical plan for `select count(distinct 1) from t`: ``` -- count(distinct 1) == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(distinct 1)], output=[count(DISTINCT 1)#2L]) +- HashAggregate(keys=[], functions=[partial_count(distinct 1)], output=[count#6L]) +- HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=20] +- HashAggregate(keys=[], functions=[], output=[]) +- FileScan parquet spark_catalog.default.t[] Batched: false, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/nikola.mandic/oss-spark/spark-warehouse/org.apache.spark.s..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> ``` Problem is happening when `HashAggregate(keys=[], functions=[], output=[])` node yields one row to `partial_count` node, which then captures one row. This four-node structure is constructed by `AggUtils.planAggregateWithOneDistinct`. To fix the problem, we're adding `Expand` node which will force non-empty grouping expressions in `HashAggregateExec` nodes. This will in turn enable streaming zero rows to parent `partial_count` node, yielding correct final result. ### Why are the changes needed? Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table `t`: `select count(distinct 1) from t` returns 1, while the correct result should be 0. For reference: `select count(1) from t` returns 0, which is the correct and expected result. ### Does this PR introduce _any_ user-facing change? Yes, this fixes a critical bug in Spark. ### How was this patch tested? New e2e SQL tests for aggregates with DISTINCT literals. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47566 from uros-db/SPARK-49000-3.5. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Thank you, @uros-db and all. |
What changes were proposed in this pull request?
Fix
RewriteDistinctAggregates
rule to deal properly with aggregation on DISTINCT literals. Physical plan forselect count(distinct 1) from t
:Problem is happening when
HashAggregate(keys=[], functions=[], output=[])
node yields one row topartial_count
node, which then captures one row. This four-node structure is constructed byAggUtils.planAggregateWithOneDistinct
.To fix the problem, we're adding
Expand
node which will force non-empty grouping expressions inHashAggregateExec
nodes. This will in turn enable streaming zero rows to parentpartial_count
node, yielding correct final result.Why are the changes needed?
Aggregation with DISTINCT literal gives wrong results. For example, when running on empty table
t
:select count(distinct 1) from t
returns 1, while the correct result should be 0.For reference:
select count(1) from t
returns 0, which is the correct and expected result.Does this PR introduce any user-facing change?
Yes, this fixes a critical bug in Spark.
How was this patch tested?
New e2e SQL tests for aggregates with DISTINCT literals.
Was this patch authored or co-authored using generative AI tooling?
No.