Skip to content

fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715

Open
xiedeyantu wants to merge 2 commits intoapache:mainfrom
xiedeyantu:fd
Open

fix: UNIQUE constraint with NULLs incorrectly collapses GROUP BY groups#21715
xiedeyantu wants to merge 2 commits intoapache:mainfrom
xiedeyantu:fd

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

UNIQUE constraints can contain multiple NULL values, so they do not guarantee row-level uniqueness in SQL semantics. The optimizer was incorrectly treating nullable unique constraints as functional dependencies that could reduce GROUP BY keys, which collapsed distinct NULL rows into a single group.

What changes are included in this PR?

This PR updates functional-dependency handling so nullable dependencies derived from UNIQUE constraints are not used to eliminate GROUP BY expressions. It also adds a regression test covering the NULL case from the issue report.

Are these changes tested?

Yes. I ran:

  • cargo fmt --all
  • cargo clippy -p datafusion-common --all-targets -- -D warnings
  • cargo test -p datafusion-common functional_dependencies
  • cargo test -p datafusion-sqllogictest --test sqllogictests -- group_by

Are there any user-facing changes?

Yes. Queries that group by nullable UNIQUE columns will no longer return incorrect aggregated results when multiple NULL values are present.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate logical-expr Logical plan and expressions substrait Changes to the substrait crate labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xiedeyantu. While this fixes the original issue, it now treats all nulls as different when aggregating, which I don't think is correct. For example:

CREATE TABLE t(a INT, b INT, c INT, UNIQUE(a));
INSERT INTO t VALUES (1, 10, 100), (NULL, 20, 200), (NULL, 30, 300);
SELECT a, SUM(c) AS total FROM t GROUP BY a;
+------+-------+
| a    | total |
+------+-------+
| 1    | 100   |
| NULL | 200   |
| NULL | 300   |
+------+-------+

@github-actions github-actions bot added sql SQL Planner and removed substrait Changes to the substrait crate labels Apr 19, 2026
@xiedeyantu
Copy link
Copy Markdown
Member Author

Thanks @xiedeyantu. While this fixes the original issue, it now treats all nulls as different when aggregating, which I don't think is correct. For example:

CREATE TABLE t(a INT, b INT, c INT, UNIQUE(a));
INSERT INTO t VALUES (1, 10, 100), (NULL, 20, 200), (NULL, 30, 300);
SELECT a, SUM(c) AS total FROM t GROUP BY a;
+------+-------+
| a    | total |
+------+-------+
| 1    | 100   |
| NULL | 200   |
| NULL | 300   |
+------+-------+

@nuno-faria Thank you for your comprehensive analysis. I have refactored the code and added the test cases you provided; please review it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect query results for GROUP BY with UNIQUE constraint

2 participants