Skip to content

Comments

feat: multiple columns in count distinct#20460

Open
Mark1626 wants to merge 3 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi
Open

feat: multiple columns in count distinct#20460
Mark1626 wants to merge 3 commits intoapache:mainfrom
Mark1626:feat/count-distinct-multi

Conversation

@Mark1626
Copy link
Contributor

Which issue does this PR close?

Closes #5619

What changes are included in this PR?

  1. Introduce a separate accumulator for multi column distinct count MultiColumnDistinctCountAccumulator
  2. I used some parts of Count distinct support multiple expressions #5939 for reference, however it was old so I had to reimplement this

Are these changes tested?

  1. Unit tests have been added
  2. I've tested this with a couple of queries in the cli
with data AS (
  select * from (values
    ('a', 1, 'x'),
    ('a', 2, 'x'),
    ('b', 2, 'y'),
    ('b', 2, 'z'),
    ('c', 3, 'z')
  ) AS t(col1, col2, col3)
)
select count(distinct (col1, col2)) FROM data;

@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Feb 21, 2026
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 21, 2026

#[derive(Debug)]
struct MultiColumnDistinctCountAccumulator {
values: HashSet<Vec<ScalarValue>, RandomState>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html or some custom hashing/equality would be much faster than Vec<ScalarValue> as key.

Ok(())
}

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please have tests in .slt

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mark1626 for driving this 💪

Before going to code review lets expand tests a little bit to support possible cases, specifically:

  • mixed nulls in values
  • different column datatypes
  • 3+ cols
  • different col order
  • duplicates like select count(distinct a, a), select count(distinct a, a, b, b)`

Once we have tests passed, we most likely got the code is stable and ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Count() and Count(Distinct )should accept multiple exprs

3 participants