Skip to content

Conversation

@XiangpengHao
Copy link
Contributor

Which issue does this PR close?

Closes #11767

Rationale for this change

It is a typo

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Thank you @XiangpengHao -- I ahve some tests written for this that I will push to this branch

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the very quick turnaround @XiangpengHao

DataType::Binary => Box::new(BytesDistinctCountAccumulator::<i32>::new(
OutputType::Binary,
)),
DataType::BinaryView => Box::new(BytesViewDistinctCountAccumulator::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@XiangpengHao
Copy link
Contributor Author

Do you @alamb have any thoughts on why clickbench query 5 won't previously panic?

SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;

I find only when we count two disctincts will go to this path, for example:

SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry") FROM hits;

This make me suspect that Q5 is being casted to Utf8, which explains why it is slower with string view... looking at it, let me know if you have any ideas..

@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

Do you @alamb have any thoughts on why clickbench query 5 won't previously panic?

Yes, I think it is because

SELECT COUNT(DISTINCT "SearchPhrase") FROM hits;

Is rewritten to a group by without distinct here: https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/single_distinct_to_groupby.rs

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads/benchmarking$ datafusion-cli -c 'EXPLAIN SELECT COUNT(DISTINCT "SearchPhrase") FROM "hits.parquet"'
DataFusion CLI v40.0.0
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: count(alias1) AS count(DISTINCT hits.parquet.SearchPhrase)                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |   Aggregate: groupBy=[[]], aggr=[[count(alias1)]]                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |     Aggregate: groupBy=[[hits.parquet.SearchPhrase AS alias1]], aggr=[[]]                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |       TableScan: hits.parquet projection=[SearchPhrase]                                                                                                                                                                                                                                                                                                                                                                                                                                |
| physical_plan | ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT hits.parquet.SearchPhrase)]                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |   AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |     CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |       AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |         AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |           CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |             RepartitionExec: partitioning=Hash([alias1@0], 16), input_partitions=16                                                                                                                                                                                                                                                                                                                                                                                                    |
|               |               AggregateExec: mode=Partial, gby=[SearchPhrase@0 as alias1], aggr=[]                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |                 ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Downloads/benchmarking/hits.parquet:0..923748528], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:923748528..1847497056], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:1847497056..2771245584], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:2771245584..3694994112], [Users/andrewlamb/Downloads/benchmarking/hits.parquet:3694994112..4618742640], ...]}, projection=[SearchPhrase] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.046 seconds.

This was largely needed before we had implemented native distinct accumulators. But I wonder if we should re-evaluate now that we have fast string accumulators 🤔

@XiangpengHao
Copy link
Contributor Author

Is rewritten to a group by without distinct here

I see, that makes sense

@XiangpengHao
Copy link
Contributor Author

This was largely needed before we had implemented native distinct accumulators. But I wonder if we should re-evaluate now that we have fast string accumulators 🤔

I ran a simple test and found that this optimization rule indeed improves performance

@Dandandan Dandandan merged commit f044bc8 into apache:main Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

COUNT(DISTINCT) on StringView panics: unreachable code: Utf8/Binary should use ArrowBytesSet

3 participants