Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COUNT(1) is different from COUNT(*) #9943

Closed
gruuya opened this issue Apr 4, 2024 · 0 comments · Fixed by #9944
Closed

COUNT(1) is different from COUNT(*) #9943

gruuya opened this issue Apr 4, 2024 · 0 comments · Fixed by #9944
Labels
bug Something isn't working

Comments

@gruuya
Copy link
Contributor

gruuya commented Apr 4, 2024

Describe the bug

As per the standard COUNT(1) is equivalent to the more common COUNT(*), however this is not the case now, since the latter has the following sentinel value:
https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/expr/src/utils.rs#L43-L45
whereas the former resolves to COUNT(Int64(1)).

In turn this means this optimization is not being used
https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L158-L162

To Reproduce

gruuya@markogrujics-MacBook-Pro datafusion-cli % DATAFUSION_EXECUTION_COLLECT_STATISTICS=true datafusion-cli
DataFusion CLI v36.0.0
❯ create external table hits stored as parquet location 'benchmarks/data/hits.parquet';
0 rows in set. Query took 0.030 seconds.

❯ explain select count(*) from hits;
+---------------+---------------------------------------------------------------+
| plan_type     | plan                                                          |
+---------------+---------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]] |
|               |   TableScan: hits projection=[]                               |
| physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)]                   |
|               |   PlaceholderRowExec                                          |
|               |                                                               |
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.039 seconds.

❯ explain select count(1) from hits;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |   TableScan: hits projection=[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| physical_plan | AggregateExec: mode=Final, gby=[], aggr=[COUNT(Int64(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |   CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |     AggregateExec: mode=Partial, gby=[], aggr=[COUNT(Int64(1))]                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |       ParquetExec: file_groups={12 groups: [[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:0..1231664704], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:1231664704..2463329408], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:2463329408..3694994112], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:3694994112..4926658816], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:4926658816..6158323520], ...]} |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.011 seconds.

Given that count(*) basically just prints out the metadata, while count(1) does some additional work, there's also a an apparent performance implication, albeit negligible, with count(1) being 10x slower:

❯ select count(*) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row in set. Query took 0.004 seconds.

❯ select count(1) from hits;
+-----------------+
| COUNT(Int64(1)) |
+-----------------+
| 99997497        |
+-----------------+
1 row in set. Query took 0.043 seconds.

Expected behavior

COUNT(1) and COUNT(*) should be interchangable and result in the same logical and physical plans (save for the column alias).

Additional context

I can get the expected behavior by explicit casting, though the column name is off then

❯ explain select count(1::tinyint unsigned) from hits;
+---------------+----------------------------------------------------------------------+
| plan_type     | plan                                                                 |
+---------------+----------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(Int64(1))]] |
|               |   TableScan: hits projection=[]                                      |
| physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)]                          |
|               |   PlaceholderRowExec                                                 |
|               |                                                                      |
+---------------+----------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds.

❯ select count(1::tinyint unsigned) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row in set. Query took 0.003 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant