You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gruuya@markogrujics-MacBook-Pro datafusion-cli % DATAFUSION_EXECUTION_COLLECT_STATISTICS=true datafusion-cli
DataFusion CLI v36.0.0
❯ create external table hits stored as parquet location 'benchmarks/data/hits.parquet';
0 rows in set. Query took 0.030 seconds.
❯ explain selectcount(*) from hits;
+---------------+---------------------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(*)]] ||| TableScan: hits projection=[] || physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)] ||| PlaceholderRowExec ||||
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.039 seconds.
❯ explain selectcount(1) from hits;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(Int64(1))]] ||| TableScan: hits projection=[] || physical_plan | AggregateExec: mode=Final, gby=[], aggr=[COUNT(Int64(1))] ||| CoalescePartitionsExec ||| AggregateExec: mode=Partial, gby=[], aggr=[COUNT(Int64(1))] ||| ParquetExec: file_groups={12 groups: [[Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:0..1231664704], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:1231664704..2463329408], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:2463329408..3694994112], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:3694994112..4926658816], [Users/gruuya/Splitgraph/arrow-datafusion/benchmarks/data/hits.parquet:4926658816..6158323520], ...]} ||||
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.011 seconds.
Given that count(*) basically just prints out the metadata, while count(1) does some additional work, there's also a an apparent performance implication, albeit negligible, with count(1) being 10x slower:
❯ select count(*) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row in set. Query took 0.004 seconds.
❯ select count(1) from hits;
+-----------------+
| COUNT(Int64(1)) |
+-----------------+
| 99997497 |
+-----------------+
1 row in set. Query took 0.043 seconds.
Expected behavior
COUNT(1) and COUNT(*) should be interchangable and result in the same logical and physical plans (save for the column alias).
Additional context
I can get the expected behavior by explicit casting, though the column name is off then
❯ explain selectcount(1::tinyint unsigned) from hits;
+---------------+----------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------+
| logical_plan | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)) AS COUNT(Int64(1))]] ||| TableScan: hits projection=[] || physical_plan | ProjectionExec: expr=[99997497 as COUNT(*)] ||| PlaceholderRowExec ||||
+---------------+----------------------------------------------------------------------+
2 rows in set. Query took 0.007 seconds.
❯ selectcount(1::tinyint unsigned) from hits;
+----------+
| COUNT(*) |
+----------+
| 99997497 |
+----------+
1 row in set. Query took 0.003 seconds.
The text was updated successfully, but these errors were encountered:
Describe the bug
As per the standard
COUNT(1)
is equivalent to the more commonCOUNT(*)
, however this is not the case now, since the latter has the following sentinel value:https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/expr/src/utils.rs#L43-L45
whereas the former resolves to
COUNT(Int64(1))
.In turn this means this optimization is not being used
https://github.com/apache/arrow-datafusion/blob/4bd7c137e0e205140e273a7c25824c94b457c660/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L158-L162
To Reproduce
Given that
count(*)
basically just prints out the metadata, whilecount(1)
does some additional work, there's also a an apparent performance implication, albeit negligible, withcount(1)
being 10x slower:Expected behavior
COUNT(1)
andCOUNT(*)
should be interchangable and result in the same logical and physical plans (save for the column alias).Additional context
I can get the expected behavior by explicit casting, though the column name is off then
The text was updated successfully, but these errors were encountered: