Skip to content

Bug: statistics not collected automatically upon creation of ListingTable #18952

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

According to https://datafusion.apache.org/user-guide/configs.html

The datafusion.execution.collect_statistics defaults to true which means the statistics should be gathered as part of the CREATE TABLE (and subsequent queries should not have to recreate Statistics for individual files)

datafusion.execution.collect_statistics true Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.

This behavior was introduced in the following issue for precisely this reason

However, while reviewing ClickBench performance results, it seems to me that the statistics are actually computed on first use:

This means all our ClickBench results include the time to gather / convert statistics for 100 files. Other systems such as DuckDB do not pay this penalty (they gather the statistics on creation) and thus appear much faster in the leaderboard: https://benchmark.clickhouse.com

You can reproduce this like

Get data:

cd ~/Software/datafusion
 cd benchmarks/
# get data
./bench.sh data clickbench_partitioned

Then run the queries:

-- Create the external table -- this SHOULD gather statistics
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'data/hits_partitioned'
OPTIONS ('binary_as_string' 'true');

-- This query should be really fast (1ms) as it can be run directly from statistics
-- however, it takes 17ms-18ms on my machine, and 32ms on c6x.* class machines in EC2
SELECT COUNT(*) FROM hits;
-- This query should take about the same time as the one above
SELECT COUNT(*) FROM hits;
(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -f q.sql
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.021 seconds.

+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.017 seconds.

+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.001 seconds.

Describe the solution you'd like

I would like the statistics to actually be gathered on CREATE TABLE

Describe alternatives you've considered

It appear that the statistics are cached on DefaultFileStatisticsCache, which created when the ListingTable and instantiated (but not populated) when the table is created the first time: https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/catalog-listing/src/table.rs#L227-L226

I think we could maybe "prewarm" the statistics cache here somehow:

let provider = ListingTable::try_new(config)?
.with_cache(state.runtime_env().cache_manager.get_file_statistic_cache());

Additional context

No response

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions