-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
According to https://datafusion.apache.org/user-guide/configs.html
The datafusion.execution.collect_statistics defaults to true which means the statistics should be gathered as part of the CREATE TABLE (and subsequent queries should not have to recreate Statistics for individual files)
| datafusion.execution.collect_statistics | true | Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true. |
|---|
This behavior was introduced in the following issue for precisely this reason
However, while reviewing ClickBench performance results, it seems to me that the statistics are actually computed on first use:
- see Update ClickBench benchmarks with DataFusion 51.0.0 #18909 (comment) for more details
This means all our ClickBench results include the time to gather / convert statistics for 100 files. Other systems such as DuckDB do not pay this penalty (they gather the statistics on creation) and thus appear much faster in the leaderboard: https://benchmark.clickhouse.com
You can reproduce this like
Get data:
cd ~/Software/datafusion
cd benchmarks/
# get data
./bench.sh data clickbench_partitionedThen run the queries:
-- Create the external table -- this SHOULD gather statistics
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'data/hits_partitioned'
OPTIONS ('binary_as_string' 'true');
-- This query should be really fast (1ms) as it can be run directly from statistics
-- however, it takes 17ms-18ms on my machine, and 32ms on c6x.* class machines in EC2
SELECT COUNT(*) FROM hits;
-- This query should take about the same time as the one above
SELECT COUNT(*) FROM hits;(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -f q.sql
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.021 seconds.
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.017 seconds.
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 0.001 seconds.Describe the solution you'd like
I would like the statistics to actually be gathered on CREATE TABLE
Describe alternatives you've considered
It appear that the statistics are cached on DefaultFileStatisticsCache, which created when the ListingTable and instantiated (but not populated) when the table is created the first time: https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/catalog-listing/src/table.rs#L227-L226
I think we could maybe "prewarm" the statistics cache here somehow:
datafusion/datafusion/core/src/datasource/listing_table_factory.rs
Lines 187 to 188 in 4ea7601
| let provider = ListingTable::try_new(config)? | |
| .with_cache(state.runtime_env().cache_manager.get_file_statistic_cache()); |
Additional context
No response