-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
While working with @zhuqi-lucas on #19042 we noticed it is not possible to sort the hits.parquet dataset
Get the data
./benchmarks/bench.sh data clickbench_1Try to resort it using 4G of memory (on a 20 core Mac M3 laptop):
datafusion-cli -m 4G -c "COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"Results in
DataFusion CLI v51.0.0
Error: Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes
caused by
Resources exhausted: Additional allocation failed for ExternalSorter[7] with top memory consumers (across reservations) as:
ExternalSorterMerge[4]#11(can spill: false) consumed 883.8 MB, peak 883.8 MB,
ExternalSorterMerge[1]#5(can spill: false) consumed 812.6 MB, peak 812.6 MB,
ExternalSorterMerge[9]#21(can spill: false) consumed 764.8 MB, peak 764.8 MB.
Error: Failed to allocate additional 13.7 MB for ExternalSorter[7] with 0.0 B already allocated for this reservation - 1088.1 KB remain available for the total pool
As @2010YOUY01 has documented in https://datafusion.apache.org/user-guide/configs.html#memory-limited-queries, this query does run to completion with fewer target partitions for example 1:
SET datafusion.execution.target_partitions = 1;Then this succeeds
datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 1; COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"However, 2 target partitions still fails:
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 2; COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.000 seconds.
Error: Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes
caused by
Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as:
ExternalSorter[1]#5(can spill: true) consumed 3.2 GB, peak 3.5 GB,
ExternalSorterMerge[1]#6(can spill: false) consumed 767.1 MB, peak 1773.9 MB,
ExternalSorterMerge[0]#4(can spill: false) consumed 10.0 MB, peak 1679.3 MB.
Error: Failed to allocate additional 27.6 MB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 20.7 MB remain available for the total poolDescribe the solution you'd like
I would like DataFusion to be able to complete such queries with a reasonable amount of RAM without having to tune the target partitions
Describe alternatives you've considered
Maybe there could be be some "rule of thumb" for the required resources -- for example, perhaps we could make sure queries run with 1 GB of RAM per core (and adjust the batch size / target partitioning automatically if needed)
Additional context
No response