Skip to content

Sort ClickBench data using 4GB on standard laptop (spilling) #19216

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

While working with @zhuqi-lucas on #19042 we noticed it is not possible to sort the hits.parquet dataset

Get the data

./benchmarks/bench.sh data clickbench_1

Try to resort it using 4G of memory (on a 20 core Mac M3 laptop):

datafusion-cli -m 4G -c "COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"

Results in

DataFusion CLI v51.0.0
Error: Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes
caused by
Resources exhausted: Additional allocation failed for ExternalSorter[7] with top memory consumers (across reservations) as:
  ExternalSorterMerge[4]#11(can spill: false) consumed 883.8 MB, peak 883.8 MB,
  ExternalSorterMerge[1]#5(can spill: false) consumed 812.6 MB, peak 812.6 MB,
  ExternalSorterMerge[9]#21(can spill: false) consumed 764.8 MB, peak 764.8 MB.
Error: Failed to allocate additional 13.7 MB for ExternalSorter[7] with 0.0 B already allocated for this reservation - 1088.1 KB remain available for the total pool

As @2010YOUY01 has documented in https://datafusion.apache.org/user-guide/configs.html#memory-limited-queries, this query does run to completion with fewer target partitions for example 1:

SET datafusion.execution.target_partitions = 1;

Then this succeeds

datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 1; COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"

However, 2 target partitions still fails:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 2; COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"
DataFusion CLI v51.0.0
0 row(s) fetched.
Elapsed 0.000 seconds.

Error: Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes
caused by
Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as:
  ExternalSorter[1]#5(can spill: true) consumed 3.2 GB, peak 3.5 GB,
  ExternalSorterMerge[1]#6(can spill: false) consumed 767.1 MB, peak 1773.9 MB,
  ExternalSorterMerge[0]#4(can spill: false) consumed 10.0 MB, peak 1679.3 MB.
Error: Failed to allocate additional 27.6 MB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 20.7 MB remain available for the total pool

Describe the solution you'd like

I would like DataFusion to be able to complete such queries with a reasonable amount of RAM without having to tune the target partitions

Describe alternatives you've considered

Maybe there could be be some "rule of thumb" for the required resources -- for example, perhaps we could make sure queries run with 1 GB of RAM per core (and adjust the batch size / target partitioning automatically if needed)

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions