Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable split_file_groups_by_statistics by default #10336

Open
Tracked by #10313
alamb opened this issue May 1, 2024 · 3 comments
Open
Tracked by #10313

Enable split_file_groups_by_statistics by default #10336

alamb opened this issue May 1, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 1, 2024

Is your feature request related to a problem or challenge?

Part of #10313

In #9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap

This feature is behind a feature flag, split_file_groups_by_statistics which defaults to false as I think there needs to be some more tests in place before we turn it on

Describe the solution you'd like

Add additional tests and then enable split_file_groups_by_statistics by default

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 1, 2024
@alamb
Copy link
Contributor Author

alamb commented May 1, 2024

Example test coverage we should add I think: #9593 (comment)

@yyy1000
Copy link
Contributor

yyy1000 commented May 4, 2024

I'd like to help it. 🙌

@alamb
Copy link
Contributor Author

alamb commented May 4, 2024

THank you @yyy1000 🙏

I think a good place to start would be to write some sqllogic level tests to cover the important cases

Perhaos for the first test:

  1. Create files: file1.parquet, file2.parquet both sorted on a but file 1 has the columns in the order a, b, c and file has the columns in the order c, b, a. The keyranges of values of a should be non overlapping
  2. Create an external table a, b, c with explicit order by a, and then query SELECT ... ORDER BY a and make sure the output plan doesn't use sort preserving merge

I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

cc @suremarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants