Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect and use "is the data sorted" information in parquet file metadata #4177

Open
1 of 2 tasks
Tracked by #10313
alamb opened this issue Nov 11, 2022 · 3 comments
Open
1 of 2 tasks
Tracked by #10313
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 11, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suggested by @crepererum in #4169 (comment)

Some systems such as IOx, store parquet files in a particular sorted order, and then uses the fact the data is sorted for a variety of sort related optimizations.

Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.

The BasicEnforcement rule added in #4122 by @mingmwang allows DataFusion to take advantage of known information about the sort order.

One contrived example is if your parquet file is sorted by price and your query is select * from data order by price limit 10 datafusion can avoid scanning the entire file

Another more interesting example could be using sorted order to reorder pushdown filters or using a sort-merge-join without actually sorting

Describe the solution you'd like

Describe alternatives you've considered
Don't do it

Additional context

Here is a ticket that tracks allowing users of DataFusion to manually specify the sort order: #4169

@doki23
Copy link
Contributor

doki23 commented Apr 24, 2023

I have a question, if we detect the sort information when initializing the physical plan, would it cause a performance regression since we need read meta of all the parquet files?

@crepererum
Copy link
Contributor

I have a question, if we detect the sort information when initializing the physical plan, would it cause a performance regression since we need read meta of all the parquet files?

Depends where you place the parquet metadata. We'll likely don't wanna pre-fetch metadata when constructing the physical plan. However you could store the metadata in some catalog or cache, in which case it could be available during planning.

@alamb
Copy link
Contributor Author

alamb commented Apr 24, 2023

Some part of the parquet file metadata is already read as part of physical planning (e.g. fetching the statistics). I don't quite remember how it is all hooked up but you can trace it back from

https://github.com/apache/arrow-datafusion/blob/729586258fe6371e394b8b2caa4e1b55eccbf6c5/datafusion/core/src/physical_plan/file_format/parquet.rs#L154

That might give one a sense of how we could use the sortedness information in DataFusion without doing more work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants