Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to specify external sort information for ParquetExec #4169

Closed
Tracked by #10313
alamb opened this issue Nov 10, 2022 · 10 comments · Fixed by #4170
Closed
Tracked by #10313

Add ability to specify external sort information for ParquetExec #4169

alamb opened this issue Nov 10, 2022 · 10 comments · Fixed by #4170
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 10, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
IOx stores parquet files in a particular sort order, and then uses the fact the data is sorted for a variety of sort related optimizations

The new BasicEnforcement rule added in #4122 by @mingmwang is (correctly) deciding that since the ParquetExec declares its output is not sorted, it needs to add a SortExec which is unnecessary in our case and will slow performance dramatically.

I think the way to avoid this is to teach DataFusion that the ParquetExec is actually sorted (which is is) and then everything will work out.

Describe the solution you'd like
I would like a way for someone constructing a ParquetExec manually to be able to specify that the data is already sorted.

Describe alternatives you've considered
It might be possible to figure out the sort order of the data given the parquet metadata, but I haven't looked into that carefully

Additional context

As a bonus, I think at least some part of our plan construction logic in IOx that adds SortExec's in to sort the data could potentially be removed as it is now covered by the DataFusion optimizer.

See more detail at https://github.com/influxdata/influxdb_iox/pull/6108#discussion_r1019387151

@crepererum
Copy link
Contributor

One alternative would be to encode the "sorted by" property into the parquet file itself. Sure that's more effort, but I kinda think that it would be nicer for the ecosystem. This metadata would be optional and solely help optimization (although if specified, it must be correct). This is very similar to statistics.

@mingmwang
Copy link
Contributor

One question for SortPreservingMergeExec, does it alway require the inputs to be sorted ? What if the inputs are not sorted,
what kind of output stream the SortPreservingMergeExec can generate ?

@Dandandan
Copy link
Contributor

Is there a way we can generalize sort order over different formats?
I think sort order information should be quite generalizable over formats, CSV/JSON/Avro could be sorted as well.

@crepererum
Copy link
Contributor

Is there a way we can generalize sort order over different formats?

Depends on the source of this information, hence my comment. Should we place this information into the parquet file (similar to stats) or in the catalog (this is what IOx does and what would generalize to other storage formats as well). Both can make sense under different circumstances.

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

One alternative would be to encode the "sorted by" property into the parquet file itself. Sure that's more effort, but I kinda think that it would be nicer for the ecosystem. This metadata would be optional and solely help optimization (although if specified, it must be correct). This is very similar to statistics.

Yes, I agree this would be nice if there was some standard way to do so.

I poked around in the format definition and it seems like there is a standard way to encode the sort order in each Row Group's metadata:

There is a "SortingColumn" in the format
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698

Which is then in the RowGroup metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832

However, I did not find any code to read/write this in the parquet crate
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard

I will file some follow on tickets to properly support this in parquet and in datafusion.

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

One question for SortPreservingMergeExec, does it alway require the inputs to be sorted ? What if the inputs are not sorted,
what kind of output stream the SortPreservingMergeExec can generate ?

SortPreservingMergeExec is a fairly classic multi-column merge operator. This comment describes it well:

https://github.com/apache/arrow-datafusion/blob/5883e43db6c16d3ac3616302606849abbfbc86eb/datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs#L54-L80

If the inputs are not sorted the output will be incorrect (specifically some of the rows may be lost)

The SortPreservingMergeExec produces a sorted output stream without resorting its input

One usecase for this operator outside of IOx might be to implement UNION (which removes duplicate) of two subqueries that were already sorted.

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

Is there a way we can generalize sort order over different formats?
I think sort order information should be quite generalizable over formats, CSV/JSON/Avro could be sorted as well.

That is an excellent point -- since they all use the "ListingTable" API perhaps that is the most appropriate place to allow users to specify externally known sorting

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

My plan is to implement the "manual override" initially (both to unblock IOx work and allow users to provide other externally known sort information) and file tickets describing the "encode sort order in the parquet files" for follow on

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

So my current proposal is:

  1. Allow users to specify the sort order of their data in ListingTableOptions (e.g. Add ability to specify external sort information for ListingTables #4170)
  2. Filed tickets to track potentially automatically determining this information from parquet metadata: Expose SortingColumn when reading and writing parquet metadata arrow-rs#3090 Automatically detect and use "is the data sorted" information in parquet file metadata #4177

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2022

PR #4170 contain the proposed implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants