Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reusable "row group pruning" logic #363

Closed
alamb opened this issue May 19, 2021 · 0 comments 路 Fixed by #426
Closed

Reusable "row group pruning" logic #363

alamb opened this issue May 19, 2021 · 0 comments 路 Fixed by #426
Assignees
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 19, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion contains logic (originally contributed by @yordan-pavlov in apache/arrow#9064 馃帀 ) to perform Row Group Pruning, which skips scanning of entire row groups within a parquet file, based on pushed down predicates (source link in arrow-datafusion: parquet.rs).

The algorithm behind the Row Group Pruning implementation is general and can be applied to any storage system that maintains min/max statistics for different sets of files / chunks of the data and would like to quickly rule out chunks which can not match a predicate.

We would like to reuse the row group pruning logic from DataFusion (rather than writing our own) because we want to make this logic easier to reuse by both other parts of DataFusion (e.g. pruning parquet files rather than just row groups) as well as downstream projects. We also hope to receive benefit ourselves as the community can work to improve this code

In addition, there other usecases, such as the one mentioned by @returnString, where you have a bunch of parquet files in some object store and statistics about the min/max values and you could skip entire files based on those statistics alone.

Describe the solution you'd like

  1. Refactor what is currently called RowGroupPredicateBuilder into something more generic related to Pruning
  2. Rework the implementation so it is generic for a Statistics trait so that the predicates can be evaluated against any type (not just the Parquet RowGroupMetadata)

Additional context

You can see more about the usecase on the IOx ticket https://github.com/influxdata/influxdb_iox/issues/736 and design document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
1 participant