-
Notifications
You must be signed in to change notification settings - Fork 17
feat: Implement Row Group Index support with predicate pushdown (Phase 3) #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0d03440 to
09135a7
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #64 +/- ##
=======================================
Coverage ? 81.03%
=======================================
Files ? 46
Lines ? 7144
Branches ? 0
=======================================
Hits ? 5789
Misses ? 1355
Partials ? 0 🚀 New features to boost your workflow:
|
I will add more test to coverage code |
WenyXu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick work!
Overview
This PR implements Phase 3: Row Index Integration from issue #58. It adds Row Group Index support for ORC files and enables efficient predicate pushdown functionality. Queries can now automatically skip irrelevant row groups based on row group-level statistics (min/max values, etc.), significantly reducing I/O and computational overhead.
This builds upon Phase 1 (Basic API) and Phase 2 (Efficient Skipping) that were implemented in previous PRs.
Key Features
1. Row Group Index Support
RowGroupIndexandStripeRowIndexdata structuresROW_INDEXstreams from stripes2. Predicate System
Predicatetype system supporting:Predicate::eq(),Predicate::gte(), etc.)3. Row Group Filtering
evaluate_predicate()function that evaluates predicates based on statistics4. ArrowReader Integration
ArrowReaderBuilder::with_predicate()methodRowSelectionand applies it to data readingRowSelection(logical AND)Usage Example
Technical Implementation
Data Flow
New Files
src/row_index.rs- Row index data structures and parsingsrc/predicate.rs- Predicate definitionssrc/row_group_filter.rs- Predicate evaluation and row group filteringModified Files
src/arrow_reader.rs- Integrated predicate pushdown supportsrc/async_arrow_reader.rs- Async version supportsrc/stripe.rs- Addedread_row_indexes()methodsrc/lib.rs- Export new typesLimitations and Future Work
Current Limitations
Future Plans