Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Nov 1, 2025

Overview

This PR implements Phase 3: Row Index Integration from issue #58. It adds Row Group Index support for ORC files and enables efficient predicate pushdown functionality. Queries can now automatically skip irrelevant row groups based on row group-level statistics (min/max values, etc.), significantly reducing I/O and computational overhead.

This builds upon Phase 1 (Basic API) and Phase 2 (Efficient Skipping) that were implemented in previous PRs.

Key Features

1. Row Group Index Support

  • ✅ Implemented RowGroupIndex and StripeRowIndex data structures
  • ✅ Support for reading and parsing ROW_INDEX streams from stripes
  • ✅ Lazy loading: indexes are only read when needed, maintaining performance for default streaming reads
  • ✅ Access to row group-level statistics (min/max/null count, etc.)

2. Predicate System

  • ✅ Complete Predicate type system supporting:
    • Comparison operations (==, !=, <, <=, >, >=)
    • NULL checks (IS NULL, IS NOT NULL)
    • Logical combinations (AND, OR, NOT)
  • ✅ Support for multiple data types: Integer, Float, String, Date, Timestamp, Decimal, Boolean
  • ✅ Convenient construction methods (Predicate::eq(), Predicate::gte(), etc.)

3. Row Group Filtering

  • ✅ Implemented evaluate_predicate() function that evaluates predicates based on statistics
  • ✅ Smart filtering logic:
    • If a row group's min/max range doesn't overlap with query conditions, skip the entire row group
    • If it might contain matching data, keep the row group and verify during decoding
  • ✅ Support for statistics comparison across all data types

4. ArrowReader Integration

  • ✅ Added ArrowReaderBuilder::with_predicate() method
  • ✅ Automatically reads row indexes and evaluates predicates
  • ✅ Automatically generates RowSelection and applies it to data reading
  • ✅ Supports combination with manual RowSelection (logical AND)
  • ✅ Both synchronous and asynchronous versions supported

Usage Example

use orc_rust::{ArrowReaderBuilder, Predicate, PredicateValue};

// Create predicate: age >= 18
let predicate = Predicate::gte("age", PredicateValue::Int32(Some(18)));

// Create reader with predicate
let file = File::open("data.orc")?;
let reader = ArrowReaderBuilder::try_new(file)?
    .with_predicate(predicate)
    .build();

// Read data (automatically skips non-matching row groups)
for batch in reader {
    let batch = batch?;
    // Process data...
}

Technical Implementation

Data Flow

Query Predicate
  ↓
ArrowReaderBuilder::with_predicate()
  ↓
Stripe::read_row_indexes()  ← Read ROW_INDEX stream
  ↓
parse_stripe_row_indexes()  ← Parse protobuf
  ↓
evaluate_predicate()  ← Evaluate predicate
  ↓
RowSelection::from_row_group_filter()  ← Generate row selection
  ↓
Decode only selected row groups

New Files

  • src/row_index.rs - Row index data structures and parsing
  • src/predicate.rs - Predicate definitions
  • src/row_group_filter.rs - Predicate evaluation and row group filtering

Modified Files

  • src/arrow_reader.rs - Integrated predicate pushdown support
  • src/async_arrow_reader.rs - Async version support
  • src/stripe.rs - Added read_row_indexes() method
  • src/lib.rs - Export new types

Limitations and Future Work

Current Limitations

  • Only supports filtering based on statistics (min/max)
  • Limited performance improvement for equality queries (cannot leverage Bloom Filter)

Future Plans

  • Bloom Filter Support: Implement Bloom Filter Index to further improve equality query performance (planned for future PR)
  • Index Caching: Cache parsed indexes to avoid repeated parsing
  • Position Information Utilization: Use position information in indexes for precise seeking

@suxiaogang223 suxiaogang223 marked this pull request as draft November 1, 2025 15:41
@suxiaogang223 suxiaogang223 marked this pull request as ready for review November 23, 2025 08:37
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 39.20118% with 411 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7ea0ce7). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #64   +/-   ##
=======================================
  Coverage        ?   81.03%           
=======================================
  Files           ?       46           
  Lines           ?     7144           
  Branches        ?        0           
=======================================
  Hits            ?     5789           
  Misses          ?     1355           
  Partials        ?        0           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suxiaogang223 suxiaogang223 changed the title draft: Impl row index feat: Implement Row Group Index support with predicate pushdown Nov 23, 2025
@suxiaogang223 suxiaogang223 changed the title feat: Implement Row Group Index support with predicate pushdown feat: Implement Row Group Index support with predicate pushdown (Phase 3) Nov 23, 2025
@suxiaogang223
Copy link
Contributor Author

suxiaogang223 commented Nov 23, 2025

Codecov Report

❌ Patch coverage is 39.20118% with 411 lines in your changes missing coverage. Please review. ⚠️ Please upload report for BASE (main@7ea0ce7). Learn more about missing BASE report.

Additional details and impacted files
🚀 New features to boost your workflow:

I will add more test to coverage code

@WenyXu WenyXu self-requested a review November 24, 2025 12:11
Copy link
Collaborator

@WenyXu WenyXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nick work!

@WenyXu WenyXu merged commit 8ffebbe into datafusion-contrib:main Nov 30, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants