-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
- part of [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 #8000
- Related to [Parquet] Adaptive Parquet Predicate Pushdown #8733 from @hhhizzz
- Related to Adaptive Parquet Predicate Pushdown Evaluation #5523
TLDR is I want to 1) advance the state of understanding of how late materialization / filter pushdown works, and 2) tell the world how great the Rust implementation is (and implicitly explain what other types of optimizations are unlocked by this)
I think there is significant room to help industrial practitioners by explaining the challenges that go into implementing late materialization "for real" in an industrial strength Parquet reader
Background
The techniques for implementing "late materialization" in column stores is well understood and explained well first in 2006/2007:
- Materialization Strategies in a Column-Oriented DBMS
- Column-Stores vs. Row-Stores: How Different Are They Really?
The current Rust Parquet reader supports late materialization (basically the "EM Pipelined" strategy in this diagram from Materialization Strategies in a Column-Oriented DBMS
The API for evaluating predicates during the scan via the ArrowReaderBuilder::with_row_filter. See details on the RowFilter API .
@XiangpengHao also gives a good background treatment in the context of adding a predicate cache (to avoid the overhead of decompressing pages twice): https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/
However, it has taken us several years (and we are still not quite there) to get to the point that we can turn on late materialization on "for real" due to various engineering challenges (decompression speed).
Interestingly, there is a similar discussion on filter representation in Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses -- referred to as 4.1.1 Range Index and 4.1.2 Bitmap Index
Describe the solution you'd like
I would like to write a blog that highlights the tradeoffs in filter representation how we worked to improve it.
Describe alternatives you've considered
Additional context