Blog post about efficient filter representation in Parquet filter pushdown

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
- part of https://github.com/apache/arrow-rs/issues/8000
- Related to #8733 from @hhhizzz
- Related to https://github.com/apache/arrow-rs/issues/5523 

TLDR is I want to 1) advance the state of understanding of how late materialization / filter pushdown works, and 2) tell the world how great the Rust implementation is (and implicitly explain what other types of optimizations are unlocked by this)

I think there is significant room to help industrial practitioners by explaining the challenges that go into implementing late materialization "for real" in an industrial strength Parquet reader

Background

The techniques for implementing "late materialization" in column stores is well understood and explained well first in 2006/2007:
- [Materialization Strategies in a Column-Oriented DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf)
- [Column-Stores vs. Row-Stores: How Different Are They Really?](https://www.cs.umd.edu/~abadi/papers/abadisigmod06.pdf)

The current Rust Parquet reader supports late materialization (basically the "EM Pipelined" strategy in this diagram from [Materialization Strategies in a Column-Oriented DBMS](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf)

<img width="367" height="305" alt="Image" src="https://github.com/user-attachments/assets/3cd787c6-483b-4047-a5fc-f80af300ad87" />

The API for evaluating predicates during the scan via the  [ArrowReaderBuilder::with_row_filter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter). See details on the [RowFilter](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html) API .

@XiangpengHao also gives a good background treatment in the context of adding a predicate cache (to avoid the overhead of decompressing pages twice): https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/

However, it has taken us several years (and we are still not quite there) to get to the point that we can turn on late materialization on "for real" due to various engineering challenges (decompression speed). 

Interestingly, there is a similar discussion on filter representation in [Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses](https://dl.acm.org/doi/10.1145/3626246.3653395) -- referred to as `4.1.1 Range Index` and `4.1.2 Bitmap Index`

**Describe the solution you'd like**

I would like to write a blog that highlights the tradeoffs in filter representation how we worked to improve it. 



**Describe alternatives you've considered**


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blog post about efficient filter representation in Parquet filter pushdown #8843

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Blog post about efficient filter representation in Parquet filter pushdown #8843

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions