Skip to content

Blog post about efficient filter representation in Parquet filter pushdown #8843

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

TLDR is I want to 1) advance the state of understanding of how late materialization / filter pushdown works, and 2) tell the world how great the Rust implementation is (and implicitly explain what other types of optimizations are unlocked by this)

I think there is significant room to help industrial practitioners by explaining the challenges that go into implementing late materialization "for real" in an industrial strength Parquet reader

Background

The techniques for implementing "late materialization" in column stores is well understood and explained well first in 2006/2007:

The current Rust Parquet reader supports late materialization (basically the "EM Pipelined" strategy in this diagram from Materialization Strategies in a Column-Oriented DBMS

Image

The API for evaluating predicates during the scan via the ArrowReaderBuilder::with_row_filter. See details on the RowFilter API .

@XiangpengHao also gives a good background treatment in the context of adding a predicate cache (to avoid the overhead of decompressing pages twice): https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown/

However, it has taken us several years (and we are still not quite there) to get to the point that we can turn on late materialization on "for real" due to various engineering challenges (decompression speed).

Interestingly, there is a similar discussion on filter representation in Predicate Caching: Query-Driven Secondary Indexing for Cloud Data Warehouses -- referred to as 4.1.1 Range Index and 4.1.2 Bitmap Index

Describe the solution you'd like

I would like to write a blog that highlights the tradeoffs in filter representation how we worked to improve it.

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions