Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files by kosiew · Pull Request #21566 · apache/datafusion

kosiew · 2026-04-12T09:03:34Z

Which issue does this PR close?

Part of Cache parquet pruning setup across files with the same physical schema #21554

Rationale for this change

Parquet scans currently adapt and simplify projection and predicate expressions per file, even when multiple files share the same physical schema and query inputs. This results in repeated CPU work (expression rewriting, simplification, and pruning predicate construction) that is identical across files.

This PR introduces a scan-local cache to reuse this CPU-only setup when safe, reducing redundant computation and improving performance for datasets with many files but few schema variations.

What changes are included in this PR?

Introduced ParquetPruningSetupCache owned by ParquetMorselizer
- Stores adapted projection, predicate, and row-group pruning predicate
- Uses a Mutex + Condvar to coordinate concurrent access and avoid duplicate work
Added ParquetPruningSetupCacheKey
- Includes:
  - Logical file schema
  - Physical file schema
  - Predicate identity (pointer-based)
  - Projection expression identities
- Ensures reuse only when inputs are equivalent within a scan
Added ParquetPruningSetup and cache entry state machine
- States: Pending, Ready, Failed
- Prevents duplicate computation and propagates errors safely
Refactored pruning setup logic
- Extracted into build_pruning_setup
- Added build_or_get_pruning_setup to handle cache lookup/fallback
- Preserves existing behavior on cache miss
Integrated cache usage into MetadataLoadedParquetOpen
- Replaces per-file rewrite + pruning predicate construction with cached version when applicable
Added supports_reusable_rewrites to PhysicalExprAdapterFactory
- Defaults to false
- Enabled for DefaultPhysicalExprAdapterFactory
- Ensures only cache-safe adapters participate in reuse
Added pruning_setup_reusable guard
- Disables reuse when literal column replacement occurs
Wired cache through ParquetMorselizer and ParquetSource

Are these changes tested?

Yes. New tests validate correctness and cache boundaries:

Reuse occurs for files with the same physical schema and reusable adapter
No reuse when adapter does not support reusable rewrites
No reuse across different physical schemas

Additionally:

A custom counting adapter factory verifies that rewrite creation is invoked only once in cache-hit scenarios
Existing pruning behavior and correctness are preserved

Are there any user-facing changes?

No user-facing API changes.

This is an internal performance optimization. Query results and behavior remain unchanged.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Notes

Cache is scoped to a single scan via ParquetMorselizer
Failure entries are removed to avoid poisoning the cache
Page-level pruning is intentionally excluded and handled separately (future work)

Performance Impact

Expected improvements for workloads with:

Many files
Few distinct physical schemas
Non-trivial predicates/projections

Reduces repeated expression rewriting and pruning predicate construction overhead.

Add a scan-local ParquetPruningSetupCache in opener.rs to reuse adapted projection, adapted predicate, and row-group pruning predicate setup for same-schema files. Implement PhysicalExprAdapterFactory::supports_reusable_rewrites() in schema_rewriter.rs to allow default adapters while defaulting custom ones to opt-out. Connect the cache to ParquetMorselizer in source.rs and introduce a regression test to ensure the setup is reused correctly across same-schema files while keeping it conservative to avoid per-file literal replacements.

Simplify the pruning setup cache in opener.rs by removing the wrapper entry struct, deriving PartialEq/Eq, and returning cloned setup values instead of an extra Arc. Extract the cache-or-build branch into build_or_get_pruning_setup and avoid repeated literal_columns.is_empty() checks. Move returned pruning setup fields directly into prepared and simplify the test-only counting adapter and cache regression test loop. Tighten the supports_reusable_rewrites doc comment in schema_rewriter.rs without changing the public interface.

Refactor cache in opener.rs to utilize a HashMap, computing cold misses outside the mutex for better performance, with a re-check on insert. Add cache-boundary tests for non-reusable adapters and diverse physical schemas. Clarify documentation in schema_rewriter.rs for supports_reusable_rewrites() to specify the same logical/physical schema rewrite inputs.

Enhance opener.rs with cache-key rationale comments for clarity. Update ParquetPruningSetupCache to handle concurrent cold misses by implementing a per-entry pending/ready state. Clarify the public contract of supports_reusable_rewrites() in schema_rewriter.rs.

…che access

AdamGS · 2026-04-13T18:29:52Z

This is really awesome! I was planning on trying to add something like that once the morsel work is more mature, but I was wondering if it'll be possible to make it more format-independent? Ran into similar issue on our FileSource, which made me file #20078 (which I expect this PR to improve significantly if not just close outright).

kosiew · 2026-04-14T03:57:19Z

@AdamGS

Thanks! I agree this is the direction we should take. This PR keeps the cache Parquet-local on purpose because the reusable setup currently stores Parquet-specific artifacts: the adapted Parquet projection/predicate, the physical schema after Parquet file-schema coercions / INT96 handling, and the row-group PruningPredicate. It also leaves page-index work, reader metadata, file metrics, and access-plan execution per file.

That said, the shape is a good stepping stone toward the more format-independent problem in #20078. The parts that seem general are:

scan-local reuse keyed by logical schema, physical schema, projection, predicate, and adapter cache-safety;
avoiding repeated PhysicalExprAdapterFactory::create / rewrite / simplification work for files with equivalent schema inputs;
letting custom adapters opt in only when their rewrites do not depend on factory-local or unkeyed per-file state.

I would prefer to land this narrowly for Parquet first, with the cache-safety contract and tests in place, then follow up by extracting the format-neutral expression adaptation / pruning setup cache into a datasource-level helper once another FileSource can exercise it. Vortex or another custom FileSource would be a good second consumer to make sure the abstraction is not overfit to Parquet row-group pruning.

AdamGS · 2026-04-14T10:38:43Z

That makes perfect sense, would be happy to help with anything here! This is really awesome stuff

kosiew added 4 commits April 12, 2026 16:50

github-actions Bot added the datasource Changes to the datasource crate label Apr 12, 2026

Fix key reference in ParquetPruningSetupCache methods for improved ca…

b17fafc

…che access

kosiew marked this pull request as ready for review April 12, 2026 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files#21566

Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files#21566
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:schema_caching-01-21554

kosiew commented Apr 12, 2026 •

edited

Loading

Uh oh!

AdamGS commented Apr 13, 2026

Uh oh!

kosiew commented Apr 14, 2026

Uh oh!

AdamGS commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kosiew commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Notes

Performance Impact

Uh oh!

AdamGS commented Apr 13, 2026

Uh oh!

kosiew commented Apr 14, 2026

Uh oh!

AdamGS commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Apr 12, 2026 •

edited

Loading