Skip to content

Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files#21566

Open
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:schema_caching-01-21554
Open

Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files#21566
kosiew wants to merge 5 commits into
apache:mainfrom
kosiew:schema_caching-01-21554

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 12, 2026

Which issue does this PR close?


Rationale for this change

Parquet scans currently adapt and simplify projection and predicate expressions per file, even when multiple files share the same physical schema and query inputs. This results in repeated CPU work (expression rewriting, simplification, and pruning predicate construction) that is identical across files.

This PR introduces a scan-local cache to reuse this CPU-only setup when safe, reducing redundant computation and improving performance for datasets with many files but few schema variations.


What changes are included in this PR?

  • Introduced ParquetPruningSetupCache owned by ParquetMorselizer

    • Stores adapted projection, predicate, and row-group pruning predicate
    • Uses a Mutex + Condvar to coordinate concurrent access and avoid duplicate work
  • Added ParquetPruningSetupCacheKey

    • Includes:

      • Logical file schema
      • Physical file schema
      • Predicate identity (pointer-based)
      • Projection expression identities
    • Ensures reuse only when inputs are equivalent within a scan

  • Added ParquetPruningSetup and cache entry state machine

    • States: Pending, Ready, Failed
    • Prevents duplicate computation and propagates errors safely
  • Refactored pruning setup logic

    • Extracted into build_pruning_setup
    • Added build_or_get_pruning_setup to handle cache lookup/fallback
    • Preserves existing behavior on cache miss
  • Integrated cache usage into MetadataLoadedParquetOpen

    • Replaces per-file rewrite + pruning predicate construction with cached version when applicable
  • Added supports_reusable_rewrites to PhysicalExprAdapterFactory

    • Defaults to false
    • Enabled for DefaultPhysicalExprAdapterFactory
    • Ensures only cache-safe adapters participate in reuse
  • Added pruning_setup_reusable guard

    • Disables reuse when literal column replacement occurs
  • Wired cache through ParquetMorselizer and ParquetSource


Are these changes tested?

Yes. New tests validate correctness and cache boundaries:

  • Reuse occurs for files with the same physical schema and reusable adapter
  • No reuse when adapter does not support reusable rewrites
  • No reuse across different physical schemas

Additionally:

  • A custom counting adapter factory verifies that rewrite creation is invoked only once in cache-hit scenarios
  • Existing pruning behavior and correctness are preserved

Are there any user-facing changes?

No user-facing API changes.

This is an internal performance optimization. Query results and behavior remain unchanged.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.


Notes

  • Cache is scoped to a single scan via ParquetMorselizer
  • Failure entries are removed to avoid poisoning the cache
  • Page-level pruning is intentionally excluded and handled separately (future work)

Performance Impact

Expected improvements for workloads with:

  • Many files
  • Few distinct physical schemas
  • Non-trivial predicates/projections

Reduces repeated expression rewriting and pruning predicate construction overhead.

kosiew added 4 commits April 12, 2026 16:50
Add a scan-local ParquetPruningSetupCache in opener.rs
to reuse adapted projection, adapted predicate, and row-group
pruning predicate setup for same-schema files.

Implement
PhysicalExprAdapterFactory::supports_reusable_rewrites() in
schema_rewriter.rs to allow default adapters while defaulting
custom ones to opt-out.

Connect the cache to ParquetMorselizer in source.rs and
introduce a regression test to ensure the setup is reused
correctly across same-schema files while keeping it
conservative to avoid per-file literal replacements.
Simplify the pruning setup cache in opener.rs by removing the
wrapper entry struct, deriving PartialEq/Eq, and returning cloned
setup values instead of an extra Arc. Extract the cache-or-build
branch into build_or_get_pruning_setup and avoid repeated
literal_columns.is_empty() checks. Move returned pruning setup
fields directly into prepared and simplify the test-only counting
adapter and cache regression test loop. Tighten the
supports_reusable_rewrites doc comment in schema_rewriter.rs
without changing the public interface.
Refactor cache in opener.rs to utilize a HashMap,
computing cold misses outside the mutex for better
performance, with a re-check on insert. Add cache-boundary
tests for non-reusable adapters and diverse physical
schemas. Clarify documentation in schema_rewriter.rs
for supports_reusable_rewrites() to specify the
same logical/physical schema rewrite inputs.
Enhance opener.rs with cache-key rationale comments for clarity.
Update ParquetPruningSetupCache to handle concurrent cold misses by
implementing a per-entry pending/ready state. Clarify the public
contract of supports_reusable_rewrites() in schema_rewriter.rs.
@github-actions github-actions Bot added the datasource Changes to the datasource crate label Apr 12, 2026
@kosiew kosiew marked this pull request as ready for review April 12, 2026 09:44
@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 13, 2026

This is really awesome! I was planning on trying to add something like that once the morsel work is more mature, but I was wondering if it'll be possible to make it more format-independent? Ran into similar issue on our FileSource, which made me file #20078 (which I expect this PR to improve significantly if not just close outright).

@kosiew
Copy link
Copy Markdown
Contributor Author

kosiew commented Apr 14, 2026

@AdamGS

Thanks! I agree this is the direction we should take. This PR keeps the cache Parquet-local on purpose because the reusable setup currently stores Parquet-specific artifacts: the adapted Parquet projection/predicate, the physical schema after Parquet file-schema coercions / INT96 handling, and the row-group PruningPredicate. It also leaves page-index work, reader metadata, file metrics, and access-plan execution per file.

That said, the shape is a good stepping stone toward the more format-independent problem in #20078. The parts that seem general are:

scan-local reuse keyed by logical schema, physical schema, projection, predicate, and adapter cache-safety;
avoiding repeated PhysicalExprAdapterFactory::create / rewrite / simplification work for files with equivalent schema inputs;
letting custom adapters opt in only when their rewrites do not depend on factory-local or unkeyed per-file state.

I would prefer to land this narrowly for Parquet first, with the cache-safety contract and tests in place, then follow up by extracting the format-neutral expression adaptation / pruning setup cache into a datasource-level helper once another FileSource can exercise it. Vortex or another custom FileSource would be a good second consumer to make sure the abstraction is not overfit to Parquet row-group pruning.

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 14, 2026

That makes perfect sense, would be happy to help with anything here! This is really awesome stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants