Parquet: Add scan-level cache for adapted pruning setup (projection/predicate) to reuse CPU-only work across same-schema files#21566
Conversation
Add a scan-local ParquetPruningSetupCache in opener.rs to reuse adapted projection, adapted predicate, and row-group pruning predicate setup for same-schema files. Implement PhysicalExprAdapterFactory::supports_reusable_rewrites() in schema_rewriter.rs to allow default adapters while defaulting custom ones to opt-out. Connect the cache to ParquetMorselizer in source.rs and introduce a regression test to ensure the setup is reused correctly across same-schema files while keeping it conservative to avoid per-file literal replacements.
Simplify the pruning setup cache in opener.rs by removing the wrapper entry struct, deriving PartialEq/Eq, and returning cloned setup values instead of an extra Arc. Extract the cache-or-build branch into build_or_get_pruning_setup and avoid repeated literal_columns.is_empty() checks. Move returned pruning setup fields directly into prepared and simplify the test-only counting adapter and cache regression test loop. Tighten the supports_reusable_rewrites doc comment in schema_rewriter.rs without changing the public interface.
Refactor cache in opener.rs to utilize a HashMap, computing cold misses outside the mutex for better performance, with a re-check on insert. Add cache-boundary tests for non-reusable adapters and diverse physical schemas. Clarify documentation in schema_rewriter.rs for supports_reusable_rewrites() to specify the same logical/physical schema rewrite inputs.
Enhance opener.rs with cache-key rationale comments for clarity. Update ParquetPruningSetupCache to handle concurrent cold misses by implementing a per-entry pending/ready state. Clarify the public contract of supports_reusable_rewrites() in schema_rewriter.rs.
|
This is really awesome! I was planning on trying to add something like that once the morsel work is more mature, but I was wondering if it'll be possible to make it more format-independent? Ran into similar issue on our |
|
Thanks! I agree this is the direction we should take. This PR keeps the cache Parquet-local on purpose because the reusable setup currently stores Parquet-specific artifacts: the adapted Parquet projection/predicate, the physical schema after Parquet file-schema coercions / INT96 handling, and the row-group PruningPredicate. It also leaves page-index work, reader metadata, file metrics, and access-plan execution per file. That said, the shape is a good stepping stone toward the more format-independent problem in #20078. The parts that seem general are: scan-local reuse keyed by logical schema, physical schema, projection, predicate, and adapter cache-safety; I would prefer to land this narrowly for Parquet first, with the cache-safety contract and tests in place, then follow up by extracting the format-neutral expression adaptation / pruning setup cache into a datasource-level helper once another FileSource can exercise it. Vortex or another custom FileSource would be a good second consumer to make sure the abstraction is not overfit to Parquet row-group pruning. |
|
That makes perfect sense, would be happy to help with anything here! This is really awesome stuff |
Which issue does this PR close?
Rationale for this change
Parquet scans currently adapt and simplify projection and predicate expressions per file, even when multiple files share the same physical schema and query inputs. This results in repeated CPU work (expression rewriting, simplification, and pruning predicate construction) that is identical across files.
This PR introduces a scan-local cache to reuse this CPU-only setup when safe, reducing redundant computation and improving performance for datasets with many files but few schema variations.
What changes are included in this PR?
Introduced
ParquetPruningSetupCacheowned byParquetMorselizerMutex + Condvarto coordinate concurrent access and avoid duplicate workAdded
ParquetPruningSetupCacheKeyIncludes:
Ensures reuse only when inputs are equivalent within a scan
Added
ParquetPruningSetupand cache entry state machinePending,Ready,FailedRefactored pruning setup logic
build_pruning_setupbuild_or_get_pruning_setupto handle cache lookup/fallbackIntegrated cache usage into
MetadataLoadedParquetOpenAdded
supports_reusable_rewritestoPhysicalExprAdapterFactoryfalseDefaultPhysicalExprAdapterFactoryAdded
pruning_setup_reusableguardWired cache through
ParquetMorselizerandParquetSourceAre these changes tested?
Yes. New tests validate correctness and cache boundaries:
Additionally:
Are there any user-facing changes?
No user-facing API changes.
This is an internal performance optimization. Query results and behavior remain unchanged.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.
Notes
ParquetMorselizerPerformance Impact
Expected improvements for workloads with:
Reduces repeated expression rewriting and pruning predicate construction overhead.