perf(core): prune partition subtrees during descent and parallelize listing by sushiljacksparrow · Pull Request #597 · apache/hudi-rs

sushiljacksparrow · 2026-05-06T15:10:36Z

Change Logs

Implements the re-scoped scope from #205 (xushiyan's 2026-04-28 comment): parallelize get_leaf_dirs and add an optional per-level predicate so partition pruning short-circuits subtrees during descent.

get_leaf_dirs (crates/core/src/storage/mod.rs) gains an optional predicate and a global concurrency cap. Subtrees the predicate rejects are skipped without listing; sibling subdirs at each level are walked concurrently. Total in-flight list_dirs calls are bounded globally by parallelism via a tokio::sync::Semaphore shared across the whole descent — recursion does not multiply concurrency. buffer_unordered is kept on the recursive stream to bound memory of pending child futures, while the semaphore bounds the actual I/O. Parallelism reads from hoodie.plan.listing.parallelism (default 10).
PartitionPruner::should_include_prefix evaluates filters whose field is already present in a partial path. Filters for deeper fields are deferred. The single _hoodie_partition_path field used by timestamp-based key generators conservatively admits at intermediate levels — should_include runs at the leaf as before.
FileLister::list_relevant_partition_paths plumbs both: the predicate filters lake-format metadata dirs (.hoodie, _delta_log, metadata) and calls should_include_prefix for partition pruning.

Scope is the no-MDT listing path only. The metadata-table FILES path is unchanged.

Note on the concurrency cap

A first cut of this PR used buffer_unordered(parallelism) per level, which only bounds fan-out at each recursion step. A D-level partition tree could therefore reach up to P^D concurrent list_dirs calls — for the default parallelism=10 on a 3-level hive-style table, 1000 simultaneous LIST requests; on a 4-level (e.g. date/hour/minute/second) layout, 10000. The semaphore fix makes parallelism a true global cap, at the cost of slowing down un-pruned full-table descents at the default — but the config is now a meaningful knob (raising it scales linearly without exponential blow-up).

Closes #205.

Impact

Performance. On remote object stores with selective partition filters, every LIST for a pruned subtree is eliminated. Even un-filtered listings benefit from parallel descent under a predictable concurrency budget.

No behavior change to the listed set on either path.

Risk level

low

Tested

Unit + integration tests in hudi-core: 11 new (get_leaf_dirs parallel + predicate, should_include_prefix correctness, end-to-end pruning via FileLister) plus the existing 675 lib tests + 89 integration tests all pass. cargo clippy --workspace --all-targets -- -D warnings and cargo fmt --check clean.
Python binding via maturin develop + pytest python/tests/test_table_read.py: 10/10 pass — the new code path is exercised through PyO3.
Listing-path microbenchmark on a synthetic level1=*/level2=*/level3=* tree of 10×10×10 = 1000 leaves on a local filesystem (median of 3 runs, release build, post semaphore fix):

Configuration	Median	Effect
Sequential, no predicate (pre-PR equivalent)	93.8 ms	baseline
Parallel `par=4`, no predicate	52.8 ms	1.8× from parallel descent
Parallel `par=10`, no predicate	39.7 ms	2.4× from parallel descent
Parallel `par=10`, accept-all predicate	42.0 ms	predicate hook overhead negligible (~5%)
Parallel `par=10`, predicate keeps 10% subtree	4.6 ms	9× from pruning
Parallel `par=10`, predicate keeps 1% subtree	1.5 ms	27× from pruning
Sequential, predicate keeps 1% subtree	1.8 ms	pruning alone (no parallelism)

Local FS saturates around par=4–10. Remote object stores with per-LIST RTT in the tens of ms should see parallel-descent speedup scale roughly linearly with parallelism on un-pruned paths, up to the underlying request-rate ceiling.

Real-world S3 against a 3-level hive-style table (~200 partitions, ~550 file slices, us-west-2, queried from outside the VPC over the internet, post-fix):

Query shape	Final slices	Wall clock @ par=10	Wall clock @ par=50
3-field equality (selective single partition)	1	~2.5 s	~2.5 s
2-field equality + wildcard at deepest level	~50	~3.9 s	~3.1 s
Top-level field equality only	~300	~10.6 s	~4.4 s
No filter (full descent)	553	~18.0 s	~5.8 s

Selective queries are unaffected by the parallelism setting (work is dominated by the small number of leaves the predicate admits). Un-filtered or weakly-filtered descents scale with the parallelism knob, which is now safe to raise because it is a global bound.

TPC-H (SF=0.001) via DataFusion: Q1/Q3/Q6 medians 7.5 / 13.5 / 6.8 ms — no regression. (The TPC-H harness creates non-partitioned tables, so this PR's listing-path changes are inert here; the run is a smoke test that nothing else regressed.)

Documentation Update

None required.

…isting `FileLister::list_relevant_partition_paths` previously walked every partition subtree to leaf depth, then filtered the results — and the recursive descent in `get_leaf_dirs` ran sequentially, paying full LIST latency at every level. On remote object stores with selective partition filters this is the dominant cost. Two changes per the re-scoped scope in issue apache#205: 1. `get_leaf_dirs` (`crates/core/src/storage/mod.rs`) now takes an optional per-child predicate and a parallelism bound. Subtrees the predicate rejects are skipped without listing. Sibling subdirs at each level are walked concurrently via `buffer_unordered`. 2. `PartitionPruner::should_include_prefix` evaluates filters whose field is already present in a partial partition path. Filters for deeper fields are deferred. The single `_hoodie_partition_path` field used by timestamp-based key generators conservatively admits at intermediate levels — `should_include` still runs at the leaf. `FileLister::list_relevant_partition_paths` plumbs both: the predicate filters out lake-format metadata dirs (`.hoodie`, `_delta_log`, `metadata`) and calls `should_include_prefix` for partition pruning. Parallelism uses the existing `hoodie.plan.listing.parallelism` config (default 10). Scope is the no-MDT listing path only; the metadata-table FILES path is unchanged. Closes apache#205. Tests: - `get_leaf_dirs_descends_in_parallel_and_returns_all_leaves` - `get_leaf_dirs_predicate_prunes_subtree` - `test_partition_pruner_should_include_prefix_*` (6 cases: empty prefix, deferred filter, evaluable filter prunes, leaf parity with `should_include`, single `_hoodie_partition_path` admits, and too-many-parts fails open) - `list_partition_paths_prunes_subtrees_when_top_level_filter_rejects` - `list_partition_paths_prunes_at_deeper_level_when_only_inner_field_filtered`

`buffer_unordered(parallelism)` only bounds fan-out per recursion level. A recursive descent over a D-level partition tree could therefore reach up to P^D concurrent `list_dirs` calls — for the default `parallelism=10` and a 3-level hive-style table, that is 1000 simultaneous LIST requests; for a date/hour/minute/second layout, 10000. Replace the per-level cap with a `tokio::sync::Semaphore` shared across the whole descent. Each recursive call acquires a permit before issuing `list_dirs` and releases it before fanning out, so total in-flight requests stay bounded by `parallelism` regardless of depth. `buffer_unordered` is kept on the recursive stream — it now bounds memory of pending child futures, while the semaphore bounds actual I/O. `hoodie.plan.listing.parallelism` becomes a meaningful global knob: the default 10 is conservative for shallow trees but can be raised for wider or deeper layouts without risking exponential concurrency growth.

sushiljacksparrow requested a review from xushiyan as a code owner May 6, 2026 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): prune partition subtrees during descent and parallelize listing#597

perf(core): prune partition subtrees during descent and parallelize listing#597
sushiljacksparrow wants to merge 2 commits intoapache:mainfrom
sushiljacksparrow:feat/parallel-pruning-leaf-listing

sushiljacksparrow commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sushiljacksparrow commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Note on the concurrency cap

Impact

Risk level

Tested

Documentation Update

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sushiljacksparrow commented May 6, 2026 •

edited

Loading