Skip to content

perf(core): prune partition subtrees during descent and parallelize listing#597

Open
sushiljacksparrow wants to merge 2 commits intoapache:mainfrom
sushiljacksparrow:feat/parallel-pruning-leaf-listing
Open

perf(core): prune partition subtrees during descent and parallelize listing#597
sushiljacksparrow wants to merge 2 commits intoapache:mainfrom
sushiljacksparrow:feat/parallel-pruning-leaf-listing

Conversation

@sushiljacksparrow
Copy link
Copy Markdown
Contributor

@sushiljacksparrow sushiljacksparrow commented May 6, 2026

Change Logs

Implements the re-scoped scope from #205 (xushiyan's 2026-04-28 comment): parallelize get_leaf_dirs and add an optional per-level predicate so partition pruning short-circuits subtrees during descent.

  1. get_leaf_dirs (crates/core/src/storage/mod.rs) gains an optional predicate and a global concurrency cap. Subtrees the predicate rejects are skipped without listing; sibling subdirs at each level are walked concurrently. Total in-flight list_dirs calls are bounded globally by parallelism via a tokio::sync::Semaphore shared across the whole descent — recursion does not multiply concurrency. buffer_unordered is kept on the recursive stream to bound memory of pending child futures, while the semaphore bounds the actual I/O. Parallelism reads from hoodie.plan.listing.parallelism (default 10).
  2. PartitionPruner::should_include_prefix evaluates filters whose field is already present in a partial path. Filters for deeper fields are deferred. The single _hoodie_partition_path field used by timestamp-based key generators conservatively admits at intermediate levels — should_include runs at the leaf as before.
  3. FileLister::list_relevant_partition_paths plumbs both: the predicate filters lake-format metadata dirs (.hoodie, _delta_log, metadata) and calls should_include_prefix for partition pruning.

Scope is the no-MDT listing path only. The metadata-table FILES path is unchanged.

Note on the concurrency cap

A first cut of this PR used buffer_unordered(parallelism) per level, which only bounds fan-out at each recursion step. A D-level partition tree could therefore reach up to P^D concurrent list_dirs calls — for the default parallelism=10 on a 3-level hive-style table, 1000 simultaneous LIST requests; on a 4-level (e.g. date/hour/minute/second) layout, 10000. The semaphore fix makes parallelism a true global cap, at the cost of slowing down un-pruned full-table descents at the default — but the config is now a meaningful knob (raising it scales linearly without exponential blow-up).

Closes #205.

Impact

Performance. On remote object stores with selective partition filters, every LIST for a pruned subtree is eliminated. Even un-filtered listings benefit from parallel descent under a predictable concurrency budget.

No behavior change to the listed set on either path.

Risk level

low

Tested

  • Unit + integration tests in hudi-core: 11 new (get_leaf_dirs parallel + predicate, should_include_prefix correctness, end-to-end pruning via FileLister) plus the existing 675 lib tests + 89 integration tests all pass. cargo clippy --workspace --all-targets -- -D warnings and cargo fmt --check clean.
  • Python binding via maturin develop + pytest python/tests/test_table_read.py: 10/10 pass — the new code path is exercised through PyO3.
  • Listing-path microbenchmark on a synthetic level1=*/level2=*/level3=* tree of 10×10×10 = 1000 leaves on a local filesystem (median of 3 runs, release build, post semaphore fix):
Configuration Median Effect
Sequential, no predicate (pre-PR equivalent) 93.8 ms baseline
Parallel par=4, no predicate 52.8 ms 1.8× from parallel descent
Parallel par=10, no predicate 39.7 ms 2.4× from parallel descent
Parallel par=10, accept-all predicate 42.0 ms predicate hook overhead negligible (~5%)
Parallel par=10, predicate keeps 10% subtree 4.6 ms 9× from pruning
Parallel par=10, predicate keeps 1% subtree 1.5 ms 27× from pruning
Sequential, predicate keeps 1% subtree 1.8 ms pruning alone (no parallelism)

Local FS saturates around par=4–10. Remote object stores with per-LIST RTT in the tens of ms should see parallel-descent speedup scale roughly linearly with parallelism on un-pruned paths, up to the underlying request-rate ceiling.

  • Real-world S3 against a 3-level hive-style table (~200 partitions, ~550 file slices, us-west-2, queried from outside the VPC over the internet, post-fix):
Query shape Final slices Wall clock @ par=10 Wall clock @ par=50
3-field equality (selective single partition) 1 ~2.5 s ~2.5 s
2-field equality + wildcard at deepest level ~50 ~3.9 s ~3.1 s
Top-level field equality only ~300 ~10.6 s ~4.4 s
No filter (full descent) 553 ~18.0 s ~5.8 s

Selective queries are unaffected by the parallelism setting (work is dominated by the small number of leaves the predicate admits). Un-filtered or weakly-filtered descents scale with the parallelism knob, which is now safe to raise because it is a global bound.

  • TPC-H (SF=0.001) via DataFusion: Q1/Q3/Q6 medians 7.5 / 13.5 / 6.8 ms — no regression. (The TPC-H harness creates non-partitioned tables, so this PR's listing-path changes are inert here; the run is a smoke test that nothing else regressed.)

Documentation Update

None required.

…isting

`FileLister::list_relevant_partition_paths` previously walked every
partition subtree to leaf depth, then filtered the results — and the
recursive descent in `get_leaf_dirs` ran sequentially, paying full LIST
latency at every level. On remote object stores with selective partition
filters this is the dominant cost.

Two changes per the re-scoped scope in issue apache#205:

1. `get_leaf_dirs` (`crates/core/src/storage/mod.rs`) now takes an
   optional per-child predicate and a parallelism bound. Subtrees the
   predicate rejects are skipped without listing. Sibling subdirs at
   each level are walked concurrently via `buffer_unordered`.

2. `PartitionPruner::should_include_prefix` evaluates filters whose
   field is already present in a partial partition path. Filters for
   deeper fields are deferred. The single `_hoodie_partition_path`
   field used by timestamp-based key generators conservatively admits
   at intermediate levels — `should_include` still runs at the leaf.

`FileLister::list_relevant_partition_paths` plumbs both: the predicate
filters out lake-format metadata dirs (`.hoodie`, `_delta_log`,
`metadata`) and calls `should_include_prefix` for partition pruning.
Parallelism uses the existing `hoodie.plan.listing.parallelism`
config (default 10).

Scope is the no-MDT listing path only; the metadata-table FILES path is
unchanged.

Closes apache#205.

Tests:
- `get_leaf_dirs_descends_in_parallel_and_returns_all_leaves`
- `get_leaf_dirs_predicate_prunes_subtree`
- `test_partition_pruner_should_include_prefix_*` (6 cases:
  empty prefix, deferred filter, evaluable filter prunes, leaf parity
  with `should_include`, single `_hoodie_partition_path` admits, and
  too-many-parts fails open)
- `list_partition_paths_prunes_subtrees_when_top_level_filter_rejects`
- `list_partition_paths_prunes_at_deeper_level_when_only_inner_field_filtered`
`buffer_unordered(parallelism)` only bounds fan-out per recursion level. A
recursive descent over a D-level partition tree could therefore reach up to
P^D concurrent `list_dirs` calls — for the default `parallelism=10` and a
3-level hive-style table, that is 1000 simultaneous LIST requests; for a
date/hour/minute/second layout, 10000.

Replace the per-level cap with a `tokio::sync::Semaphore` shared across the
whole descent. Each recursive call acquires a permit before issuing
`list_dirs` and releases it before fanning out, so total in-flight requests
stay bounded by `parallelism` regardless of depth. `buffer_unordered` is
kept on the recursive stream — it now bounds memory of pending child
futures, while the semaphore bounds actual I/O.

`hoodie.plan.listing.parallelism` becomes a meaningful global knob: the
default 10 is conservative for shallow trees but can be raised for wider
or deeper layouts without risking exponential concurrency growth.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve partition pruning during file listing

1 participant