HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups by smengcl · Pull Request #10321 · apache/ozone

smengcl · 2026-05-21T01:18:33Z

This started out as digging what bottlenecks Recon currently has, then turned into focusing on NSSummaryTask and its benchmark. It has been a journey.

Generated-by: Claude Code (Opus 4.7 xhigh)

What changes were proposed in this pull request?

Background

NSSummaryTask.process() processes every batch of OM update events Recon ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout) it has two avoidable costs: every event triggers a fresh getBucketTable().getSkipCache(...) RocksDB point read even though bucket layout and objectID never change; and the three sub-tasks (FSO / Legacy / OBS) iterate the event list sequentially even though they operate on disjoint slices and write to disjoint NSSummary entries.

Changes

NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a field-level Map. After the first lookup for a bucket, subsequent lookups become HashMap.get() calls.
NSSummaryTask.process() submits the three sub-tasks to a 3-thread pool and joins on all three. The threads see the same event list; each only processes events whose (table, bucket layout) matches its target. Target NSSummary entries are disjoint across sub-tasks so no cross-thread synchronization is needed, and the TaskResult contract is unchanged.
The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo) call. The parent of an OBS key is its bucket, and an UPDATE event cannot move a key between buckets.

Throughput

Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK 17, 500k events plus 500k preloaded keys, RATIS replication, mixed 60/30/10 create/update/delete:

Code	events/sec	vs vanilla
Vanilla	78,098	1.00x
+ change 1 (cache)	672,172	8.61x
+ changes 1 and 2	918,550	11.76x

Change 1 is the dominant lever: it removes about 1.5M getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task scans of 500k events, each scan doing one or more bucket lookups before bailing or processing). Change 2 gives a further ~1.37x. Change 3 is below measurement noise.

Flame graphs for Change 1 (cache)

Before: Three RocksDB get tower under NSSummaryTask.process:

After: Last three RocksDB get towers gone. NSSummaryTask.process is not even visible at this zoom level:

Heap pressure

Reduced because change 1 stops allocating a transient OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms) and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across the bench lifetime.

FSO-heavy workloads

On a 100% FSO workload (fileTable / dirTable / deletedDirTable), change 1 is a no-op because the FSO sub-task reads keyInfo.getParentObjectID() directly without a bucket lookup. Change 2 still saves the bail-loop cost of Legacy and OBS scanning the event list to skip at the table-name check, but that cost is small relative to FSO's own processing, so the wall-clock speedup on FSO-heavy workloads is correspondingly smaller. The patch is non-regressive in any case.

Reproduction

The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is provided as a patch on the JIRA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15335

How was this patch tested?

All existing TestNSSummaryTask* unit tests pass
Two regression tests are added to TestNSSummaryTask: one exercises the OBS sub-task path end-to-end (previously only FSO + Legacy events were sent through process()), and one asserts the returned TaskResult reports success and contains a seek position for each of FSO, LEGACY, and OBS.

…cketInfo lookups. NSSummaryTask.process() processes every batch of OM update events Recon ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout) it has two avoidable costs: every event triggers a fresh getBucketTable().getSkipCache(...) RocksDB point read even though bucket layout and objectID never change; and the three sub-tasks (FSO / Legacy / OBS) iterate the event list sequentially even though they operate on disjoint slices and write to disjoint NSSummary entries. This patch makes three changes: 1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a field-level Map. After the first lookup for a bucket, subsequent lookups become HashMap.get() calls. 2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread pool and joins on all three. The threads see the same event list; each only processes events whose (table, bucket layout) matches its target. Target NSSummary entries are disjoint across sub-tasks so no cross-thread synchronization is needed, and the TaskResult contract is unchanged. 3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo) call: the parent of an OBS key is its bucket, and an UPDATE event cannot move a key between buckets. Throughput on Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK 17, at 500k events plus 500k preloaded keys, RATIS replication, mixed 60/30/10 create/update/delete: | Code | events/sec | vs vanilla | | -------------------------- | ----------:| ----------:| | Vanilla | 78,098 | 1.00x | | + change 1 (cache) | 672,172 | 8.61x | | + changes 1 and 2 | 918,550 | 11.76x | Change 1 is the dominant lever: it removes about 1.5M getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task scans of 500k events, each scan doing one or more bucket lookups before bailing or processing). Change 2 gives a further ~1.37x. Change 3 is below measurement noise. Heap pressure is reduced because change 1 stops allocating a transient OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms) and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across the bench lifetime. On a 100% FSO workload (fileTable / dirTable / deletedDirTable), change 1 is a no-op because the FSO sub-task reads keyInfo.getParentObjectID() directly without a bucket lookup. Change 2 still saves the bail-loop cost of Legacy and OBS scanning the event list to skip at the table-name check, but that cost is small relative to FSO's own processing, so the wall-clock speedup on FSO-heavy workloads is correspondingly smaller. The patch is non-regressive in any case. The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is provided as a companion patch on this JIRA. All existing TestNSSummaryTask* unit tests pass. Two regression tests are added to TestNSSummaryTask: one exercises the OBS sub-task path end-to-end (previously only FSO + Legacy events were sent through process()), and one asserts the returned TaskResult reports success and contains a seek position for each of FSO, LEGACY, and OBS.

… to satisfy PMD.

…bEventHandler.lookupBucketCached.

smengcl requested review from devmadhuu and rakeshadr May 21, 2026 01:18

smengcl added performance recon labels May 21, 2026

smengcl added 2 commits May 20, 2026 18:25

HDDS-15335. Recon: move NSSummaryTask.subTaskExecutor to top of class…

4b01ae7

… to satisfy PMD.

HDDS-15335. Recon: fix unresolved javadoc reference in NSSummaryTaskD…

b982965

…bEventHandler.lookupBucketCached.

smengcl added the AI-gen label May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321

HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321
smengcl wants to merge 3 commits into
apache:masterfrom
smengcl:HDDS-15335-recon-cache-parallel

smengcl commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smengcl commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Background

Changes

Throughput

Flame graphs for Change 1 (cache)

Heap pressure

FSO-heavy workloads

Reproduction

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smengcl commented May 21, 2026 •

edited

Loading