Skip to content

HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321

Draft
smengcl wants to merge 3 commits into
apache:masterfrom
smengcl:HDDS-15335-recon-cache-parallel
Draft

HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321
smengcl wants to merge 3 commits into
apache:masterfrom
smengcl:HDDS-15335-recon-cache-parallel

Conversation

@smengcl
Copy link
Copy Markdown
Contributor

@smengcl smengcl commented May 21, 2026

This started out as digging what bottlenecks Recon currently has, then turned into focusing on NSSummaryTask and its benchmark. It has been a journey.

Generated-by: Claude Code (Opus 4.7 xhigh)

What changes were proposed in this pull request?

Background

NSSummaryTask.process() processes every batch of OM update events Recon ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout) it has two avoidable costs: every event triggers a fresh getBucketTable().getSkipCache(...) RocksDB point read even though bucket layout and objectID never change; and the three sub-tasks (FSO / Legacy / OBS) iterate the event list sequentially even though they operate on disjoint slices and write to disjoint NSSummary entries.

Changes

  1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a field-level Map. After the first lookup for a bucket, subsequent lookups become HashMap.get() calls.
  2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread pool and joins on all three. The threads see the same event list; each only processes events whose (table, bucket layout) matches its target. Target NSSummary entries are disjoint across sub-tasks so no cross-thread synchronization is needed, and the TaskResult contract is unchanged.
  3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo) call. The parent of an OBS key is its bucket, and an UPDATE event cannot move a key between buckets.

Throughput

Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK 17, 500k events plus 500k preloaded keys, RATIS replication, mixed 60/30/10 create/update/delete:

Code events/sec vs vanilla
Vanilla 78,098 1.00x
+ change 1 (cache) 672,172 8.61x
+ changes 1 and 2 918,550 11.76x

Change 1 is the dominant lever: it removes about 1.5M getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task scans of 500k events, each scan doing one or more bucket lookups before bailing or processing). Change 2 gives a further ~1.37x. Change 3 is below measurement noise.

Flame graphs for Change 1 (cache)

Before: Three RocksDB get tower under NSSummaryTask.process:

1 BEFORE

After: Last three RocksDB get towers gone. NSSummaryTask.process is not even visible at this zoom level:

2 AFTER

Heap pressure

Reduced because change 1 stops allocating a transient OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms) and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across the bench lifetime.

FSO-heavy workloads

On a 100% FSO workload (fileTable / dirTable / deletedDirTable), change 1 is a no-op because the FSO sub-task reads keyInfo.getParentObjectID() directly without a bucket lookup. Change 2 still saves the bail-loop cost of Legacy and OBS scanning the event list to skip at the table-name check, but that cost is small relative to FSO's own processing, so the wall-clock speedup on FSO-heavy workloads is correspondingly smaller. The patch is non-regressive in any case.

Reproduction

The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is provided as a patch on the JIRA.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15335

How was this patch tested?

  • All existing TestNSSummaryTask* unit tests pass
  • Two regression tests are added to TestNSSummaryTask: one exercises the OBS sub-task path end-to-end (previously only FSO + Legacy events were sent through process()), and one asserts the returned TaskResult reports success and contains a seek position for each of FSO, LEGACY, and OBS.

…cketInfo lookups.

NSSummaryTask.process() processes every batch of OM update events Recon
ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
it has two avoidable costs: every event triggers a fresh
getBucketTable().getSkipCache(...) RocksDB point read even though
bucket layout and objectID never change; and the three sub-tasks
(FSO / Legacy / OBS) iterate the event list sequentially even though
they operate on disjoint slices and write to disjoint NSSummary
entries.

This patch makes three changes:

  1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
     field-level Map. After the first lookup for a bucket, subsequent
     lookups become HashMap.get() calls.

  2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
     pool and joins on all three. The threads see the same event list;
     each only processes events whose (table, bucket layout) matches
     its target. Target NSSummary entries are disjoint across
     sub-tasks so no cross-thread synchronization is needed, and the
     TaskResult contract is unchanged.

  3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
     call: the parent of an OBS key is its bucket, and an UPDATE event
     cannot move a key between buckets.

Throughput on Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK
17, at 500k events plus 500k preloaded keys, RATIS replication, mixed
60/30/10 create/update/delete:

  | Code                       | events/sec | vs vanilla |
  | -------------------------- | ----------:| ----------:|
  | Vanilla                    |     78,098 |      1.00x |
  | + change 1 (cache)         |    672,172 |      8.61x |
  | + changes 1 and 2          |    918,550 |     11.76x |

Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
scans of 500k events, each scan doing one or more bucket lookups
before bailing or processing). Change 2 gives a further ~1.37x.
Change 3 is below measurement noise.

Heap pressure is reduced because change 1 stops allocating a transient
OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
the bench lifetime.

On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
change 1 is a no-op because the FSO sub-task reads
keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
still saves the bail-loop cost of Legacy and OBS scanning the event
list to skip at the table-name check, but that cost is small relative
to FSO's own processing, so the wall-clock speedup on FSO-heavy
workloads is correspondingly smaller. The patch is non-regressive in
any case.

The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
provided as a companion patch on this JIRA.

All existing TestNSSummaryTask* unit tests pass. Two regression tests
are added to TestNSSummaryTask: one exercises the OBS sub-task path
end-to-end (previously only FSO + Legacy events were sent through
process()), and one asserts the returned TaskResult reports success
and contains a seek position for each of FSO, LEGACY, and OBS.
@smengcl smengcl added the AI-gen label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant