HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321
Draft
smengcl wants to merge 3 commits into
Draft
HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups#10321smengcl wants to merge 3 commits into
smengcl wants to merge 3 commits into
Conversation
…cketInfo lookups.
NSSummaryTask.process() processes every batch of OM update events Recon
ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
it has two avoidable costs: every event triggers a fresh
getBucketTable().getSkipCache(...) RocksDB point read even though
bucket layout and objectID never change; and the three sub-tasks
(FSO / Legacy / OBS) iterate the event list sequentially even though
they operate on disjoint slices and write to disjoint NSSummary
entries.
This patch makes three changes:
1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
field-level Map. After the first lookup for a bucket, subsequent
lookups become HashMap.get() calls.
2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
pool and joins on all three. The threads see the same event list;
each only processes events whose (table, bucket layout) matches
its target. Target NSSummary entries are disjoint across
sub-tasks so no cross-thread synchronization is needed, and the
TaskResult contract is unchanged.
3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
call: the parent of an OBS key is its bucket, and an UPDATE event
cannot move a key between buckets.
Throughput on Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK
17, at 500k events plus 500k preloaded keys, RATIS replication, mixed
60/30/10 create/update/delete:
| Code | events/sec | vs vanilla |
| -------------------------- | ----------:| ----------:|
| Vanilla | 78,098 | 1.00x |
| + change 1 (cache) | 672,172 | 8.61x |
| + changes 1 and 2 | 918,550 | 11.76x |
Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
scans of 500k events, each scan doing one or more bucket lookups
before bailing or processing). Change 2 gives a further ~1.37x.
Change 3 is below measurement noise.
Heap pressure is reduced because change 1 stops allocating a transient
OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
the bench lifetime.
On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
change 1 is a no-op because the FSO sub-task reads
keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
still saves the bail-loop cost of Legacy and OBS scanning the event
list to skip at the table-name check, but that cost is small relative
to FSO's own processing, so the wall-clock speedup on FSO-heavy
workloads is correspondingly smaller. The patch is non-regressive in
any case.
The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
provided as a companion patch on this JIRA.
All existing TestNSSummaryTask* unit tests pass. Two regression tests
are added to TestNSSummaryTask: one exercises the OBS sub-task path
end-to-end (previously only FSO + Legacy events were sent through
process()), and one asserts the returned TaskResult reports success
and contains a seek position for each of FSO, LEGACY, and OBS.
…bEventHandler.lookupBucketCached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This started out as digging what bottlenecks Recon currently has, then turned into focusing on NSSummaryTask and its benchmark. It has been a journey.
Generated-by: Claude Code (Opus 4.7 xhigh)
What changes were proposed in this pull request?
Background
NSSummaryTask.process()processes every batch of OM update events Recon ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout) it has two avoidable costs: every event triggers a freshgetBucketTable().getSkipCache(...)RocksDB point read even though bucket layout and objectID never change; and the three sub-tasks (FSO / Legacy / OBS) iterate the event list sequentially even though they operate on disjoint slices and write to disjoint NSSummary entries.Changes
NSSummaryTaskDbEventHandlercachesOmBucketInfolookups in a field-level Map. After the first lookup for a bucket, subsequent lookups becomeHashMap.get()calls.NSSummaryTask.process()submits the three sub-tasks to a 3-thread pool and joins on all three. The threads see the same event list; each only processes events whose(table, bucket layout)matches its target. Target NSSummary entries are disjoint across sub-tasks so no cross-thread synchronization is needed, and theTaskResultcontract is unchanged.getKeyParentID(oldKeyInfo)call. The parent of an OBS key is its bucket, and an UPDATE event cannot move a key between buckets.Throughput
Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK 17, 500k events plus 500k preloaded keys, RATIS replication, mixed 60/30/10 create/update/delete:
Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey)RocksDB Gets perprocess()call (3 sub-task scans of 500k events, each scan doing one or more bucket lookups before bailing or processing). Change 2 gives a further ~1.37x. Change 3 is below measurement noise.Flame graphs for Change 1 (cache)
Before: Three RocksDB get tower under
NSSummaryTask.process:After: Last three RocksDB get towers gone.
NSSummaryTask.processis not even visible at this zoom level:Heap pressure
Reduced because change 1 stops allocating a transient
OmBucketInfoper RocksDB Get. At 1M events / 1M preloaded keys with an 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms) and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across the bench lifetime.FSO-heavy workloads
On a 100% FSO workload (
fileTable/dirTable/deletedDirTable), change 1 is a no-op because the FSO sub-task readskeyInfo.getParentObjectID()directly without a bucket lookup. Change 2 still saves the bail-loop cost of Legacy and OBS scanning the event list to skip at the table-name check, but that cost is small relative to FSO's own processing, so the wall-clock speedup on FSO-heavy workloads is correspondingly smaller. The patch is non-regressive in any case.Reproduction
The reproduction harness (
NSSummaryProcessTimingTestunder-Pbench) is provided as a patch on the JIRA.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15335
How was this patch tested?
TestNSSummaryTask*unit tests passTestNSSummaryTask: one exercises the OBS sub-task path end-to-end (previously only FSO + Legacy events were sent throughprocess()), and one asserts the returnedTaskResultreports success and contains a seek position for each ofFSO,LEGACY, andOBS.