Skip to content

OAK-12216 : clear segment cache after successful compaction#2904

Closed
rishabhdaim wants to merge 4 commits into
trunkfrom
OAK-12216
Closed

OAK-12216 : clear segment cache after successful compaction#2904
rishabhdaim wants to merge 4 commits into
trunkfrom
OAK-12216

Conversation

@rishabhdaim
Copy link
Copy Markdown
Contributor

No description provided.

@rishabhdaim rishabhdaim self-assigned this May 15, 2026
@rishabhdaim
Copy link
Copy Markdown
Contributor Author

rishabhdaim commented May 18, 2026

Build is failing due to resource leakage in DocumentNodeStoreIT test. To be fixed by : #2905

@ChlineSaurus
Copy link
Copy Markdown
Contributor

What AI flagged (I'm not familiar with this part of the code, so I would need some time to read up on it, to give a proper review. Since this was flagged as the mosts risky PR, I'm still adding it):

After a successful compaction (Oak's GC), this PR throws away the entire L2 cache so the new generation of segments can fill it from scratch.

Why it's problematic:

  • It's redundant. The cleanup step that runs after every compaction already clears the cache. This PR adds a second clear earlier in the pipeline. Net result: two cold-cache windows where there used to
    be one. That's strictly worse for hit rate.
  • It directly contradicts existing code. Another compaction strategy has an explicit comment: "we don't clear the segment cache. There might be transient segments in there." The PR ignores this.
  • Wrong order. It clears L2 before the listener that flushes old-generation segments from the writer cache. In the gap, concurrent commits can refill L2 with the very old-gen segments we just tried to
    evict.
  • Partial-success path clears too. Even when compaction only partly succeeded, it still wipes the cache — including the segments that were just committed and about to be read.
  • Races with concurrent putSegment. Clearing while a writer is mid-insert can leave L1 references pointing at "unloaded" state.

@rishabhdaim
Copy link
Copy Markdown
Contributor Author

rishabhdaim commented May 18, 2026

thanks @ChlineSaurus for the review. I have asked re-review for the AI suggested review and below are the findings:

Verdict on each AI claim

Claim 1: "Cleanup already clears the cache — this adds a redundant second cold-cache window"

Mostly correct, but overstated. DefaultCleanupStrategy.cleanup() (line 41) does unconditionally call context.getSegmentCache().clear(). The PR adds a second clear in notifyCompactionSucceeded →
AbstractCompactionStrategy.java:82
. So there are now two clears per GC cycle.

However "strictly worse" overstates it. The first clear's intent is legitimate: evict old-gen segments before new-gen reads start filling the cache, so new-gen content doesn't compete with stale entries. The
problem is that cleanup immediately follows and clears everything again anyway — so anything loaded between the two clears is thrown away. This is a real but mild design issue, not catastrophic.

Claim 2: "Contradicts CleanupFirstCompactionStrategy's 'transient segments' comment"

Misleading. That comment is in CleanupFirstCompactionStrategy.java:60 and refers to the pre-compaction cleanup phase — when the new generation hasn't been committed yet and transient (in-flight) segments exist.
The PR's clear happens after setHead() committed the new generation. These are different phases with different semantics. The "transient segments" concern simply doesn't apply post-commit.

Claim 3: "Wrong order — clears L2 before L1 old-gen eviction"

Technically true but practically negligible. The cacheManager.evictOldGeneration() call happens inside compactionSucceeded(), which fires after the new code's cache clear. But after setHead() commits the new
generation, concurrent writers already use the new generation. The window between the two operations where old-gen segments could sneak back into L2 is real but vanishingly small.

Claim 4: "Partial-success path clears too"

Valid. notifyCompactionSucceeded() is called from both compactionSucceeded() (line 288) and compactionPartiallySucceeded() (line 292). For partial success, segments from the new partial generation were just
committed and are about to be read — clearing them from L2 hurts without clear benefit. This is analogous to why CleanupFirstCompactionStrategy explicitly avoids clearing: those partially-compacted segments are
live data. This is the strongest point in the review.

Claim 5: "Race between clear() and concurrent putSegment()"

Overstated. The putSegment() implementation calls id.loaded(segment) before cache.put(id, segment). If a concurrent clear() fires between those two calls, the L1 memoization (SegmentId.segment) gets cleared by
onRemove() → id.unloaded(). This results in an L1 miss on the next access, not data corruption — the segment is still loaded from L2 or disk correctly. Moreover, this same race already existed before this PR from
the cleanup's clear(). Not a new issue.

P.S. : The one claim worth acting on is 4: notifyCompactionSucceeded() should probably skip the cache clear when !compacted.isComplete() (partial success).
The double-clear concern (1) is also worth discussing with reviewers — either justify why the early clear adds value despite cleanup doing the same, or remove it and rely solely on cleanup's existing clear.

addressed this here : d2bdaea

Copy link
Copy Markdown
Contributor

@jsedding jsedding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical that clearing the cache after compaction is a good idea.

On a running system, there may be open sessions that reference a pre-compacted head state. Clearing the cache may therefore cause an unexpected surge in loads of old segments. Given that all segments from the new compacted generation need to also get into the cache at the same time, I believe that this may cause more trouble than it's worth.

I would expect Caffeine's normal eviction mechanism to take care of the GC generation change-over in a smoother fashion than what this change could achieve.

@rishabhdaim
Copy link
Copy Markdown
Contributor Author

@jsedding Yes, I agree with you on this.

Also, based on the benchmarks, the improvement is only for first few thousand entries only.
After sometime, caffeine, starts performing better than Guava even without changes.

rishabhdaim and others added 4 commits May 19, 2026 09:43
After compaction, old-generation segments remain in the Caffeine cache with
saturated W-TinyLFU frequency counts (freq=15). New-generation segments
enter at freq=0 and are auto-rejected by the admission gate (threshold=5),
causing a read freeze until they accumulate enough frequency.

Calling cache.clear() immediately after compaction evicts old-generation
incumbents so new-generation segments can fill the cache directly without
competing against saturated sketch counts.

Change is guarded by FeatureToggle FT_CLEAR_CACHE_OAK-12216 (enabled by
default — bug-fix behaviour). Disable at runtime if needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the standalone FeatureToggle field with a public AtomicBoolean
(FT_OAK_12216_ENABLE) as the shared state, and a string constant for the
toggle name (FT_CLEAR_CACHE_OAK_12216).  SegmentNodeStoreRegistrar creates
a FeatureToggle wrapping that AtomicBoolean and registers it on the
Whiteboard so runtime tooling (JMX) can discover and flip it without a
code change.  notifyCompactionSucceeded() reads FT_OAK_12216_ENABLE.get()
directly for zero-overhead access on the hot compaction path.

AbstractCompactionStrategy visibility widened to public so the registrar
(different package) can reference the constants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@rishabhdaim
Copy link
Copy Markdown
Contributor Author

As discussed with @jsedding we won't be fixing this.

@rishabhdaim rishabhdaim deleted the OAK-12216 branch May 19, 2026 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants