Add DD_TOLERATE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries#12934
Add DD_TOLERATE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries#12934saintstack wants to merge 6 commits intoapple:release-7.3from
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
gxglass
left a comment
There was a problem hiding this comment.
I haven't looked closely yet. Below is what Claude said. I do agree with the suggestion to rename the flag to say "TOLERATE" and not to say that coalescing is happening.
⏺ Analysis of PR #12934: KRM Uncoalesced Entry Tolerance
Goal
This diff addresses a production crash scenario where FoundationDB's KeyRangeMap (KRM) metadata has become uncoalesced — meaning adjacent entries in the KRM carry the same value when they shouldn't (they should have been merged into a single range). When this happens,
krmSetRangeCoalescing_ hits this assertion:
ASSERT(value != endValue || endKey == maxWithPrefix.end); // line 297
…and the process crashes. The diff introduces a knob DD_COALESCE_UNCOALESCED_KRM (default: false) that, when enabled, converts these crashes into logged warnings that skip the problematic operation instead.
Two call sites are patched:
- krmSetRangeCoalescing_ in KeyRangeMap.actor.cpp — the generic KRM range-set function
- unassignServerKeys in MoveKeys.actor.cpp — the server-key-unassignment loop
Risk Assessment
- The "skip and return early" in krmSetRangeCoalescing_ is the biggest concern.
When uncoalesced entries are detected, the function does return Void() — no clear, no set, nothing. This means the caller's intended metadata mutation is silently dropped. For data distribution (DD), this means shard assignments, moves, or removals may not be recorded. That could
lead to:
- Shards that DD thinks it moved but didn't update metadata for
- Stuck data moves that never complete
- Accumulating stale metadata over time
The original ASSERT was a safety net saying "this state is impossible and we can't reason about correctness if it happens." Silently skipping preserves uptime but trades it for potentially accumulating inconsistent metadata state.
- The MoveKeys.actor.cpp change is more benign but still concerning.
Here the loop continues writing entries even when uncoalesced entries are detected (it just logs instead of crashing). The entries are still written, so the mutation proceeds. However, writing entries that perpetuate uncoalesced state means the problem doesn't self-heal — it just
propagates.
- Neither site attempts to actually fix/coalesce the entries.
Despite the knob name DD_COALESCE_UNCOALESCED_KRM, no actual coalescing is performed. The name is misleading. The code either skips the operation entirely or tolerates the bad state, but never repairs it. A more honest name would be DD_TOLERATE_UNCOALESCED_KRM.
- The return Void() early-exit changes the transaction's read-conflict ranges without writing.
The function has already added addReadConflictRange calls (lines 266-271 in the original) before hitting the new early-return. A transaction that reads but doesn't write is benign to FDB, but the caller likely expects that if the function returns successfully, the mutation was
applied. There's no way for the caller to know the operation was skipped.
Does It Solve Its Intended Problem?
Partially. It prevents the crash, which is the immediate operational need — a cluster in this state is stuck in a crash loop on the DD process. But:
- It doesn't repair the root cause (uncoalesced entries remain)
- The skipped mutations may cause downstream DD logic to misbehave
- It's a tourniquet, not a fix — appropriate for emergency use, but the cluster still needs manual intervention to clean up the uncoalesced KRM entries
For a production hotfix to unblock a crashed cluster, this is reasonable if treated as temporary.
Test Workload Assessment
The KRMCoalesceTest workload is a good smoke test — it manually injects uncoalesced entries, calls krmSetRangeCoalescing, and verifies no crash. However:
- It uses a non-system prefix (/krmcoalescetest/), so it doesn't exercise the real DD code paths that would call this in anger
- It only tests the krmSetRangeCoalescing_ path, not the unassignServerKeys path
- The "normal coalescing" test (testNormalCoalescing) verifies basic KRM operations still work, which is good regression coverage
- The test asserts entry count == 6 after injection but doesn't assert anything specific about post-skip state beyond "we got here"
Other Observations
- Knob is off by default — good. This is opt-in for clusters in trouble.
- No BUGGIFY randomization — the knob isn't randomized in simulation. This means simulation won't exercise the skip path unless the test explicitly enables it. You might want if (randomize && BUGGIFY) DD_COALESCE_UNCOALESCED_KRM = true; to get coverage.
- SevWarnAlways is appropriate — these are rare events that operators must see.
- The blank line addition at MoveKeys.actor.cpp:2815 is cosmetic noise.
- Missing #include risk — IKnobCollection.h is added to KeyRangeMap.actor.cpp. The function accesses knobs via the global collection (IKnobCollection::getGlobalKnobCollection().getServerKnobs().DD_COALESCE_UNCOALESCED_KRM) rather than the SERVER_KNOBS macro. This works but is
inconsistent with the MoveKeys.actor.cpp site which uses SERVER_KNOBS->DD_COALESCE_UNCOALESCED_KRM. The KeyRangeMap.actor.cpp site is in fdbclient/ which may not have direct access to SERVER_KNOBS — hence the longer path. This is a valid pattern but worth a comment explaining
why.
Summary
┌───────────────────┬───────────────────────────────────────────────────────────┐
│ Aspect │ Assessment │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Prevents crash │ Yes │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Fixes root cause │ No — uncoalesced entries persist │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Naming │ Misleading — "coalesce" implies repair, but it only skips │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Skip semantics │ Risky — silent mutation drop in krmSetRangeCoalescing_ │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Caller awareness │ None — callers can't tell the op was skipped │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Test coverage │ Adequate for smoke, doesn't test real DD paths │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Production safety │ Acceptable as emergency knob, not for permanent use │
└───────────────────┴───────────────────────────────────────────────────────────┘
Recommendation: This is a reasonable emergency escape valve. I'd suggest (a) renaming to DD_TOLERATE_UNCOALESCED_KRM, (b) adding a follow-up mechanism (e.g., a background actor) that actually repairs uncoalesced entries when detected, and (c) considering returning an error or
special status from krmSetRangeCoalescing_ rather than silently succeeding when the operation is skipped.
6b0c7f9 to
2254fde
Compare
|
Rebase and address review comments. |
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
2 similar comments
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
|
What do you make of this? |
When enabled, logs a warning and skips instead of crashing with ASSERT on adjacent KRM entries with the same value. This allows DD to continue operating when metadata corruption (uncoalesced entries) is present, rather than crash-looping. Off by default. Enable with DD_COALESCE_UNCOALESCED_KRM=true when investigating or recovering from corrupted metadata. Includes KRMCoalesceTest workload and simulation test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The knob does not actually coalesce entries — it tolerates them by skipping operations that would hit the ASSERT. Rename to accurately reflect behavior. Also fix comments that implied coalescing was happening and remove a spurious blank line addition.
2254fde to
6221635
Compare
|
@gxglass Thanks for pr12934_review.md. Rare but possible. Fixed.
|
There was a problem hiding this comment.
Pull request overview
Adds a new DD knob to tolerate (warn + continue) when uncoalesced KeyRangeMap (KRM) entries are detected, preventing DD crash-looping on corrupted metadata, and introduces a simulation workload to exercise the tolerant path.
Changes:
- Add
DD_TOLERATE_UNCOALESCED_KRMserver knob (default off) and use it to gate ASSERT-vs-warn behavior when uncoalesced KRM entries are detected. - Update
krmSetRangeCoalescingand DD’sunassignServerKeysto log warnings and continue when the knob is enabled. - Add
KRMCoalesceTestworkload + TOML config and wire it into the fast test suite.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fast/KRMCoalesceTest.toml | Adds a fast simulation test config enabling the new knob and running the new workload. |
| tests/CMakeLists.txt | Registers the new fast test TOML so it runs in CI/sim. |
| fdbserver/workloads/KRMCoalesceTest.actor.cpp | New workload validating tolerant behavior when uncoalesced KRM entries exist. |
| fdbserver/MoveKeys.actor.cpp | Gates uncoalesced-KRM ASSERT behind the new knob; warns when enabled. |
| fdbclient/include/fdbclient/ServerKnobs.h | Declares the new server knob. |
| fdbclient/ServerKnobs.cpp | Initializes the new server knob (default off). |
| fdbclient/KeyRangeMap.actor.cpp | Adds knob-gated tolerance logic to krmSetRangeCoalescing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .detail("Key2", kvs[i + 1].key) | ||
| .detail("Value", kvs[i].value); | ||
| } else { | ||
| ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate |
There was a problem hiding this comment.
When the tolerance knob is disabled, this now does ASSERT(false) without preserving the original assertion condition or emitting any SevError trace details. This is a regression in debuggability compared to the previous ASSERT(kvs[i].value != kvs[i + 1].value || atEnd) condition. Consider keeping the original ASSERT condition (and/or logging the keys/value before asserting) in the non-tolerating path.
| ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate | |
| TraceEvent(SevError, "MoveKeysUncoalescedDetected", logId) | |
| .detail("SSID", ssId) | |
| .detail("Key1", kvs[i].key) | |
| .detail("Key2", kvs[i + 1].key) | |
| .detail("Value", kvs[i].value); | |
| ASSERT(kvs[i].value != kvs[i + 1].value || | |
| atEnd); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate |
| .detail("MaxEnd", maxWithPrefix.end) | ||
| .detail("Value", value); | ||
| } else { | ||
| ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate |
There was a problem hiding this comment.
In the non-tolerating path this switches from ASSERT(value != endValue || endKey == maxWithPrefix.end) to ASSERT(false), which loses the original condition string and makes failures harder to diagnose. Consider asserting the original condition (possibly after a TraceEvent with details) so the crash output still reflects the invariant being violated.
| ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate | |
| ASSERT(value != endValue || | |
| endKey == maxWithPrefix.end); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate |
| bool DD_VALIDATE_LOCALITY; | ||
| int DD_CHECK_INVALID_LOCALITY_DELAY; | ||
| bool DD_ENABLE_VERBOSE_TRACING; | ||
| bool DD_TOLERATE_UNCOALESCED_KRM; // If true, tolerate uncoalesced KRM entries instead of crashing with ASSERT |
There was a problem hiding this comment.
PR description mentions enabling the behavior with DD_COALESCE_UNCOALESCED_KRM=true, but the code/config knob added here is DD_TOLERATE_UNCOALESCED_KRM (toml: dd_tolerate_uncoalesced_krm). To avoid operational confusion during recovery, align the documentation/PR description with the actual knob name.
|
|
||
| tr->clear(KeyRangeRef(beginKey, endKey)); | ||
|
|
||
| // Write entries in pairs, checking for uncoalesced entries (adjacent with same value). |
There was a problem hiding this comment.
The new comment says “Write entries in pairs”, but the loop advances by 1 and writes overlapping adjacent pairs (i,i+1), so it’s not actually writing non-overlapping pairs. Consider rewording to clarify it’s checking adjacent entries for coalescing violations (or change the loop to step by 2 if that was the intent).
| // Write entries in pairs, checking for uncoalesced entries (adjacent with same value). | |
| // Write entries while checking adjacent entries for uncoalesced KRM values. |
Result of foundationdb-pr-clang-73 on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9
|
When enabled, logs a warning and skips instead of crashing with ASSERT on adjacent KRM entries with the same value. This allows DD to continue operating when metadata corruption (uncoalesced entries) is present, rather than crash-looping.
Off by default. Enable with DD_COALESCE_UNCOALESCED_KRM=true when investigating or recovering from corrupted metadata.
Includes KRMCoalesceTest workload and simulation test.
20260413-193715-stack-0d0a91bbfeaeb724 compressed=True data_size=50801557 duration=3500765 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:13:18 sanity=False started=100000 stopped=20260413-205033 submitted=20260413-193715 timeout=5400 username=stackThe failure is unrelated...
JoshuaSeed="1550387959184392139" WillRestart="0" NegativeTest="0" RandomSeed="1613085174" SourceVersion="871a6bc81881169b744bbdeb2c6bb530ca6a02ea" Time="1776109168" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/slow/MetaclusterManagementConcurrency.toml"