Skip to content

Add DD_TOLERATE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries#12934

Open
saintstack wants to merge 6 commits intoapple:release-7.3from
saintstack:dd_7.3.coalesce_krm
Open

Add DD_TOLERATE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries#12934
saintstack wants to merge 6 commits intoapple:release-7.3from
saintstack:dd_7.3.coalesce_krm

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented Apr 3, 2026

When enabled, logs a warning and skips instead of crashing with ASSERT on adjacent KRM entries with the same value. This allows DD to continue operating when metadata corruption (uncoalesced entries) is present, rather than crash-looping.

Off by default. Enable with DD_COALESCE_UNCOALESCED_KRM=true when investigating or recovering from corrupted metadata.

Includes KRMCoalesceTest workload and simulation test.

20260413-193715-stack-0d0a91bbfeaeb724 compressed=True data_size=50801557 duration=3500765 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:13:18 sanity=False started=100000 stopped=20260413-205033 submitted=20260413-193715 timeout=5400 username=stack

The failure is unrelated... JoshuaSeed="1550387959184392139" WillRestart="0" NegativeTest="0" RandomSeed="1613085174" SourceVersion="871a6bc81881169b744bbdeb2c6bb530ca6a02ea" Time="1776109168" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/slow/MetaclusterManagementConcurrency.toml"

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 70744bf
  • Duration 0:04:07
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 70744bf
  • Duration 0:08:25
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 70744bf
  • Duration 0:08:27
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 70744bf
  • Duration 0:08:35
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 70744bf
  • Duration 0:34:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 70744bf
  • Duration 0:38:50
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Copy Markdown
Collaborator

@gxglass gxglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely yet. Below is what Claude said. I do agree with the suggestion to rename the flag to say "TOLERATE" and not to say that coalescing is happening.

⏺ Analysis of PR #12934: KRM Uncoalesced Entry Tolerance

Goal

This diff addresses a production crash scenario where FoundationDB's KeyRangeMap (KRM) metadata has become uncoalesced — meaning adjacent entries in the KRM carry the same value when they shouldn't (they should have been merged into a single range). When this happens,
krmSetRangeCoalescing_ hits this assertion:

ASSERT(value != endValue || endKey == maxWithPrefix.end); // line 297

…and the process crashes. The diff introduces a knob DD_COALESCE_UNCOALESCED_KRM (default: false) that, when enabled, converts these crashes into logged warnings that skip the problematic operation instead.

Two call sites are patched:

  1. krmSetRangeCoalescing_ in KeyRangeMap.actor.cpp — the generic KRM range-set function
  2. unassignServerKeys in MoveKeys.actor.cpp — the server-key-unassignment loop

Risk Assessment

  1. The "skip and return early" in krmSetRangeCoalescing_ is the biggest concern.

When uncoalesced entries are detected, the function does return Void() — no clear, no set, nothing. This means the caller's intended metadata mutation is silently dropped. For data distribution (DD), this means shard assignments, moves, or removals may not be recorded. That could
lead to:

  • Shards that DD thinks it moved but didn't update metadata for
  • Stuck data moves that never complete
  • Accumulating stale metadata over time

The original ASSERT was a safety net saying "this state is impossible and we can't reason about correctness if it happens." Silently skipping preserves uptime but trades it for potentially accumulating inconsistent metadata state.

  1. The MoveKeys.actor.cpp change is more benign but still concerning.

Here the loop continues writing entries even when uncoalesced entries are detected (it just logs instead of crashing). The entries are still written, so the mutation proceeds. However, writing entries that perpetuate uncoalesced state means the problem doesn't self-heal — it just
propagates.

  1. Neither site attempts to actually fix/coalesce the entries.

Despite the knob name DD_COALESCE_UNCOALESCED_KRM, no actual coalescing is performed. The name is misleading. The code either skips the operation entirely or tolerates the bad state, but never repairs it. A more honest name would be DD_TOLERATE_UNCOALESCED_KRM.

  1. The return Void() early-exit changes the transaction's read-conflict ranges without writing.

The function has already added addReadConflictRange calls (lines 266-271 in the original) before hitting the new early-return. A transaction that reads but doesn't write is benign to FDB, but the caller likely expects that if the function returns successfully, the mutation was
applied. There's no way for the caller to know the operation was skipped.


Does It Solve Its Intended Problem?

Partially. It prevents the crash, which is the immediate operational need — a cluster in this state is stuck in a crash loop on the DD process. But:

  • It doesn't repair the root cause (uncoalesced entries remain)
  • The skipped mutations may cause downstream DD logic to misbehave
  • It's a tourniquet, not a fix — appropriate for emergency use, but the cluster still needs manual intervention to clean up the uncoalesced KRM entries

For a production hotfix to unblock a crashed cluster, this is reasonable if treated as temporary.


Test Workload Assessment

The KRMCoalesceTest workload is a good smoke test — it manually injects uncoalesced entries, calls krmSetRangeCoalescing, and verifies no crash. However:

  • It uses a non-system prefix (/krmcoalescetest/), so it doesn't exercise the real DD code paths that would call this in anger
  • It only tests the krmSetRangeCoalescing_ path, not the unassignServerKeys path
  • The "normal coalescing" test (testNormalCoalescing) verifies basic KRM operations still work, which is good regression coverage
  • The test asserts entry count == 6 after injection but doesn't assert anything specific about post-skip state beyond "we got here"

Other Observations

  • Knob is off by default — good. This is opt-in for clusters in trouble.
  • No BUGGIFY randomization — the knob isn't randomized in simulation. This means simulation won't exercise the skip path unless the test explicitly enables it. You might want if (randomize && BUGGIFY) DD_COALESCE_UNCOALESCED_KRM = true; to get coverage.
  • SevWarnAlways is appropriate — these are rare events that operators must see.
  • The blank line addition at MoveKeys.actor.cpp:2815 is cosmetic noise.
  • Missing #include risk — IKnobCollection.h is added to KeyRangeMap.actor.cpp. The function accesses knobs via the global collection (IKnobCollection::getGlobalKnobCollection().getServerKnobs().DD_COALESCE_UNCOALESCED_KRM) rather than the SERVER_KNOBS macro. This works but is
    inconsistent with the MoveKeys.actor.cpp site which uses SERVER_KNOBS->DD_COALESCE_UNCOALESCED_KRM. The KeyRangeMap.actor.cpp site is in fdbclient/ which may not have direct access to SERVER_KNOBS — hence the longer path. This is a valid pattern but worth a comment explaining
    why.

Summary

┌───────────────────┬───────────────────────────────────────────────────────────┐
│ Aspect │ Assessment │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Prevents crash │ Yes │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Fixes root cause │ No — uncoalesced entries persist │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Naming │ Misleading — "coalesce" implies repair, but it only skips │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Skip semantics │ Risky — silent mutation drop in krmSetRangeCoalescing_ │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Caller awareness │ None — callers can't tell the op was skipped │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Test coverage │ Adequate for smoke, doesn't test real DD paths │
├───────────────────┼───────────────────────────────────────────────────────────┤
│ Production safety │ Acceptable as emergency knob, not for permanent use │
└───────────────────┴───────────────────────────────────────────────────────────┘

Recommendation: This is a reasonable emergency escape valve. I'd suggest (a) renaming to DD_TOLERATE_UNCOALESCED_KRM, (b) adding a follow-up mechanism (e.g., a background actor) that actually repairs uncoalesced entries when detected, and (c) considering returning an error or
special status from krmSetRangeCoalescing_ rather than silently succeeding when the operation is skipped.

@saintstack saintstack force-pushed the dd_7.3.coalesce_krm branch from 6b0c7f9 to 2254fde Compare April 14, 2026 18:58
@saintstack
Copy link
Copy Markdown
Contributor Author

Rebase and address review comments.

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6b0c7f9
  • Duration 0:04:53
  • Result: ❌ FAILED
  • Error: git checkout failed with exit status 128: fatal: unable to read tree (6b0c7f9176e0f386952fccf5d37e583df02d33f0) for primary source and source version 6b0c7f9176e0f386952fccf5d37e583df02d33f0
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

2 similar comments
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6b0c7f9
  • Duration 0:04:53
  • Result: ❌ FAILED
  • Error: git checkout failed with exit status 128: fatal: unable to read tree (6b0c7f9176e0f386952fccf5d37e583df02d33f0) for primary source and source version 6b0c7f9176e0f386952fccf5d37e583df02d33f0
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6b0c7f9
  • Duration 0:04:53
  • Result: ❌ FAILED
  • Error: git checkout failed with exit status 128: fatal: unable to read tree (6b0c7f9176e0f386952fccf5d37e583df02d33f0) for primary source and source version 6b0c7f9176e0f386952fccf5d37e583df02d33f0
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack changed the title Add DD_COALESCE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries Add DD_TOLERATE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries Apr 14, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 3:00:41
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

2 similar comments
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 3:00:41
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 3:00:41
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack closed this Apr 14, 2026
@saintstack saintstack reopened this Apr 14, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 1:31:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

2 similar comments
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 1:31:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 2254fde
  • Duration 1:31:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 20, 2026

What do you make of this?

pr12934_review.md

michael stack and others added 6 commits April 20, 2026 13:53
When enabled, logs a warning and skips instead of crashing with ASSERT
on adjacent KRM entries with the same value. This allows DD to continue
operating when metadata corruption (uncoalesced entries) is present,
rather than crash-looping.

Off by default. Enable with DD_COALESCE_UNCOALESCED_KRM=true when
investigating or recovering from corrupted metadata.

Includes KRMCoalesceTest workload and simulation test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The knob does not actually coalesce entries — it tolerates them by
skipping operations that would hit the ASSERT. Rename to accurately
reflect behavior. Also fix comments that implied coalescing was
happening and remove a spurious blank line addition.
@saintstack saintstack force-pushed the dd_7.3.coalesce_krm branch from 2254fde to 6221635 Compare April 20, 2026 20:53
@saintstack
Copy link
Copy Markdown
Contributor Author

@gxglass Thanks for pr12934_review.md. Rare but possible. Fixed.

20260420-174846-stack_centos7_all-ea4892122c072bbc compressed=True data_size=51178350 duration=4483617 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:51:56 sanity=False started=100000 stopped=20260420-184042 submitted=20260420-174846 timeout=5400 username=stack_centos7_all

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new DD knob to tolerate (warn + continue) when uncoalesced KeyRangeMap (KRM) entries are detected, preventing DD crash-looping on corrupted metadata, and introduces a simulation workload to exercise the tolerant path.

Changes:

  • Add DD_TOLERATE_UNCOALESCED_KRM server knob (default off) and use it to gate ASSERT-vs-warn behavior when uncoalesced KRM entries are detected.
  • Update krmSetRangeCoalescing and DD’s unassignServerKeys to log warnings and continue when the knob is enabled.
  • Add KRMCoalesceTest workload + TOML config and wire it into the fast test suite.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/fast/KRMCoalesceTest.toml Adds a fast simulation test config enabling the new knob and running the new workload.
tests/CMakeLists.txt Registers the new fast test TOML so it runs in CI/sim.
fdbserver/workloads/KRMCoalesceTest.actor.cpp New workload validating tolerant behavior when uncoalesced KRM entries exist.
fdbserver/MoveKeys.actor.cpp Gates uncoalesced-KRM ASSERT behind the new knob; warns when enabled.
fdbclient/include/fdbclient/ServerKnobs.h Declares the new server knob.
fdbclient/ServerKnobs.cpp Initializes the new server knob (default off).
fdbclient/KeyRangeMap.actor.cpp Adds knob-gated tolerance logic to krmSetRangeCoalescing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.detail("Key2", kvs[i + 1].key)
.detail("Value", kvs[i].value);
} else {
ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tolerance knob is disabled, this now does ASSERT(false) without preserving the original assertion condition or emitting any SevError trace details. This is a regression in debuggability compared to the previous ASSERT(kvs[i].value != kvs[i + 1].value || atEnd) condition. Consider keeping the original ASSERT condition (and/or logging the keys/value before asserting) in the non-tolerating path.

Suggested change
ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate
TraceEvent(SevError, "MoveKeysUncoalescedDetected", logId)
.detail("SSID", ssId)
.detail("Key1", kvs[i].key)
.detail("Key2", kvs[i + 1].key)
.detail("Value", kvs[i].value);
ASSERT(kvs[i].value != kvs[i + 1].value ||
atEnd); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate

Copilot uses AI. Check for mistakes.
.detail("MaxEnd", maxWithPrefix.end)
.detail("Value", value);
} else {
ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-tolerating path this switches from ASSERT(value != endValue || endKey == maxWithPrefix.end) to ASSERT(false), which loses the original condition string and makes failures harder to diagnose. Consider asserting the original condition (possibly after a TraceEvent with details) so the crash output still reflects the invariant being violated.

Suggested change
ASSERT(false); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate
ASSERT(value != endValue ||
endKey == maxWithPrefix.end); // Uncoalesced KRM entries detected; set DD_TOLERATE_UNCOALESCED_KRM=true to tolerate

Copilot uses AI. Check for mistakes.
bool DD_VALIDATE_LOCALITY;
int DD_CHECK_INVALID_LOCALITY_DELAY;
bool DD_ENABLE_VERBOSE_TRACING;
bool DD_TOLERATE_UNCOALESCED_KRM; // If true, tolerate uncoalesced KRM entries instead of crashing with ASSERT
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions enabling the behavior with DD_COALESCE_UNCOALESCED_KRM=true, but the code/config knob added here is DD_TOLERATE_UNCOALESCED_KRM (toml: dd_tolerate_uncoalesced_krm). To avoid operational confusion during recovery, align the documentation/PR description with the actual knob name.

Copilot uses AI. Check for mistakes.

tr->clear(KeyRangeRef(beginKey, endKey));

// Write entries in pairs, checking for uncoalesced entries (adjacent with same value).
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment says “Write entries in pairs”, but the loop advances by 1 and writes overlapping adjacent pairs (i,i+1), so it’s not actually writing non-overlapping pairs. Consider rewording to clarify it’s checking adjacent entries for coalescing violations (or change the loop to step by 2 if that was the intent).

Suggested change
// Write entries in pairs, checking for uncoalesced entries (adjacent with same value).
// Write entries while checking adjacent entries for uncoalesced KRM values.

Copilot uses AI. Check for mistakes.
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-73 on Linux RHEL 9

  • Commit ID: 6221635
  • Duration 0:44:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-73 on Linux RHEL 9

  • Commit ID: 6221635
  • Duration 0:57:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests-73 on Linux RHEL 9

  • Commit ID: 6221635
  • Duration 2:14:31
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants