Miscellaneous DD fixes in 7.3#12682
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
|
@saintstack I have some questions here:
|
Its what I did to compile 7.3. Not related to my changes. Seeing if others have input.
CI doesn't have the -Wno-* flags so fails compile because warnings are errors.
Internal. Couldn't compile. Distcc cache seemed to be given me wrong compiles. |
|
cmk to generate gcc compile. |
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
|
For the last push: |
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
…d handling tool. Bug fixes: - Detect peer disconnect in waitValueOrSignal (genericactors.actor.h). Adds a when() clause watching peer->disconnect so dead connections (e.g., from NAT timeouts) are detected immediately instead of hanging indefinitely waiting on a connection the lower layer has already replaced. We saw this in an incident where waiting on a long reply on a network with frequent disconnects; low level fdb would make a new connection but high-level would wait until we timed out on the original. - Add DD_COALESCE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries (KeyRangeMap.actor.cpp, MoveKeys.actor.cpp). When enabled, logs a warning and skips instead of crashing with ASSERT on adjacent entries with the same value. Off by default. Tests: - Unit tests for waitValueOrSignal peer disconnect detection - KRMCoalesceTest workload and toml for testing uncoalesced KRM tolerance
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
| TraceEvent(SevDebug, "WaitStorageMetricsHandleError").error(e); | ||
| if (e.code() == error_code_wrong_shard_server || e.code() == error_code_all_alternatives_failed) { | ||
| cx->invalidateCache(tenantInfo.prefix, keys); | ||
| retryCount++; |
|
Ok. Closing. This PR has been broken into smaller pieces so easier to mix and match what we want to commit to release-7.3. The added logging that was in this PR has been assumed by #12913 The coalescing tool by #12934 The stuck retry by #12935 That leaves a hack where we timed out wait after 15 minutes and returned fake size if DD stuck fetching shard sizes. This latter needs more work if it is to make it out to production (There are better ideas... put aside problem shards so DD can progress rather than fake data, etc.). |
Refactored. #12913 adds the logging this PR was adding and then some. Now this PR just adds two things; a bug fix and a knob protected KRM coalescing tool (The facility where we would wait on shard size in DD startup for 15 minutes and if no answer, return a 'faked' answer has been dropped).
Adds a when() clause watching peer->disconnect so dead connections
(e.g., from NAT timeouts) are detected immediately instead of hanging
indefinitely waiting on a connection the lower layer has already replaced.
We saw this in an incident where waiting on a long reply on a network
with frequent disconnects; low level fdb would make a new connection
but high-level would wait until we timed out on the original.
- Add DD_COALESCE_UNCOALESCED_KRM knob to tolerate uncoalesced KRM entries
(KeyRangeMap.actor.cpp, MoveKeys.actor.cpp). When enabled, logs a warning
and skips instead of crashing with ASSERT on adjacent entries with the
same value. Off by default.