Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging#12913
Conversation
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
|
|
||
| add_library(${COMPILE_BOOST_TARGET} INTERFACE) | ||
| target_include_directories(${COMPILE_BOOST_TARGET} SYSTEM INTERFACE ${BOOST_INSTALL_DIR}/include) | ||
| target_compile_definitions(${COMPILE_BOOST_TARGET} INTERFACE BOOST_NO_CXX98_FUNCTION_BASE) |
There was a problem hiding this comment.
Trying to fix the failures here in CI with gcc 13
There was a problem hiding this comment.
Ok. Seems like branch-7.3 has issues in CI. Let me remove these attempts at fixing them and do as separate item.
There was a problem hiding this comment.
Yeah 7.3 never succeeds and always needs force-push.
I am not sure if this is worth pushing on. If it passes ctests and simulation for you I think that's fine.
There was a problem hiding this comment.
Ack. Removed these atttempts.
| TraceEvent(SevDebug, "WaitStorageMetricsHandleError").error(e); | ||
| if (e.code() == error_code_wrong_shard_server || e.code() == error_code_all_alternatives_failed) { | ||
| cx->invalidateCache(tenantInfo.prefix, keys); | ||
| retryCount++; |
There was a problem hiding this comment.
why not print this stuff outside the if block here? Change the SevDebug to be conditional on the amount of time spent, and just always print all of these details.
|
|
||
| wait(readyToStart); | ||
|
|
||
| for (; it != self->initData->dataMoveMap.ranges().end(); ++it) { |
There was a problem hiding this comment.
How long does this loop take? If it's non-trivial, instead of using in-memory int counters here, use SimplerCounter<int64_t> so we can watch the counters go up by consulting whatever the monitoring backend is (currently TraceEvent).
There was a problem hiding this comment.
Seconds? Runs once at startup. No SimplerCounter in 7.3. Let me log if > 30 seconds....
There was a problem hiding this comment.
Right, working on cherrypicking ff19220. I need to set up 7.3 development environment so don't wait for this
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
|
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
| // DDInitServerListAndDataMoveReadComplete - Server list + data moves read: NumDataMoves, NumServers, | ||
| // ElapsedSeconds DDInitKeyServerScanProgress - (every 30s) keyServer scan: BeginKey, Batches, ShardsScanned |
There was a problem hiding this comment.
looks like a wrapping mixup, ElapsedSeconds should be end of previous line?
When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often opaque. Add logging. For example, DD init hung for 14-16 minutes with zero visibility into what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys and DDInitGotInitialDD trace events. Diagnosing the root cause required extensive log splunking of SS metrics to determine that a single getRange(dataMoveKeys) read was queued on an overloaded storage server. DDTxnProcessor.actor.cpp: - Log elapsed time for the server list + data move read transaction (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers - Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete) with NumShards - Warn when getRange(dataMoveKeys) takes >5 seconds (DDInitSlowDataMoveRead) DataDistribution.actor.cpp: - Add NumShards and NumServers to DDInitGotInitialDD - Promote DDInitFoundDataMove from SevDebug to SevInfo so individual data moves are visible in production logs - Add DDInitResumedDataMoves summary event with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time DDTeamCollection.actor.cpp: - Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes without needing to correlate with other log lines
…ress Additional logging to address DD operational opacity during startup, based on past incidents where DD hung with no visibility into the cause. NativeAPI.actor.cpp: - Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is stuck retrying wrong_shard_server or all_alternatives_failed, with the key range, retry count, and elapsed time. Previously these retries were silent (SevDebug only), making it impossible to identify which shard was stuck or that retries were even happening. DDShardTracker.actor.cpp: - Log TrackInitialShardsComplete after shard tracker setup with count - Log TrackInitialShardsMetricsComplete after changeSizes() finishes with elapsed time. changeSizes() waits for ALL shards to report metrics via getFirstSize/waitStorageMetrics -- if any shard metrics never arrive, this hangs silently. DDTxnProcessor.actor.cpp: - Log DDInitKeyServerScanProgress every 30s during the multi-transaction keyServer scan with current beginKey, batch count, shards scanned, and elapsed time. With 255K shards this scan requires many transactions and a stuck one was previously invisible. DataDistribution.actor.cpp: - Log DDInitComplete with elapsed time after DataDistributor::init() finishes, providing a single event showing total init duration.
FAILED: flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o
/usr/local/bin/ccache /usr/local/bin/clang++ -DBOOST_ATOMIC_NO_LIB -DBOOST_CONTEXT_NO_LIB -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_FILESYSTEM_NO_LIB -DBOOST_IOSTREAMS_NO_LIB -DBOOST_SERIALIZATION_NO_LIB -DBOOST_SYSTEM_NO_DEPRECATED -DBOOST_SYSTEM_NO_LIB -DNO_INTELLISENSE -DWITH_FOLLY_MEMCPY -DCOMPILATION_UNIT=/codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/ActorCollection.actor.g.cpp -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/stacktrace/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/fmt-8.1.1/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/SimpleOpt/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/crc32/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/folly_memcpy -isystem /opt/boost_1_78_0_clang/include -stdlib=libc++ -O3 -DNDEBUG -std=gnu++20 -DCMAKE_BUILD -fno-omit-frame-pointer -gdwarf-4 -ggdb1 -gz -Wall -Wextra -Wredundant-move -Wpessimizing-move -Woverloaded-virtual -Wshift-sign-overflow -Wno-sign-compare -Wno-undefined-var-template -Wno-unknown-warning-option -Wno-unused-parameter -Wno-constant-logical-operand -Wno-deprecated-copy -Wno-delete-non-abstract-non-virtual-dtor -Wno-range-loop-construct -Wno-reorder-ctor -Wno-unused-command-line-argument -Wno-ambiguous-reversed-operator -Wno-register -Werror -Wno-error=format -Wunused-variable -Wno-deprecated -fvisibility=hidden -Wreturn-type -fPIC -march=armv8.2-a+crc+simd -DHAVE_OPENSSL -Wno-unused-function -MD -MT flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o -MF flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o.d -o flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o -c /codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/ActorCollection.actor.g.cpp
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/ActorCollection.actor.cpp:21:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/ActorCollection.h:25:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/flow.h:23:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/Arena.h:37:
In file included from /opt/boost_1_78_0_clang/include/boost/functional/hash.hpp:6:
/opt/boost_1_78_0_clang/include/boost/container_hash/hash.hpp:132:33: error: no template named 'unary_function' in namespace 'std'; did you mean '__unary_function'?
132 | struct hash_base : std::unary_function<T, std::size_t> {};
| ~~~~~^
/usr/local/bin/../include/c++/v1/__functional/unary_function.h:46:1: note: '__unary_function' declared here
46 | using __unary_function = __unary_function_keep_layout_base<_Arg, _Result>;
| ^
1 error generated.
- NativeAPI.actor.cpp: Move retry logging outside the error-type if block so all errors get keys/elapsed/retries details. Use severity upgrade (SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError event instead of a separate event name. - DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in resumeFromDataMoves loop so operators can watch counts go up during long data move recovery. - CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is broken independently of this change.
DD exits (e.g. movekeys_conflict) were invisible because reportErrorsExcept suppresses logging for "normal" DD errors. Add DDExiting trace event at SevWarn with error and code so every DD death is visible in trace logs.
a715dcb to
e95820e
Compare
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
|
@saintstack many checks are failing |
it's release-7.3, these are expected. @spraza could you also review. Would like to get this into the next patch for 7.3 |
…s, and team collection logging (apple#12913) Add logging throughout DD startup and runtime to diagnose slow startups, stuck data moves, and undesired server classification. DDTxnProcessor: Log elapsed time for server list + data move read transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes over 5 seconds. DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD. Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting at SevWarn. Add DDInit-prefixed trace events throughout startup sequence. DDTeamCollection: Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes. DDShardTracker: Log TrackInitialShardsComplete with shard count and TrackInitialShardsMetricsComplete with elapsed time. NativeAPI: Add retry counting and logging for getStorageMetrics timeouts.
…s, and team collection logging (apple#12913) Add logging throughout DD startup and runtime to diagnose slow startups, stuck data moves, and undesired server classification. DDTxnProcessor: Log elapsed time for server list + data move read transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes over 5 seconds. DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD. Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting at SevWarn. Add DDInit-prefixed trace events throughout startup sequence. DDTeamCollection: Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes. DDShardTracker: Log TrackInitialShardsComplete with shard count and TrackInitialShardsMetricsComplete with elapsed time. NativeAPI: Add retry counting and logging for getStorageMetrics timeouts.
…nosing slow startups (#13002) * Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913) Add logging throughout DD startup and runtime to diagnose slow startups, stuck data moves, and undesired server classification. DDTxnProcessor: Log elapsed time for server list + data move read transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes over 5 seconds. DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD. Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting at SevWarn. Add DDInit-prefixed trace events throughout startup sequence. DDTeamCollection: Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes. DDShardTracker: Log TrackInitialShardsComplete with shard count and TrackInitialShardsMetricsComplete with elapsed time. NativeAPI: Add retry counting and logging for getStorageMetrics timeouts. * Remove double trace * Fix DD trace event issues: rate-limit warns, deduplicate events, fix DDExiting - Wire up unused lastLogTime in waitStorageMetrics to rate-limit SevWarn events to once per 10s after the 60s threshold. Previously every retry (up to 100/s with 10ms WRONG_SHARD_SERVER_DELAY) emitted SevWarn. - Remove redundant DataDistributorRunning and DDInitRunning events inside the try block. Rename the original DataDistributorRunning to DDInitRunning so a single event serves both purposes and fits the DDInit* query pattern. - Move DDExiting to the three actual throw sites so it only fires on terminal exits, not on retries of movekeys_conflict/dd_config_changed when DD is disabled. * Trim over-apologetic DDInitDone comment
Add DD init and team collection logging for diagnosing slow startups. In particular, when SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths.
Also add DD startup visibility: metrics retries, shard tracking, scan progress to address DD operational opacity during startup.
Pushing on branch 7.3 first and will then forward port.
20260401-150853-stack-5ab41795f72e6586 compressed=True data_size=50771955 duration=4116809 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:48:12 sanity=False started=100000 stopped=20260401-155705 submitted=20260401-150853 timeout=5400 username=stack