Skip to content

Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging#12913

Merged
spraza merged 13 commits intoapple:release-7.3from
saintstack:7.3_DD_init_logging
Apr 15, 2026
Merged

Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging#12913
spraza merged 13 commits intoapple:release-7.3from
saintstack:7.3_DD_init_logging

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented Apr 1, 2026

Add DD init and team collection logging for diagnosing slow startups. In particular, when SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths.

For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events.  Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.

DDTxnProcessor.actor.cpp:
- Log elapsed time for the server list + data move read transaction
  (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
- Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete)
  with NumShards
- Warn when getRange(dataMoveKeys) takes >5 seconds
  (DDInitSlowDataMoveRead)

DataDistribution.actor.cpp:
- Add NumShards and NumServers to DDInitGotInitialDD
- Promote DDInitFoundDataMove from SevDebug to SevInfo so individual
  data moves are visible in production logs
- Add DDInitResumedDataMoves summary event with ValidMoves,
  CancelledMoves, EmptyMoves counts and elapsed time

DDTeamCollection.actor.cpp:
- Add Reason and Address details to UndesiredStorageServer trace events
  to distinguish version lag, same-address, wrong-class, and exclusion
  causes without needing to correlate with other log lines

Also add DD startup visibility: metrics retries, shard tracking, scan progress to address DD operational opacity during startup.

NativeAPI.actor.cpp:
- Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is
  stuck retrying wrong_shard_server or all_alternatives_failed, with
  the key range, retry count, and elapsed time. Previously these retries
  were silent (SevDebug only), making it impossible to identify which
  shard was stuck or that retries were even happening.

DDShardTracker.actor.cpp:
- Log TrackInitialShardsComplete after shard tracker setup with count
- Log TrackInitialShardsMetricsComplete after changeSizes() finishes
  with elapsed time. changeSizes() waits for ALL shards to report
  metrics via getFirstSize/waitStorageMetrics -- if any shard metrics
  never arrive, this hangs silently.

DDTxnProcessor.actor.cpp:
- Log DDInitKeyServerScanProgress every 30s during the multi-transaction
  keyServer scan with current beginKey, batch count, shards scanned,
  and elapsed time. With 255K shards this scan requires many transactions
  and a stuck one was previously invisible.

DataDistribution.actor.cpp:
- Log DDInitComplete with elapsed time after DataDistributor::init()
  finishes, providing a single event showing total init duration.

Pushing on branch 7.3 first and will then forward port.

20260401-150853-stack-5ab41795f72e6586 compressed=True data_size=50771955 duration=4116809 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:48:12 sanity=False started=100000 stopped=20260401-155705 submitted=20260401-150853 timeout=5400 username=stack

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 65f74a9
  • Duration 0:04:02
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 65f74a9
  • Duration 0:07:46
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 65f74a9
  • Duration 0:08:06
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 65f74a9
  • Duration 0:08:17
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 65f74a9
  • Duration 0:33:18
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 65f74a9
  • Duration 0:39:28
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 6fad402
  • Duration 0:04:00
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 6fad402
  • Duration 0:08:42
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 6fad402
  • Duration 0:08:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 6fad402
  • Duration 0:09:00
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: ca49453
  • Duration 0:05:37
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: ca49453
  • Duration 0:08:29
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: ca49453
  • Duration 0:08:48
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: ca49453
  • Duration 0:09:09
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: ca49453
  • Duration 0:39:25
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Comment thread cmake/CompileBoost.cmake Outdated

add_library(${COMPILE_BOOST_TARGET} INTERFACE)
target_include_directories(${COMPILE_BOOST_TARGET} SYSTEM INTERFACE ${BOOST_INSTALL_DIR}/include)
target_compile_definitions(${COMPILE_BOOST_TARGET} INTERFACE BOOST_NO_CXX98_FUNCTION_BASE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to fix the failures here in CI with gcc 13

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Seems like branch-7.3 has issues in CI. Let me remove these attempts at fixing them and do as separate item.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 7.3 never succeeds and always needs force-push.

I am not sure if this is worth pushing on. If it passes ctests and simulation for you I think that's fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Removed these atttempts.

Comment thread fdbclient/NativeAPI.actor.cpp Outdated
TraceEvent(SevDebug, "WaitStorageMetricsHandleError").error(e);
if (e.code() == error_code_wrong_shard_server || e.code() == error_code_all_alternatives_failed) {
cx->invalidateCache(tenantInfo.prefix, keys);
retryCount++;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not print this stuff outside the if block here? Change the SevDebug to be conditional on the amount of time spent, and just always print all of these details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing...


wait(readyToStart);

for (; it != self->initData->dataMoveMap.ranges().end(); ++it) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this loop take? If it's non-trivial, instead of using in-memory int counters here, use SimplerCounter<int64_t> so we can watch the counters go up by consulting whatever the monitoring backend is (currently TraceEvent).

Copy link
Copy Markdown
Contributor Author

@saintstack saintstack Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconds? Runs once at startup. No SimplerCounter in 7.3. Let me log if > 30 seconds....

Copy link
Copy Markdown
Collaborator

@gxglass gxglass Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, working on cherrypicking ff19220. I need to set up 7.3 development environment so don't wait for this

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: aa44b89
  • Duration 0:04:10
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: aa44b89
  • Duration 0:08:32
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: aa44b89
  • Duration 0:08:38
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: aa44b89
  • Duration 0:08:46
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: aa44b89
  • Duration 0:33:44
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: aa44b89
  • Duration 0:39:07
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 0de426a
  • Duration 0:04:12
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor Author

20260402-184830-stack-ee6cb149c8086d9b compressed=True data_size=50771515 duration=4111560 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:48:28 sanity=False started=100000 stopped=20260402-193658 submitted=20260402-184830 timeout=5400 username=stack

@saintstack saintstack requested a review from spraza April 2, 2026 19:45
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 0de426a
  • Duration 0:08:23
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 0de426a
  • Duration 0:08:26
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 0de426a
  • Duration 0:08:36
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 0de426a
  • Duration 0:34:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Comment thread fdbserver/DataDistribution.actor.cpp Outdated
Comment on lines +516 to +517
// DDInitServerListAndDataMoveReadComplete - Server list + data moves read: NumDataMoves, NumServers,
// ElapsedSeconds DDInitKeyServerScanProgress - (every 30s) keyServer scan: BeginKey, Batches, ShardsScanned
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a wrapping mixup, ElapsedSeconds should be end of previous line?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing. Thanks.

michael stack added 13 commits April 13, 2026 16:26
When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often
opaque. Add logging.

For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events.  Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.

DDTxnProcessor.actor.cpp:
- Log elapsed time for the server list + data move read transaction
  (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
- Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete)
  with NumShards
- Warn when getRange(dataMoveKeys) takes >5 seconds
  (DDInitSlowDataMoveRead)

DataDistribution.actor.cpp:
- Add NumShards and NumServers to DDInitGotInitialDD
- Promote DDInitFoundDataMove from SevDebug to SevInfo so individual
  data moves are visible in production logs
- Add DDInitResumedDataMoves summary event with ValidMoves,
  CancelledMoves, EmptyMoves counts and elapsed time

DDTeamCollection.actor.cpp:
- Add Reason and Address details to UndesiredStorageServer trace events
  to distinguish version lag, same-address, wrong-class, and exclusion
  causes without needing to correlate with other log lines
…ress

Additional logging to address DD operational opacity during startup,
based on past incidents where DD hung with no visibility into the cause.

NativeAPI.actor.cpp:
- Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is
  stuck retrying wrong_shard_server or all_alternatives_failed, with
  the key range, retry count, and elapsed time. Previously these retries
  were silent (SevDebug only), making it impossible to identify which
  shard was stuck or that retries were even happening.

DDShardTracker.actor.cpp:
- Log TrackInitialShardsComplete after shard tracker setup with count
- Log TrackInitialShardsMetricsComplete after changeSizes() finishes
  with elapsed time. changeSizes() waits for ALL shards to report
  metrics via getFirstSize/waitStorageMetrics -- if any shard metrics
  never arrive, this hangs silently.

DDTxnProcessor.actor.cpp:
- Log DDInitKeyServerScanProgress every 30s during the multi-transaction
  keyServer scan with current beginKey, batch count, shards scanned,
  and elapsed time. With 255K shards this scan requires many transactions
  and a stuck one was previously invisible.

DataDistribution.actor.cpp:
- Log DDInitComplete with elapsed time after DataDistributor::init()
  finishes, providing a single event showing total init duration.
FAILED: flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o
/usr/local/bin/ccache /usr/local/bin/clang++ -DBOOST_ATOMIC_NO_LIB -DBOOST_CONTEXT_NO_LIB -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_FILESYSTEM_NO_LIB -DBOOST_IOSTREAMS_NO_LIB -DBOOST_SERIALIZATION_NO_LIB -DBOOST_SYSTEM_NO_DEPRECATED -DBOOST_SYSTEM_NO_LIB -DNO_INTELLISENSE -DWITH_FOLLY_MEMCPY -DCOMPILATION_UNIT=/codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/ActorCollection.actor.g.cpp -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/stacktrace/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/fmt-8.1.1/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/SimpleOpt/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/crc32/include -I/codebuild/output/src2645037596/src/github.com/apple/foundationdb/contrib/folly_memcpy -isystem /opt/boost_1_78_0_clang/include -stdlib=libc++ -O3 -DNDEBUG -std=gnu++20 -DCMAKE_BUILD -fno-omit-frame-pointer -gdwarf-4 -ggdb1 -gz -Wall -Wextra -Wredundant-move -Wpessimizing-move -Woverloaded-virtual -Wshift-sign-overflow -Wno-sign-compare -Wno-undefined-var-template -Wno-unknown-warning-option -Wno-unused-parameter -Wno-constant-logical-operand -Wno-deprecated-copy -Wno-delete-non-abstract-non-virtual-dtor -Wno-range-loop-construct -Wno-reorder-ctor -Wno-unused-command-line-argument -Wno-ambiguous-reversed-operator -Wno-register -Werror -Wno-error=format -Wunused-variable -Wno-deprecated -fvisibility=hidden -Wreturn-type -fPIC -march=armv8.2-a+crc+simd -DHAVE_OPENSSL -Wno-unused-function -MD -MT flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o -MF flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o.d -o flow/CMakeFiles/flow.dir/ActorCollection.actor.g.cpp.o -c /codebuild/output/src2645037596/src/github.com/apple/foundationdb/build_output/flow/ActorCollection.actor.g.cpp
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/ActorCollection.actor.cpp:21:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/ActorCollection.h:25:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/flow.h:23:
In file included from /codebuild/output/src2645037596/src/github.com/apple/foundationdb/flow/include/flow/Arena.h:37:
In file included from /opt/boost_1_78_0_clang/include/boost/functional/hash.hpp:6:
/opt/boost_1_78_0_clang/include/boost/container_hash/hash.hpp:132:33: error: no template named 'unary_function' in namespace 'std'; did you mean '__unary_function'?
  132 |         struct hash_base : std::unary_function<T, std::size_t> {};
      |                            ~~~~~^
/usr/local/bin/../include/c++/v1/__functional/unary_function.h:46:1: note: '__unary_function' declared here
   46 | using __unary_function = __unary_function_keep_layout_base<_Arg, _Result>;
      | ^
1 error generated.
- NativeAPI.actor.cpp: Move retry logging outside the error-type if block
  so all errors get keys/elapsed/retries details. Use severity upgrade
  (SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError
  event instead of a separate event name.
- DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in
  resumeFromDataMoves loop so operators can watch counts go up during long
  data move recovery.
- CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is
  broken independently of this change.
DD exits (e.g. movekeys_conflict) were invisible because
reportErrorsExcept suppresses logging for "normal" DD errors.
Add DDExiting trace event at SevWarn with error and code so
every DD death is visible in trace logs.
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: e95820e
  • Duration 0:04:04
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: e95820e
  • Duration 0:08:37
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: e95820e
  • Duration 0:08:37
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: e95820e
  • Duration 0:08:43
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: e95820e
  • Duration 0:33:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: e95820e
  • Duration 0:39:54
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 14, 2026

@saintstack many checks are failing

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 15, 2026

@saintstack many checks are failing

it's release-7.3, these are expected.

@spraza could you also review. Would like to get this into the next patch for 7.3

@spraza spraza merged commit 30c7a2a into apple:release-7.3 Apr 15, 2026
1 of 6 checks passed
saintstack pushed a commit to saintstack/foundationdb that referenced this pull request Apr 15, 2026
…s, and team collection logging (apple#12913)

Add logging throughout DD startup and runtime to diagnose slow startups,
stuck data moves, and undesired server classification.

DDTxnProcessor: Log elapsed time for server list + data move read
transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes
over 5 seconds.

DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD.
Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves,
EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting
at SevWarn. Add DDInit-prefixed trace events throughout startup sequence.

DDTeamCollection: Add Reason and Address details to UndesiredStorageServer
trace events to distinguish version lag, same-address, wrong-class, and
exclusion causes.

DDShardTracker: Log TrackInitialShardsComplete with shard count and
TrackInitialShardsMetricsComplete with elapsed time.

NativeAPI: Add retry counting and logging for getStorageMetrics timeouts.
saintstack pushed a commit to saintstack/foundationdb that referenced this pull request Apr 21, 2026
…s, and team collection logging (apple#12913)

Add logging throughout DD startup and runtime to diagnose slow startups,
stuck data moves, and undesired server classification.

DDTxnProcessor: Log elapsed time for server list + data move read
transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes
over 5 seconds.

DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD.
Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves,
EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting
at SevWarn. Add DDInit-prefixed trace events throughout startup sequence.

DDTeamCollection: Add Reason and Address details to UndesiredStorageServer
trace events to distinguish version lag, same-address, wrong-class, and
exclusion causes.

DDShardTracker: Log TrackInitialShardsComplete with shard count and
TrackInitialShardsMetricsComplete with elapsed time.

NativeAPI: Add retry counting and logging for getStorageMetrics timeouts.
saintstack added a commit that referenced this pull request Apr 23, 2026
…nosing slow startups (#13002)

* Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913)

Add logging throughout DD startup and runtime to diagnose slow startups,
stuck data moves, and undesired server classification.

DDTxnProcessor: Log elapsed time for server list + data move read
transaction and keyServer scan. Warn when getRange(dataMoveKeys) takes
over 5 seconds.

DataDistribution: Add NumShards and NumServers to DDInitGotInitialDD.
Add DDInitResumedDataMoves summary with ValidMoves, CancelledMoves,
EmptyMoves counts and elapsed time. Log DD exit reason as DDExiting
at SevWarn. Add DDInit-prefixed trace events throughout startup sequence.

DDTeamCollection: Add Reason and Address details to UndesiredStorageServer
trace events to distinguish version lag, same-address, wrong-class, and
exclusion causes.

DDShardTracker: Log TrackInitialShardsComplete with shard count and
TrackInitialShardsMetricsComplete with elapsed time.

NativeAPI: Add retry counting and logging for getStorageMetrics timeouts.

* Remove double trace

* Fix DD trace event issues: rate-limit warns, deduplicate events, fix DDExiting

- Wire up unused lastLogTime in waitStorageMetrics to rate-limit SevWarn
  events to once per 10s after the 60s threshold. Previously every retry
  (up to 100/s with 10ms WRONG_SHARD_SERVER_DELAY) emitted SevWarn.

- Remove redundant DataDistributorRunning and DDInitRunning events inside
  the try block. Rename the original DataDistributorRunning to DDInitRunning
  so a single event serves both purposes and fits the DDInit* query pattern.

- Move DDExiting to the three actual throw sites so it only fires on
  terminal exits, not on retries of movekeys_conflict/dd_config_changed
  when DD is disabled.

* Trim over-apologetic DDInitDone comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants