Skip to content

Add DD init visibility, metrics retries, shard tracking, scan progres…#13142

Merged
saintstack merged 1 commit into
apple:mainfrom
saintstack:dd_init_logging_main
May 5, 2026
Merged

Add DD init visibility, metrics retries, shard tracking, scan progres…#13142
saintstack merged 1 commit into
apple:mainfrom
saintstack:dd_init_logging_main

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

…s, and team collection logging (#12913)

Forward-port from 7.3 of

commit 30c7a2aadf395e6cec2be10e03ce7d3f496d5cb5
Author: Michael Stack <saintstack@users.noreply.github.com>
Date:   Wed Apr 15 10:25:43 2026 -0700

    Add DD init visibility, metrics retries, shard tracking, scan progress, and team collection logging (#12913)

  • Add DD init and team collection logging for diagnosing slow startups

When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often opaque. Add logging.

For example, DD init hung for 14-16 minutes with zero visibility into what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys and DDInitGotInitialDD trace events. Diagnosing the root cause required extensive log splunking of SS metrics to determine that a single getRange(dataMoveKeys) read was queued on an overloaded storage server.

DDTxnProcessor.actor.cpp:

  • Log elapsed time for the server list + data move read transaction (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
  • Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete) with NumShards
  • Warn when getRange(dataMoveKeys) takes >5 seconds (DDInitSlowDataMoveRead)

DataDistribution.actor.cpp:

  • Add NumShards and NumServers to DDInitGotInitialDD
  • Promote DDInitFoundDataMove from SevDebug to SevInfo so individual data moves are visible in production logs
  • Add DDInitResumedDataMoves summary event with ValidMoves, CancelledMoves, EmptyMoves counts and elapsed time

DDTeamCollection.actor.cpp:

  • Add Reason and Address details to UndesiredStorageServer trace events to distinguish version lag, same-address, wrong-class, and exclusion causes without needing to correlate with other log lines
  • Revert DDInitFoundDataMove to SevDebug to avoid log spam with many data moves

  • Add DD startup visibility: metrics retries, shard tracking, scan progress

Additional logging to address DD operational opacity during startup, based on past incidents where DD hung with no visibility into the cause.

NativeAPI.actor.cpp:

  • Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is stuck retrying wrong_shard_server or all_alternatives_failed, with the key range, retry count, and elapsed time. Previously these retries were silent (SevDebug only), making it impossible to identify which shard was stuck or that retries were even happening.

DDShardTracker.actor.cpp:

  • Log TrackInitialShardsComplete after shard tracker setup with count
  • Log TrackInitialShardsMetricsComplete after changeSizes() finishes with elapsed time. changeSizes() waits for ALL shards to report metrics via getFirstSize/waitStorageMetrics -- if any shard metrics never arrive, this hangs silently.

DDTxnProcessor.actor.cpp:

  • Log DDInitKeyServerScanProgress every 30s during the multi-transaction keyServer scan with current beginKey, batch count, shards scanned, and elapsed time. With 255K shards this scan requires many transactions and a stuck one was previously invisible.

DataDistribution.actor.cpp:

  • Log DDInitComplete with elapsed time after DataDistributor::init() finishes, providing a single event showing total init duration.

  • NativeAPI.actor.cpp: Move retry logging outside the error-type if block so all errors get keys/elapsed/retries details. Use severity upgrade (SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError event instead of a separate event name.

  • DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in resumeFromDataMoves loop so operators can watch counts go up during long data move recovery.

  • CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is broken independently of this change.

DD exits (e.g. movekeys_conflict) were invisible because reportErrorsExcept suppresses logging for "normal" DD errors. Add DDExiting trace event at SevWarn with error and code so every DD death is visible in trace logs.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds trace-based observability around Data Distributor startup, shard tracking, storage-metrics retries, and undesired-server classification so operators can diagnose slow or stuck DD initialization more easily.

Changes:

  • Adds new DD startup and progress trace events in initialization, shard tracking, and data-move recovery paths.
  • Expands storage-metrics retry logging and undesired storage server logging with more context.
  • Adds warning/error visibility for DD exit paths and long-running initialization sub-steps.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
fdbserver/datadistributor/DataDistribution.cpp Adds DD startup sequence documentation/comments, extra init/data-move logging, and DD exit visibility.
fdbserver/datadistributor/DDTxnProcessor.cpp Adds logging around initial server/data-move reads and key-server scan progress/completion.
fdbserver/datadistributor/DDTeamCollection.actor.cpp Adds reason/address context to undesired storage server trace events.
fdbserver/datadistributor/DDShardTracker.cpp Adds shard-tracker setup completion, metrics completion, and final DD init trace events.
fdbclient/NativeAPI.actor.cpp Extends waitStorageMetrics retry logging with elapsed time, keys, retry count, and severity escalation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fdbserver/datadistributor/DDTxnProcessor.cpp
Comment thread fdbserver/datadistributor/DataDistribution.cpp
Comment thread fdbserver/datadistributor/DataDistribution.cpp
Comment thread fdbserver/datadistributor/DDTeamCollection.actor.cpp
Comment thread fdbclient/NativeAPI.actor.cpp
Comment thread fdbserver/datadistributor/DDTxnProcessor.cpp
Comment thread fdbserver/datadistributor/DDTxnProcessor.cpp
Comment thread fdbserver/datadistributor/DDShardTracker.cpp
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 0:22:17
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 7e27150
  • Duration 0:34:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 7e27150
  • Duration 0:42:43
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 7e27150
  • Duration 0:46:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 1:10:38
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 3:00:09
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 3:00:11
  • Result: ❌ FAILED
  • Error: Build has timed out.
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 1:38:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 7e27150
  • Duration 1:44:55
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor Author

20260504-155129-stack_dd_init_logging-06af153d303c00aa compressed=True data_size=37065434 duration=5138938 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=1:03:31 sanity=False started=100000 stopped=20260504-165500 submitted=20260504-155129 timeout=5400 username=stack_dd_init_logging

RandomUnitTest failure. I ran it and seems unrelated to logging changes.

…s, and team collection logging (apple#12913)

* Add DD init and team collection logging for diagnosing slow startups

When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often
opaque. Add logging.

For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events.  Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.

DDTxnProcessor.actor.cpp:
- Log elapsed time for the server list + data move read transaction
  (DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
- Log elapsed time for the keyServer scan (DDInitKeyServerScanComplete)
  with NumShards
- Warn when getRange(dataMoveKeys) takes >5 seconds
  (DDInitSlowDataMoveRead)

DataDistribution.actor.cpp:
- Add NumShards and NumServers to DDInitGotInitialDD
- Promote DDInitFoundDataMove from SevDebug to SevInfo so individual
  data moves are visible in production logs
- Add DDInitResumedDataMoves summary event with ValidMoves,
  CancelledMoves, EmptyMoves counts and elapsed time

DDTeamCollection.actor.cpp:
- Add Reason and Address details to UndesiredStorageServer trace events
  to distinguish version lag, same-address, wrong-class, and exclusion
  causes without needing to correlate with other log lines

* Revert DDInitFoundDataMove to SevDebug to avoid log spam with many data moves

* Add DD startup visibility: metrics retries, shard tracking, scan progress

Additional logging to address DD operational opacity during startup,
based on past incidents where DD hung with no visibility into the cause.

NativeAPI.actor.cpp:
- Log WaitStorageMetricsRetrying every 60s when waitStorageMetrics is
  stuck retrying wrong_shard_server or all_alternatives_failed, with
  the key range, retry count, and elapsed time. Previously these retries
  were silent (SevDebug only), making it impossible to identify which
  shard was stuck or that retries were even happening.

DDShardTracker.actor.cpp:
- Log TrackInitialShardsComplete after shard tracker setup with count
- Log TrackInitialShardsMetricsComplete after changeSizes() finishes
  with elapsed time. changeSizes() waits for ALL shards to report
  metrics via getFirstSize/waitStorageMetrics -- if any shard metrics
  never arrive, this hangs silently.

DDTxnProcessor.actor.cpp:
- Log DDInitKeyServerScanProgress every 30s during the multi-transaction
  keyServer scan with current beginKey, batch count, shards scanned,
  and elapsed time. With 255K shards this scan requires many transactions
  and a stuck one was previously invisible.

DataDistribution.actor.cpp:
- Log DDInitComplete with elapsed time after DataDistributor::init()
  finishes, providing a single event showing total init duration.

- NativeAPI.actor.cpp: Move retry logging outside the error-type if block
  so all errors get keys/elapsed/retries details. Use severity upgrade
  (SevDebug -> SevWarn after 60s) on the existing WaitStorageMetricsHandleError
  event instead of a separate event name.
- DataDistribution.actor.cpp: Add periodic progress logging (every 30s) in
  resumeFromDataMoves loop so operators can watch counts go up during long
  data move recovery.
- CompileBoost.cmake: Remove BOOST_NO_CXX98_FUNCTION_BASE since 7.3 CI is
  broken independently of this change.

DD exits (e.g. movekeys_conflict) were invisible because
reportErrorsExcept suppresses logging for "normal" DD errors.
Add DDExiting trace event at SevWarn with error and code so
every DD death is visible in trace logs.
@saintstack saintstack force-pushed the dd_init_logging_main branch from 7e27150 to 69d488b Compare May 4, 2026 21:42
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 69d488b
  • Duration 0:27:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 69d488b
  • Duration 0:42:50
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 69d488b
  • Duration 0:50:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 69d488b
  • Duration 1:01:42
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack
Copy link
Copy Markdown
Contributor Author

20260504-215109-stack_dd_init_logging-9e5454afc7cde41e compressed=True data_size=37067069 duration=5215522 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:55:58 sanity=False started=100000 stopped=20260504-224707 submitted=20260504-215109 timeout=5400 username=stack_dd_init_logging

Failure:
RandomSeed="3509699268" SourceVersion="69d488b8d0e9df6c62466f7b6976b910058b9b11" Time="1777933966" BuggifyEnabled="1" DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/slow/VersionStampBackupToDB.toml"

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 69d488b
  • Duration 1:10:59
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 69d488b
  • Duration 1:11:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 69d488b
  • Duration 1:29:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack requested review from gxglass and spraza May 5, 2026 20:46
@saintstack saintstack merged commit 09d67f9 into apple:main May 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants