Forward-port to 7.4: Add DD init and team collection logging for diagnosing slow startups#13002
Conversation
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
@saintstack assuming you ran 100K against 7.4 branch with this change? |
|
@spraza Yes. This |
| Optional<Reference<TransactionState>> trState) { | ||
| state Span span("NAPI:WaitStorageMetrics"_loc, generateSpanID(cx->transactionTracingSample)); | ||
| state double startTime = now(); | ||
| state double lastLogTime = 0; |
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
The following tests FAILED: |
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
Reopening. I don't see a macOS build running. |
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
|
Says pending on the macOS build. I ran it manually: Now I'll add manually the success. Meantime will try to figure what's up with this macOS on release-7.4. |

When SHARD_ENCODE_LOCATION_METADATA=true we take new codepaths often
opaque. Add logging.
For example, DD init hung for 14-16 minutes with zero visibility into
what was stuck. The only clue was a gap between DDInitUpdatedReplicaKeys
and DDInitGotInitialDD trace events. Diagnosing the root cause required
extensive log splunking of SS metrics to determine that a single
getRange(dataMoveKeys) read was queued on an overloaded storage server.
DDTxnProcessor.actor.cpp:
(DDInitServerListAndDataMoveReadComplete) with NumDataMoves, NumServers
with NumShards
(DDInitSlowDataMoveRead)
DataDistribution.actor.cpp:
data moves are visible in production logs
CancelledMoves, EmptyMoves counts and elapsed time
DDTeamCollection.actor.cpp:
to distinguish version lag, same-address, wrong-class, and exclusion
causes without needing to correlate with other log lines
Forward-ported from 7.3