Skip to content

Make getTeamByServers O(1) in time#12938

Merged
spraza merged 2 commits intoapple:release-7.3from
spraza:dd-efficient-getTeamByServers
Apr 9, 2026
Merged

Make getTeamByServers O(1) in time#12938
spraza merged 2 commits intoapple:release-7.3from
spraza:dd-efficient-getTeamByServers

Conversation

@spraza
Copy link
Copy Markdown
Collaborator

@spraza spraza commented Apr 7, 2026

Under certain workloads (large data movement, storage migration, etc.), with SHARD_ENCODE_LOCATION_METADATA enabled, DD can get indefinitely stuck at initialization time, because getTeamByServers function saturates the CPU and starves other critical DD actors to complete (symptom: txn_too_old).

This PR is a perf optimization that makes getTeamByServers O(1) instead of O(teams). Previously, for every team, we were doing expensive CPU operations.

500K: 20260407-065022-praza-6a689f78cbedc591 compressed=True data_size=35201181 duration=15433678 ended=500000 fail_fast=50 max_runs=500000 pass=500000 priority=100 remaining=0 runtime=2:28:17 sanity=False started=500000 stopped=20260407-091839 submitted=20260407-065022 timeout=5400 username=praza

ctests passed:

...
54/70 Test #57: fdb_c_upgrade_from_prev2_gradual ....................................   Passed   21.68 sec
55/70 Test #65: java-integration ....................................................   Passed   15.12 sec
56/70 Test #17: fdb_c_unit_tests ....................................................   Passed   33.22 sec
57/70 Test #20: fdb_c_external_client_unit_tests ....................................   Passed   33.73 sec
58/70 Test #66: java-multi-integration ..............................................   Passed   19.35 sec
59/70 Test #55: fdb_c_upgrade_from_prev3_gradual ....................................   Passed   28.47 sec
60/70 Test  #4: multi_process_fdbcli_tests ..........................................   Passed   37.14 sec
61/70 Test #53: fdb_c_upgrade_to_future_version_blob_granules .......................   Passed   33.55 sec
62/70 Test #63: python_unit_tests ...................................................   Passed   28.50 sec
63/70 Test #13: authz_no_grv_cache_no_forced_mvc ....................................   Passed   41.63 sec
64/70 Test #15: authz_with_grv_cache_with_forced_mvc ................................   Passed   42.09 sec
65/70 Test #14: authz_no_grv_cache_with_forced_mvc ..................................   Passed   43.16 sec
66/70 Test  #3: single_process_fdbcli_tests .........................................   Passed   49.86 sec
67/70 Test  #5: single_process_external_client_fdbcli_tests .........................   Passed   51.73 sec
68/70 Test #62: fdb_c_shim_library_tests ............................................   Passed   46.75 sec
69/70 Test #60: fdb_c_wiggle_only ...................................................   Passed   74.72 sec
70/70 Test #61: fdb_c_wiggle_and_upgrade ............................................   Passed   99.68 sec

100% tests passed, 0 tests failed out of 70

Total Test time (real) = 118.15 sec

Post feedback 500K: 20260407-232634-praza-193a7d8cc8dd5b43 compressed=True data_size=35200768 duration=14012243 ended=500000 fail=2 fail_fast=50 max_runs=500000 pass=499998 priority=100 remaining=0 runtime=22:58:12 sanity=False started=500000 stopped=20260408-222446 submitted=20260407-232634 timeout=5400 username=praza.

The two failures are in ConfigureStorageMigration restart test, does not seem related to the additional feedback commit.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 5466516
  • Duration 0:04:07
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

gxglass
gxglass previously approved these changes Apr 7, 2026
Comment thread fdbserver/DDTeamCollection.actor.cpp Outdated

bool DDTeamCollection::removeTeam(Reference<TCTeamInfo> team) {
TraceEvent("RemovedServerTeam", distributorId).detail("Team", team->getDesc());
if (teamsByServerIDs.find(team->getServerIDsStr()) != teamsByServerIDs.end()) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto it = teamsByServerIds.find(...);
if (it != thing.end()) {
thing.erase(it);
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, reads better and avoids the redundant hash + lookup

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 5466516
  • Duration 0:08:10
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 5466516
  • Duration 0:08:29
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 5466516
  • Duration 0:08:35
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 9b0d64d
  • Duration 0:04:12
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 9b0d64d
  • Duration 0:08:10
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 9b0d64d
  • Duration 0:08:28
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 9b0d64d
  • Duration 0:08:40
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 5466516
  • Duration 0:32:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 5466516
  • Duration 0:38:53
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@spraza
Copy link
Copy Markdown
Collaborator Author

spraza commented Apr 8, 2026

Thanks for the reviews. I'm doing some additional perf testing against this change. Hoping to merge later this week or early next week once I have positive signal from the perf testing.

@gxglass
Copy link
Copy Markdown
Collaborator

gxglass commented Apr 8, 2026

Thanks for the reviews. I'm doing some additional perf testing against this change. Hoping to merge later this week or early next week once I have positive signal from the perf testing.

Based on experience in production so far I am comfortable merging

@spraza spraza merged commit bf78a46 into apple:release-7.3 Apr 9, 2026
0 of 4 checks passed
spraza added a commit to spraza/foundationdb that referenced this pull request Apr 16, 2026
* Make getTeamByServers O(1) in time

* address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants