New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loss protection v3 #8560
Merged
sfc-gh-ljoswiak
merged 13 commits into
apple:main
from
sfc-gh-ljoswiak:fixes/cluster-id-v3
Oct 27, 2022
Merged
Data loss protection v3 #8560
sfc-gh-ljoswiak
merged 13 commits into
apple:main
from
sfc-gh-ljoswiak:fixes/cluster-id-v3
Oct 27, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
sfc-gh-ljoswiak
force-pushed
the
fixes/cluster-id-v3
branch
from
October 24, 2022 23:58
b968976
to
3b356a2
Compare
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
sfc-gh-ljoswiak
force-pushed
the
fixes/cluster-id-v3
branch
from
October 25, 2022 00:36
3b356a2
to
f6be2bc
Compare
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
And have these processes enter a "zombie" state where they cancel all their actors and then wait forever, refusing to do any additional work until they are manually handled by the operator.
The simulator tracks only active processes. Rebooted or killed processes are removed from the list of processes, and only get added back when the process is rebooted and starts up again. This causes a problem for the `RebootProcessAndSwitch` kill type, which wants to simultaneously reboot all machines in a cluster and change their cluster file. If a machine is currently being rebooted, it will miss the reboot process and switch command. The fix is to add a check when a process is being started in simulation. If the process has had its cluster file changed and the cluster is in a state where all processes should have had their cluster files reverted to the original value, the simulator will now send a `RebootProcessAndSwitch` signal right when the process is started. This will cause an extra reboot, but should correctly switch the process back to its original, correct cluster file, allowing the cluster to fully recover all clusters. Note that the above issue should only affect simulation, due to how the simulator tracks processes and handles kill signals. This commit also adds a field to each process struct to determine whether the process is being run in a DR cluster in the simulation run. This is needed because simulation does not differentiate between processes in different clusters (other than by the IP), and some processes needed to switch clusters and some simply needed to be rebooted.
The cluster ID is now stored in the database instead of in the txnStateStore. The cluster controller will read it on boot and send it to all processes to persist.
The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has been moved to the database. This was causing protocol compatibility issues during upgrades, so we need to rename the key.
This enables clients to receive the cluster ID.
sfc-gh-ljoswiak
force-pushed
the
fixes/cluster-id-v3
branch
from
October 26, 2022 20:47
f6be2bc
to
19e739a
Compare
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Monterey 12.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
sfc-gh-ljoswiak
requested review from
jzhou77,
sfc-gh-etschannen and
halfprice
October 26, 2022 22:23
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
halfprice
reviewed
Oct 27, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
Adding isConfigDB
to avoid updateClusterSharedStateMap
LGTM.
As I don't have context for the rest of the PR, I'd wait for others' approval.
Result of foundationdb-pr-clang-ide on Linux CentOS 7
|
Doxense CI Report for Windows 10
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
jzhou77
approved these changes
Oct 27, 2022
This was referenced Mar 8, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Builds on #8472, but fixes an fdbcli test timeout due to the
ClientDBInfo
not being updated with the cluster ID. Also disables reading of\xff\xff/cluster_id
from configuration databases.Including PR description from #8472 for reference:
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)