Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss protection v3 #8560

Merged
merged 13 commits into from Oct 27, 2022
Merged

Conversation

sfc-gh-ljoswiak
Copy link
Collaborator

@sfc-gh-ljoswiak sfc-gh-ljoswiak commented Oct 24, 2022

Builds on #8472, but fixes an fdbcli test timeout due to the ClientDBInfo not being updated with the cluster ID. Also disables reading of \xff\xff/cluster_id from configuration databases.

Including PR description from #8472 for reference:

Data loss protection is a feature to disallow processes from joining the wrong cluster and deleting their data (see #5375). The original implementation (#5637) had to be reverted due to some problems.

This PR re-enables data loss protection, but with a slightly different implementation. Each cluster generates a random cluster ID on the CC after recovery. This UID is stored in the database and sent to all processes through the ServerDBInfo object. Previously, stateful processes would write the cluster ID to their data files, and compare this value to that of the txnStateStore on boot. This PR modifies this approach. Now, each process will write the cluster ID to a durable file in the data directory (the file is named clusterId). When a process starts, it will include its cluster ID, read from this durable file, in its registration message with the cluster controller. The cluster controller will verify the cluster ID against its own on disk value, and refuse to let the process join the cluster if these cluster IDs don't match. The cluster controller will also verify its on disk cluster ID with that of the database when first starting.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: b968976
  • Duration 0:33:47
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: b968976
  • Duration 0:40:35
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: b968976
  • Duration 1:58:14
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

  • Commit ID: f6be2bc
  • Duration 0:29:14
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Monterey 12.x

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

  • Commit ID: f6be2bc
  • Duration 0:57:39
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 3b356a2
  • Duration 1:57:41
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: f6be2bc
  • Duration 1:56:09
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
This enables clients to receive the cluster ID.
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Monterey 12.x

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@sfc-gh-ljoswiak sfc-gh-ljoswiak marked this pull request as ready for review October 26, 2022 22:22
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 19e739a
  • Duration 1:57:29
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

Copy link
Contributor

@halfprice halfprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

Adding isConfigDB to avoid updateClusterSharedStateMap LGTM.

As I don't have context for the rest of the PR, I'd wait for others' approval.

fdbclient/MultiVersionTransaction.actor.cpp Show resolved Hide resolved
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux CentOS 7

@fdb-windows-ci
Copy link
Collaborator

Doxense CI Report for Windows 10

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux CentOS 7

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

  • Commit ID: 55b8e75
  • Duration 1:57:57
  • Result: ❌ FAILED
  • Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
  • Build Logs (available for 30 days)
  • Build Artifact (available for 30 days)

@sfc-gh-ljoswiak sfc-gh-ljoswiak merged commit 9625efd into apple:main Oct 27, 2022
@sfc-gh-ljoswiak sfc-gh-ljoswiak deleted the fixes/cluster-id-v3 branch October 27, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants