Data loss protection v3 #8560

sfc-gh-ljoswiak · 2022-10-24T21:36:25Z

Builds on #8472, but fixes an fdbcli test timeout due to the ClientDBInfo not being updated with the cluster ID. Also disables reading of \xff\xff/cluster_id from configuration databases.

Including PR description from #8472 for reference:

Data loss protection is a feature to disallow processes from joining the wrong cluster and deleting their data (see #5375). The original implementation (#5637) had to be reverted due to some problems.

This PR re-enables data loss protection, but with a slightly different implementation. Each cluster generates a random cluster ID on the CC after recovery. This UID is stored in the database and sent to all processes through the ServerDBInfo object. Previously, stateful processes would write the cluster ID to their data files, and compare this value to that of the txnStateStore on boot. This PR modifies this approach. Now, each process will write the cluster ID to a durable file in the data directory (the file is named clusterId). When a process starts, it will include its cluster ID, read from this durable file, in its registration message with the cluster controller. The cluster controller will verify the cluster ID against its own on disk value, and refuse to let the process join the cluster if these cluster IDs don't match. The cluster controller will also verify its on disk cluster ID with that of the database when first starting.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2022-10-24T21:54:21Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: b968976
Duration 0:17:46
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-24T22:10:22Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: b968976
Duration 0:33:47
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-24T22:17:06Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: b968976
Duration 0:40:35
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

fdb-windows-ci · 2022-10-24T22:45:45Z

Doxense CI Report for Windows 10

Commit ID: b968976
Result: ✔️ SUCCEEDED
Build Logs (available for 30 days)

foundationdb-ci · 2022-10-24T23:34:50Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: b968976
Duration 1:58:14
Result: ❌ FAILED
Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T00:15:30Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 3b356a2
Duration 0:17:05
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

fdb-windows-ci · 2022-10-25T00:51:03Z

Doxense CI Report for Windows 10

Commit ID: 3b356a2
Result: ✔️ SUCCEEDED
Build Logs (available for 30 days)

foundationdb-ci · 2022-10-25T00:52:00Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 3b356a2
Duration 0:53:36
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T00:52:37Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: f6be2bc
Duration 0:15:59
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T00:59:01Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 3b356a2
Duration 1:00:35
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

fdb-windows-ci · 2022-10-25T01:05:01Z

Doxense CI Report for Windows 10

Commit ID: f6be2bc
Result: ✔️ SUCCEEDED
Build Logs (available for 30 days)

foundationdb-ci · 2022-10-25T01:05:55Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: f6be2bc
Duration 0:29:14
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T01:15:20Z

Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2

Commit ID: f6be2bc
Duration 0:38:41
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T01:21:04Z

Result of foundationdb-pr-macos on macOS Monterey 12.x

Commit ID: f6be2bc
Duration 0:44:26
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T01:34:17Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: f6be2bc
Duration 0:57:39
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T01:56:07Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 3b356a2
Duration 1:57:41
Result: ❌ FAILED
Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-25T02:32:48Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: f6be2bc
Duration 1:56:09
Result: ❌ FAILED
Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

And have these processes enter a "zombie" state where they cancel all their actors and then wait forever, refusing to do any additional work until they are manually handled by the operator.

The simulator tracks only active processes. Rebooted or killed processes are removed from the list of processes, and only get added back when the process is rebooted and starts up again. This causes a problem for the `RebootProcessAndSwitch` kill type, which wants to simultaneously reboot all machines in a cluster and change their cluster file. If a machine is currently being rebooted, it will miss the reboot process and switch command. The fix is to add a check when a process is being started in simulation. If the process has had its cluster file changed and the cluster is in a state where all processes should have had their cluster files reverted to the original value, the simulator will now send a `RebootProcessAndSwitch` signal right when the process is started. This will cause an extra reboot, but should correctly switch the process back to its original, correct cluster file, allowing the cluster to fully recover all clusters. Note that the above issue should only affect simulation, due to how the simulator tracks processes and handles kill signals. This commit also adds a field to each process struct to determine whether the process is being run in a DR cluster in the simulation run. This is needed because simulation does not differentiate between processes in different clusters (other than by the IP), and some processes needed to switch clusters and some simply needed to be rebooted.

The cluster ID is now stored in the database instead of in the txnStateStore. The cluster controller will read it on boot and send it to all processes to persist.

The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).

In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has been moved to the database. This was causing protocol compatibility issues during upgrades, so we need to rename the key.

This enables clients to receive the cluster ID.

foundationdb-ci · 2022-10-26T21:05:18Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 19e739a
Duration 0:18:05
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-26T21:30:58Z

Result of foundationdb-pr-macos on macOS Monterey 12.x

Commit ID: 19e739a
Duration 0:43:44
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-26T21:38:51Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 19e739a
Duration 0:51:37
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-26T21:42:13Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 19e739a
Duration 0:55:00
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

fdb-windows-ci · 2022-10-26T22:12:52Z

Doxense CI Report for Windows 10

Commit ID: 19e739a
Result: ✔️ SUCCEEDED
Build Logs (available for 30 days)

foundationdb-ci · 2022-10-26T22:44:41Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 19e739a
Duration 1:57:29
Result: ❌ FAILED
Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

halfprice

Thanks for the fix!

Adding isConfigDB to avoid updateClusterSharedStateMap LGTM.

As I don't have context for the rest of the PR, I'd wait for others' approval.

fdbclient/MultiVersionTransaction.actor.cpp

foundationdb-ci · 2022-10-27T16:28:45Z

Result of foundationdb-pr-clang-ide on Linux CentOS 7

Commit ID: 55b8e75
Duration 0:16:41
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

fdb-windows-ci · 2022-10-27T16:35:35Z

Doxense CI Report for Windows 10

Commit ID: 55b8e75
Result: ✔️ SUCCEEDED
Build Logs (available for 30 days)

foundationdb-ci · 2022-10-27T16:56:40Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 55b8e75
Duration 0:44:38
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-27T17:09:05Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 55b8e75
Duration 0:57:05
Result: ✅ SUCCEEDED
Error: N/A
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

foundationdb-ci · 2022-10-27T18:09:55Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 55b8e75
Duration 1:57:57
Result: ❌ FAILED
Error: Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1
Build Logs (available for 30 days)
Build Artifact (available for 30 days)

sfc-gh-ljoswiak force-pushed the fixes/cluster-id-v3 branch from b968976 to 3b356a2 Compare October 24, 2022 23:58

sfc-gh-ljoswiak force-pushed the fixes/cluster-id-v3 branch from 3b356a2 to f6be2bc Compare October 25, 2022 00:36

sfc-gh-ljoswiak added 11 commits October 26, 2022 13:17

Add simulation support for changing the cluster file

f1c8cb3

Avoid recruiting workers with different cluster ID

cc97154

Notify processes joining the wrong cluster

8596c44

And have these processes enter a "zombie" state where they cancel all their actors and then wait forever, refusing to do any additional work until they are manually handled by the operator.

Move cluster ID from txnStateStore to the database

c907f74

The cluster ID is now stored in the database instead of in the txnStateStore. The cluster controller will read it on boot and send it to all processes to persist.

Modify how cluster ID tests are run in simulation

eec7c16

Remove cluster ID logic from individual roles

64474ac

The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).

Avoid blocking in choose when

1fa6e6b

Rename cluster ID key

12ff151

In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has been moved to the database. This was causing protocol compatibility issues during upgrades, so we need to rename the key.

Format

cdaa95d

Write cluster ID to ClientDBInfo

5a32001

This enables clients to receive the cluster ID.

Disable shared state updates on configuration database

19e739a

sfc-gh-ljoswiak force-pushed the fixes/cluster-id-v3 branch from f6be2bc to 19e739a Compare October 26, 2022 20:47

sfc-gh-ljoswiak marked this pull request as ready for review October 26, 2022 22:22

sfc-gh-ljoswiak requested review from jzhou77, sfc-gh-etschannen and halfprice October 26, 2022 22:23

halfprice reviewed Oct 27, 2022

View reviewed changes

fdbclient/MultiVersionTransaction.actor.cpp Show resolved Hide resolved

Add comment about configuration database

55b8e75

jzhou77 approved these changes Oct 27, 2022

View reviewed changes

sfc-gh-ljoswiak merged commit 9625efd into apple:main Oct 27, 2022

sfc-gh-ljoswiak deleted the fixes/cluster-id-v3 branch October 27, 2022 20:56

This was referenced Mar 8, 2023

Protect against data loss if process joins wrong cluster #5375

Closed

Data loss protection v2 #8472

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loss protection v3 #8560

Data loss protection v3 #8560

sfc-gh-ljoswiak commented Oct 24, 2022 •

edited

foundationdb-ci commented Oct 24, 2022

foundationdb-ci commented Oct 24, 2022

foundationdb-ci commented Oct 24, 2022

fdb-windows-ci commented Oct 24, 2022

foundationdb-ci commented Oct 24, 2022

foundationdb-ci commented Oct 25, 2022

fdb-windows-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

fdb-windows-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 25, 2022

foundationdb-ci commented Oct 26, 2022

foundationdb-ci commented Oct 26, 2022

foundationdb-ci commented Oct 26, 2022

foundationdb-ci commented Oct 26, 2022

fdb-windows-ci commented Oct 26, 2022

foundationdb-ci commented Oct 26, 2022

halfprice left a comment

foundationdb-ci commented Oct 27, 2022

fdb-windows-ci commented Oct 27, 2022

foundationdb-ci commented Oct 27, 2022

foundationdb-ci commented Oct 27, 2022

foundationdb-ci commented Oct 27, 2022

Data loss protection v3 #8560

Data loss protection v3 #8560

Conversation

sfc-gh-ljoswiak commented Oct 24, 2022 • edited

Code-Reviewer Section

For Release-Branches

foundationdb-ci commented Oct 24, 2022

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Oct 24, 2022

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Oct 24, 2022

Result of foundationdb-pr on Linux CentOS 7

fdb-windows-ci commented Oct 24, 2022

Doxense CI Report for Windows 10

foundationdb-ci commented Oct 24, 2022

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-clang-ide on Linux CentOS 7

fdb-windows-ci commented Oct 25, 2022

Doxense CI Report for Windows 10

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr on Linux CentOS 7

fdb-windows-ci commented Oct 25, 2022

Doxense CI Report for Windows 10

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-macos-m1 on macOS BigSur 11.5.2

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-macos on macOS Monterey 12.x

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Oct 25, 2022

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Oct 26, 2022

Result of foundationdb-pr-clang-ide on Linux CentOS 7

foundationdb-ci commented Oct 26, 2022

Result of foundationdb-pr-macos on macOS Monterey 12.x

foundationdb-ci commented Oct 26, 2022

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Oct 26, 2022

Result of foundationdb-pr on Linux CentOS 7

fdb-windows-ci commented Oct 26, 2022

Doxense CI Report for Windows 10

foundationdb-ci commented Oct 26, 2022

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

halfprice left a comment

Choose a reason for hiding this comment

foundationdb-ci commented Oct 27, 2022

Result of foundationdb-pr-clang-ide on Linux CentOS 7

fdb-windows-ci commented Oct 27, 2022

Doxense CI Report for Windows 10

foundationdb-ci commented Oct 27, 2022

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Oct 27, 2022

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Oct 27, 2022

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

sfc-gh-ljoswiak commented Oct 24, 2022 •

edited