[Release-7.1] Fix checkall false alarms and refactor code #11141

kakaiu · 2024-01-20T01:51:59Z

Consider two key sets stored in SS1 and SS2:

SS1 (current server): 1, 2, 3, 4, 5, 6, 7, 8, 9
SS2 (reference server): 1, 2, 3, 4, 7, 8, 9

In this case, SS2 omitted some keys, and the expected output of checkall is 5, 6 as the unique keys on SS1.
However, the existing checkall does not output as expected.
Without loss of generality, suppose the checkall command request 6 successive keys to a SS in batch. Then, in the first batch, SS1 replies 1, 2, 3, 4, 5, 6 and SS2 replies 1, 2, 3, 4, 7, 8. Then, checkall is failed to find 5, 6 on SS2 and failed to find 7, 8 on SS1. Therefore, the tool outputs 5, 6 as the uniques keys on SS1 and 7, 8 as the uniques keys on SS2.

The problem stems from the tool arbitrarily selects the nextBeginKey (currently, the nextBeginKey is the last key replied by the reference server). However, if the reference server omits some keys, the above issue can happen, resulting in false alarms.

This PR fixed this issue by using the minimal last key among all replica reply values and report inconsistency only if (1) the current key is smaller than the minimal last key or (2) no next round exists.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2024-01-20T01:59:59Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 59893cc
Duration 0:07:48
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-20T02:00:38Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 59893cc
Duration 0:08:27
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:01:08Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 59893cc
Duration 0:08:59
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:09:29Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 59893cc
Duration 0:17:20
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:29:17Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: a2a56d1
Duration 0:06:13
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-20T02:30:20Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: de62a54
Duration 0:08:47
Result: ❌ FAILED
Error: reference not found for primary source and source version de62a54b103ee0fdda55573b2d992bf003c2f86e
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:30:35Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: de62a54
Duration 0:09:03
Result: ❌ FAILED
Error: reference not found for primary source and source version de62a54b103ee0fdda55573b2d992bf003c2f86e
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-20T02:30:39Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: de62a54
Duration 0:09:04
Result: ❌ FAILED
Error: reference not found for primary source and source version de62a54b103ee0fdda55573b2d992bf003c2f86e
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:31:18Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: a2a56d1
Duration 0:08:18
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:32:03Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: de62a54
Duration 0:10:27
Result: ❌ FAILED
Error: reference not found for primary source and source version de62a54b103ee0fdda55573b2d992bf003c2f86e
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:32:17Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: de62a54
Duration 0:10:43
Result: ❌ FAILED
Error: reference not found for primary source and source version de62a54b103ee0fdda55573b2d992bf003c2f86e
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T02:45:38Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 59893cc
Duration 0:53:26
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T03:23:02Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: a2a56d1
Duration 0:59:59
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

jzhou77

LGTM. A couple of questions.

jzhou77 · 2024-01-20T04:35:36Z

fdbcli/DebugCommands.actor.cpp

 			if (currentI >= current.data.size()) {
-				printf(" #%d CurrentI: %lu ReferenceI: %lu Unique key: %s\n",
-				       firstValidServer,
+				printf("UniqueKey, %s(1), %s(0), CurrentIndex %lu, ReferenceIndex %lu, Version %ld, UniqueKey %s\n",


what's the meaning of 1 and 0 here, i.e., "%s(1), %s(0)"?
UniqueKey appears twice.

1 means that the key appears in the server right before "1". 0 means that the key does not exist in the server right before "0".

jzhou77 · 2024-01-20T04:40:05Z

fdbcli/DebugCommands.actor.cpp

+					// checkResults keeps invariants:
+					// (1) hasMore = true if any server has more data not read yet
+					// (2) nextBeginKey is the minimal key returned from all servers
+					// (3) checkResults reports inconsistency of keys only before the nextBeginKey if hasMore=true
+					// Therefore, whether to proceed to the next round depends on hasMore


These repeat the comments in the definition above, could be removed.

Nice catch!

foundationdb-ci · 2024-01-20T05:21:54Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: a3813dd
Duration 0:07:50
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-20T05:22:12Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: a3813dd
Duration 0:08:11
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T05:22:50Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: a3813dd
Duration 0:08:46
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T05:26:31Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: ab79237
Duration 0:07:42
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-20T05:27:18Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: ab79237
Duration 0:08:31
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T05:30:52Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: a3813dd
Duration 0:16:50
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T06:12:35Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: a3813dd
Duration 0:58:34
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-20T06:12:46Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: ab79237
Duration 0:54:01
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

liquid-helium

Great catch and fix

hfu94

LGTM, just have minor comments can be fixed later.

hfu94 · 2024-01-22T19:58:36Z

fdbcli/DebugCommands.actor.cpp

@@ -294,78 +304,91 @@ ACTOR Future<bool> checkallCommandActor(Database cx, std::vector<StringRef> toke
 		    "for that purpose).\n");
 		return false;
 	}
+	if (inputRange.empty()) {
+		return true;


do we want to give some responses indicating this is an empty range.

I think so. Nice catch!

hfu94 · 2024-01-22T19:59:04Z

fdbcli/DebugCommands.actor.cpp

+		}
+		hasMore = hasMore || current.more;
+	}
+	ASSERT(!claimEndKey.empty());


in a corner case that none of the replies have data, does it fail here?

In a round, there are two possible cases that claimEndKey is empty here: (1) the input range to the tool is empty. For this case, the tool exits at line 308; (2) all replicas return empty. This case is very unlikely to happen. If all replica have no more data is available within the range, the reply.more of all replica should not set and the round was ended in the previous round. There is a corner case which is very unlikely to happen is that we check replica in round1 at version 1 and some replica returns more=true. At version 2, the key is cleared. At version 3, we check the replica and find no data is replied. For this case, the assert is false and we need to redo the checkall command.

foundationdb-ci · 2024-01-23T00:16:24Z

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

Commit ID: 4d1dbb2
Duration 0:07:59
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2024-01-23T00:16:56Z

Result of foundationdb-pr on Linux CentOS 7

Commit ID: 4d1dbb2
Duration 0:08:29
Result: ❌ FAILED
Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-23T00:17:21Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: 4d1dbb2
Duration 0:08:58
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-23T00:24:02Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: 4d1dbb2
Duration 0:15:39
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2024-01-23T01:04:19Z

Result of foundationdb-pr-clang on Linux CentOS 7

Commit ID: 4d1dbb2
Duration 0:55:56
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

kakaiu requested review from jzhou77 and hfu94 January 20, 2024 01:52

kakaiu force-pushed the fix-check-all branch from 59893cc to de62a54 Compare January 20, 2024 02:21

fix-checkall-false-alarm-bug

a2a56d1

kakaiu force-pushed the fix-check-all branch from de62a54 to a2a56d1 Compare January 20, 2024 02:22

jzhou77 reviewed Jan 20, 2024

View reviewed changes

address comments

ab79237

kakaiu force-pushed the fix-check-all branch from a3813dd to ab79237 Compare January 20, 2024 05:18

kakaiu requested a review from jzhou77 January 20, 2024 05:18

jzhou77 previously approved these changes Jan 20, 2024

View reviewed changes

kakaiu requested a review from liquid-helium January 22, 2024 19:29

liquid-helium previously approved these changes Jan 22, 2024

View reviewed changes

hfu94 previously approved these changes Jan 22, 2024

View reviewed changes

address comments and nits

4d1dbb2

kakaiu dismissed stale reviews from hfu94, liquid-helium, and jzhou77 via 4d1dbb2 January 23, 2024 00:08

kakaiu requested review from hfu94, liquid-helium and jzhou77 January 23, 2024 01:04

liquid-helium approved these changes Jan 23, 2024

View reviewed changes

jzhou77 approved these changes Jan 23, 2024

View reviewed changes

jzhou77 merged commit b5efbc7 into apple:release-7.1 Jan 23, 2024
1 of 5 checks passed

kakaiu changed the title ~~Fix checkall false alarms and refactor code~~ [Release-7.1] Fix checkall false alarms and refactor code Jan 30, 2024

[Release-7.1] Fix checkall false alarms and refactor code #11141

[Release-7.1] Fix checkall false alarms and refactor code #11141

Conversation

kakaiu commented Jan 20, 2024 • edited

Code-Reviewer Section

For Release-Branches

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

jzhou77 left a comment

Choose a reason for hiding this comment

jzhou77 Jan 20, 2024

Choose a reason for hiding this comment

kakaiu Jan 20, 2024

Choose a reason for hiding this comment

jzhou77 Jan 20, 2024

Choose a reason for hiding this comment

kakaiu Jan 20, 2024

Choose a reason for hiding this comment

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

foundationdb-ci commented Jan 20, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

liquid-helium left a comment

Choose a reason for hiding this comment

hfu94 left a comment

Choose a reason for hiding this comment

hfu94 Jan 22, 2024

Choose a reason for hiding this comment

kakaiu Jan 22, 2024

Choose a reason for hiding this comment

hfu94 Jan 22, 2024

Choose a reason for hiding this comment

kakaiu Jan 22, 2024 • edited

Choose a reason for hiding this comment

foundationdb-ci commented Jan 23, 2024

Result of foundationdb-pr-cluster-tests on Linux CentOS 7

foundationdb-ci commented Jan 23, 2024

Result of foundationdb-pr on Linux CentOS 7

foundationdb-ci commented Jan 23, 2024

Result of foundationdb-pr-macos on macOS Ventura 13.x

foundationdb-ci commented Jan 23, 2024

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

foundationdb-ci commented Jan 23, 2024

Result of foundationdb-pr-clang on Linux CentOS 7

kakaiu commented Jan 20, 2024 •

edited

kakaiu Jan 22, 2024 •

edited