New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Release-7.1] Fix checkall false alarms and refactor code #11141
Conversation
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple of questions.
fdbcli/DebugCommands.actor.cpp
Outdated
if (currentI >= current.data.size()) { | ||
printf(" #%d CurrentI: %lu ReferenceI: %lu Unique key: %s\n", | ||
firstValidServer, | ||
printf("UniqueKey, %s(1), %s(0), CurrentIndex %lu, ReferenceIndex %lu, Version %ld, UniqueKey %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of 1 and 0 here, i.e., "%s(1), %s(0)"?
UniqueKey
appears twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 means that the key appears in the server right before "1". 0 means that the key does not exist in the server right before "0".
fdbcli/DebugCommands.actor.cpp
Outdated
// checkResults keeps invariants: | ||
// (1) hasMore = true if any server has more data not read yet | ||
// (2) nextBeginKey is the minimal key returned from all servers | ||
// (3) checkResults reports inconsistency of keys only before the nextBeginKey if hasMore=true | ||
// Therefore, whether to proceed to the next round depends on hasMore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These repeat the comments in the definition above, could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch and fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just have minor comments can be fixed later.
@@ -294,78 +304,91 @@ ACTOR Future<bool> checkallCommandActor(Database cx, std::vector<StringRef> toke | |||
"for that purpose).\n"); | |||
return false; | |||
} | |||
if (inputRange.empty()) { | |||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to give some responses indicating this is an empty range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. Nice catch!
fdbcli/DebugCommands.actor.cpp
Outdated
} | ||
hasMore = hasMore || current.more; | ||
} | ||
ASSERT(!claimEndKey.empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a corner case that none of the replies have data, does it fail here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a round, there are two possible cases that claimEndKey is empty here: (1) the input range to the tool is empty. For this case, the tool exits at line 308; (2) all replicas return empty. This case is very unlikely to happen. If all replica have no more data is available within the range, the reply.more of all replica should not set and the round was ended in the previous round. There is a corner case which is very unlikely to happen is that we check replica in round1 at version 1 and some replica returns more=true. At version 2, the key is cleared. At version 3, we check the replica and find no data is replied. For this case, the assert is false and we need to redo the checkall command.
4d1dbb2
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
|
Result of foundationdb-pr on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang on Linux CentOS 7
|
Consider two key sets stored in SS1 and SS2:
SS1 (current server): 1, 2, 3, 4, 5, 6, 7, 8, 9
SS2 (reference server): 1, 2, 3, 4, 7, 8, 9
In this case, SS2 omitted some keys, and the expected output of checkall is 5, 6 as the unique keys on SS1.
However, the existing checkall does not output as expected.
Without loss of generality, suppose the checkall command request 6 successive keys to a SS in batch. Then, in the first batch, SS1 replies 1, 2, 3, 4, 5, 6 and SS2 replies 1, 2, 3, 4, 7, 8. Then, checkall is failed to find 5, 6 on SS2 and failed to find 7, 8 on SS1. Therefore, the tool outputs 5, 6 as the uniques keys on SS1 and 7, 8 as the uniques keys on SS2.
The problem stems from the tool arbitrarily selects the nextBeginKey (currently, the nextBeginKey is the last key replied by the reference server). However, if the reference server omits some keys, the above issue can happen, resulting in false alarms.
This PR fixed this issue by using the minimal last key among all replica reply values and report inconsistency only if (1) the current key is smaller than the minimal last key or (2) no next round exists.
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)