-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up] #120521
Comments
The error is cockroach/pkg/upgrade/upgradecluster/nodes.go Lines 31 to 62 in 32622e1
In the CRDB logs, it looks like n3 is in fact up, but n1 doesn't think it is.
Checking if KV can investigate why this disagreement might occur. |
Is the solution here to improve the cockroach/pkg/upgrade/upgradecluster/cluster.go Lines 74 to 98 in 7fc5635
|
I'm confused why these series of tests started failing recently (<2 weeks ago) without related changes(?). I've attempted reproducing over an hour to no success with 12a8659895a33f4a29900e6b9d83bd624e120216:
Speculatively, the cause the of the failure could be a race between the node joining and heart-beating liveness with the upgrade check. We know from the logs that the node created a liveness record but not that it had heartbeated its liveness.
I'll continue to try and reproduce. |
This hasn't reproduced over 2 hours. I'll try a different linked test. |
In the test it is very close between the n3 is "fullly up" and n1 checking if it is alive. It probably published liveness a few ms earlier, but the gossip delay could explain this miss.
These are only 20ms apart, so it is definitely possible that n1 hadn't seen the liveness record for n1 yet. I'm actually surprised this doesn't fail more often. |
It should be consistent since it is scanning the liveness range, as opposed to using the in-memory gossiped version?
I was assuming the issue was the first heartbeat, after creating the record didn't complete before the upgrade began. |
Good point, it does read it from KV. However the write to KV happens here: |
Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry `UntilClusterStable` for up to 10 times when there are unavailable nodes before returning an error. Resolves: cockroachdb#120521 Resolves: cockroachdb#121069 Resolves: cockroachdb#119696 Release note: None
@andrewbaptist I implemented a (less than desirable) retry mechanism when there are unavailable nodes during the upgrade |
124288: upgradecluster: retry until cluster stable with unavailable nodes r=rafiss a=kvoli Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry `UntilClusterStable` for up to 10 times when there are unavailable nodes before returning an error. Resolves: #120521 Resolves: #121069 Resolves: #119696 Release note: None Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry `UntilClusterStable` for up to 10 times when there are unavailable nodes before returning an error. Resolves: cockroachdb#120521 Resolves: cockroachdb#121069 Resolves: cockroachdb#119696 Release note: None
ccl/schemachangerccl.TestBackupMixedVersionElements_base_alter_table_alter_primary_key_drop_rowid failed with artifacts on release-23.1 @ 689c53764c44c609313c00f7317f1e747e13a152:
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-36728
The text was updated successfully, but these errors were encountered: