Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: have autoupgrade process look at decommission status, not availability #53515

Open
irfansharif opened this issue Aug 26, 2020 · 3 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Aug 26, 2020

Reported privately in https://github.com/cockroachlabs/support/issues/584 we saw an instance of the following:

  • A node (n14) was added, and allocated a node ID.
  • The node was immediately removed (fast enough, I think, to not get to persist a liveness record of its own).
  • A month later they tried upgrading their cluster, relying on the autoupgrade process.
  • The autoupgrade process found the added node to be unavailable and kept spinning
59512:I200824 08:03:41.404150 1038 server/server_update.go:50  [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59516:I200824 08:04:09.303473 1038 server/server_update.go:50  [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59520:I200824 08:04:35.472315 1038 server/server_update.go:50  [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59529:I200824 08:05:05.895085 1038 server/server_update.go:50  [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
  • We couldn't decommission n14 because no corresponding liveness record existed for it.
  • We were able to manually bump the cluster version because this process scanned all the available liveness records to ensure that there were no dead nodes that weren't decommissioned.

We should have the autoupgrade process behave similarly by looking at the fully decommissioned bit we added in #50329, instead of just looking at availability. (Separately, it'd be nice to always have a liveness record created when adding a node to the cluster, something that I think will be made easier by #52526).

Jira issue: CRDB-3857

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-decom-rolling-restart Decommission and Rolling Restarts labels Aug 26, 2020
@irfansharif irfansharif added this to Incoming in KV via automation Aug 26, 2020
@knz
Copy link
Contributor

knz commented Aug 27, 2020

The code for this is in (*Server) upgradeStatus() in server/server_update.go.

@irfansharif irfansharif added the E-starter Might be suitable for a starter project for new employees or team members. label Sep 2, 2020
irfansharif added a commit to irfansharif/cockroach that referenced this issue Oct 22, 2020
… not just availability

Fixes cockroachdb#53515.

We should have the autoupgrade process look at the fully decommissioned
bit we added in cockroachdb#50329, instead of just looking at availability. It
would avoid the hazard described in cockroachdb#53515.

Previously the autoupgrade process was also looking at
NodeStatusLiveness, which we've since soured upon (see cockroachdb#50478). Now that
we always create a liveness record on start up (cockroachdb#53805), we can simply
fetch all liveness records from KV. We add a helper to do this, which
we'll also rely on in future PRs for other purposes. It's a bit
unfortunate that we're further adding on to the NodeLiveness API without
changing the caching structure, but the methods fetching records from KV
is the world we're hoping to move towards going forward.

Release note: None
@mwang1026 mwang1026 moved this from Incoming to On Hold in KV May 28, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@github-actions
Copy link

github-actions bot commented Sep 7, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz knz removed E-starter Might be suitable for a starter project for new employees or team members. no-issue-activity labels Sep 7, 2023
@knz
Copy link
Contributor

knz commented Sep 7, 2023

still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
KV
On Hold
3 participants