Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness: allow registering callbacks after start #107265

Merged
merged 13 commits into from Jul 26, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 20, 2023

I discovered1 a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix #106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

Epic: None
Release note (bug fix): it was possible for a node status to reflect a "last up" timestamp that lead the actual last liveness heartbeat of the node. This has been fixed.

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/106706#issuecomment-1640254715

@blathers-crl
Copy link

blathers-crl bot commented Jul 20, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor

Right, this is probably fallout from #103601.

@tbg tbg added the db-cy-23 label Jul 21, 2023
@tbg tbg marked this pull request as ready for review July 21, 2023 11:20
@tbg tbg requested a review from a team as a code owner July 21, 2023 11:20
@andrewbaptist
Copy link
Collaborator

Thanks Tobi!

I see what you mean about this test, and it was tricky to get some of tests, including this one, to work without doing what you did by allowing later registration.

The approach you have works, but I had considered two other options as well (and could code them up if you want)

  1. Don't allow disks/stores to be added to a running instance, instead require a restart. I wasn't sure if this was something that we support in a field deployment but I should have checked first. If we don't support this, then it would be possible to rearrange the code to allow this to work

  2. Have a "single" node level callback that then fans out to the stores on liveness changes. This simplifies the liveness code as there is only a single callback, and the node already has locking in place for adding stores.

Your approach also works and is probably the most straightforward.

@tbg
Copy link
Member Author

tbg commented Jul 21, 2023

Don't allow disks/stores to be added to a running instance, instead require a restart. I wasn't sure if this was something that we support in a field deployment but I should have checked first. If we don't support this, then it would be possible to rearrange the code to allow this to work

The new stores are being added following a restart. However, due to the need to assign StoreIDs, KV needs to be up and running. So these new stores are added "async". I don't see a good way around that.

Have a "single" node level callback that then fans out to the stores on liveness changes. This simplifies the liveness code as there is only a single callback, and the node already has locking in place for adding stores.

That's not a bad idea, though I won't be able to do this until I leave (and it might be crowded out anyway). So even though I like that idea I'd like to stick with what I have, so that we can at least address the bug.

@tbg tbg force-pushed the liveness-register-anytime branch 2 times, most recently from 05cbe38 to 9c8bddd Compare July 24, 2023 10:27
tbg added 13 commits July 24, 2023 17:55
The two removed fields are nil. This made a test failure during the refactors
in this PR more annoying. If we're going to set up a half-inited NodeLiveness,
let's at least be honest about it.
This will help with testing.
We'll give this a proper interface soon.
So that it can implement a public interface.
It's needed to implement the liveness storage (once it exists).
We still want a Storage to be passed into NewNodeLiveness as opposed to a
`*kv.DB`, but so far, so good.
Now Liveness is constructed using a `Storage` as opposed to a `*kv.DB`.
I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart
with additional stores that need to be bootstrapped. In that case, liveness
must be running when the StoreIDs are allocated, but it is not.

Trying to address this problem, I realized that when an auxiliary Store is bootstrapped,
it will create a new replicateQueue, which will register a new callback into NodeLiveness.

But if liveness must be started at this point to fix cockroachdb#106706, we'll run into the assertion
that checks that we don't register callbacks on a started node liveness.

Something's got to give: we will allow registering callbacks at any given point
in time, and they'll get an initial set of notifications synchronously. I
audited the few users of RegisterCallback and this seems OK with all of them.

[^1]: cockroachdb#106706 (comment)

Epic: None
Release note: None
I think there was a bug here. This method was previously invoked in
`updateLiveness`, but that method is the general workhorse for updating
anyone's liveness. In particular, it is called by `IncrementEpoch`. So we were
invoking `onSelfHeartbeat` when we would increment other nodes' epochs. This
doesn't seem great.

Additionally, the code was trying to avoid invoking this callback before
liveness was officially "started". Heartbeating yourself before liveness is
started is unfortunately a thing due to the tangled start-up initialization
sequence; we may see heartbeats triggered by lease requests.

Avoid both complications by invoking `onSelfCallback` from the actual main
heartbeat loop, whose only job is to heartbeat the own liveness record.

I tried to adopt `TestNodeHeartbeatCallback` to give better coverage, but
it's a yak shave. A deterministic node liveness (i.e. a way to invoke the
main heartbeat loop manually) would make this a lot simpler. I filed an
issue to that effect:

cockroachdb#107452
This tests that regardless of when a callback is registered, it gets called.
@tbg tbg force-pushed the liveness-register-anytime branch from 9c8bddd to d767731 Compare July 24, 2023 15:57
Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@tbg tbg requested a review from erikgrinaker July 26, 2023 07:23
@erikgrinaker
Copy link
Contributor

We'll want a targeted backport here too, since it seems like prior versions are vulnerable to the same issue (but I'm not sure with all the refactoring in this area).

@tbg
Copy link
Member Author

tbg commented Jul 26, 2023

We'll want a targeted backport here too, since it seems like prior versions are vulnerable to the same issue (but I'm not sure with all the refactoring in this area).

I'm not sure we should fix this in a backport. The chances of messing something up are much higher than anyone hitting this bug. Folks very very rarely add stores to existing nodes (multi-storage in itself is rare) and doing so in a way that loses quorum on the storeIDGen is even rarer. Also, there is a workaround - restart a quorum without the extra store.

If you disagree, let's file an issue because I am not sure I would get to this backport even if we tried.

@erikgrinaker
Copy link
Contributor

Ok, I can buy that. Was thinking of surrounding issues like the self callback too though, but they're fairly minor.

@tbg
Copy link
Member Author

tbg commented Jul 26, 2023

The self-callback I can fix in a backport - we can just add a conditional (if we don't insist on also adding a test).

The hang at startup, btw, I think has existed in every past version of CRDB, and to the best of my knowledge was never hit by anyone in a real cluster. Even the test that caught it only did so after I refactored it and made it use in-mem engines. So I am fairly confident that not backporting is the reasonable strategy here.

@tbg
Copy link
Member Author

tbg commented Jul 26, 2023

bors r=erikgrinaker

tbg added a commit to tbg/cockroach that referenced this pull request Jul 26, 2023
tbg added a commit to tbg/cockroach that referenced this pull request Jul 26, 2023
@tbg
Copy link
Member Author

tbg commented Jul 26, 2023

Backports:

#107605
#107606

@craig craig bot merged commit f147c2b into cockroachdb:master Jul 26, 2023
7 checks passed
@craig
Copy link
Contributor

craig bot commented Jul 26, 2023

Build succeeded:

kvoli pushed a commit to kvoli/cockroach that referenced this pull request Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: TestAddNewStoresToExistingNodes failed
4 participants