Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodedialer: don't dial decommissioned nodes #53168

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Aug 20, 2020

Fixes #66710.

This patch prevents dialing attempts to a decommissioned node. Since
these nodes are not coming back to the cluster, there's no reason to
dial them. We've seen that we have various cases where different modules
try to connect to such nodes, for whatever reasons (which reasons -
stale caches or whatever - should be addressed separately), causing log
spam, extra latency, or worse. This patches teaches the dialing circuit
breakers to integrate with the node liveness info and refuse dialing
attempts.

Touches #32581

Release note: None

This patch prevents dialing attempts to a decommissioned node. Since
these nodes are not coming back to the cluster, there's no reason to
dial them. We've seen that we have various cases where different modules
try to connect to such nodes, for whatever reasons (which reasons -
stale caches or whatever - should be addressed separately), causing log
spam, extra latency, or worse. This patches teaches the dialing circuit
breakers to integrate with the node liveness info and refuse dialing
attempts.

Touches cockroachdb#32581

Release note: None
@andreimatei andreimatei requested a review from knz August 20, 2020 23:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

Hey Rafa this is not a complete PR, but this is how I think I'd try to take the decommissioned bit into consideration for preventing dialing attempts to such nodes. If you like it, you can take it over.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand this. IT's nice!
Just one question - see below

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/rpc/nodedialer/nodedialer.go, line 300 at r1 (raw file):

	nodeID roachpb.NodeID, class rpc.ConnectionClass,
) *circuit.Breaker {
	// !!! this should return the wrappedBreaker

Can you explain this? I don't understand it

@knz knz added this to In progress in DB Server & Security via automation Aug 21, 2020
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/rpc/nodedialer/nodedialer.go, line 300 at r1 (raw file):

Previously, knz (kena) wrote…

Can you explain this? I don't understand it

I mean that whoever is using this method currently to get the breaker for whatever reason (the raft transport and the closed ts transport I think), should also get exposed to thedecommissionAwareBreaker (which I think should just be called dialBreaker btw) so that they can return different errors and not log too much when they're trying to contact a decommissioned node.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@tbg
Copy link
Member

tbg commented May 18, 2021

This PR came up again in the context of a customer cluster that is seeing connection attempts to decommissioned nodes. I'm actually not sure if this PR would reliably fix that. In the linked issue, you can see that connection attempts to that node get a connection refused error, which do trip the breaker. So the breaker is open most of the time. I think what is happening is that the connection is attempted as part of something that has to do with a stale range descriptor, which is still good enough to serve requests (i.e. has up to date leaseholder) and thus doesn't get evicted. I'm not sure this PR would change that behavior; ultimately we need to evict range descriptors if they contain a decommissioned node. But on top of that, I'm just speculating. Maybe we are connecting to the decommissioned node for other reasons; we'd need to run our own experiments I think.

cc @mwang1026 for prioritization (I know this is nominally server now, but I'll leave that transfer to you if it is going to happen this time around).

@knz
Copy link
Contributor

knz commented Jun 22, 2021

Since this PR was orphan I have created a new issue #66710 to explain the problem, so that the issue can be considered in prioritization.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
DB Server & Security
  
In progress
Development

Successfully merging this pull request may close these issues.

node dialer: avoid dialing decommissioned nodes
4 participants