-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodedialer: don't dial decommissioned nodes #53168
base: master
Are you sure you want to change the base?
nodedialer: don't dial decommissioned nodes #53168
Conversation
This patch prevents dialing attempts to a decommissioned node. Since these nodes are not coming back to the cluster, there's no reason to dial them. We've seen that we have various cases where different modules try to connect to such nodes, for whatever reasons (which reasons - stale caches or whatever - should be addressed separately), causing log spam, extra latency, or worse. This patches teaches the dialing circuit breakers to integrate with the node liveness info and refuse dialing attempts. Touches cockroachdb#32581 Release note: None
Hey Rafa this is not a complete PR, but this is how I think I'd try to take the decommissioned bit into consideration for preventing dialing attempts to such nodes. If you like it, you can take it over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand this. IT's nice!
Just one question - see below
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/rpc/nodedialer/nodedialer.go, line 300 at r1 (raw file):
nodeID roachpb.NodeID, class rpc.ConnectionClass, ) *circuit.Breaker { // !!! this should return the wrappedBreaker
Can you explain this? I don't understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/rpc/nodedialer/nodedialer.go, line 300 at r1 (raw file):
Previously, knz (kena) wrote…
Can you explain this? I don't understand it
I mean that whoever is using this method currently to get the breaker for whatever reason (the raft transport and the closed ts transport I think), should also get exposed to thedecommissionAwareBreaker
(which I think should just be called dialBreaker
btw) so that they can return different errors and not log too much when they're trying to contact a decommissioned node.
This PR came up again in the context of a customer cluster that is seeing connection attempts to decommissioned nodes. I'm actually not sure if this PR would reliably fix that. In the linked issue, you can see that connection attempts to that node get a connection refused error, which do trip the breaker. So the breaker is open most of the time. I think what is happening is that the connection is attempted as part of something that has to do with a stale range descriptor, which is still good enough to serve requests (i.e. has up to date leaseholder) and thus doesn't get evicted. I'm not sure this PR would change that behavior; ultimately we need to evict range descriptors if they contain a decommissioned node. But on top of that, I'm just speculating. Maybe we are connecting to the decommissioned node for other reasons; we'd need to run our own experiments I think. cc @mwang1026 for prioritization (I know this is nominally server now, but I'll leave that transfer to you if it is going to happen this time around). |
Since this PR was orphan I have created a new issue #66710 to explain the problem, so that the issue can be considered in prioritization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
Fixes #66710.
This patch prevents dialing attempts to a decommissioned node. Since
these nodes are not coming back to the cluster, there's no reason to
dial them. We've seen that we have various cases where different modules
try to connect to such nodes, for whatever reasons (which reasons -
stale caches or whatever - should be addressed separately), causing log
spam, extra latency, or worse. This patches teaches the dialing circuit
breakers to integrate with the node liveness info and refuse dialing
attempts.
Touches #32581
Release note: None