Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: nodes fail to connect to peer even after the peer is up #44101

Open
andreimatei opened this issue Jan 17, 2020 · 3 comments
Open

rpc: nodes fail to connect to peer even after the peer is up #44101

andreimatei opened this issue Jan 17, 2020 · 3 comments
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security X-server-triaged-202105

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Jan 17, 2020

I believe I've run into the following race:

  • node 1 goes down
  • some node 2 code path tries to dial node 1. Let's call it dial attempt 1.
  • node 1 comes back, opens its ports
  • another code path on node 2 tries to dial. This is dial attempt 2. Dial attempt 1 is still in progress, so dial attempt two blocks on conn.initOnce
  • dial attempt one was in the process of failing, and eventually releases that lock
  • every subsequent attempt (i.e. dial attempt 2) that was convoyed behind attempt 1 fails with the same message

This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a SetupFlow RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.

A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.

cc @ajwerner @bdarnell

Epic: CRDB-8500

Jira issue: CRDB-5255

@andreimatei andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 17, 2020
@andreimatei andreimatei added this to Incoming in KV via automation Jan 17, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 17, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 18, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
@ajwerner ajwerner self-assigned this Jan 21, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 21, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 23, 2020
This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by cockroachdb#44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of cockroachdb#44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes cockroachdb#43957
Fixes cockroachdb#44005
Touches cockroachdb#44101

Release note: None
craig bot pushed a commit that referenced this issue Jan 23, 2020
44102: sql: don't distribute migration queries r=andreimatei a=andreimatei

This patch inhibits DistSQL distribution for the queries that the
migrations run. This was prompted by #44101, which is causing a
distributed query done soon after a node startup to sometimes fail.

I've considered more bluntly disabling distribution for any query for a
short period of time after the node starts up, but I went with the more
targeted change to migrations because I think it's a bad idea for
migrations to use query distribution even outside of #44101 -
distributed queries are more fragile than local execution in general
(for example, because of DistSender retries). And migrations can't
tolerate any flakiness.

Fixes #43957
Fixes #44005
Touches #44101

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
asubiotto added a commit to asubiotto/cockroach that referenced this issue Jul 8, 2020
… complete

DistSQL is fragile on node startup (see cockroachdb#44101) and is therefore disabled in
the migration manager's internal executor. However, migrations sometimes have
side effects and sometimes end up using the SQL server's internal executor
which had DistSQL enabled.

This commit disabled DistSQL for this internal executor until the migrations
complete to avoid flakiness.

Release note: None (testing flake fix)
asubiotto added a commit to asubiotto/cockroach that referenced this issue Jul 9, 2020
… complete

DistSQL is fragile on node startup (see cockroachdb#44101) and is therefore disabled in
the migration manager's internal executor. However, migrations sometimes have
side effects and sometimes end up using the SQL server's internal executor
which had DistSQL enabled.

This commit disabled DistSQL for this internal executor until the migrations
complete to avoid flakiness.

Release note: None (testing flake fix)
@lunevalex lunevalex added the A-kv-server Relating to the KV-level RPC server label Jul 28, 2020
@lunevalex lunevalex moved this from Incoming to Server/CLI in KV Jul 28, 2020
@knz knz added this to To do in DB Server & Security via automation Jan 27, 2021
@lunevalex lunevalex removed this from Server/CLI in KV Apr 23, 2021
@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz moved this from To do to Linked issues (from the roadmap columns on the right) in DB Server & Security Jun 28, 2021
@knz knz added A-server-networking Pertains to network addressing,routing,initialization A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@ajwerner ajwerner assigned ajwerner and unassigned ajwerner Mar 23, 2022
@ajwerner
Copy link
Contributor

This is a cool issue and would be good to fix at some point. I'm taking it out of my queue, as I haven't thought about it in years.

@tbg
Copy link
Member

tbg commented Sep 5, 2022

We could add code here

OnIncomingPing: func(ctx context.Context, req *rpc.PingRequest) error {
// Decommission state is only tracked for the system tenant.
if tenantID, isTenant := roachpb.TenantFromContext(ctx); isTenant &&
!roachpb.IsSystemTenantID(tenantID.ToUint64()) {
return nil
}
// Incoming ping will reject requests with codes.PermissionDenied to
// signal remote node that it is not considered valid anymore and
// operations should fail immediately.
return checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied)
},

that resets the breaker for the originator of the PingRequest:

diff --git a/pkg/server/server.go b/pkg/server/server.go
index 3f24182238..3ed5616a67 100644
--- a/pkg/server/server.go
+++ b/pkg/server/server.go
@@ -233,6 +233,13 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 
 	nodeTombStorage, checkPingFor := getPingCheckDecommissionFn(engines)
 
+	// This will be assigned once we've instantiated the nodeDialer.
+	//
+	// NB: not a fan of this pattern but alternative is a larger
+	// rearrangement. Would need to double-check that this won't
+	// be invoked before it's populated.
+	var resetNodeDialerBreakerTo func(nodeID roachpb.NodeID)
+
 	rpcCtxOpts := rpc.ContextOptions{
 		TenantID:         roachpb.SystemTenantID,
 		NodeID:           cfg.IDContainer,
@@ -257,7 +264,10 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 			// Incoming ping will reject requests with codes.PermissionDenied to
 			// signal remote node that it is not considered valid anymore and
 			// operations should fail immediately.
-			return checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied)
+			if err := checkPingFor(ctx, req.OriginNodeID, codes.PermissionDenied); err != nil {
+				return err
+			}
+			resetNodeDialerBreakerTo(req.OriginNodeID)
 		},
 	}
 	if knobs := cfg.TestingKnobs.Server; knobs != nil {
@@ -322,6 +332,11 @@ func NewServer(cfg Config, stopper *stop.Stopper) (*Server, error) {
 
 	nodeDialer := nodedialer.NewWithOpt(rpcContext, gossip.AddressResolver(g),
 		nodedialer.DialerOpt{TestingKnobs: dialerKnobs})
+	resetNodeDialerBreakerTo = func(nodeID roachpb.NodeID) {
+		for class := rpc.ConnectionClass(0); int(class) < rpc.NumConnectionClasses; class++ {
+			nodeDialer.GetCircuitBreaker(nodeID, class).Reset()
+		}
+	}
 
 	runtimeSampler := status.NewRuntimeStatSampler(ctx, clock)
 	registry.AddMetricStruct(runtimeSampler)

This would deal with the "causally related" situations. For example, #87104 (comment) would now read:

  • n4 restarts
  • n1 has breaker to n4 tripped as a result
  • n4 issues SetupFlow to n1
    • in particular, establishes healthy connection to n1
    • which includes a heartbeat n4->n1
    • which resets the breaker n1->n4
  • n1 checks breaker, which passes
  • n1 sets up the flow to n4

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security X-server-triaged-202105
Projects
DB Server & Security
  
Linked issues (from the roadmap colum...
Development

No branches or pull requests

7 participants