roachtest: acceptance/version-upgrade is flaky #87104

adityamaru · 2022-08-30T12:46:48Z

In the past few days the acceptance/version-upgrade roachtest has been failing in various ways, some error modes are:

pq: failed to run backup: exporting 112 ranges: unable to dial n2: breaker open

dial tcp 127.0.0.1:26259: connect: connection refused

pq: version mismatch in flow request: 65; this node accepts 69 through 69

The last one is the most common failure mode at the moment where the test fails at this step -

cockroach/pkg/cmd/roachtest/tests/versionupgrade.go

Line 149 in b87e290

testFeaturesStep,

which is when node 1 is running the current binary version, while the other nodes are on the predecessor binary versions.

Build examples:
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278345&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_Tests_LocalRoachtest
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278433&buildTypeId=Cockroach_BazelEssentialCi

Jira issue: CRDB-19153

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-08-30T12:47:15Z

cc @cockroachdb/test-eng

Skipping the flaky roachtest while we stabilize it. Informs: cockroachdb#87104 Release note: None Release justification: testing only change

tbg · 2022-08-30T12:52:12Z

Marking as release-blocker to reflect the gravity of this flake - afaict it's likely a problem that would be encountered by customers' workloads while upgrading to 22.2.

I suggest someone from SQL queries to own this. @yuzefovich can you think of someone appropriate and facilitate the assignment? Thank you!

renatolabs · 2022-08-30T13:30:41Z

FWIW, I remember seeing the second error message when I was trying to reduce the flakiness of this test about a month ago (#84382), so I don't think it's new. However, it was a fairly rare occurrence, and maybe it's become more frequent since then.

86563: ts: fix the pretty-printing of tsd keys r=abarganier a=knz Found while working on #86524. Release justification: bug fix Release note (bug fix): When printing keys and range start/end boundaries for time series, the displayed structure of keys was incorrect. This is now fixed. 86904: sql: allow mismatch type numbers in `PREPARE` statement r=rafiss a=ZhouXing19 Previously, we only allow having the same number of parameters and placeholders in a `PREPARE` statement. This is not compatible with Postgres14's behavior. This commit is to loosen the restriction and enable this compatibility. We now take `max(#placeholders, #parameters)` as the true length of parameters of the prepare statement. For each parameter, we first look at the type deduced from the query stmt. If we can't deduce it, we take the type hint for this param. I.e. we now allow queries such as ``` PREPARE args_test_many(int, int) as select $1 // 2 parameters, but only 1 placeholder in the query. PREPARE args_test_few(int) as select $1, $2::int // 1 parameter, but 2 placeholders in the query. ``` fixes #86375 Release justification: Low risk, high benefit changes to existing functionality Release note: allow mismatch type numbers in `PREPARE` statement 87105: roachtest: skip flaky acceptance/version-upgrade r=tbg a=adityamaru Skipping the flaky roachtest while we stabilize it. Informs: #87104 Release note: None Release justification: testing only change 87117: bazci: fix output path computation r=rail a=rickystewart These updates were happening in-place so `bazci` was constructing big, silly paths like `backupccl_test/shard_6_of_16/shard_7_of_16/shard_13_of_16/...` We just need to copy the variable here. Release justification: Non-production code changes Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Jane Xing <zhouxing@uchicago.edu> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

msirek · 2022-08-30T18:53:24Z

Action item here may be to do a bisect.

yuzefovich · 2022-08-30T23:05:16Z

I believe I identified the root cause in #87154 (it's a test issue, not an actual bug), so removing the release blocker label.

yuzefovich · 2022-09-01T01:44:24Z

I saw it flake on one of my PRs with a different error at a different time in the test. I'll need to take another look.

yuzefovich · 2022-09-02T19:12:17Z

So here is what's happening in this flake:

n4 has just been re-upgraded from 22.1.6 to current
n4 is the gateway for a SELECT query of Object Access feature
n4 thinks n1 is the leaseholder for the relevant range, so n4 issues SetupFlow RPC to n1
n1 receives that request and needs to perform FlowStream RPC to stream data back to n4.
n1 is able to get a connection (because we ignore the breaker), but then n1 fails to perform FlowStream RPC because of the breaker

W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109  â€¹[core]â€ºâ€¹[Channel #22 SubChannel #23] grpc: addrConn.createTransport failed to connect to {â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "Addr": "127.0.0.1:26263",â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "ServerName": "127.0.0.1:26263",â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "Attributes": null,â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "BalancerAttributes": null,â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "Type": 0,â€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹  "Metadata": nullâ€º
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 â‹® [-] 109 +â€¹}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110  Outbox FlowStream connection error, distributed query will fail: â€¹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110 +(1) â€¹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º

To me this looks like a dup of #44101.

I'm not sure what to do here though. I don't have much context of how we would go about fixing #44101. Couple of options for fixing this flake in particular:

disallow distributing queries for a brief period of time when the recently started node is the gateway for the query. This will be similar in spirit to sql: don't distribute migration queries #44102.
allow retrying / skipping the query if it hits "no inbound stream connection" error in the version upgrade test. We previously were already doing that until execinfra: add retries to outbox DialNoBreaker attempts #52624 "solved" the problem. Seems like the recent change to stop the cockroach nodes gracefully increased the likelihood of flakes.

Curious what others think, especially @tbg on the feasibility of addressing #44101 for good.

tbg · 2022-09-05T12:31:27Z

because of the breaker

I don't see the breaker error in the output you pasted. Rather, this is the onlyOnceDialer:

cockroach/pkg/rpc/context.go

Lines 1548 to 1558 in 34de5fb

    
           // onlyOnceDialer implements the grpc.WithDialer interface but only 
        
           // allows a single connection attempt. If a reconnection is attempted, 
        
           // redialChan is closed to signal a higher-level retry loop. This 
        
           // ensures that our initial heartbeat (and its version/clusterID 
        
           // validation) occurs on every new connection. 
        
           type onlyOnceDialer struct { 
        
           	syncutil.Mutex 
        
           	dialed     bool 
        
           	closed     bool 
        
           	redialChan chan struct{} 
        
           }

meaning that a previous attempt to dial failed, and legitimately failed (i.e. wasn't stopped by the breaker, as this wouldn't waste the onlyOnceDialer). Could you pull up a bit more of the log to see if you can find the true reason n1 couldn't talk to n4?

tbg · 2022-09-05T12:43:12Z

re: "real" fix, see #44101 (comment)

yuzefovich · 2022-09-06T22:25:03Z

I copied the logs from here. Do you mean getting a more verbose logging output that what is printed by default?

tbg · 2022-09-07T14:58:38Z

The logging just doesn't corroborate the scenario you've outlined, you say

n1 is able to get a connection (because we ignore the breaker), but then n1 fails to perform FlowStream RPC because of the breaker

That last part doesn't seem true - it looks more like the connection it pulled from the node dialer here

cockroach/pkg/sql/execinfra/outboxbase.go

Lines 32 to 53 in 711e228

    
           // GetConnForOutbox is a shared function between the rowexec and colexec 
        
           // outboxes. It attempts to dial the destination ignoring the breaker, up to the 
        
           // given timeout and returns the connection or an error. 
        
           // This connection attempt is retried since failure results in a query error. In 
        
           // the past, we have seen cases where a gateway node, n1, would send a flow 
        
           // request to n2, but n2 would be unable to connect back to n1 due to this 
        
           // connection attempt failing. 
        
           // Retrying here alleviates these flakes and causes no impact to the end 
        
           // user, since the receiver at the other end will hang for 
        
           // SettingFlowStreamTimeout waiting for a successful connection attempt. 
        
           func GetConnForOutbox( 
        
           	ctx context.Context, dialer Dialer, sqlInstanceID base.SQLInstanceID, timeout time.Duration, 
        
           ) (conn *grpc.ClientConn, err error) { 
        
           	firstConnectionAttempt := timeutil.Now() 
        
           	for r := retry.StartWithCtx(ctx, base.DefaultRetryOptions()); r.Next(); { 
        
           		conn, err = dialer.DialNoBreaker(ctx, roachpb.NodeID(sqlInstanceID), rpc.DefaultClass) 
        
           		if err == nil || timeutil.Since(firstConnectionAttempt) > timeout { 
        
           			break 
        
           		} 
        
           	} 
        
           	return 
        
           }

is somehow unhealthy? Is it possible that the DistSQL request somehow straddles the restart and that n4 legit was down (or hadn't fully restarted yet) when that query was run? The reason I suspect this is because there's lots of code that you're hitting that tries to establish this connection as healthy,

cockroach/pkg/rpc/nodedialer/nodedialer.go

Lines 195 to 227 in 50c1196

    
           conn, err := n.rpcContext.GRPCDialNode(addr.String(), nodeID, class).Connect(ctx) 
        
           if err != nil { 
        
           	// If we were canceled during the dial, don't trip the breaker. 
        
           	if ctxErr := ctx.Err(); ctxErr != nil { 
        
           		return nil, ctxErr 
        
           	} 
        
           	err = errors.Wrapf(err, "failed to connect to n%d at %v", nodeID, addr) 
        
           	if breaker != nil { 
        
           		breaker.Fail(err) 
        
           	} 
        
           	return nil, err 
        
           } 
        
           // Check to see if the connection is in the transient failure state. This can 
        
           // happen if the connection already existed, but a recent heartbeat has 
        
           // failed and we haven't yet torn down the connection. 
        
           if err := grpcutil.ConnectionReady(conn); err != nil { 
        
           	err = errors.Wrapf(err, "failed to check for ready connection to n%d at %v", nodeID, addr) 
        
           	if breaker != nil { 
        
           		breaker.Fail(err) 
        
           	} 
        
           	return nil, err 
        
           } 
        
           // TODO(bdarnell): Reconcile the different health checks and circuit breaker 
        
           // behavior in this file. Note that this different behavior causes problems 
        
           // for higher-levels in the system. For example, DistSQL checks for 
        
           // ConnHealth when scheduling processors, but can then see attempts to send 
        
           // RPCs fail when dial fails due to an open breaker. Reset the breaker here 
        
           // as a stop-gap before the reconciliation occurs. 
        
           if breaker != nil { 
        
           	breaker.Success() 
        
           } 
        
           return conn, nil

yuzefovich · 2022-09-07T16:27:31Z

Is it possible that the DistSQL request somehow straddles the restart

That doesn't seem possible because the query is issued only after n4 is restarted.

n4 hadn't fully restarted yet

That seems plausible.

Here are the things that I'm confident in:

n4 has just been upgraded
n4 is the gateway for "Object Access" query and performs SetupFlow RPC against n1, which succeeds
n1 serves SetupFlow RPC and creates an outbox
that outbox is able to get a connection via GetConnForOutbox, but then FlowStream RPC against n4 fails with

W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110  Outbox FlowStream connection error, distributed query will fail: â€¹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110 +(1) â€¹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110 +Error types: (1) *status.Error

n4 keeps on waiting for n1 to dial back in for 10 seconds, after which it times out the query with

E220901 00:36:39.693231 2632 sql/flowinfra/flow_registry.go:336 â‹® [n4,client=127.0.0.1:37748,user=root,fâ€¹ab65bb47â€º] 148  flow id:â€¹ab65bb47-63ac-4f0f-85e0-0a0a088503c0â€º : 1 inbound streams timed out after 10s; propagated error throughout flow

Let's take a closer look at the logs of n4 after the restart.

I220901 00:36:29.645727 132 1@server/server_sql.go:1415 â‹® [n4] 52  serving sql connections
...
I220901 00:36:29.648057 1012 upgrade/upgrademanager/manager.go:115 â‹® [n4,intExec=â€¹set-versionâ€º,migration-mgr] 54  migrating cluster from 22.1 to 1000022.1-68 (stepping through [1000022.1-2 1000022.1-4 1000022.1-6 1000022.1-8 1000022.1-10 1000022.1-12 1000022.1-14 1000022.1-16 1000022.1-18 1000022.1-20 1000022.1-22 1000022.1-24 1000022.1-26 1000022.1-28 1000022.1-30 1000022.1-32 1000022.1-34 1000022.1-36 1000022.1-38 1000022.1-40 1000022.1-42 1000022.1-44 1000022.1-46 1000022.1-48 1000022.1-50 1000022.1-52 1000022.1-54 1000022.1-56 1000022.1-58 1000022.1-60 1000022.1-62 1000022.1-64 1000022.1-66 1000022.1-68])
I220901 00:36:29.649797 1012 upgrade/upgradecluster/cluster.go:118 â‹® [n4,intExec=â€¹set-versionâ€º,migration-mgr] 55  executing validate-cluster-version=1000022.1-68 on nodes n{1,2,3,4}
I220901 00:36:29.695749 1012 upgrade/upgrademanager/manager.go:135 â‹® [n4,intExec=â€¹set-versionâ€º,migration-mgr] 56  stepping through 1000022.1-2
I220901 00:36:29.725805 1188 jobs/adopt.go:243 â‹® [n4,intExec=â€¹set-versionâ€º,migration-mgr] 57  job 792730631195164676: resuming execution
I220901 00:36:29.741825 1190 jobs/registry.go:1206 â‹® [n4] 58  MIGRATION job 792730631195164676: stepping through state running with error: <nil>
I220901 00:36:29.763583 1190 jobs/registry.go:1206 â‹® [n4] 59  MIGRATION job 792730631195164676: stepping through state succeeded with error: <nil>
...
I220901 00:36:29.990979 1474 jobs/registry.go:1206 â‹® [n4] 99  MIGRATION job 792730632062795780: stepping through state running with error: <nil>

If we can rely on the clocks of n1 and n4 being in sync, then we can see that n4 is still running through the upgrade migrations at the time when n1 tries to perform FlowStream RPC which seems to corroborate the theory that n4 wasn't fully "up" yet. Do you know whether running migrations on n4 would somehow prevent other nodes to dial into it? Are we starting to serve sql connections too early (i.e. should we wait for migrations to complete)?

tbg · 2022-09-08T15:53:19Z

cockroach start (as deployed by roachprod) will return when it hits the sdnotify line at the end of this method:

cockroach/pkg/cli/start.go

Lines 489 to 548 in 1af6635

    
           serverCfg.ReadyFn = func(waitForInit bool) { 
        
           	// Inform the user if the network settings are suspicious. We need 
        
           	// to do that after starting to listen because we need to know 
        
           	// which advertise address NewServer() has decided. 
        
           	hintServerCmdFlags(ctx, cmd) 
        
           	// If another process was waiting on the PID (e.g. using a FIFO), 
        
           	// this is when we can tell them the node has started listening. 
        
           	if startCtx.pidFile != "" { 
        
           		log.Ops.Infof(ctx, "PID file: %s", startCtx.pidFile) 
        
           		if err := os.WriteFile(startCtx.pidFile, []byte(fmt.Sprintf("%d\n", os.Getpid())), 0644); err != nil { 
        
           			log.Ops.Errorf(ctx, "failed writing the PID: %v", err) 
        
           		} 
        
           	} 
        
           	// If the invoker has requested an URL update, do it now that 
        
           	// the server is ready to accept SQL connections. 
        
           	// (Note: as stated above, ReadyFn is called after the server 
        
           	// has started listening on its socket, but possibly before 
        
           	// the cluster has been initialized and can start processing requests. 
        
           	// This is OK for SQL clients, as the connection will be accepted 
        
           	// by the network listener and will just wait/suspend until 
        
           	// the cluster initializes, at which point it will be picked up 
        
           	// and let the client go through, transparently.) 
        
           	if startCtx.listeningURLFile != "" { 
        
           		log.Ops.Infof(ctx, "listening URL file: %s", startCtx.listeningURLFile) 
        
           		// (Re-)compute the client connection URL. We cannot do this 
        
           		// earlier (e.g. above, in the runStart function) because 
        
           		// at this time the address and port have not been resolved yet. 
        
           		clientConnOptions, serverParams := makeServerOptionsForURL(&serverCfg) 
        
           		pgURL, err := clientsecopts.MakeURLForServer(clientConnOptions, serverParams, url.User(username.RootUser)) 
        
           		if err != nil { 
        
           			log.Errorf(ctx, "failed computing the URL: %v", err) 
        
           			return 
        
           		} 
        
           		if err = os.WriteFile(startCtx.listeningURLFile, []byte(fmt.Sprintf("%s\n", pgURL.ToPQ())), 0644); err != nil { 
        
           			log.Ops.Errorf(ctx, "failed writing the URL: %v", err) 
        
           		} 
        
           	} 
        
           	if waitForInit { 
        
           		log.Ops.Shout(ctx, severity.INFO, 
        
           			"initial startup completed.\n"+ 
        
           				"Node will now attempt to join a running cluster, or wait for `cockroach init`.\n"+ 
        
           				"Client connections will be accepted after this completes successfully.\n"+ 
        
           				"Check the log file(s) for progress. ") 
        
           	} 
        
           	// Ensure the configuration logging is written to disk in case a 
        
           	// process is waiting for the sdnotify readiness to read important 
        
           	// information from there. 
        
           	log.Flush() 
        
           	// Signal readiness. This unblocks the process when running with 
        
           	// --background or under systemd. 
        
           	if err := sdnotify.Ready(); err != nil { 
        
           		log.Ops.Errorf(ctx, "failed to signal readiness using systemd protocol: %s", err) 
        
           	} 
        
           }

Since n4 is restarted, the relevant line is this:

cockroach/pkg/server/server.go

Line 1275 in 2675c7c

onSuccessfulReturnFn = func() { readyFn(false /* waitForInit */) }

which is invoked at the top of the diff here:

cockroach/pkg/server/server.go

Lines 1403 to 1495 in 2675c7c

    
           onSuccessfulReturnFn() 
        
           // NB: This needs to come after `startListenRPCAndSQL`, which determines 
        
           // what the advertised addr is going to be if nothing is explicitly 
        
           // provided. 
        
           advAddrU := util.NewUnresolvedAddr("tcp", s.cfg.AdvertiseAddr) 
        
           // We're going to need to start gossip before we spin up Node below. 
        
           s.gossip.Start(advAddrU, filtered) 
        
           log.Event(ctx, "started gossip") 
        
           // Now that we have a monotonic HLC wrt previous incarnations of the process, 
        
           // init all the replicas. At this point *some* store has been initialized or 
        
           // we're joining an existing cluster for the first time. 
        
           advSQLAddrU := util.NewUnresolvedAddr("tcp", s.cfg.SQLAdvertiseAddr) 
        
           advHTTPAddrU := util.NewUnresolvedAddr("tcp", s.cfg.HTTPAdvertiseAddr) 
        
           if err := s.node.start( 
        
           	ctx, 
        
           	advAddrU, 
        
           	advSQLAddrU, 
        
           	advHTTPAddrU, 
        
           	*state, 
        
           	initialStart, 
        
           	s.cfg.ClusterName, 
        
           	s.cfg.NodeAttributes, 
        
           	s.cfg.Locality, 
        
           	s.cfg.LocalityAddresses, 
        
           ); err != nil { 
        
           	return err 
        
           } 
        
           log.Event(ctx, "started node") 
        
           if err := s.startPersistingHLCUpperBound(ctx, hlcUpperBoundExists); err != nil { 
        
           	return err 
        
           } 
        
           s.replicationReporter.Start(ctx, s.stopper) 
        
           sentry.ConfigureScope(func(scope *sentry.Scope) { 
        
           	scope.SetTags(map[string]string{ 
        
           		"cluster":         s.StorageClusterID().String(), 
        
           		"node":            s.NodeID().String(), 
        
           		"server_id":       fmt.Sprintf("%s-%s", s.StorageClusterID().Short(), s.NodeID()), 
        
           		"engine_type":     s.cfg.StorageEngine.String(), 
        
           		"encrypted_store": strconv.FormatBool(encryptedStore), 
        
           	}) 
        
           }) 
        
           // We can now add the node registry. 
        
           s.recorder.AddNode( 
        
           	s.registry, 
        
           	s.node.Descriptor, 
        
           	s.node.startedAt, 
        
           	s.cfg.AdvertiseAddr, 
        
           	s.cfg.HTTPAdvertiseAddr, 
        
           	s.cfg.SQLAdvertiseAddr, 
        
           ) 
        
           // Begin recording runtime statistics. 
        
           if err := startSampleEnvironment(s.AnnotateCtx(ctx), 
        
           	s.ClusterSettings(), 
        
           	s.stopper, 
        
           	s.cfg.GoroutineDumpDirName, 
        
           	s.cfg.HeapProfileDirName, 
        
           	s.runtime, 
        
           	s.status.sessionRegistry, 
        
           ); err != nil { 
        
           	return err 
        
           } 
        
           var graphiteOnce sync.Once 
        
           graphiteEndpoint.SetOnChange(&s.st.SV, func(context.Context) { 
        
           	if graphiteEndpoint.Get(&s.st.SV) != "" { 
        
           		graphiteOnce.Do(func() { 
        
           			s.node.startGraphiteStatsExporter(s.st) 
        
           		}) 
        
           	} 
        
           }) 
        
           // Start the protected timestamp subsystem. Note that this needs to happen 
        
           // before the modeOperational switch below, as the protected timestamps 
        
           // subsystem will crash if accessed before being Started (and serving general 
        
           // traffic may access it). 
        
           // 
        
           // See https://github.com/cockroachdb/cockroach/issues/73897. 
        
           if err := s.protectedtsProvider.Start(ctx, s.stopper); err != nil { 
        
           	return err 
        
           } 
        
           // After setting modeOperational, we can block until all stores are fully 
        
           // initialized. 
        
           s.grpc.setMode(modeOperational)

and note the bottom of the diff which sets grpc to "operational" (meaning it'll stop refusing incoming requests).

The listener is opened a few pages before this, so a dial to n4 should have succeeded (i.e. no conn refused or the like); before the operational line, RPCs would have been refused, but we are getting a different error which indicates that an attempt to dial failed:

W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 â‹® [n1,fâ€¹ab65bb47â€º,streamID=â€¹0â€º] 110 Outbox FlowStream connection error, distributed query will fail: â€¹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"â€º

Unfortunately, the error we really want is the one "before" that; this error here only tells us that a previous dial failed. Why did it fail? With what? That is unclear.

tbg · 2022-09-08T18:51:14Z

@yuzefovich I made a separate issue for this problem: #87634

For now, let's introduce a 4s sleep after each node restart, that should reliably paper over it. Not great, but I don't think this is a new problem - I think we're seeing it now because we are now draining the nodes and so there is no range unavailability after downtime, which probably papered over it very reliably. Would you be able to send that PR, Yahor, and close this issue out if it passes a couple of runs?

yuzefovich · 2022-09-09T16:06:05Z

Thanks Tobi! I'll send a patch.

87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg · 2022-09-19T15:29:23Z

Closing since it's been passing for close to a week now. If it fails again, better to open a new issue.

adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 30, 2022

adityamaru added this to Triage in Test Engineering via automation Aug 30, 2022

adityamaru added A-testing Testing tools and infrastructure T-testeng TestEng Team labels Aug 30, 2022

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 30, 2022

roachtest: skip flaky acceptance/version-upgrade

87b6b07

Skipping the flaky roachtest while we stabilize it. Informs: cockroachdb#87104 Release note: None Release justification: testing only change

adityamaru mentioned this issue Aug 30, 2022

roachtest: skip flaky acceptance/version-upgrade #87105

Merged

adityamaru added the skipped-test label Aug 30, 2022

tbg added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 30, 2022

This comment was marked as resolved.

Sign in to view

tbg added the branch-master Failures on the master branch. label Aug 30, 2022

tbg removed this from Triage in Test Engineering Aug 30, 2022

tbg added this to Triage in SQL Queries via automation Aug 30, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 30, 2022

msirek assigned yuzefovich Aug 30, 2022

msirek moved this from Triage to Up Next in SQL Queries Aug 30, 2022

yuzefovich moved this from Up Next to Active in SQL Queries Aug 30, 2022

yuzefovich mentioned this issue Aug 30, 2022

roachtest: stop cockroach gracefully when upgrading nodes #87154

Merged

yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 30, 2022

craig bot closed this as completed in da00c3a Aug 31, 2022

SQL Queries automation moved this from Active to Done Aug 31, 2022

yuzefovich reopened this Sep 1, 2022

SQL Queries automation moved this from Done to Triage Sep 1, 2022

yuzefovich moved this from Triage to Active in SQL Queries Sep 1, 2022

yuzefovich removed the skipped-test label Sep 1, 2022

tbg mentioned this issue Sep 5, 2022

rpc: nodes fail to connect to peer even after the peer is up #44101

Open

tbg mentioned this issue Sep 8, 2022

rpc: FlowStream error to recently restarted node #87634

Open

yuzefovich mentioned this issue Sep 9, 2022

roachtest: add 4s of sleep after restart when upgrading nodes #87715

Merged

blathers-crl bot mentioned this issue Sep 12, 2022

release-22.2: roachtest: add 4s of sleep after restart when upgrading nodes #87882

Merged

tbg closed this as completed Sep 19, 2022

SQL Queries automation moved this from Active to Done Sep 19, 2022

yuzefovich mentioned this issue Sep 20, 2022

roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711

Closed

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: acceptance/version-upgrade is flaky #87104

roachtest: acceptance/version-upgrade is flaky #87104

adityamaru commented Aug 30, 2022 •

edited by cockroach-jira-scripts

blathers-crl bot commented Aug 30, 2022

This comment was marked as resolved.

tbg commented Aug 30, 2022

renatolabs commented Aug 30, 2022

msirek commented Aug 30, 2022

yuzefovich commented Aug 30, 2022

yuzefovich commented Sep 1, 2022

yuzefovich commented Sep 2, 2022

tbg commented Sep 5, 2022

tbg commented Sep 5, 2022

yuzefovich commented Sep 6, 2022

tbg commented Sep 7, 2022

yuzefovich commented Sep 7, 2022

tbg commented Sep 8, 2022

tbg commented Sep 8, 2022

yuzefovich commented Sep 9, 2022

tbg commented Sep 19, 2022

roachtest: acceptance/version-upgrade is flaky #87104

roachtest: acceptance/version-upgrade is flaky #87104

Comments

adityamaru commented Aug 30, 2022 • edited by cockroach-jira-scripts

blathers-crl bot commented Aug 30, 2022

This comment was marked as resolved.

tbg commented Aug 30, 2022

renatolabs commented Aug 30, 2022

msirek commented Aug 30, 2022

yuzefovich commented Aug 30, 2022

yuzefovich commented Sep 1, 2022

yuzefovich commented Sep 2, 2022

tbg commented Sep 5, 2022

tbg commented Sep 5, 2022

yuzefovich commented Sep 6, 2022

tbg commented Sep 7, 2022

yuzefovich commented Sep 7, 2022

tbg commented Sep 8, 2022

tbg commented Sep 8, 2022

yuzefovich commented Sep 9, 2022

tbg commented Sep 19, 2022

adityamaru commented Aug 30, 2022 •

edited by cockroach-jira-scripts