cli/zip: make `cockroach zip` more resilient #44342

knz · 2020-01-24T15:20:06Z

Fixes #44337.
Fixes #36584.
Fixes #44215.
Fixes #44215.
Fixes #39620.

This is a collection of patches that enhance the behavior of cockroach zip over partially or fully unavailable clusters. In particular it increases the amount of useful data collected.

cockroach-teamcity · 2020-01-24T15:20:19Z

This change is

knz · 2020-01-24T16:09:36Z

@tbg PTAL

Also a question:
I'm stumped about why my loop to wait for application of the zone configs in StartClusterWithoutReplication (built after the code that Darin wrote) does not actually do its job, and there are still ranges popping up on some other nodes every now and then.

It is as if there are new ranges appearing after the loop completes. Is it possible that TestCluster's own WaitForFullReplication does not wait for all initial splits?

(Note that this is immaterial to merging this PR: as long as the majority of the ranges are moved to the other nodes, the unavailability is sufficient to trigger the tested behavior)

tbg

Thank you! These improvements look good across the board. I do have some questions about the bug you're fixing.

causing zip to encounter a retry error while retrieving the system table.

Isn't the real bug that debug zip isn't handling the retry error? It seems that the other thing I don't like (the bespoke way to start a weirdly configured test server) only serves to work around that. Isn't the client (zip)-facing retry error unexpected and also undesirable?

The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. For example, a range could be moved as part of a merge, or rebalancing. I assume you had no extra nodes that could cause that, though, but the method also relies on metrics as an exit condition, which is async and in particular starts out at zero. It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.

cockroach/pkg/testutils/testcluster/testcluster.go

Lines 756 to 808 in fc3676e

    
           func (tc *TestCluster) WaitForFullReplication() error { 
        
           	start := timeutil.Now() 
        
           	defer func() { 
        
           		end := timeutil.Now() 
        
           		log.Infof(context.TODO(), "WaitForFullReplication took: %s", end.Sub(start)) 
        
           	}() 
        
           	if len(tc.Servers) < 3 { 
        
           		// If we have less than three nodes, we will never have full replication. 
        
           		return nil 
        
           	} 
        
           	opts := retry.Options{ 
        
           		InitialBackoff: time.Millisecond * 10, 
        
           		MaxBackoff:     time.Millisecond * 100, 
        
           		Multiplier:     2, 
        
           	} 
        
           	notReplicated := true 
        
           	for r := retry.Start(opts); r.Next() && notReplicated; { 
        
           		notReplicated = false 
        
           		for _, s := range tc.Servers { 
        
           			err := s.Stores().VisitStores(func(s *storage.Store) error { 
        
           				if n := s.ClusterNodeCount(); n != len(tc.Servers) { 
        
           					log.Infof(context.TODO(), "%s only sees %d/%d available nodes", s, n, len(tc.Servers)) 
        
           					notReplicated = true 
        
           					return nil 
        
           				} 
        
           				// Force upreplication. Otherwise, if we rely on the scanner to do it, 
        
           				// it'll take a while. 
        
           				if err := s.ForceReplicationScanAndProcess(); err != nil { 
        
           					return err 
        
           				} 
        
           				if err := s.ComputeMetrics(context.TODO(), 0); err != nil { 
        
           					// This can sometimes fail since ComputeMetrics calls 
        
           					// updateReplicationGauges which needs the system config gossiped. 
        
           					log.Info(context.TODO(), err) 
        
           					notReplicated = true 
        
           					return nil 
        
           				} 
        
           				if n := s.Metrics().UnderReplicatedRangeCount.Value(); n > 0 { 
        
           					log.Infof(context.TODO(), "%s has %d underreplicated ranges", s, n) 
        
           					notReplicated = true 
        
           				} 
        
           				return nil 
        
           			}) 
        
           			if err != nil { 
        
           				return err 
        
           			} 
        
           			if notReplicated { 
        
           				break 
        
           			} 
        
           		}

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/zip.go, line 579 at r2 (raw file):

		}
	}
	objName := strings.ToLower(out.String())

would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where IsLetter is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).

pkg/cli/zip.go, line 587 at r2 (raw file):

	cnt := fne.counters[objName]
	if cnt > 0 {
		result += fmt.Sprintf("-%d", cnt)

Can I break something by feeding in a f=something-2 initially? No, because - will be replaced out in the input, right?

pkg/cli/zip_test.go, line 249 at r1 (raw file):

	}

	// Strip any non-deterministic messages:

Hmm, this will potentially fail nondeterministically as we change things elsewhere, bump deps, etc. Have you considered making expected a regexp? From the looks of it, that's going to leave the test in an even more annoying place, though. Hmm. I'm fine with leaving as-is.

pkg/cli/zip_test.go, line 241 at r2 (raw file):

create table defaultdb."a-b"(x int);
create table defaultdb."pg_catalog.pg_class"(x int);
create table defaultdb."../system"(x int);

💡

pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

}

// StartTestClusterWithoutReplication starts up a TestCluster made up

Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.

knz

Isn't the real bug that debug zip isn't handling the retry error?

That is a bug but it was there already before. I suppose I could add an additional commit.

The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. [...] It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.

Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/zip.go, line 579 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where IsLetter is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).

Done.

pkg/cli/zip.go, line 587 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can I break something by feeding in a f=something-2 initially? No, because - will be replaced out in the input, right?

Yes, correct. Added a test to show.

pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.

It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in sql and one in cli), what would be a better place for it?

(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)

tbg

That is a bug but it was there already before. I suppose I could add an additional commit.

If you don't mind - that or file an issue. I thought we always issued single statements and wouldn't see retries. (Though with data streamed out already, I can see how we would get one. Which is annoying; not sure how we can fix this without adding retry handlers for all of debug zip; we don't want to buffer too much there though I suppose we could buffer more than the default)

Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?

I'm telling you that this code tries to check conditions at a distance and that it may not be bulletproof. I think most consumers are OK with this, but it's clearly not desirable. But looking at the code more, this should really work in practice - it's computing the metrics in each step and node and checking for underreplication. Do you have a log with the suspicious activity? I'm looking for the "WaitForFullReplication took: %s` line and then more activity.

Reviewed 2 of 2 files at r8, 1 of 2 files at r9, 1 of 1 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, knz (kena) wrote…

It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in sql and one in cli), what would be a better place for it?

(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)

Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?

knz

I have re-enabled TestPartialZip and verified via make stress that the various tests here are not flaky.
They are under race-enabled builds, but then that's expected because the slowness of race runs triggers all kinds of timeouts.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?

Removed in favor of the technique you explained to me.

@tbg

45133: *: add a context argument to storagepb.ReplicaRequestFilter r=ajwerner a=knz Suggested by @tbg This will unblock progress on #44342. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

knz · 2020-02-15T20:31:17Z

Filed #45134 to add the missing retry loop.

Previously `cockroach zip` would only print an informational message about a piece of data it was retrieving *after* the data was retrieved (or an error was observed). This patch changes it to print a message also beforehand. This enables better troubleshooting of hanging queries. Release note (cli change): `cockroach zip` now displays its progress differently on the terminal.

Previously, `cockroach zip` did not properly escape special characters in db/table names, so that a SQL database named e.g. `"../schema/system"` (inside SQL) would cause a zip entry called `../schema/system` to be created. This generated name is ambiguous with the output path of zip files for the `system` database. This patch fixes that by properly escaping the file names. Release note (cli change): `cockroach zip` now properly supports special characters in database and table names.

The SQL connections used by `cockroach zip` already use a *connect timeout* configurable via COCKROACH_CONNECT_TIMEOUT and (currently) defaulting to 15s. However there was no timeout after a connection was established. This patch changes it to use the value configured via `--timeout` as per-SQL-query timeout. Release note (cli change): `cockroach zip` will now apply the `--timeout` parameters also to the SQL queries it performs (there was no timeout previously, causing `cockroach zip` to potentially hang).

Prior to this patch, `cockroach zip` was performing a Node() RPC on the head node to retrieve its SQL address. This was performing a KV read and would thus block if the liveness range was unavailable. This is fixed by replacing the call to `Details()` (`/health`) which also delivers the SQL address. Additionally, the per-node retrieval code is updated to retrieve at least the head node's endpoints, even when `Nodes()` fails. This enables extracting a few things more in case liveness is unavailable: gossip, some crdb_internal stuff, log files, etc. Release note (cli change): `cockroach zip` is now able to tolerate more forms of cluster unavailability. Nonetheless, in case system ranges are unavailable, it is recommended to run `cockroach zip` towards each node address in turn to maximize the amount of useful data collected.

knz · 2020-02-15T22:24:10Z

Ok this passes tests reliably now. RFAL

knz · 2020-02-18T15:46:35Z

cc @ajwerner if you could help with the last pass of reviews that would be swell.

ajwerner

Reviewed 2 of 3 files at r17, 2 of 2 files at r20.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

knz · 2020-02-19T02:33:21Z

thank you!

bors r=tbg,ajwerner

44342: cli/zip: make `cockroach zip` more resilient r=tbg,ajwerner a=knz Fixes #44337. Fixes #36584. Fixes #44215. Fixes #44215. Fixes #39620. This is a collection of patches that enhance the behavior of `cockroach zip` over partially or fully unavailable clusters. In particular it increases the amount of useful data collected. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

craig · 2020-02-19T03:11:45Z

Build succeeded

GitHub CI (Cockroach)

knz requested a review from tbg January 24, 2020 15:20

knz requested a review from a team as a code owner January 24, 2020 15:20

knz force-pushed the 20200124-zip-fixes branch from 7286767 to 0f44524 Compare January 24, 2020 15:51

knz force-pushed the 20200124-zip-fixes branch 2 times, most recently from 6f3362c to 94e5b3d Compare January 24, 2020 16:24

tbg reviewed Jan 27, 2020

View reviewed changes

tbg self-requested a review January 27, 2020 11:14

knz force-pushed the 20200124-zip-fixes branch from 94e5b3d to 10e45bb Compare January 27, 2020 11:37

knz commented Jan 27, 2020

View reviewed changes

tbg reviewed Jan 27, 2020

View reviewed changes

knz mentioned this pull request Jan 28, 2020

wip: example test to exercise unavailable clusters #44435

Closed

knz mentioned this pull request Feb 15, 2020

*: add a context argument to storagepb.ReplicaRequestFilter #45133

Merged

knz force-pushed the 20200124-zip-fixes branch from 10e45bb to becaa6f Compare February 15, 2020 19:30

knz requested a review from a team as a code owner February 15, 2020 19:30

knz commented Feb 15, 2020

View reviewed changes

craig bot pushed a commit that referenced this pull request Feb 15, 2020

Merge #45133

4276eaa

45133: *: add a context argument to storagepb.ReplicaRequestFilter r=ajwerner a=knz Suggested by @tbg This will unblock progress on #44342. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

knz mentioned this pull request Feb 15, 2020

cli: cockroach debug zip is missing retry loops #45134

Closed

knz added 4 commits February 15, 2020 22:41

knz force-pushed the 20200124-zip-fixes branch from becaa6f to 48259ad Compare February 15, 2020 21:41

ajwerner approved these changes Feb 18, 2020

View reviewed changes

craig bot merged commit 48259ad into cockroachdb:master Feb 19, 2020

knz deleted the 20200124-zip-fixes branch February 19, 2020 03:26

knz mentioned this pull request Feb 19, 2020

release-19.2: cli/zip: make cockroach zip more resilient #45207

Merged

jseldess mentioned this pull request Mar 2, 2020

cli/zip: make cockroach zip more resilient cockroachdb/docs#6739

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli/zip: make `cockroach zip` more resilient #44342

cli/zip: make `cockroach zip` more resilient #44342

knz commented Jan 24, 2020 •

edited

Loading

cockroach-teamcity commented Jan 24, 2020

knz commented Jan 24, 2020 •

edited

Loading

tbg left a comment

knz left a comment

tbg left a comment

knz left a comment

knz commented Feb 15, 2020

knz commented Feb 15, 2020

knz commented Feb 18, 2020

ajwerner left a comment

knz commented Feb 19, 2020

craig bot commented Feb 19, 2020

	func (tc *TestCluster) WaitForFullReplication() error {
	start := timeutil.Now()
	defer func() {
	end := timeutil.Now()
	log.Infof(context.TODO(), "WaitForFullReplication took: %s", end.Sub(start))
	}()

	if len(tc.Servers) < 3 {
	// If we have less than three nodes, we will never have full replication.
	return nil
	}

	opts := retry.Options{
	InitialBackoff: time.Millisecond * 10,
	MaxBackoff: time.Millisecond * 100,
	Multiplier: 2,
	}

	notReplicated := true
	for r := retry.Start(opts); r.Next() && notReplicated; {
	notReplicated = false
	for _, s := range tc.Servers {
	err := s.Stores().VisitStores(func(s *storage.Store) error {
	if n := s.ClusterNodeCount(); n != len(tc.Servers) {
	log.Infof(context.TODO(), "%s only sees %d/%d available nodes", s, n, len(tc.Servers))
	notReplicated = true
	return nil
	}
	// Force upreplication. Otherwise, if we rely on the scanner to do it,
	// it'll take a while.
	if err := s.ForceReplicationScanAndProcess(); err != nil {
	return err
	}
	if err := s.ComputeMetrics(context.TODO(), 0); err != nil {
	// This can sometimes fail since ComputeMetrics calls
	// updateReplicationGauges which needs the system config gossiped.
	log.Info(context.TODO(), err)
	notReplicated = true
	return nil
	}
	if n := s.Metrics().UnderReplicatedRangeCount.Value(); n > 0 {
	log.Infof(context.TODO(), "%s has %d underreplicated ranges", s, n)
	notReplicated = true
	}
	return nil
	})
	if err != nil {
	return err
	}
	if notReplicated {
	break
	}
	}

cli/zip: make cockroach zip more resilient #44342

cli/zip: make cockroach zip more resilient #44342

Conversation

knz commented Jan 24, 2020 • edited Loading

cockroach-teamcity commented Jan 24, 2020

knz commented Jan 24, 2020 • edited Loading

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Feb 15, 2020

knz commented Feb 15, 2020

knz commented Feb 18, 2020

ajwerner left a comment

Choose a reason for hiding this comment

knz commented Feb 19, 2020

craig bot commented Feb 19, 2020

Build succeeded

cli/zip: make `cockroach zip` more resilient #44342

cli/zip: make `cockroach zip` more resilient #44342

knz commented Jan 24, 2020 •

edited

Loading

knz commented Jan 24, 2020 •

edited

Loading