Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/zip: make cockroach zip more resilient #44342

Merged
merged 4 commits into from
Feb 19, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 24, 2020

Fixes #44337.
Fixes #36584.
Fixes #44215.
Fixes #44215.
Fixes #39620.

This is a collection of patches that enhance the behavior of cockroach zip over partially or fully unavailable clusters. In particular it increases the amount of useful data collected.

@knz knz requested a review from tbg January 24, 2020 15:20
@knz knz requested a review from a team as a code owner January 24, 2020 15:20
@knz knz added this to To do in DB Server & Security via automation Jan 24, 2020
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jan 24, 2020

@tbg PTAL

Also a question:
I'm stumped about why my loop to wait for application of the zone configs in StartClusterWithoutReplication (built after the code that Darin wrote) does not actually do its job, and there are still ranges popping up on some other nodes every now and then.

It is as if there are new ranges appearing after the loop completes. Is it possible that TestCluster's own WaitForFullReplication does not wait for all initial splits?

(Note that this is immaterial to merging this PR: as long as the majority of the ranges are moved to the other nodes, the unavailability is sufficient to trigger the tested behavior)

@knz knz force-pushed the 20200124-zip-fixes branch 2 times, most recently from 6f3362c to 94e5b3d Compare January 24, 2020 16:24
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! These improvements look good across the board. I do have some questions about the bug you're fixing.

causing zip to encounter a retry error while retrieving the system table.

Isn't the real bug that debug zip isn't handling the retry error? It seems that the other thing I don't like (the bespoke way to start a weirdly configured test server) only serves to work around that. Isn't the client (zip)-facing retry error unexpected and also undesirable?

The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. For example, a range could be moved as part of a merge, or rebalancing. I assume you had no extra nodes that could cause that, though, but the method also relies on metrics as an exit condition, which is async and in particular starts out at zero. It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.

func (tc *TestCluster) WaitForFullReplication() error {
start := timeutil.Now()
defer func() {
end := timeutil.Now()
log.Infof(context.TODO(), "WaitForFullReplication took: %s", end.Sub(start))
}()
if len(tc.Servers) < 3 {
// If we have less than three nodes, we will never have full replication.
return nil
}
opts := retry.Options{
InitialBackoff: time.Millisecond * 10,
MaxBackoff: time.Millisecond * 100,
Multiplier: 2,
}
notReplicated := true
for r := retry.Start(opts); r.Next() && notReplicated; {
notReplicated = false
for _, s := range tc.Servers {
err := s.Stores().VisitStores(func(s *storage.Store) error {
if n := s.ClusterNodeCount(); n != len(tc.Servers) {
log.Infof(context.TODO(), "%s only sees %d/%d available nodes", s, n, len(tc.Servers))
notReplicated = true
return nil
}
// Force upreplication. Otherwise, if we rely on the scanner to do it,
// it'll take a while.
if err := s.ForceReplicationScanAndProcess(); err != nil {
return err
}
if err := s.ComputeMetrics(context.TODO(), 0); err != nil {
// This can sometimes fail since ComputeMetrics calls
// updateReplicationGauges which needs the system config gossiped.
log.Info(context.TODO(), err)
notReplicated = true
return nil
}
if n := s.Metrics().UnderReplicatedRangeCount.Value(); n > 0 {
log.Infof(context.TODO(), "%s has %d underreplicated ranges", s, n)
notReplicated = true
}
return nil
})
if err != nil {
return err
}
if notReplicated {
break
}
}

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/zip.go, line 579 at r2 (raw file):

		}
	}
	objName := strings.ToLower(out.String())

would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where IsLetter is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).


pkg/cli/zip.go, line 587 at r2 (raw file):

	cnt := fne.counters[objName]
	if cnt > 0 {
		result += fmt.Sprintf("-%d", cnt)

Can I break something by feeding in a f=something-2 initially? No, because - will be replaced out in the input, right?


pkg/cli/zip_test.go, line 249 at r1 (raw file):

	}

	// Strip any non-deterministic messages:

Hmm, this will potentially fail nondeterministically as we change things elsewhere, bump deps, etc. Have you considered making expected a regexp? From the looks of it, that's going to leave the test in an even more annoying place, though. Hmm. I'm fine with leaving as-is.


pkg/cli/zip_test.go, line 241 at r2 (raw file):

create table defaultdb."a-b"(x int);
create table defaultdb."pg_catalog.pg_class"(x int);
create table defaultdb."../system"(x int);

💡


pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

}

// StartTestClusterWithoutReplication starts up a TestCluster made up

Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the real bug that debug zip isn't handling the retry error?

That is a bug but it was there already before. I suppose I could add an additional commit.

The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. [...] It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.

Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/zip.go, line 579 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where IsLetter is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).

Done.


pkg/cli/zip.go, line 587 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can I break something by feeding in a f=something-2 initially? No, because - will be replaced out in the input, right?

Yes, correct. Added a test to show.


pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.

It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in sql and one in cli), what would be a better place for it?

(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bug but it was there already before. I suppose I could add an additional commit.

If you don't mind - that or file an issue. I thought we always issued single statements and wouldn't see retries. (Though with data streamed out already, I can see how we would get one. Which is annoying; not sure how we can fix this without adding retry handlers for all of debug zip; we don't want to buffer too much there though I suppose we could buffer more than the default)

Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?

I'm telling you that this code tries to check conditions at a distance and that it may not be bulletproof. I think most consumers are OK with this, but it's clearly not desirable. But looking at the code more, this should really work in practice - it's computing the metrics in each step and node and checking for underreplication. Do you have a log with the suspicious activity? I'm looking for the "WaitForFullReplication took: %s` line and then more activity.

Reviewed 2 of 2 files at r8, 1 of 2 files at r9, 1 of 1 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, knz (kena) wrote…

It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in sql and one in cli), what would be a better place for it?

(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)

Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-enabled TestPartialZip and verified via make stress that the various tests here are not flaky.
They are under race-enabled builds, but then that's expected because the slowness of race runs triggers all kinds of timeouts.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?

Removed in favor of the technique you explained to me.

craig bot pushed a commit that referenced this pull request Feb 15, 2020
45133: *: add a context argument to storagepb.ReplicaRequestFilter r=ajwerner a=knz

Suggested by @tbg 
This will unblock progress on #44342.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz
Copy link
Contributor Author

knz commented Feb 15, 2020

Filed #45134 to add the missing retry loop.

Previously `cockroach zip` would only print an informational message
about a piece of data it was retrieving *after* the data was retrieved
(or an error was observed).

This patch changes it to print a message also beforehand.
This enables better troubleshooting of hanging queries.

Release note (cli change): `cockroach zip` now displays its progress
differently on the terminal.
Previously, `cockroach zip` did not properly escape special characters
in db/table names, so that a SQL database named
e.g. `"../schema/system"` (inside SQL) would cause a zip entry called
`../schema/system` to be created. This generated name is ambiguous
with the output path of zip files for the `system` database.

This patch fixes that by properly escaping the file names.

Release note (cli change): `cockroach zip` now properly supports
special characters in database and table names.
The SQL connections used by `cockroach zip` already use
a *connect timeout* configurable via COCKROACH_CONNECT_TIMEOUT
and (currently) defaulting to 15s. However there was no timeout after
a connection was established.

This patch changes it to use the value configured via `--timeout`
as per-SQL-query timeout.

Release note (cli change): `cockroach zip` will now apply the
`--timeout` parameters also to the SQL queries it performs (there was
no timeout previously, causing `cockroach zip` to potentially hang).
Prior to this patch, `cockroach zip` was performing a Node() RPC on
the head node to retrieve its SQL address. This was performing a KV
read and would thus block if the liveness range was unavailable.

This is fixed by replacing the call to `Details()` (`/health`) which
also delivers the SQL address.

Additionally, the per-node retrieval code is updated to retrieve
at least the head node's endpoints, even when `Nodes()` fails.
This enables extracting a few things more in case liveness is
unavailable: gossip, some crdb_internal stuff, log files, etc.

Release note (cli change): `cockroach zip` is now able to tolerate
more forms of cluster unavailability. Nonetheless, in case system
ranges are unavailable, it is recommended to run `cockroach zip`
towards each node address in turn to maximize the amount of useful
data collected.
@knz
Copy link
Contributor Author

knz commented Feb 15, 2020

Ok this passes tests reliably now. RFAL

@knz
Copy link
Contributor Author

knz commented Feb 18, 2020

cc @ajwerner if you could help with the last pass of reviews that would be swell.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 2 of 3 files at r17, 2 of 2 files at r20.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@knz
Copy link
Contributor Author

knz commented Feb 19, 2020

thank you!

bors r=tbg,ajwerner

craig bot pushed a commit that referenced this pull request Feb 19, 2020
44342: cli/zip: make `cockroach zip` more resilient r=tbg,ajwerner a=knz

Fixes #44337.
Fixes #36584.
Fixes #44215.
Fixes #44215.
Fixes #39620.

This is a collection of patches that enhance the behavior of `cockroach zip` over partially or fully unavailable clusters. In particular it increases the amount of useful data collected.



Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@craig
Copy link
Contributor

craig bot commented Feb 19, 2020

Build succeeded

@craig craig bot merged commit 48259ad into cockroachdb:master Feb 19, 2020
DB Server & Security automation moved this from In progress to Done 20.1 Feb 19, 2020
@knz knz deleted the 20200124-zip-fixes branch February 19, 2020 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants