-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli/zip: make cockroach zip
more resilient
#44342
Conversation
7286767
to
0f44524
Compare
@tbg PTAL Also a question: It is as if there are new ranges appearing after the loop completes. Is it possible that TestCluster's own (Note that this is immaterial to merging this PR: as long as the majority of the ranges are moved to the other nodes, the unavailability is sufficient to trigger the tested behavior) |
6f3362c
to
94e5b3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! These improvements look good across the board. I do have some questions about the bug you're fixing.
causing
zip
to encounter a retry error while retrieving the system table.
Isn't the real bug that debug zip isn't handling the retry error? It seems that the other thing I don't like (the bespoke way to start a weirdly configured test server) only serves to work around that. Isn't the client (zip)-facing retry error unexpected and also undesirable?
The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. For example, a range could be moved as part of a merge, or rebalancing. I assume you had no extra nodes that could cause that, though, but the method also relies on metrics as an exit condition, which is async and in particular starts out at zero. It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.
cockroach/pkg/testutils/testcluster/testcluster.go
Lines 756 to 808 in fc3676e
func (tc *TestCluster) WaitForFullReplication() error { | |
start := timeutil.Now() | |
defer func() { | |
end := timeutil.Now() | |
log.Infof(context.TODO(), "WaitForFullReplication took: %s", end.Sub(start)) | |
}() | |
if len(tc.Servers) < 3 { | |
// If we have less than three nodes, we will never have full replication. | |
return nil | |
} | |
opts := retry.Options{ | |
InitialBackoff: time.Millisecond * 10, | |
MaxBackoff: time.Millisecond * 100, | |
Multiplier: 2, | |
} | |
notReplicated := true | |
for r := retry.Start(opts); r.Next() && notReplicated; { | |
notReplicated = false | |
for _, s := range tc.Servers { | |
err := s.Stores().VisitStores(func(s *storage.Store) error { | |
if n := s.ClusterNodeCount(); n != len(tc.Servers) { | |
log.Infof(context.TODO(), "%s only sees %d/%d available nodes", s, n, len(tc.Servers)) | |
notReplicated = true | |
return nil | |
} | |
// Force upreplication. Otherwise, if we rely on the scanner to do it, | |
// it'll take a while. | |
if err := s.ForceReplicationScanAndProcess(); err != nil { | |
return err | |
} | |
if err := s.ComputeMetrics(context.TODO(), 0); err != nil { | |
// This can sometimes fail since ComputeMetrics calls | |
// updateReplicationGauges which needs the system config gossiped. | |
log.Info(context.TODO(), err) | |
notReplicated = true | |
return nil | |
} | |
if n := s.Metrics().UnderReplicatedRangeCount.Value(); n > 0 { | |
log.Infof(context.TODO(), "%s has %d underreplicated ranges", s, n) | |
notReplicated = true | |
} | |
return nil | |
}) | |
if err != nil { | |
return err | |
} | |
if notReplicated { | |
break | |
} | |
} |
Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cli/zip.go, line 579 at r2 (raw file):
} } objName := strings.ToLower(out.String())
would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where IsLetter
is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).
pkg/cli/zip.go, line 587 at r2 (raw file):
cnt := fne.counters[objName] if cnt > 0 { result += fmt.Sprintf("-%d", cnt)
Can I break something by feeding in a f=something-2
initially? No, because -
will be replaced out in the input, right?
pkg/cli/zip_test.go, line 249 at r1 (raw file):
} // Strip any non-deterministic messages:
Hmm, this will potentially fail nondeterministically as we change things elsewhere, bump deps, etc. Have you considered making expected
a regexp? From the looks of it, that's going to leave the test in an even more annoying place, though. Hmm. I'm fine with leaving as-is.
pkg/cli/zip_test.go, line 241 at r2 (raw file):
create table defaultdb."a-b"(x int); create table defaultdb."pg_catalog.pg_class"(x int); create table defaultdb."../system"(x int);
💡
pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):
} // StartTestClusterWithoutReplication starts up a TestCluster made up
Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.
94e5b3d
to
10e45bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the real bug that debug zip isn't handling the retry error?
That is a bug but it was there already before. I suppose I could add an additional commit.
The loop mostly does the right thing from the looks of it but it doesn't guarantee that there won't be replica movement after. [...] It seems reasonable that the loop could call it quits once at least one range is upreplicated, though usually it will wait for them all.
Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cli/zip.go, line 579 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
would it be better to transform to lowercase before iterating? I'm just thinking of some crazy unicode stuff (which may not exist) where
IsLetter
is true but then after lowercasing we get something weird that's unsuitable for a filesystem. (I don't know what qualifies as IsLetter, probably this is all benign).
Done.
pkg/cli/zip.go, line 587 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can I break something by feeding in a
f=something-2
initially? No, because-
will be replaced out in the input, right?
Yes, correct. Added a test to show.
pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Oof, that seems like a .. quite bespoke, not at all general purpose way of starting a server.
It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in sql
and one in cli
), what would be a better place for it?
(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a bug but it was there already before. I suppose I could add an additional commit.
If you don't mind - that or file an issue. I thought we always issued single statements and wouldn't see retries. (Though with data streamed out already, I can see how we would get one. Which is annoying; not sure how we can fix this without adding retry handlers for all of debug zip; we don't want to buffer too much there though I suppose we could buffer more than the default)
Are you telling me that the method is called "WaitForFullReplication" and it does not, in fact, wait for full replication? How is that not more generally a problem?
I'm telling you that this code tries to check conditions at a distance and that it may not be bulletproof. I think most consumers are OK with this, but it's clearly not desirable. But looking at the code more, this should really work in practice - it's computing the metrics in each step and node and checking for underreplication. Do you have a log with the suspicious activity? I'm looking for the "WaitForFullReplication took: %s` line and then more activity.
Reviewed 2 of 2 files at r8, 1 of 2 files at r9, 1 of 1 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):
Previously, knz (kena) wrote…
It was introduced in #44022, this is just moving it to a new place. I would prefer to reuse the same code in both tests that need it (one in
sql
and one incli
), what would be a better place for it?(Also would welcome your comments on correctness more generally. The comment at the top wrt WaitForFullReplication make me think we have a problem to fix.)
Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?
10e45bb
to
becaa6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have re-enabled TestPartialZip and verified via make stress
that the various tests here are not flaky.
They are under race-enabled builds, but then that's expected because the slowness of race runs triggers all kinds of timeouts.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/testutils/testcluster/testcluster.go, line 890 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Right, I'm not saying we shouldn't start a server like that, more that it seems too niche for the general case. Assuming WaitForFullReplication worked, you would set up three nodes, wait for full replication, and then take two of them down, and things would work (assuming the retry error were fixed)?
Removed in favor of the technique you explained to me.
Filed #45134 to add the missing retry loop. |
Previously `cockroach zip` would only print an informational message about a piece of data it was retrieving *after* the data was retrieved (or an error was observed). This patch changes it to print a message also beforehand. This enables better troubleshooting of hanging queries. Release note (cli change): `cockroach zip` now displays its progress differently on the terminal.
Previously, `cockroach zip` did not properly escape special characters in db/table names, so that a SQL database named e.g. `"../schema/system"` (inside SQL) would cause a zip entry called `../schema/system` to be created. This generated name is ambiguous with the output path of zip files for the `system` database. This patch fixes that by properly escaping the file names. Release note (cli change): `cockroach zip` now properly supports special characters in database and table names.
The SQL connections used by `cockroach zip` already use a *connect timeout* configurable via COCKROACH_CONNECT_TIMEOUT and (currently) defaulting to 15s. However there was no timeout after a connection was established. This patch changes it to use the value configured via `--timeout` as per-SQL-query timeout. Release note (cli change): `cockroach zip` will now apply the `--timeout` parameters also to the SQL queries it performs (there was no timeout previously, causing `cockroach zip` to potentially hang).
Prior to this patch, `cockroach zip` was performing a Node() RPC on the head node to retrieve its SQL address. This was performing a KV read and would thus block if the liveness range was unavailable. This is fixed by replacing the call to `Details()` (`/health`) which also delivers the SQL address. Additionally, the per-node retrieval code is updated to retrieve at least the head node's endpoints, even when `Nodes()` fails. This enables extracting a few things more in case liveness is unavailable: gossip, some crdb_internal stuff, log files, etc. Release note (cli change): `cockroach zip` is now able to tolerate more forms of cluster unavailability. Nonetheless, in case system ranges are unavailable, it is recommended to run `cockroach zip` towards each node address in turn to maximize the amount of useful data collected.
becaa6f
to
48259ad
Compare
Ok this passes tests reliably now. RFAL |
cc @ajwerner if you could help with the last pass of reviews that would be swell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r17, 2 of 2 files at r20.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
thank you! bors r=tbg,ajwerner |
44342: cli/zip: make `cockroach zip` more resilient r=tbg,ajwerner a=knz Fixes #44337. Fixes #36584. Fixes #44215. Fixes #44215. Fixes #39620. This is a collection of patches that enhance the behavior of `cockroach zip` over partially or fully unavailable clusters. In particular it increases the amount of useful data collected. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Build succeeded |
Fixes #44337.
Fixes #36584.
Fixes #44215.
Fixes #44215.
Fixes #39620.
This is a collection of patches that enhance the behavior of
cockroach zip
over partially or fully unavailable clusters. In particular it increases the amount of useful data collected.