Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix and test a bogus source of replica divergence errors #37668

Merged
merged 5 commits into from May 21, 2019

Conversation

Projects
None yet
3 participants
@tbg
Copy link
Member

commented May 21, 2019

An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see /auxiliary/checkpoints).

Release note (bug fix): Fixed a case in which ./cockroach quit would
return success even though the server process was still running in a
severely degraded state.

tbg added some commits May 21, 2019

batcheval: bump ReplicaChecksumVersion
In #35861, I made changes to the consistency checksum computation that
were not backwards-compatible. When a 19.1 node asks a 2.1 node for a
fast SHA, the 2.1 node would run a full computation and return a
corresponding SHA which wouldn't match with the leaseholder's.

Bump ReplicaChecksumVersion to make sure that we don't attempt to
compare SHAs across these two releases.

Fixes #37425.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see <store
directory>/auxiliary/checkpoints).
server: avoid stalled process in ./cockroach quit
There are "known unknown" problems regarding shutting down gRPC or the
stopper, which can leave the process dangling while its sockets are
already closed. The UX is poor and it's likely very annoying to find,
fix, and regress against the root cause. Luckily, all we want to achieve
is that the process is dead soon after the client disconnects, and we
can do that.

If I were to rewrite this code, I would probably not even bother with
stopping the stopper or grpc, but just call `os.Exit` straight away.
I'm not doing this right now to minimize fallout since this change
will be backported to release-19.1.

Release note (bug fix): Fixed a case in which `./cockroach quit` would
return success even though the server process was still running in a
severely degraded state.
roachtest: actually check correct node
The mixed version test was always verifying the first node by accident.

Release note: None

@tbg tbg requested a review from cockroachdb/core-prs as a code owner May 21, 2019

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented May 21, 2019

This change is Reviewable

@tbg tbg requested a review from cockroachdb/sql-rest-prs as a code owner May 21, 2019

@tbg tbg force-pushed the tbg:fix/conscheck-version branch from 572b8c9 to 326a213 May 21, 2019

@tbg tbg requested a review from nvanbenschoten May 21, 2019

@nvanbenschoten
Copy link
Member

left a comment

:lgtm:

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/cmd/roachtest/cluster.go, line 1058 at r4 (raw file):

		// TODO(tbg): the checks can fail for silly reasons like missing gossiped
		// descriptors, etc. -- not worth failing the test for. Ideally this would
		// be rock solid.

Is it still worth logging?


pkg/cmd/roachtest/cluster.go, line 1090 at r4 (raw file):

	}

	var db *gosql.DB

Comment that you're trying to find a live node and that this isn't actually the consistency check.

tbg added some commits May 21, 2019

roachtest: run inconsistency checks during mixed version test
This regression tests #37425, which exposed an incompatibility between
v19.1 and v2.1.

`./bin/roachtest run --local version/mixed/nodes=3` ran successfully
after these changes.

I took the opportunity to address a TODO in FailOnReplicaDivergence.

Release note: None

@tbg tbg force-pushed the tbg:fix/conscheck-version branch from 326a213 to e3ae436 May 21, 2019

@tbg
Copy link
Member Author

left a comment

Comments addressed, TFTR!

bors r=nvanbenschoten

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

craig bot pushed a commit that referenced this pull request May 21, 2019

Merge #37668 #37701
37668: storage: fix and test a bogus source of replica divergence errors r=nvanbenschoten a=tbg

An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see <store
directory>/auxiliary/checkpoints).

Release note (bug fix): Fixed a case in which `./cockroach quit` would
return success even though the server process was still running in a
severely degraded state.

37701: workloadcccl: fix two regressions in fixtures make/load r=nvanbenschoten a=danhhz

The SQL database for all the tables in the BACKUPs created by `fixtures
make` used to be "csv" (an artifact of the way we made them), but as
of #37343 it's the name of the generator. This seems better so change
`fixtures load` to match.

The same PR also (accidentally) started adding foreign keys in the
BACKUPs, but since there's one table per BACKUP (another artifact of the
way we used to make fixtures), we can't restore the foreign keys. It'd
be nice to switch to one BACKUP with all tables and get the foreign
keys, but the UX of the postLoad hook becomes tricky and I don't have
time right now to sort it all out. So, revert to the previous behavior
(no fks in fixtures) for now.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@craig

This comment has been minimized.

Copy link

commented May 21, 2019

Build succeeded

@craig craig bot merged commit e3ae436 into cockroachdb:master May 21, 2019

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@tbg tbg deleted the tbg:fix/conscheck-version branch May 22, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.