storage: fix command queue dependencies where candidate span is enclosed #14733

spencerkimball · 2017-04-09T21:40:22Z

Change #14342 had a bug in that it did not remove the optimization to stop
checking dependencies in the event that the range group being built up
completely enclosed the candidate span. Bailing out early is no longer
allowed because we must additionally include any read-only spans which
might not be transitively depended on by the enclosing read-write span(s).

This bug applies to both prop-eval-KV as well as non-prop-eval-KV, but
manifested much more readily with prop-eval-KV due to the massively
enclosing span which is added to provide linearizability across all ops
hitting the range's command queue.

Fixes #14434
Fixes #8057

cockroach-teamcity · 2017-04-09T21:40:28Z

This change is

bdarnell · 2017-04-09T22:24:38Z

I've been chasing red herrings related to this all weekend; thanks for finding it!

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/sql/parallel_test.go, line 238 at r1 (raw file):

	defer leaktest.AfterTest(t)()

	if testutils.Stress() {

This LGTM but cc @RaduBerinde for context about why this was skipped under stress to begin with. Was it just copied from logic_test.go which is large and arguably less interesting to run under stress?

pkg/storage/command_queue.go, line 392 at r1 (raw file):

				// The current command is a write, so add it to the write RangeGroup.
				_ = cq.wRg.Add(keyRange)

Remove the _ =; we aren't explicit about ignoring the return value at other call sites of RangeGroup.Add (which is now only used in tests)

pkg/storage/command_queue_test.go, line 474 at r1 (raw file):

// fail to return read-only dependencies that a candidate read/write
// span depends on, despite there being an overlapping span which
// completely encloses the candidate.

Link back to the issue that this is testing for.

Should we test other permutations too? (a read depending on a write, etc)

Comments from Reviewable

RaduBerinde · 2017-04-09T22:34:22Z

pkg/sql/parallel_test.go, line 238 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This LGTM but cc @RaduBerinde for context about why this was skipped under stress to begin with. Was it just copied from logic_test.go which is large and arguably less interesting to run under stress?

I am not sure, I did some digging around and it was introduced in #10966; there's some context in #10939.

Comments from Reviewable

spencerkimball · 2017-04-09T22:57:07Z

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

pkg/sql/parallel_test.go, line 238 at r1 (raw file):

Previously, RaduBerinde wrote…

I am not sure, I did some digging around and it was introduced in #10966; there's some context in #10939.

Hmm, OK I'll revert this.

pkg/storage/command_queue.go, line 392 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove the _ =; we aren't explicit about ignoring the return value at other call sites of RangeGroup.Add (which is now only used in tests)

Done.

pkg/storage/command_queue_test.go, line 474 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Link back to the issue that this is testing for.

Should we test other permutations too? (a read depending on a write, etc)

Other combinations don't really make sense. This is the simplest test which will reproduce the bug and it crucially requires that the waiter is a write, the enclosing span is a write, and there is a read-only span which is depended on by the waiter but not the enclosing span.

Comments from Reviewable

bdarnell · 2017-04-09T23:05:28Z

Review status: 1 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

pkg/sql/parallel_test.go, line 238 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Hmm, OK I'll revert this.

I'd leave it (so the parallel tests are run under stress). The logic tests are skipped under stress because A) the quantity and running time of the tests causes problems when run under stress and B) they don't have a history of differing behavior under stress. Neither of these apply to the parallel tests; there are only two (for now; even if it grows it's got a long way to go before it's an issue) and both have an understandable history of flakiness under stress.

Comments from Reviewable

Change #14342 had a bug in that it did not remove the optimization to stop checking dependencies in the event that the range group being built up completely enclosed the candidate span. Bailing out early is no longer allowed because we must additionally include any read-only spans which might not be transitively depended on by the enclosing read-write span(s). This bug applies to both prop-eval-KV as well as non-prop-eval-KV, but manifested much more readily with prop-eval-KV due to the massively enclosing span which is added to provide linearizability across all ops hitting the range's command queue. Fixes #14434 Fixes #8057

spencerkimball · 2017-04-09T23:09:00Z

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions.

pkg/sql/parallel_test.go, line 238 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd leave it (so the parallel tests are run under stress). The logic tests are skipped under stress because A) the quantity and running time of the tests causes problems when run under stress and B) they don't have a history of differing behavior under stress. Neither of these apply to the parallel tests; there are only two (for now; even if it grows it's got a long way to go before it's an issue) and both have an understandable history of flakiness under stress.

Done.

Comments from Reviewable

RaduBerinde · 2017-04-09T23:31:39Z

Awesome! Exciting to have the multinode parallel test finally enabled!

nvanbenschoten · 2017-04-10T01:37:26Z

Nice catch!

With the change in #14342 we might eventually want to consider changing the structure here, now that the classic "reads depend on writes, writes depend on reads and write" rules are getting more complicated. I have a feeling some of the complexity we've been incrementally adding can be simplified if we reconsider the problem from scratch in light of these new constraints.

Reviewed 2 of 4 files at r1.
Review status: 2 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/storage/command_queue.go, line 358 at r2 (raw file):

					chans = append(chans, cmd.pending)
				}
			} else {

It looks like we could have retained the optimization for cases where !cmd.readOnly && enclosed == true. I'm not sure this is very important in practice though, as all of the operations below should have constant amortized time in the regime where enclosed is already true (cq.wRg.Encloses(newCmdRange)). Still, might be worth adding back in.

Comments from Reviewable

spencerkimball · 2017-04-10T19:00:51Z

@nvanbenschoten I agree. If the range group data structure could give us enough info to efficiently track inter-command dependencies we could get linear performance in usual case and only quadratic time penalty in worst case scenarios. There is probably something better. Something to keep in mind.

spencerkimball requested review from bdarnell and nvanbenschoten April 9, 2017 21:40

spencerkimball force-pushed the spencerkimball/serializable-failure-prop-eval-kv branch from 7015e6d to 31576ef Compare April 9, 2017 22:38

spencerkimball force-pushed the spencerkimball/serializable-failure-prop-eval-kv branch from 31576ef to 6610d39 Compare April 9, 2017 22:57

spencerkimball force-pushed the spencerkimball/serializable-failure-prop-eval-kv branch from 6610d39 to ae253ca Compare April 9, 2017 23:08

spencerkimball merged commit 8c29595 into master Apr 10, 2017

spencerkimball deleted the spencerkimball/serializable-failure-prop-eval-kv branch April 10, 2017 00:13

bdarnell mentioned this pull request Apr 10, 2017

teamcity: failed tests on master: testrace (proposer evaluated kv)/TestSchemaChangeReverseMutations #14735

Closed

bdarnell mentioned this pull request Apr 10, 2017

storage: Remove propEvalKV synchronization hack #14739

Merged

nvanbenschoten mentioned this pull request Aug 2, 2017

storage: Support deferred cancellation in the CommandQueue #9448

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fix command queue dependencies where candidate span is enclosed #14733

storage: fix command queue dependencies where candidate span is enclosed #14733

spencerkimball commented Apr 9, 2017

cockroach-teamcity commented Apr 9, 2017

bdarnell commented Apr 9, 2017

RaduBerinde commented Apr 9, 2017

spencerkimball commented Apr 9, 2017

bdarnell commented Apr 9, 2017

spencerkimball commented Apr 9, 2017

RaduBerinde commented Apr 9, 2017

nvanbenschoten commented Apr 10, 2017

spencerkimball commented Apr 10, 2017

storage: fix command queue dependencies where candidate span is enclosed #14733

storage: fix command queue dependencies where candidate span is enclosed #14733

Conversation

spencerkimball commented Apr 9, 2017

cockroach-teamcity commented Apr 9, 2017

bdarnell commented Apr 9, 2017

RaduBerinde commented Apr 9, 2017

spencerkimball commented Apr 9, 2017

bdarnell commented Apr 9, 2017

spencerkimball commented Apr 9, 2017

RaduBerinde commented Apr 9, 2017

nvanbenschoten commented Apr 10, 2017

spencerkimball commented Apr 10, 2017