Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logictest: TestPlannerLogic/5node-dist/distsql_interleaved_join created unexpected plan #31068

Closed
tbg opened this issue Oct 8, 2018 · 33 comments
Assignees
Labels
A-sql-optimizer SQL logical planning and optimizations. C-investigation Further steps needed to qualify. C-label will change.

Comments

@tbg
Copy link
Member

tbg commented Oct 8, 2018

Seen on a PR which I don't think should have that effect: https://teamcity.cockroachdb.com/viewLog.html?buildId=950803&buildTypeId=Cockroach_UnitTests_Test&tab=buildResultsDiv

(There's a chance that it is, in some obscure way, caused by it, but I don't think so and it hasn't repro'ed in several minutes of stressrace on my gceworker)

Basically the plan looks as if there were only four nodes.

Have:

image

Want:

image

logic.go:2247: 
	 
	testdata/planner_test/distsql_interleaved_join:393: SELECT url FROM [EXPLAIN (DISTSQL) SELECT * FROM outer_c1 LEFT OUTER JOIN outer_p1 USING (pid1) WHERE pid1 >= 0 AND pid1 < 40]
	expected:
	    https://cockroachdb.github.io/distsqlplan/decode.html#eJzUlE-Lo0AQxe_7KeSddtlaYqu7B2HBSwYMIQ6SOQ0yiF3JCMaW7naYEPzug3rIJJP5l5xys_v1q1e_gnKHWkle5Bs2CO8hQPBA8EEIkBEarQo2RuleHh_H8hmhSyjrprX9dUYolGaEO9jSVowQcW1ZV5w_ccq5ZD1TZc164oIg2eZlNaTNeWXRZ5SbXG8j1VrWD0XfQ1quH99KTS-NpUBYbhsOnfn0Zukkd8tp6sySeAFCxSv7MxK_f_3XfZXhE4SktaETCYo8inyKAor-IesIqrV7DmPzNSMUHb3Dukdsa6Ula5YHTFl3YhoL9Uc1k-Do4elo7yBanDVmcZVj9s5i9a6S1T-L1b9K1k9-FSmbRtWGv7Qdbr9eLNc87qJRrS74VqtiiBmPyeAbLiQbO6piPMT1IA0NvjaLD81_D8zusdm7JNm_xBx8y5x1P14CAAD___ca6HA=
	    
	but found (query options: "") :
	    https://cockroachdb.github.io/distsqlplan/decode.html#eJzUlEHLm0AQhu_9FfKeWjolrpochIKXFAwhFklPRYq4k1QwruyupSH434t6SJOm7feZk7fdnX323WcOc0GtJO_yExuEXyFA8EDwQQhAWCIjNFoVbIzS_ZURiOVPhC6hrJvW9scZoVCaEV5gS1sxQsS1ZV1x_oNTziXrjSpr1gsXBMk2L6shccsHiz6jPOX6HKnWsv5W9P9Iy-P3P0tNXxqfAmF_bjh0tutPeyf5sl-nziaJdyBUfLBvI_H-3UfdvzIsQUhaGzqRoMijyKcooGiFrCOo1l49jM2PjFB09BfXq2JbKy1Zs7xxyroH3dipD6pZLO8uPo72bqLFpDaLWbbZm-TqzdLVn-QazNI1mOTqz9L1P2MxZdOo2vCLJoHbjxKWRx7njlGtLvizVsUQM26TgRsOJBs7VsW4ieuhNHzwd1j8E17dwO497D2T7D8DB8_Ay1fBWffmVwAAAP__PKlIOw==

@jordanlewis for routing.

@tbg tbg added the A-sql-optimizer SQL logical planning and optimizations. label Oct 8, 2018
@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

I think it's unrelated but might as well add it here:

there's a data race that is probably generic in the logictests but which I triggered (on master) while stressing via five minutes of

make stressrace PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join STRESSFLAGS='-p 16'
=== RUN   TestPlannerLogic/5node-dist/distsql_interleaved_join
==================
WARNING: DATA RACE
Read at 0x000004627590 by goroutine 33:
  time.Now()
      /usr/local/go/src/time/time.go:1049 +0xd3
  time.sendTime()
      /usr/local/go/src/time/sleep.go:141 +0x44

Previous write at 0x000004627590 by goroutine 42:
  github.com/cockroachdb/cockroach/pkg/sql/logictest.RunLogicTest()
      /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:2008 +0xf4
  github.com/cockroachdb/cockroach/pkg/sql/logictest.TestPlannerLogic()
      /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic_test.go:48 +0xd2
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:777 +0x16d

Goroutine 33 (running) created at:
  runtime.(*timersBucket).addtimerLocked()
      /usr/local/go/src/runtime/time.go:160 +0x106
  time.Tick()
      /usr/local/go/src/time/tick.go:58 +0x59
  github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
      /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1203 +0x3b

Goroutine 42 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:824 +0x564
  testing.runTests.func1()
      /usr/local/go/src/testing/testing.go:1063 +0xa4
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:777 +0x16d
  testing.runTests()
      /usr/local/go/src/testing/testing.go:1061 +0x4e1
  testing.(*M).Run()
      /usr/local/go/src/testing/testing.go:978 +0x2cd
  github.com/cockroachdb/cockroach/pkg/sql/logictest.TestMain()
      /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/main_test.go:37 +0x15d
  main.main()
      _testmain.go:48 +0x22a
==================


ERROR: exit status 66

97 runs completed, 1 failures, over 4m19s

@petermattis
Copy link
Collaborator

petermattis commented Oct 8, 2018

I think @arjunravinarayan or someone saw this (the data race) before, but couldn't make sense of it given that it came from time.Now().

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Seems like a pretty obvious race to me:

	// Set time.Local to time.UTC to circumvent pq's timetz parsing flaw.
	time.Local = time.UTC

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Should probably do that kind of thing in an init function to avoid racing the test harness (or anything else that might be doing stuff on other goroutines)

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Happened again in CI, but on my branch, so that's probably where the problem is.

image

vs

image

@petermattis
Copy link
Collaborator

@tschottdorf Here is the other issue about the time.Now race: #30027.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Thanks @petermattis.

Back to the original test failure, thankfully it fails on master too:

--- FAIL: TestPlannerLogic (2.87s)
        test_log_scope.go:81: test logs captured to: /tmp/logTestPlannerLogic804741515
        test_log_scope.go:62: use -show-logs to present logs inline
    --- FAIL: TestPlannerLogic/5node-dist (0.00s)
        --- FAIL: TestPlannerLogic/5node-dist/distsql_interleaved_join (2.84s)
                logic.go:2247:

                        testdata/planner_test/distsql_interleaved_join:383: SELECT url FROM [EXPLAIN (DISTSQL) SELECT * FROM outer_p1 FULL OUTER JOIN outer_c1 USING (pid1)]
                        expected:
                            https://cockroachdb.github.io/distsqlplan/decode.html#eJzclE2L2zAQhu_9FWJOu3TK-nMPhoIuDTgYp5jkVEwx1iQ1OJKR5NIQ_N-L7EOaNP1Y5-abpdEzr585zBmkEpRXRzKQfAEfEAJACAEhAoQYSoROq5qMUdo9mYBU_IDEQ2hk11t3XSLUShMkZ7CNbQkSSKUl3VL1nQqqBOm1aiTpFxchyFZNOyZmtLfgMppjpU9c9Zb01849KprDt99LtStNrQBhe-ooYatdlrHNbvupYOtNmgNCS3v7xP33zx-16zJ-upYkBemEpat8l2VP3EcePiPjATIeIeMxMv4K5YCgensxM7Y6ECT-gH-wv0j3UmlBmsSVZTncmU-uPqjuJb55eD86uIr2Zw0-WMjgg1n24ULsw1n20ULso1n23kLs_7FwCzKdkob-a6N4biWRONC0v4zqdU2ftarHmOm4GbnxQpCxU9WfDqkcS-MP_gr7f4Vfr2DvFg4eSQ4fgaNH4PhNcDm8-xkAAP__5oReVw==

                        but found (query options: "") :
                            https://cockroachdb.github.io/distsqlplan/decode.html#eJzclE2L2zAQhu_9FWJOu3TKWv7Yg6GgSxe8GKeYzamYYqzZ1OBIRpJLQ_B_L7IPadL0Y52bb7ZGz7x65jBHUFpSUe_JQvoFOCCEgBABQgwICVQIvdENWauNvzIDmfwBaYDQqn5w_rhCaLQhSI_gWtcRpJApR6aj-juVVEsyz7pVZB58W0mubrspMadXBz6j3dfmIPTgyHzt_TvKdvft91LjS3MrQHg59JSyp22es8325VPJnjdZAQgdvbo7wd_ffzS-y_TpW5KSZFKWPRXbPL8THEV0j0yEyESMTCTIxCNUI4Ie3MnMunpHkPIR_2B_kh6UNpIMyTPLarwyn0J_0P1DcnHxenR4Fs0XDZ6vZPDhIvtwJfbRIvtoJfbxIvtgJfb_WLgl2V4rS_-1UQK_kkjuaN5fVg-moc9GN1PM_LuZuOlAknVzlc8_mZpK0wN_hflf4cczOLiEw1uSo1vg-BY4eRNcje9-BgAA___t6V5X


                logic.go:2279:
                        testdata/planner_test/distsql_interleaved_join:388: error while processing
                logic.go:2280: testdata/planner_test/distsql_interleaved_join:388: too many errors encountered, skipping the rest of the input
FAIL


ERROR: exit status 1

595 runs completed, 1 failures, over 1m20s
context canceled

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

This will be a nice test of my automated bisection script I have lying around.

Bisecting: 173 revisions left to test after this (roughly 8 steps)

@jordanlewis
Copy link
Member

Thanks for the investigation so far - I wonder if there's a race in the way that the data moves between nodes in the setup. Or maybe the cache that is supposed to keep track of range location is broken. @RaduBerinde or @andreimatei do you have time to look into this? I feel like there were some related flakiness in the range cache integration test, but I'm failing to turn it up.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

#13525 (comment)?

@jordanlewis
Copy link
Member

Yes, it could very well be related to that. What I don't understand is why your PR tickles it so easily.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

I'm bisecting on master right now. The failure pops up easily with make stress on gceworker.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

and the winner is:

4b28b0b storage: Avoid adding all replicas at once in RelocateRange

(I'm verifying, but seems plausible)

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Verified. I'm going to make sure that I don't merge my PR if it makes the race much more likely, but will otherwise leave the issue in your hands @jordanlewis if that's OK

@jordanlewis
Copy link
Member

Alright, so the tl;dr is that RELOCATE_RANGE isn't synchronous anymore, right?

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Hmm, no, that wasn't my takeaway. I think the impl has just changed to be smarter about getting the replica set where it needs to be with less spurious copies created. Perhaps @a-robinson has an idea what the difference might be.

Btw, this test is a lot more flaky on my branch. I guess that makes it my problem 🙈

@a-robinson
Copy link
Contributor

Ugh, I'm sorry for not having caught this one in the first place. 4b28b0b made a few other TestPlannerLogic tests fail as well due to different results compared to how RelocateRange worked before. It's a bit odd, though. For those failures, adding a "verification" query like 4b28b0b#diff-899ba38a1293a3c30487235783b4668eR247 was all that was needed to fix the failures, which doesn't make much sense on the face of it.

Also, I can't tell what the difference is in #31068 (comment). They look logically equivalent?

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

Also, I can't tell what the difference is in #31068 (comment). They look logically equivalent?

I was also confused by that.

For those failures, adding a "verification" query like 4b28b0b#diff-899ba38a1293a3c30487235783b4668eR247 was all that was needed to fix the failures, which doesn't make much sense on the face of it.

Wait, how or rather why does that work?

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

The difference is that node i gets assigned joiner i. In the failing test, the numbers are sometimes mixed up. These tests are brittle, I'm probably introducing just enough nondeterminism to sometimes generate an every so slightly different (but basically equivalent) plan. Sigh.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2018

@a-robinson this is the segment that fails: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/logictest/testdata/planner_test/distsql_interleaved_join#L366-L387

Is that an easy fix with some magic query? Looks like we already "verify placement" before.

@a-robinson
Copy link
Contributor

I've been trying to get a repro in order to be able to test out changes like that, but (a) I haven't been able to repro the failure yet, and (b) we do already verify the placement of outer_p1 right after the relocate range. I'm not sure how that query could fail but then the data/leases would be in the wrong places in the next query. Perhaps we should also verify outer_c1, but it isn't having RELOCATE run on it.

I haven't spent much time working with the planner tests, but it seems like you'd always want to verify the placements of all tables in order to rule out whether that's the cause of plan differences.

Wait, how or rather why does that work?

I don't know. I initially put them in just to better understand what was wrong about the data/lease placement when things went wrong, but after putting them in the tests stopped failing. The tests were deterministically failing back then, though, not flaking.

@jordanlewis
Copy link
Member

Also, I can't tell what the difference is in #31068 (comment). They look logically equivalent?

They're logically equivalent, but the test asserts that the plans are identical. This shouldn't be a problem as long as the placement is exactly what the tests request.

@a-robinson
Copy link
Contributor

I'm experimenting on top of #31013 and as far as I can tell from enabling verbose logging the replicas/leases are indeed being put in the right places and not moving after the SHOW EXPERIMENTAL_RANGES commands. If I change the expected SHOW EXPERIMENTAL_RANGES output at all, then it fails. It's very odd that the SHOW EXPERIMENTAL_RANGES succeeds but the following query doesn't use all nodes.

This makes me suspect that the issue here is either around bad caching behavior somewhere in the kv/distsql stack or some sort of bad behavior around new replicas not being initialized fully/properly and thus not able to respond to a real request by the time the failing query runs (or some combination of the two). The latter would especially make sense given the nature of #31013.

@a-robinson
Copy link
Contributor

Yeah, could someone familiar with the planner track down the info that's being used in constructing the failing plans? The ranges all appear to be in the right places. With #31013, the following command fails almost immediately:

make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join

@andreimatei
Copy link
Contributor

FWIW - As I think you've realized, what matters for the purposes of planning is not where ranges are, but where the range cache on the gateway thinks they are. #31013 does claim to affect that caching by removing some eviction in some case, if I understand correctly.

@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

The eviction isn't relevant (I disabled it and the failure remains). What matters is that we don't try the next replica on RangeNotFound. Not quite sure why, but I should be able to find out.

@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

A theory I have is that by trying the next replica, we discover the leader "organically" but don't populate the cache, where previously we would've tried that first replica for a while and eventually it'd give us a NotLeaseHolderError.

@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

Yeah, that seems to be it. I have a change that populates the leaseholder when a successful RPC comes back and that fixes it. I'll see whether that is easy enough to gets tests passing for, it's definitely a change I've wanted to make for a while.

@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

Sigh, now another test fails under stress.

expected:

image

got:

image

--- FAIL: TestExecBuild (1.31s)
	test_log_scope.go:81: test logs captured to: /tmp/logTestExecBuild151687037
	test_log_scope.go:62: use -show-logs to present logs inline
    --- FAIL: TestExecBuild/5node-dist-opt (0.00s)
        --- FAIL: TestExecBuild/5node-dist-opt/lookup_join (1.30s)
        	logic.go:2247:

        		testdata/lookup_join:214: SELECT url FROM [EXPLAIN (DISTSQL) SELECT DISTINCT authors.name FROM books AS b1, books2 AS b2, authors WHERE b1.title = b2.title AND authors.book = b1.title AND b1.shelf <> b2.shelf]
        		expected:
        		    https://cockroachdb.github.io/distsqlplan/decode.html#eJzEk09r4zAQxe_7Kbxz2oAWW_4XMAR02MNmD8kSeis5KNY0EXUsI8nQEvLdi21oYjdW_lx61Hh-857eWAcolcAF36OB7BkoEEhgTaDSKkdjlG7KXdNcvEEWEJBlVdumvCaQK42QHcBKWyBksFC_VeWnQECg5bJo244EVG1PkLF8i5BNj-RsMHUPfuKbAlfIBWo_6I2HSss91-9so9SrCYHAsraZxyhhEYxp00e16WVtXttdk9WYXjiqd5KpS6UFahTD9K63XDD9l5vdPyVL1H7Y9_wZT0hYRFg8ajp6NKTIsaDb9hN_Z15x336BL_ZXYzeezLTc7roDnTQ3WXgs9X7OPHb678YvldyT5x9prCxz6yd9N4yOzk9786-81BWaSpUGb3qqQZMYii12GzCq1jn-1ypvZbrjsuXagkBju6_T7jAvu0-NwXOYOuGoB9MhHN4Bh0M4csKJWzm-A_6inDjh1B1Y6oSDAbw-_vgIAAD__0987gU=

        		but found (query options: "") :
        		    https://cockroachdb.github.io/distsqlplan/decode.html#eJzEk7FvqzAQxvf3V_BuepEsBQxkQIrk4Q1Nh6SKulUMDr4mVomNbCO1ivK_VwapKTTQkqWjz_7d9913cAKlBa75ES1kTxBBTqAyukBrtfGl9sFKvEIWEpCqqp0v5wQKbRCyEzjpSoQMHvmuxC1ygWYeAgGBjsuyaVsZeeTmje20frEUCGxqlwUsIiyG_ExA1-7S1zq-R8iiM7lNO7quzWt38CMN6dFBvYtMrbQRaFB0JHJPfvfkiuk7bg_3Wio0c9r1_BEPJSwmLBk0Hd8aUjyyoJ_tJ_nNvJKu_RKf3T9vN5ktjdwf2kM085OsA7YI_i4DdvnuhodKp-T5X1onVeHmadcNiwb7L6b036KttLLYD-9q59AnhmKP7Qasrk2BD0YXjUx73DRcUxBoXXtL28NKNVfNX_cZjibAtA_TUTgZV44nwF-Uk1E47cBhH05H4UUPzs9_3gMAAP__QUm74Q==


        	logic.go:2279:
        		testdata/lookup_join:219: error while processing
        	logic.go:2280: testdata/lookup_join:219: too many errors encountered, skipping the rest of the input
FAIL

@tbg
Copy link
Member Author

tbg commented Oct 9, 2018

I added

# Verify data placement.
query TTITI colnames
SHOW EXPERIMENTAL_RANGES FROM TABLE books
----
start_key  end_key  range_id  replicas  lease_holder
NULL       NULL     10        {5}       5

and now this works. #magic

@a-robinson
Copy link
Contributor

Yeah, that seems to be it. I have a change that populates the leaseholder when a successful RPC comes back and that fixes it. I'll see whether that is easy enough to gets tests passing for, it's definitely a change I've wanted to make for a while.

I've been wanting to do that for a while too. Nice work!

tbg added a commit to tbg/cockroach that referenced this issue Oct 9, 2018
Whenever a successful response is received from an RPC that we know has
to contact the leaseholder to succeed, update the leaseholder cache.

The immediate motivation for this is to be able to land the preceding
commits, which greatly exacerbated (as in, added a much faster failure
mode to)

```
make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join
```

However, the change is one we've wanted to make for a while; our caching
and in particular the eviction of leaseholders has been deficient
essentially ever since it was first introduced.

Touches cockroachdb#31068.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 10, 2018
Whenever a successful response is received from an RPC that we know has
to contact the leaseholder to succeed, update the leaseholder cache.

The immediate motivation for this is to be able to land the preceding
commits, which greatly exacerbated (as in, added a much faster failure
mode to)

```
make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join
```

However, the change is one we've wanted to make for a while; our caching
and in particular the eviction of leaseholders has been deficient
essentially ever since it was first introduced.

Touches cockroachdb#31068.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 11, 2018
Whenever a successful response is received from an RPC that we know has
to contact the leaseholder to succeed, update the leaseholder cache.

The immediate motivation for this is to be able to land the preceding
commits, which greatly exacerbated (as in, added a much faster failure
mode to)

```
make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join
```

However, the change is one we've wanted to make for a while; our caching
and in particular the eviction of leaseholders has been deficient
essentially ever since it was first introduced.

Touches cockroachdb#31068.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 12, 2018
Whenever a successful response is received from an RPC that we know has
to contact the leaseholder to succeed, update the leaseholder cache.

The immediate motivation for this is to be able to land the preceding
commits, which greatly exacerbated (as in, added a much faster failure
mode to)

```
make stress PKG=./pkg/sql/logictest TESTS=TestPlannerLogic/5node-dist/distsql_interleaved_join
```

However, the change is one we've wanted to make for a while; our caching
and in particular the eviction of leaseholders has been deficient
essentially ever since it was first introduced.

Touches cockroachdb#31068.

Release note: None
@petermattis petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Oct 14, 2018
@knz knz added this to Triage in (DEPRECATED) SQL execution via automation Oct 15, 2018
@jordanlewis
Copy link
Member

@tschottdorf going to close this as the immediate problem is fixed (right?)

@tbg
Copy link
Member Author

tbg commented Oct 22, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-optimizer SQL logical planning and optimizations. C-investigation Further steps needed to qualify. C-label will change.
Projects
No open projects
Development

No branches or pull requests

5 participants