New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: move cmd/zerosum to a roachtest #31011

Merged
merged 1 commit into from Oct 11, 2018

Conversation

Projects
None yet
4 participants
@m-schneider
Contributor

m-schneider commented Oct 5, 2018

closes: 29486

Release note: None

@m-schneider m-schneider requested a review from petermattis Oct 5, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 5, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 5, 2018

This change is Reviewable

@m-schneider

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):

		{"bank/cluster-recovery", runBankClusterRecovery},
		{"bank/node-restart", runBankNodeRestart},
		{"bank/zerosum", runBankNodeZeroSum},

There's already a test with basic chaos that restarts node, do we need another test the combines the splits and lease transfers and the restarts?


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

			}
		case 1:
			zipF := partitionDistribution(r, partitions)

My biggest question on this test is whether or not partitioning and changing lease preferences is a reasonable alternative to the original lease transfers.

@petermattis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

There's already a test with basic chaos that restarts node, do we need another test the combines the splits and lease transfers and the restarts?

Yes. Doing them all at the same time is what leads to interesting badness. In particular, splitting and lease transfers at the same time a node is being restarted is the interesting bit.


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

My biggest question on this test is whether or not partitioning and changing lease preferences is a reasonable alternative to the original lease transfers.

It's interesting, but different. The downside to adjust lease holder preferences is how asynchronous that is. I think you can use ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... in order to maintain the original behavior.

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 8, 2018

Contributor

pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's interesting, but different. The downside to adjust lease holder preferences is how asynchronous that is. I think you can use ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... in order to maintain the original behavior.

I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:

pq: change replicas of r24 failed: [n1,s1,r24/1:/Table/53/1/{258-408}]: received invalid ChangeReplicasTrigger REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):3 (n4,s4):2] next=4 to remove self (leaseholder)round 12: setting partition lease preferences for partition 582 to node 2

pq: change replicas of r29 failed: descriptor changed: [expected] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, next=6, gen=0] != [actual] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, (n2,s2):6, next=7, gen=0]: unexpected value: raw_bytes:"\023\240&K\003\010\035\022\005\275\211\367\001\t\032\005\275\211\367\001?\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\005\"\006\010\002\020\002\030\006(\007" timestamp:<wall_time:1539012778171889067 > round 46: setting partition lease preferences for partition 81 to node 1

Are these expected or a real failure?

Contributor

m-schneider commented Oct 8, 2018


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's interesting, but different. The downside to adjust lease holder preferences is how asynchronous that is. I think you can use ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... in order to maintain the original behavior.

I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:

pq: change replicas of r24 failed: [n1,s1,r24/1:/Table/53/1/{258-408}]: received invalid ChangeReplicasTrigger REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):3 (n4,s4):2] next=4 to remove self (leaseholder)round 12: setting partition lease preferences for partition 582 to node 2

pq: change replicas of r29 failed: descriptor changed: [expected] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, next=6, gen=0] != [actual] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, (n2,s2):6, next=7, gen=0]: unexpected value: raw_bytes:"\023\240&K\003\010\035\022\005\275\211\367\001\t\032\005\275\211\367\001?\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\005\"\006\010\002\020\002\030\006(\007" timestamp:<wall_time:1539012778171889067 > round 46: setting partition lease preferences for partition 81 to node 1

Are these expected or a real failure?

@petermattis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:

pq: change replicas of r24 failed: [n1,s1,r24/1:/Table/53/1/{258-408}]: received invalid ChangeReplicasTrigger REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):3 (n4,s4):2] next=4 to remove self (leaseholder)round 12: setting partition lease preferences for partition 582 to node 2

pq: change replicas of r29 failed: descriptor changed: [expected] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, next=6, gen=0] != [actual] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, (n2,s2):6, next=7, gen=0]: unexpected value: raw_bytes:"\023\240&K\003\010\035\022\005\275\211\367\001\t\032\005\275\211\367\001?\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\005\"\006\010\002\020\002\030\006(\007" timestamp:<wall_time:1539012778171889067 > round 46: setting partition lease preferences for partition 81 to node 1

Are these expected or a real failure?

The first error looks like a bug in what you did. The destination of the relocation has to be a node that has a replica of the range on it (and is not currently the leaseholder).

The second error is more curious. I think it might be expected if we're performing multiple operations concurrently. I'd sync with @nvanbenschoten to be sure about that.

@nvanbenschoten

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The first error looks like a bug in what you did. The destination of the relocation has to be a node that has a replica of the range on it (and is not currently the leaseholder).

The second error is more curious. I think it might be expected if we're performing multiple operations concurrently. I'd sync with @nvanbenschoten to be sure about that.

Yeah, the second error is expected if we're performing concurrent descriptor changes. We usually have these operations in retry loops for exactly this reason. We probably need a retry loop for the operation that ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... performs.

For instance, see

for retryable := retry.StartWithCtx(ctx, retryOpts); retryable.Next(); {

@m-schneider

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes. Doing them all at the same time is what leads to interesting badness. In particular, splitting and lease transfers at the same time a node is being restarted is the interesting bit.

I added the combination of the splits, lease transfers and restarts.


pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, the second error is expected if we're performing concurrent descriptor changes. We usually have these operations in retry loops for exactly this reason. We probably need a retry loop for the operation that ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... performs.

For instance, see

for retryable := retry.StartWithCtx(ctx, retryOpts); retryable.Next(); {

Thanks! I think since the chaos in this file is mostly done in a loop, the other tests ignore retryable errors, so I can add this particular one to the list vs adding retry logic?

Also after trying a couple of different ways to move the leaseholder, the cleanest turned out to relocating the range onto multiple stores. Other implementations like querying for the replicas corresponding to a primary key and then shifting the leaseholder were pretty fragile, because replicas moved between the query to check and the query to move. This way the shift is more atomic.

@petermattis

I need to take another pass at this, but it is looking close to done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 307 at r2 (raw file):

			})

			const zoneConfQuery = `ALTER TABLE bank.accounts EXPERIMENTAL_RELOCATE VALUES (ARRAY[%s], %d);`

s/zoneConfQuery/relocateQuery/g


pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):

			const zoneConfQuery = `ALTER TABLE bank.accounts EXPERIMENTAL_RELOCATE VALUES (ARRAY[%s], %d);`
			finalQ := fmt.Sprintf(zoneConfQuery, strings.Join(nodes[1:], ", "), key)

Nit: odd to name this finalQ and the template zoneConfQuery. Odd due to the inconsistent terminology final vs zoneConf and Q vs Query. I'd go with something like relocateQueryFormat and relocateQuery.


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

			finalQ := fmt.Sprintf(zoneConfQuery, strings.Join(nodes[1:], ", "), key)
			c.l.Printf("round %d: setting lease preferences for key %d to node %s\n",
				curRound, key, nodes[0])

This log message is no longer accurate (you're relocating ranges, not changing lease preferences) and it only logs nodes[0] while it should be logging all of the nodes.

@m-schneider

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 307 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/zoneConfQuery/relocateQuery/g

Done.


pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):

relocateQueryFormat
Done


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This log message is no longer accurate (you're relocating ranges, not changing lease preferences) and it only logs nodes[0] while it should be logging all of the nodes.

nodes[0] is the leaseholder, but I'll change it to log all of them

@petermattis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

relocateQueryFormat
Done

Heh, you have the variable names reversed. I was imagining const relocateQueryFormat because that is a format string used to construct the relocateQuery variable.


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

nodes[0] is the leaseholder, but I'll change it to log all of them

Don't you want s/nodes[1:]/nodes/g.


pkg/cmd/roachtest/bank.go, line 488 at r3 (raw file):

	for i := 1; i <= c.nodes; i++ {
		c.Start(ctx, c.Node(i), startArgs(fmt.Sprintf("-a=--attrs=node%d", i)))

Is the --attrs flag still necessary now that you are not using leaseholder preferences?


pkg/cmd/roachtest/bank.go, line 530 at r3 (raw file):

	for i := 1; i <= c.nodes; i++ {
		c.Start(ctx, c.Node(i), startArgs(fmt.Sprintf("-a=--attrs=node%d", i)))

Ditto about the --attrs flag.

@m-schneider

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Heh, you have the variable names reversed. I was imagining const relocateQueryFormat because that is a format string used to construct the relocateQuery variable.

:) I did think about it.


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Don't you want s/nodes[1:]/nodes/g.

Done.


pkg/cmd/roachtest/bank.go, line 488 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is the --attrs flag still necessary now that you are not using leaseholder preferences?

Yep, that's not necessary anymore in both places.


pkg/cmd/roachtest/bank.go, line 530 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto about the --attrs flag.

Done.

@petermattis

:lgtm: modulo final comment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

Done.

Did you forget to push the update? (still says nodes[1:] in reviewable).

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 9, 2018

Contributor

pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you forget to push the update? (still says nodes[1:] in reviewable).

It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4.

Contributor

m-schneider commented Oct 9, 2018


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you forget to push the update? (still says nodes[1:] in reviewable).

It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4.

@petermattis

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4.

Ah, got it. Carry on.

@m-schneider m-schneider changed the title from [wip]roachtest: move cmd/zerosum to a roachtest to roachtest: move cmd/zerosum to a roachtest Oct 10, 2018

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 10, 2018

Contributor

bors r+

Contributor

m-schneider commented Oct 10, 2018

bors r+

craig bot pushed a commit that referenced this pull request Oct 10, 2018

Merge #31011
31011: roachtest: move cmd/zerosum to a roachtest r=m-schneider a=m-schneider

closes: 29486

Release note: None

Co-authored-by: Masha Schneider <masha@cockroachlabs.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 10, 2018

Build failed

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 10, 2018

Contributor

I'm stressing the two tests before merging, just in case I didn't see more rare errors while testing.

Contributor

m-schneider commented Oct 10, 2018

I'm stressing the two tests before merging, just in case I didn't see more rare errors while testing.

@petermattis

This comment has been minimized.

Show comment
Hide comment
@petermattis

petermattis Oct 11, 2018

Contributor

@m-schneider I think you need to rebase this onto current master. The breaker open errors should be fixed by #30987 which this PR doesn't seem to contain.

Contributor

petermattis commented Oct 11, 2018

@m-schneider I think you need to rebase this onto current master. The breaker open errors should be fixed by #30987 which this PR doesn't seem to contain.

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 11, 2018

Contributor

pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):

		//TODO(masha) This is flaking for same reasons as bank/node-restart
		// enable after #30064 is fixed.
		//{"bank/zerosum-restart", runBankZeroSumRestart},

I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed.

Contributor

m-schneider commented Oct 11, 2018


pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):

		//TODO(masha) This is flaking for same reasons as bank/node-restart
		// enable after #30064 is fixed.
		//{"bank/zerosum-restart", runBankZeroSumRestart},

I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed.

@petermattis

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):

Previously, m-schneider (Masha Schneider) wrote…

I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed.

Ack. Just to be clear: all of the failures were stalls? I'm fine checking this in disabled.

@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 11, 2018

Contributor

pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack. Just to be clear: all of the failures were stalls? I'm fine checking this in disabled.

Yes it stalled about 10% of the time.

Contributor

m-schneider commented Oct 11, 2018


pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack. Just to be clear: all of the failures were stalls? I'm fine checking this in disabled.

Yes it stalled about 10% of the time.

roachtest: move cmd/zerosum to a roachtest
see: 29486

Release note: None
@m-schneider

This comment has been minimized.

Show comment
Hide comment
@m-schneider

m-schneider Oct 11, 2018

Contributor

bors r+

Contributor

m-schneider commented Oct 11, 2018

bors r+

craig bot pushed a commit that referenced this pull request Oct 11, 2018

Merge #31011
31011: roachtest: move cmd/zerosum to a roachtest r=m-schneider a=m-schneider

closes: 29486

Release note: None

Co-authored-by: Masha Schneider <masha@cockroachlabs.com>
@craig

This comment has been minimized.

Show comment
Hide comment
@craig

craig bot commented Oct 11, 2018

Build succeeded

@craig craig bot merged commit 14241fa into cockroachdb:master Oct 11, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment