Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign uproachtest: move cmd/zerosum to a roachtest #31011
Conversation
m-schneider
requested a review
from
petermattis
Oct 5, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
reviewed
Oct 5, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):
{"bank/cluster-recovery", runBankClusterRecovery}, {"bank/node-restart", runBankNodeRestart}, {"bank/zerosum", runBankNodeZeroSum},
There's already a test with basic chaos that restarts node, do we need another test the combines the splits and lease transfers and the restarts?
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
} case 1: zipF := partitionDistribution(r, partitions)
My biggest question on this test is whether or not partitioning and changing lease preferences is a reasonable alternative to the original lease transfers.
petermattis
reviewed
Oct 5, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
There's already a test with basic chaos that restarts node, do we need another test the combines the splits and lease transfers and the restarts?
Yes. Doing them all at the same time is what leads to interesting badness. In particular, splitting and lease transfers at the same time a node is being restarted is the interesting bit.
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
My biggest question on this test is whether or not partitioning and changing lease preferences is a reasonable alternative to the original lease transfers.
It's interesting, but different. The downside to adjust lease holder preferences is how asynchronous that is. I think you can use ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... in order to maintain the original behavior.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
Oct 8, 2018
Contributor
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
It's interesting, but different. The downside to adjust lease holder preferences is how asynchronous that is. I think you can use
ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ...in order to maintain the original behavior.
I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:
pq: change replicas of r24 failed: [n1,s1,r24/1:/Table/53/1/{258-408}]: received invalid ChangeReplicasTrigger REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):3 (n4,s4):2] next=4 to remove self (leaseholder)round 12: setting partition lease preferences for partition 582 to node 2
pq: change replicas of r29 failed: descriptor changed: [expected] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, next=6, gen=0] != [actual] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, (n2,s2):6, next=7, gen=0]: unexpected value: raw_bytes:"\023\240&K\003\010\035\022\005\275\211\367\001\t\032\005\275\211\367\001?\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\005\"\006\010\002\020\002\030\006(\007" timestamp:<wall_time:1539012778171889067 > round 46: setting partition lease preferences for partition 81 to node 1
Are these expected or a real failure?
|
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:
Are these expected or a real failure? |
petermattis
reviewed
Oct 8, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
I added the ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... behavior, but I'm seeing a lot of errors of the form:
pq: change replicas of r24 failed: [n1,s1,r24/1:/Table/53/1/{258-408}]: received invalid ChangeReplicasTrigger REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):3 (n4,s4):2] next=4 to remove self (leaseholder)round 12: setting partition lease preferences for partition 582 to node 2 pq: change replicas of r29 failed: descriptor changed: [expected] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, next=6, gen=0] != [actual] r29:/Table/53/1/{265-319} [(n1,s1):1, (n4,s4):5, (n2,s2):6, next=7, gen=0]: unexpected value: raw_bytes:"\023\240&K\003\010\035\022\005\275\211\367\001\t\032\005\275\211\367\001?\"\006\010\001\020\001\030\001\"\006\010\004\020\004\030\005\"\006\010\002\020\002\030\006(\007" timestamp:<wall_time:1539012778171889067 > round 46: setting partition lease preferences for partition 81 to node 1Are these expected or a real failure?
The first error looks like a bug in what you did. The destination of the relocation has to be a node that has a replica of the range on it (and is not currently the leaseholder).
The second error is more curious. I think it might be expected if we're performing multiple operations concurrently. I'd sync with @nvanbenschoten to be sure about that.
nvanbenschoten
reviewed
Oct 8, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The first error looks like a bug in what you did. The destination of the relocation has to be a node that has a replica of the range on it (and is not currently the leaseholder).
The second error is more curious. I think it might be expected if we're performing multiple operations concurrently. I'd sync with @nvanbenschoten to be sure about that.
Yeah, the second error is expected if we're performing concurrent descriptor changes. We usually have these operations in retry loops for exactly this reason. We probably need a retry loop for the operation that ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ... performs.
For instance, see
cockroach/pkg/storage/replica_command.go
Line 152 in f4e4789
m-schneider
reviewed
Oct 8, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/acceptance.go, line 40 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Yes. Doing them all at the same time is what leads to interesting badness. In particular, splitting and lease transfers at the same time a node is being restarted is the interesting bit.
I added the combination of the splits, lease transfers and restarts.
pkg/cmd/roachtest/bank.go, line 293 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, the second error is expected if we're performing concurrent descriptor changes. We usually have these operations in retry loops for exactly this reason. We probably need a retry loop for the operation that
ALTER TABLE ... EXPERIMENTAL_RELOCATE LEASE ...performs.For instance, see
cockroach/pkg/storage/replica_command.go
Line 152 in f4e4789
Thanks! I think since the chaos in this file is mostly done in a loop, the other tests ignore retryable errors, so I can add this particular one to the list vs adding retry logic?
Also after trying a couple of different ways to move the leaseholder, the cleanest turned out to relocating the range onto multiple stores. Other implementations like querying for the replicas corresponding to a primary key and then shifting the leaseholder were pretty fragile, because replicas moved between the query to check and the query to move. This way the shift is more atomic.
petermattis
requested changes
Oct 9, 2018
I need to take another pass at this, but it is looking close to done.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 307 at r2 (raw file):
}) const zoneConfQuery = `ALTER TABLE bank.accounts EXPERIMENTAL_RELOCATE VALUES (ARRAY[%s], %d);`
s/zoneConfQuery/relocateQuery/g
pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):
const zoneConfQuery = `ALTER TABLE bank.accounts EXPERIMENTAL_RELOCATE VALUES (ARRAY[%s], %d);` finalQ := fmt.Sprintf(zoneConfQuery, strings.Join(nodes[1:], ", "), key)
Nit: odd to name this finalQ and the template zoneConfQuery. Odd due to the inconsistent terminology final vs zoneConf and Q vs Query. I'd go with something like relocateQueryFormat and relocateQuery.
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
finalQ := fmt.Sprintf(zoneConfQuery, strings.Join(nodes[1:], ", "), key) c.l.Printf("round %d: setting lease preferences for key %d to node %s\n", curRound, key, nodes[0])
This log message is no longer accurate (you're relocating ranges, not changing lease preferences) and it only logs nodes[0] while it should be logging all of the nodes.
m-schneider
reviewed
Oct 9, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 307 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/zoneConfQuery/relocateQuery/g
Done.
pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):
relocateQueryFormat
Done
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This log message is no longer accurate (you're relocating ranges, not changing lease preferences) and it only logs
nodes[0]while it should be logging all of the nodes.
nodes[0] is the leaseholder, but I'll change it to log all of them
petermattis
reviewed
Oct 9, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
relocateQueryFormat
Done
Heh, you have the variable names reversed. I was imagining const relocateQueryFormat because that is a format string used to construct the relocateQuery variable.
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
nodes[0] is the leaseholder, but I'll change it to log all of them
Don't you want s/nodes[1:]/nodes/g.
pkg/cmd/roachtest/bank.go, line 488 at r3 (raw file):
for i := 1; i <= c.nodes; i++ { c.Start(ctx, c.Node(i), startArgs(fmt.Sprintf("-a=--attrs=node%d", i)))
Is the --attrs flag still necessary now that you are not using leaseholder preferences?
pkg/cmd/roachtest/bank.go, line 530 at r3 (raw file):
for i := 1; i <= c.nodes; i++ { c.Start(ctx, c.Node(i), startArgs(fmt.Sprintf("-a=--attrs=node%d", i)))
Ditto about the --attrs flag.
m-schneider
reviewed
Oct 9, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 308 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Heh, you have the variable names reversed. I was imagining
const relocateQueryFormatbecause that is a format string used to construct therelocateQueryvariable.
:) I did think about it.
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Don't you want
s/nodes[1:]/nodes/g.
Done.
pkg/cmd/roachtest/bank.go, line 488 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is the
--attrsflag still necessary now that you are not using leaseholder preferences?
Yep, that's not necessary anymore in both places.
pkg/cmd/roachtest/bank.go, line 530 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Ditto about the
--attrsflag.
Done.
petermattis
approved these changes
Oct 9, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
Done.
Did you forget to push the update? (still says nodes[1:] in reviewable).
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
Oct 9, 2018
Contributor
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Did you forget to push the update? (still says
nodes[1:]in reviewable).
It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4.
|
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4. |
petermattis
approved these changes
Oct 9, 2018
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/cmd/roachtest/bank.go, line 310 at r2 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
It should, I got a bit confused when I first read your first comment. I used the shuffle function to randomize all of the nodes, but that still left me with 4 nodes, so I dropped the first one to move the range to 3 nodes vs 4.
Ah, got it. Carry on.
m-schneider
changed the title from
[wip]roachtest: move cmd/zerosum to a roachtest
to
roachtest: move cmd/zerosum to a roachtest
Oct 10, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
bors r+ |
bot
pushed a commit
that referenced
this pull request
Oct 10, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 10, 2018
Build failed |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
Oct 10, 2018
Contributor
I'm stressing the two tests before merging, just in case I didn't see more rare errors while testing.
|
I'm stressing the two tests before merging, just in case I didn't see more rare errors while testing. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
petermattis
Oct 11, 2018
Contributor
@m-schneider I think you need to rebase this onto current master. The breaker open errors should be fixed by #30987 which this PR doesn't seem to contain.
|
@m-schneider I think you need to rebase this onto current master. The |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
Oct 11, 2018
Contributor
pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):
//TODO(masha) This is flaking for same reasons as bank/node-restart // enable after #30064 is fixed. //{"bank/zerosum-restart", runBankZeroSumRestart},
I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed.
|
pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):
I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed. |
petermattis
approved these changes
Oct 11, 2018
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):
Previously, m-schneider (Masha Schneider) wrote…
I did a rebase and ran the test 100 times and there were still failures and hangs. After talking to @nvanbenschoten, I think it would make sense to hold off on merging the test with restarts until the bank/node-restart is fixed.
Ack. Just to be clear: all of the failures were stalls? I'm fine checking this in disabled.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
m-schneider
Oct 11, 2018
Contributor
pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Ack. Just to be clear: all of the failures were stalls? I'm fine checking this in disabled.
Yes it stalled about 10% of the time.
|
pkg/cmd/roachtest/acceptance.go, line 43 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
Yes it stalled about 10% of the time. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
bors r+ |
bot
pushed a commit
that referenced
this pull request
Oct 11, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
craig
bot
commented
Oct 11, 2018
Build succeeded |
m-schneider commentedOct 5, 2018
closes: 29486
Release note: None