Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: TestReplicateRemovedNodeDisruptiveElection failed under stress #16119

Closed
cockroach-teamcity opened this issue May 25, 2017 · 3 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/11987dc7e2f20f6417925c638f45c2067435b17f

Parameters:

TAGS=
GOFLAGS=

Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=256618&tab=buildLog

I170525 06:32:14.870491 47202 storage/store.go:1257  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I170525 06:32:14.870530 47202 gossip/gossip.go:297  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:34295" > attrs:<> locality:<> 
W170525 06:32:14.873392 47202 gossip/gossip.go:1196  [n2] no incoming or outgoing connections
I170525 06:32:14.873471 47202 storage/store.go:1257  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I170525 06:32:14.873503 47202 gossip/gossip.go:297  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:49715" > attrs:<> locality:<> 
I170525 06:32:14.873958 47322 gossip/client.go:131  [n2] started gossip client to 127.0.0.1:34295
W170525 06:32:14.893826 47202 gossip/gossip.go:1196  [n3] no incoming or outgoing connections
I170525 06:32:14.893929 47202 storage/store.go:1257  [n3,s3]: failed initial metrics computation: [n3,s3]: system config not yet available
I170525 06:32:14.893958 47202 gossip/gossip.go:297  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:57170" > attrs:<> locality:<> 
I170525 06:32:14.894401 47411 gossip/client.go:131  [n3] started gossip client to 127.0.0.1:34295
W170525 06:32:14.917658 47202 gossip/gossip.go:1196  [n4] no incoming or outgoing connections
I170525 06:32:14.917865 47202 storage/store.go:1257  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I170525 06:32:14.917911 47202 gossip/gossip.go:297  [n4] NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:41767" > attrs:<> locality:<> 
I170525 06:32:14.919115 47534 gossip/client.go:131  [n4] started gossip client to 127.0.0.1:34295
I170525 06:32:14.939286 47546 storage/replica_raftstorage.go:416  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 3d4eba22 at index 16
I170525 06:32:14.939693 47546 storage/store.go:3353  [s1,r1/1:/M{in-ax}] streamed snapshot to (n2,s2):?: kv pairs: 34, log entries: 6, rate-limit: 2.0 MiB/sec, 0ms
I170525 06:32:14.939873 47563 storage/replica_raftstorage.go:597  [s2,r1/?:{-}] applying preemptive snapshot at index 16 (id=3d4eba22, encoded size=6059, 1 rocksdb batches, 6 log entries)
I170525 06:32:14.940178 47563 storage/replica_raftstorage.go:605  [s2,r1/?:/M{in-ax}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I170525 06:32:14.940474 47546 storage/replica_command.go:3538  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r1:/M{in-ax} [(n1,s1):1, next=2]
I170525 06:32:14.941291 47697 storage/replica.go:2617  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n2,s2):2: [(n1,s1):1 (n2,s2):2]
I170525 06:32:14.941952 47551 storage/replica_raftstorage.go:416  [s1,r1/1:/M{in-ax}] generated preemptive snapshot f3ffae7e at index 18
I170525 06:32:14.942635 47551 storage/store.go:3353  [s1,r1/1:/M{in-ax}] streamed snapshot to (n3,s3):?: kv pairs: 37, log entries: 8, rate-limit: 2.0 MiB/sec, 1ms
I170525 06:32:14.942805 47701 storage/replica_raftstorage.go:597  [s3,r1/?:{-}] applying preemptive snapshot at index 18 (id=f3ffae7e, encoded size=7619, 1 rocksdb batches, 8 log entries)
I170525 06:32:14.943141 47701 storage/replica_raftstorage.go:605  [s3,r1/?:/M{in-ax}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I170525 06:32:14.943845 47551 storage/replica_command.go:3538  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n3,s3):3): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, next=3]
I170525 06:32:14.946213 47730 storage/raft_transport.go:436  raft transport stream to node 1 established
I170525 06:32:14.948634 47723 storage/replica.go:2617  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n3,s3):3: [(n1,s1):1 (n2,s2):2 (n3,s3):3]
I170525 06:32:14.950860 46953 storage/replica_raftstorage.go:416  [s1,r1/1:/M{in-ax}] generated preemptive snapshot 1102424e at index 20
I170525 06:32:14.951302 46953 storage/store.go:3353  [s1,r1/1:/M{in-ax}] streamed snapshot to (n4,s4):?: kv pairs: 40, log entries: 10, rate-limit: 2.0 MiB/sec, 0ms
I170525 06:32:14.951462 47749 storage/replica_raftstorage.go:597  [s4,r1/?:{-}] applying preemptive snapshot at index 20 (id=1102424e, encoded size=9243, 1 rocksdb batches, 10 log entries)
I170525 06:32:14.951872 47749 storage/replica_raftstorage.go:605  [s4,r1/?:/M{in-ax}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I170525 06:32:14.952180 46953 storage/replica_command.go:3538  [s1,r1/1:/M{in-ax}] change replicas (ADD_REPLICA (n4,s4):4): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4]
I170525 06:32:14.953434 47739 storage/replica.go:2617  [s1,r1/1:/M{in-ax}] proposing ADD_REPLICA (n4,s4):4: [(n1,s1):1 (n2,s2):2 (n3,s3):3 (n4,s4):4]
I170525 06:32:15.095427 47745 storage/replica_command.go:3538  [s2,r1/2:/M{in-ax}] change replicas (REMOVE_REPLICA (n1,s1):1): read existing descriptor r1:/M{in-ax} [(n1,s1):1, (n2,s2):2, (n3,s3):3, (n4,s4):4, next=5]
I170525 06:32:15.097110 47815 storage/replica.go:2617  [s2,r1/2:/M{in-ax}] proposing REMOVE_REPLICA (n1,s1):1: [(n4,s4):4 (n2,s2):2 (n3,s3):3]
I170525 06:32:15.171221 47719 storage/store.go:3160  [s1,r1/1:/M{in-ax}] added to replica GC queue (peer suggestion)
I170525 06:32:15.171486 47201 storage/store.go:2147  [replicaGC,s1,r1/1:/M{in-ax}] removing replica
I170525 06:32:15.171629 47201 storage/replica.go:684  [replicaGC,s1,r1/1:/M{in-ax}] removed 34 (26+8) keys in 0ms [clear=0ms commit=0ms]
W170525 06:32:15.598731 47788 storage/replica.go:2308  [s2,r1/2:/M{in-ax}] context cancellation after 0.5s of attempting command Increment ["a",/Min)
I170525 06:32:40.378150 47586 gossip/gossip.go:1210  [n4] node has connected to cluster via gossip
I170525 06:32:41.135798 47217 gossip/gossip.go:1210  [n1] node has connected to cluster via gossip
I170525 06:32:42.084931 47332 gossip/gossip.go:1210  [n2] node has connected to cluster via gossip
I170525 06:32:47.115588 47453 gossip/gossip.go:1210  [n3] node has connected to cluster via gossip
I170525 06:33:00.779201 47576 vendor/google.golang.org/grpc/transport/http2_server.go:392  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:41767->127.0.0.1:33693: use of closed network connection
I170525 06:33:00.779257 47528 vendor/google.golang.org/grpc/transport/http2_server.go:392  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:57170->127.0.0.1:60181: read: connection reset by peer
I170525 06:33:00.779296 47407 vendor/google.golang.org/grpc/transport/http2_server.go:392  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:49715->127.0.0.1:57530: use of closed network connection
I170525 06:33:00.779313 47195 vendor/google.golang.org/grpc/transport/http2_server.go:392  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:34295->127.0.0.1:33941: use of closed network connection
W170525 06:33:00.779356 47332 gossip/gossip.go:1196  [n2] no incoming or outgoing connections
	soon.go:43: storage/client_raft_test.go:2494, condition failed to evaluate within 45s: expected [0 5 5 5], got [0 10 10 10]
@cockroach-teamcity cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels May 25, 2017
@tamird
Copy link
Contributor

tamird commented May 25, 2017

This one is mine. 05a3745

@tamird tamird self-assigned this May 25, 2017
@cockroach-teamcity
Copy link
Member Author

@tamird
Copy link
Contributor

tamird commented May 26, 2017

Well this seems bad!

Here's what I think happens:

  • client sends an increment
  • dist sender sends the increment to the first replica
  • SendNextTimeout fires, and dist sender sends the increment to the next replica
  • the second increment returns RangeNotFoundError, causing sendToReplicas to return eagerly, cancelling the context that was used for both RPCs
  • first replica receives context cancellation, but the increment is already proposed, so an ambiguous result is returned, but the dist sender is no longer paying attention, and this ambiguous error is lost (!)
  • dist sender retries at a higher level (sendPartialBatch), and this time the command goes through, and now we have a double-applied increment.

cc @petermattis because this is probably the same as #15687. cc @spencerkimball @bdarnell because I think we should remove SendNextTimeout.

In the meantime, I'm going to revert the early bail introduced in #16088.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants