Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes killed an alpha pod in the middle of retrieving snapshot #2698

Closed
manishrjain opened this issue Oct 28, 2018 · 2 comments

Comments

@manishrjain
Copy link
Member

commented Oct 28, 2018

If you suspect this could be a bug, follow the template.

  • What version of Dgraph are you using?
    v1.0.10-rc1

  • Have you tried reproducing the issue with latest release?
    Yes.

  • What is the hardware spec (RAM, OS)?
    K8s cluster, with 8GB requested.

  • Steps to reproduce the issue (command/config used to run Dgraph).

I1027 07:43:20.172728       1 node.go:83] 3 [term: 88] received a MsgHeartbeat message with higher term from 1 [term: 89]
I1027 07:43:20.173126       1 node.go:83] 3 became follower at term 89
I1027 07:43:20.173157       1 node.go:83] raft.node: 3 changed leader from 3 to 1 at term 89
E1027 07:43:20.180570       1 groups.go:821] While proposing delta: max_assigned:1105827 . Error=While proposing error: context deadline exceeded. Retrying...
I1027 07:43:21.154696       1 node.go:83] 3 [commit: 1081364, lastindex: 1081364, lastterm: 88] starts to restore snapshot [index: 1081489, term: 89]
I1027 07:43:21.154710       1 node.go:83] log [committed=1081364, applied=1081364, unstable.offset=1081365, len(unstable.Entries)=0] starts to restore snapshot [index: 1081489, term: 89]
I1027 07:43:21.154730       1 node.go:83] 3 restored progress of 1 [next = 1081490, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I1027 07:43:21.154742       1 node.go:83] 3 restored progress of 3 [next = 1081490, match = 1081489, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I1027 07:43:21.154748       1 node.go:83] 3 restored progress of 4 [next = 1081490, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
I1027 07:43:21.154754       1 node.go:83] 3 [commit: 1081489] restored snapshot [index: 1081489, term: 89]
I1027 07:43:21.156350       1 draft.go:668] Waiting for applyCh to reach 1081364 before taking snapshot
I1027 07:43:21.156364       1 draft.go:672] -------> SNAPSHOT [1] from 1
I1027 07:43:23.220590       1 predicate.go:76] Received a batch of 21865 keys. Total so far: 0
I1027 07:43:23.386962       1 predicate.go:76] Received a batch of 20953 keys. Total so far: 21865
I1027 07:43:24.623417       1 predicate.go:76] Received a batch of 488985 keys. Total so far: 42818
I1027 07:43:27.868938       1 predicate.go:76] Received a batch of 580929 keys. Total so far: 531803
I1027 07:43:27.893203       1 run.go:353] Caught Ctrl-C. Terminating now (this may take a few seconds)...
E1027 07:43:27.893286       1 run.go:205] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
E1027 07:43:27.894173       1 run.go:224] Stopped taking more http(s) requests. Err: accept tcp [::]:8080: use of closed network connection
I1027 07:43:27.895211       1 run.go:366] GRPC and HTTP stopped.
I1027 07:43:27.895227       1 worker.go:109] Stopping group...
E1027 07:43:27.895564       1 groups.go:534] Unable to sync memberships. Error: EOF
I1027 07:43:31.308171       1 predicate.go:76] Received a batch of 86847 keys. Total so far: 1112732
rpc error: code = Unknown desc = Error: No such container: f42295465b4457261c6061fdd3f12431e667b1fa11d8ca140a874140237cc67bUnable to retrieve container logs for docker://afe50afa8a791b46e767e89fb4691917bc872f733f58b62cf35eccb51bd3f341Unable to retrieve container logs for docker://afe50afa8a791b46e767e89fb4691917bc872f733f58b62cf35eccb51bd3f341Unable to retrieve container logs for docker://afe50afa8a791b46e767e89fb4691917bc872f733f58b62cf35eccb51bd3f341Unable to retrieve container logs for docker://afe50afa8a791b46e767e89fb4691917bc872f733f58b62cf35eccb51bd3f341Unable to retrieve container logs for docker://afe50afa8a791b46e767e89fb4691917bc872f733f58b62cf35eccb51bd3f341Error from server (BadRequest): container "server" in pod "dgraph-server-2" is waiting to start: ContainerCreating
Error from server (BadRequest): container "server" in pod "dgraph-server-2" is waiting to start: ContainerCreating

Server 3 lost its leadership, then was considered to be behind the group, and was getting a snapshot. This is when k8s killed it, packed up its contents and shipped it to another container.
Note that before a snapshot is received, the server would have dropped everything it had so far (DropAll). No chance was given to server 3 to gracefully shutdown. So, it ended up losing a lot of its keys during the transfer, causing it to fall out of sync with the rest of the replicas.

  • Expected behaviour and actual result.

It is not clear what we could have done at Dgraph level to avoid this, given our lack of control over a forceful shutdown and a pod move. Therefore, what we want to determine here is if kubernetes is a good choice to run Dgraph cluster. Can we find a way where we can ensure that k8s would allow the pods to shutdown gracefully? Or, should we recommend users to not use k8s for Dgraph.

@manishrjain

This comment has been minimized.

Copy link
Member Author

commented Oct 28, 2018

One possible way is to increase the graceful shutdown to a sufficiently large duration, like say 10 minutes.

https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html

@manishrjain manishrjain self-assigned this Oct 29, 2018

@manishrjain

This comment has been minimized.

Copy link
Member Author

commented Oct 29, 2018

Also curious to understand the behavior of Raft in this case. Given the snapshot wasn't fully applied, Raft should have re-tried when the node restarted.

danielmai added a commit that referenced this issue Oct 30, 2018
manishrjain added a commit that referenced this issue Oct 30, 2018
Make raft leader resume probing after snapshot crash (#2707)
When a raft follower is considered to be falling behind, leader sends it a snapshot. The follower then opens a streaming connection to the leader, asking it to send the snapshot. If the follower crashes while receiving and applying the snapshot, it is left in a permanent limbo, receiving no more raft updates. This was because the leader pauses its probing until it hears back from the follower. This "persists" through follower crash and restart.

This PR makes the snapshot streaming bi-directional. So, the follower can send an ACK back to the leader when it has successfully applied the snapshot. If the leader gets an error instead, it would mark the snapshot as a failure, and resume probing.

The end effect is that, if the follower crashes while receiving a snapshot, leader resumes probing, so when the follower comes back up, it again requests a snapshot from the leader.

Fixes #2698 .

P.S. I wish etcd's raft lib had better documentation to warn us about this, via the function. I'll create another PR to improve their godocs.

* Change the order in which Snapshot is retrieved and WAL is `SaveToStorage`. So, if a follower is unable to retrieve the snapshot, it won't store it in the WAL. This allows future probing to work correctly.
* Some exploration to determine a good way to send ack of writes back to the leader.
* Recovers correctly from a node crash during snapshot streaming.
dna2github added a commit to dna2fork/dgraph that referenced this issue Jul 19, 2019
Make raft leader resume probing after snapshot crash (dgraph-io#2707)
When a raft follower is considered to be falling behind, leader sends it a snapshot. The follower then opens a streaming connection to the leader, asking it to send the snapshot. If the follower crashes while receiving and applying the snapshot, it is left in a permanent limbo, receiving no more raft updates. This was because the leader pauses its probing until it hears back from the follower. This "persists" through follower crash and restart.

This PR makes the snapshot streaming bi-directional. So, the follower can send an ACK back to the leader when it has successfully applied the snapshot. If the leader gets an error instead, it would mark the snapshot as a failure, and resume probing.

The end effect is that, if the follower crashes while receiving a snapshot, leader resumes probing, so when the follower comes back up, it again requests a snapshot from the leader.

Fixes dgraph-io#2698 .

P.S. I wish etcd's raft lib had better documentation to warn us about this, via the function. I'll create another PR to improve their godocs.

* Change the order in which Snapshot is retrieved and WAL is `SaveToStorage`. So, if a follower is unable to retrieve the snapshot, it won't store it in the WAL. This allows future probing to work correctly.
* Some exploration to determine a good way to send ack of writes back to the leader.
* Recovers correctly from a node crash during snapshot streaming.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.