Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Open
Quentin-M opened this issue Apr 5, 2018 · 11 comments · May be fixed by #16754
Open

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Quentin-M opened this issue Apr 5, 2018 · 11 comments · May be fixed by #16754

Comments

@Quentin-M
Copy link
Contributor

Hi guys,

Please find below a simple excerpt where we try to start an embed server, wait for it to be ready up to a certain point or stop it otherwise.

        // Assume .MemberAdd() was already called properly.

        c.server, err = embed.StartEtcd(etcdCfg)
	if err != nil {
		// embed.StartEtcd() will by itself call 
		//   `for _, sctx := range e.sctxs { close(sctx.serversC) }
		//   e.Close()
		// in case of an early failure.
		return fmt.Errorf("failed to start etcd: %s", err)
	}

	select {
	case <-c.server.Server.ReadyNotify():
		break
	case <-ctx.Done():
		//c.server.Server.HardStop()
		//c.server.Server = nil

		c.server.Close()

		return fmt.Errorf("server took too long to become ready")
	}

When we are setting up an embed server to join an existing cluster, it may take some time for the server to become ready as it needs to sync data. If anything happens during the transmission of the snapshot, or if the member is removed in the meantime, the etcdserver will notice it and rafthttp will stop the peers (see below) and idle. The allocated time for etcd to start runs out and the context cancels (or we read errors from the error channel). Therefore, we now try to Close() the server so etcd can be stopped and the ports freed - so we can retry or move on without having zombie etcd goroutines/servers around.

However, Close() is not too happy about it and stale. From my debugging, this line right here is blocking. I am not exactly sure of the lifecycle of this channel.

I would be happy to just attempt at closing manually the Peers / Listeners (as they are exported) (I haven't tried yet - and I am not sure that would be enough anyways), but at least, the Metrics server is not exported and thus could not be closed.

Do you have any suggestions?

I | embed: peerTLS: cert = /var/lib/eco/etcd/fixtures/peer/cert.pem, key = /var/lib/eco/etcd/fixtures/peer/key.pem, ca = , trusted-ca = , client-cert-auth = false, crl-file =
I | embed: listening for peers on https://172.16.4.239:2380
I | embed: listening for client requests on 172.16.4.239:2379
I | etcdserver: recovered store from snapshot at index 295
I | mvcc: restore compact to 239897
I | etcdserver: name = i-06429521c23f59cc7
I | etcdserver: data dir = /var/lib/eco/etcd
I | etcdserver: member dir = /var/lib/eco/etcd/member
I | etcdserver: heartbeat = 100ms
I | etcdserver: election = 1000ms
I | etcdserver: snapshot count = 100000
I | etcdserver: advertise client URLs = https://etcd.quentin.bitmechs.com:2379
I | etcdserver: restarting member 20b37b73ca49d36e in cluster 5c8eef25bdd13a25 at commit index 312
I | raft: 20b37b73ca49d36e became follower at term 4
I | raft: newRaft 20b37b73ca49d36e [peers: [20b37b73ca49d36e,333e79a567546799,dc84c4754fe5e219], term: 4, commit: 312, applied: 295, lastindex: 312, lastterm: 4]
I | etcdserver/api: enabled capabilities for version 3.3
I | etcdserver/membership: added member 20b37b73ca49d36e [https://172.16.4.239:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: added member 333e79a567546799 [https://172.16.8.30:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: added member dc84c4754fe5e219 [https://172.16.28.168:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: set the cluster version to 3.3 from store
I | mvcc: restore compact to 239897
W | auth: simple token is not cryptographically signed
I | rafthttp: starting peer 333e79a567546799...
I | rafthttp: started HTTP pipelining with peer 333e79a567546799
I | rafthttp: started streaming with peer 333e79a567546799 (writer)
I | rafthttp: started streaming with peer 333e79a567546799 (writer)
I | rafthttp: started peer 333e79a567546799
I | rafthttp: added peer 333e79a567546799
I | rafthttp: starting peer dc84c4754fe5e219...
I | rafthttp: started HTTP pipelining with peer dc84c4754fe5e219
I | rafthttp: started streaming with peer 333e79a567546799 (stream MsgApp v2 reader)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (stream MsgApp v2 reader)
I | rafthttp: started peer dc84c4754fe5e219
I | rafthttp: added peer dc84c4754fe5e219
I | etcdserver: starting server... [version: 3.3.3, cluster version: 3.3]
I | rafthttp: started streaming with peer dc84c4754fe5e219 (stream Message reader)
I | rafthttp: started streaming with peer 333e79a567546799 (stream Message reader)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (writer)
I | embed: ClientTLS: cert = /etc/eco/eco.crt, key = /etc/eco/eco.key, ca = , trusted-ca = /etc/eco/ca.crt, client-cert-auth = true, crl-file =
I | embed: listening for metrics on http://172.16.4.239:2381
I | rafthttp: peer dc84c4754fe5e219 became active
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | raft: raft.node: 20b37b73ca49d36e elected leader dc84c4754fe5e219 at term 4
I | rafthttp: peer 333e79a567546799 became active
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
W | rafthttp: closed an existing TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream Message writer)
W | rafthttp: closed an existing TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
W | rafthttp: closed an existing TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
W | rafthttp: closed an existing TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | etcdserver: 20b37b73ca49d36e initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
E | rafthttp: failed to dial dc84c4754fe5e219 on stream MsgApp v2 (the member has been permanently removed from the cluster)
I | rafthttp: peer dc84c4754fe5e219 became inactive
E | etcdserver: the member has been permanently removed from the cluster
I | etcdserver: the data-dir used by this member must be removed.
I | etcdserver: aborting publish because server is stopped
I | rafthttp: stopping peer 333e79a567546799...
I | rafthttp: closed the TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
I | rafthttp: stopped streaming with peer 333e79a567546799 (writer)
I | rafthttp: closed the TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: stopped streaming with peer 333e79a567546799 (writer)
E | rafthttp: failed to dial 333e79a567546799 on stream MsgApp v2 (the member has been permanently removed from the cluster)
I | rafthttp: peer 333e79a567546799 became inactive
I | rafthttp: stopped HTTP pipelining with peer 333e79a567546799
I | rafthttp: stopped streaming with peer 333e79a567546799 (stream MsgApp v2 reader)
I | rafthttp: stopped streaming with peer 333e79a567546799 (stream Message reader)
I | rafthttp: stopped peer 333e79a567546799
I | rafthttp: stopping peer dc84c4754fe5e219...
I | rafthttp: closed the TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: closed the TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: stopped HTTP pipelining with peer dc84c4754fe5e219
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (stream MsgApp v2 reader)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (stream Message reader)
I | rafthttp: stopped peer dc84c4754fe5e219

<Call Close() here>
<Frozen>
@gyuho
Copy link
Contributor

gyuho commented Apr 5, 2018

Hmm, that channel should've been closed when starting embedded etcd failed?
Can you also try adding case err = <-(*embed.Etcd).Err(): to your select?

@Quentin-M
Copy link
Contributor Author

Yes, I also have that case in a separate goroutine that starts after etcd has successfully started and became ready. I haven't verified if etcd pushes something there in the case above - but the reaction is about the same: call Stop() and hope it'll not freeze / free the ports.

func (c *Server) runErrorWatcher() {
	select {
	case <-c.server.Server.StopNotify():
		return
	case <-c.server.Err():
		log.Warnf("server has crashed")
		c.server.Close()
	}
}

@Quentin-M
Copy link
Contributor Author

Quentin-M commented Apr 6, 2018

To be extra clear, here is an 'easy' reproduction case:

  • Get a cluster running with 2-8GB of data loaded (it buys time to setup the seemingly race condition)
  • Add a new member, matching the configuration the embed server you'll start
  • Start the embed server
  • Remove the member while the embed server is sync'ing
  • Try to stop the embed server - and witness Stop() going stale.

@gyuho
Copy link
Contributor

gyuho commented Apr 6, 2018

Remove the member while the embed server is sync'ing

Is this removing the member that was started as an embed server in step 3?

@Quentin-M
Copy link
Contributor Author

Quentin-M commented Apr 6, 2018

Yes, that is pretty rare/stupid. However, from what I have seen, that maybe could also be reproduced by injecting a failure in the snapshot transmission (e.g. something happens to the member transmitting the snapshot, or to the network in between). I could try to do this if that's of any interest for you. Although it feels like it is just about a channel not being closed properly.

Point being is that, albeit an human eye can notice the start failure from the logs, and a machine could potentially interpolate this from a 'longer than expected' start, recovering from such situation (i.e. etcd started, but was not able to ever become ready - for reasons) is impossible without panicking / exiting the whole wrapping application as Stop() is not able to stop the goroutines / free the ports.

By the way, I just added case <-c.server.Err(): as suggested to verify whether the start failure could be caught and raised in this channel, but it is not (a context timer is thus the only way to catch it at this point).

@gyuho
Copy link
Contributor

gyuho commented Apr 6, 2018

@Quentin-M Got it. We will try to reproduce. Thanks!

@gyuho
Copy link
Contributor

gyuho commented Apr 16, 2018

Sorry for delay. Just fyi, I am seeing similar blocking behaviors in other sets of testing.
So, will make sure to address this by this release cycle.

@Quentin-M
Copy link
Contributor Author

@gyuho Thank you so much for looking into it, I don't think there's any urgency from anyone at the moment anyways!

@gyuho gyuho added this to the etcd-v3.4 milestone Apr 17, 2018
@gyuho gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019
@serathius serathius removed this from the etcd-v3.5 milestone Dec 15, 2021
@Hendrik-H
Copy link

I'm seeing the same issue in 3.5.9. My case seems to be very similar. I have a cluster with 3 nodes and while the a node is stopped it is being removed from the cluster. When I now try to start the node the start does not return an error but the server stops on its own. After that I can not invoke Close() on embed.Etcd as it hangs at https://github.com/etcd-io/etcd/blob/v3.5.9/server/embed/etcd.go#L397 as the channel does not get closed. This way the listeners are not closed and I cannot restart the server again as the ports are still in use. As the peer and client listeners are exposed I can close them directly to get around the issue but not everything is closed this way.

@Hendrik-H
Copy link

The issue seems to be:

A fix might be to wait for Ready and Stopping / Stopped using a select and in the stop case simply close the channel and return.

@ahrtr
Copy link
Member

ahrtr commented Oct 13, 2023

A fix might be to wait for Ready and Stopping / Stopped using a select and in the stop case simply close the channel and return.

It looks like a valid fix. #16754. Please feel free to continue to work on top of the PR if you are interested.

Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 13, 2023
Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 13, 2023
Signed-off-by: Hendrik Haddorp <2423293+Hendrik-H@users.noreply.github.com>
Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 17, 2023
Signed-off-by: Hendrik Haddorp <2423293+Hendrik-H@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants