embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Quentin-M · 2018-04-05T22:41:13Z

Hi guys,

Please find below a simple excerpt where we try to start an embed server, wait for it to be ready up to a certain point or stop it otherwise.

        // Assume .MemberAdd() was already called properly.

        c.server, err = embed.StartEtcd(etcdCfg)
	if err != nil {
		// embed.StartEtcd() will by itself call 
		//   `for _, sctx := range e.sctxs { close(sctx.serversC) }
		//   e.Close()
		// in case of an early failure.
		return fmt.Errorf("failed to start etcd: %s", err)
	}

	select {
	case <-c.server.Server.ReadyNotify():
		break
	case <-ctx.Done():
		//c.server.Server.HardStop()
		//c.server.Server = nil

		c.server.Close()

		return fmt.Errorf("server took too long to become ready")
	}

When we are setting up an embed server to join an existing cluster, it may take some time for the server to become ready as it needs to sync data. If anything happens during the transmission of the snapshot, or if the member is removed in the meantime, the etcdserver will notice it and rafthttp will stop the peers (see below) and idle. The allocated time for etcd to start runs out and the context cancels (or we read errors from the error channel). Therefore, we now try to Close() the server so etcd can be stopped and the ports freed - so we can retry or move on without having zombie etcd goroutines/servers around.

However, Close() is not too happy about it and stale. From my debugging, this line right here is blocking. I am not exactly sure of the lifecycle of this channel.

I would be happy to just attempt at closing manually the Peers / Listeners (as they are exported) (I haven't tried yet - and I am not sure that would be enough anyways), but at least, the Metrics server is not exported and thus could not be closed.

Do you have any suggestions?

I | embed: peerTLS: cert = /var/lib/eco/etcd/fixtures/peer/cert.pem, key = /var/lib/eco/etcd/fixtures/peer/key.pem, ca = , trusted-ca = , client-cert-auth = false, crl-file =
I | embed: listening for peers on https://172.16.4.239:2380
I | embed: listening for client requests on 172.16.4.239:2379
I | etcdserver: recovered store from snapshot at index 295
I | mvcc: restore compact to 239897
I | etcdserver: name = i-06429521c23f59cc7
I | etcdserver: data dir = /var/lib/eco/etcd
I | etcdserver: member dir = /var/lib/eco/etcd/member
I | etcdserver: heartbeat = 100ms
I | etcdserver: election = 1000ms
I | etcdserver: snapshot count = 100000
I | etcdserver: advertise client URLs = https://etcd.quentin.bitmechs.com:2379
I | etcdserver: restarting member 20b37b73ca49d36e in cluster 5c8eef25bdd13a25 at commit index 312
I | raft: 20b37b73ca49d36e became follower at term 4
I | raft: newRaft 20b37b73ca49d36e [peers: [20b37b73ca49d36e,333e79a567546799,dc84c4754fe5e219], term: 4, commit: 312, applied: 295, lastindex: 312, lastterm: 4]
I | etcdserver/api: enabled capabilities for version 3.3
I | etcdserver/membership: added member 20b37b73ca49d36e [https://172.16.4.239:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: added member 333e79a567546799 [https://172.16.8.30:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: added member dc84c4754fe5e219 [https://172.16.28.168:2380] to cluster 5c8eef25bdd13a25 from store
I | etcdserver/membership: set the cluster version to 3.3 from store
I | mvcc: restore compact to 239897
W | auth: simple token is not cryptographically signed
I | rafthttp: starting peer 333e79a567546799...
I | rafthttp: started HTTP pipelining with peer 333e79a567546799
I | rafthttp: started streaming with peer 333e79a567546799 (writer)
I | rafthttp: started streaming with peer 333e79a567546799 (writer)
I | rafthttp: started peer 333e79a567546799
I | rafthttp: added peer 333e79a567546799
I | rafthttp: starting peer dc84c4754fe5e219...
I | rafthttp: started HTTP pipelining with peer dc84c4754fe5e219
I | rafthttp: started streaming with peer 333e79a567546799 (stream MsgApp v2 reader)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (stream MsgApp v2 reader)
I | rafthttp: started peer dc84c4754fe5e219
I | rafthttp: added peer dc84c4754fe5e219
I | etcdserver: starting server... [version: 3.3.3, cluster version: 3.3]
I | rafthttp: started streaming with peer dc84c4754fe5e219 (stream Message reader)
I | rafthttp: started streaming with peer 333e79a567546799 (stream Message reader)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: started streaming with peer dc84c4754fe5e219 (writer)
I | embed: ClientTLS: cert = /etc/eco/eco.crt, key = /etc/eco/eco.key, ca = , trusted-ca = /etc/eco/ca.crt, client-cert-auth = true, crl-file =
I | embed: listening for metrics on http://172.16.4.239:2381
I | rafthttp: peer dc84c4754fe5e219 became active
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | raft: raft.node: 20b37b73ca49d36e elected leader dc84c4754fe5e219 at term 4
I | rafthttp: peer 333e79a567546799 became active
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
W | rafthttp: closed an existing TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream Message writer)
W | rafthttp: closed an existing TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
W | rafthttp: closed an existing TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
I | rafthttp: established a TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
W | rafthttp: closed an existing TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | rafthttp: established a TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | etcdserver: 20b37b73ca49d36e initialzed peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
E | rafthttp: failed to dial dc84c4754fe5e219 on stream MsgApp v2 (the member has been permanently removed from the cluster)
I | rafthttp: peer dc84c4754fe5e219 became inactive
E | etcdserver: the member has been permanently removed from the cluster
I | etcdserver: the data-dir used by this member must be removed.
I | etcdserver: aborting publish because server is stopped
I | rafthttp: stopping peer 333e79a567546799...
I | rafthttp: closed the TCP streaming connection with peer 333e79a567546799 (stream MsgApp v2 writer)
I | rafthttp: stopped streaming with peer 333e79a567546799 (writer)
I | rafthttp: closed the TCP streaming connection with peer 333e79a567546799 (stream Message writer)
I | rafthttp: stopped streaming with peer 333e79a567546799 (writer)
E | rafthttp: failed to dial 333e79a567546799 on stream MsgApp v2 (the member has been permanently removed from the cluster)
I | rafthttp: peer 333e79a567546799 became inactive
I | rafthttp: stopped HTTP pipelining with peer 333e79a567546799
I | rafthttp: stopped streaming with peer 333e79a567546799 (stream MsgApp v2 reader)
I | rafthttp: stopped streaming with peer 333e79a567546799 (stream Message reader)
I | rafthttp: stopped peer 333e79a567546799
I | rafthttp: stopping peer dc84c4754fe5e219...
I | rafthttp: closed the TCP streaming connection with peer dc84c4754fe5e219 (stream MsgApp v2 writer)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: closed the TCP streaming connection with peer dc84c4754fe5e219 (stream Message writer)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (writer)
I | rafthttp: stopped HTTP pipelining with peer dc84c4754fe5e219
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (stream MsgApp v2 reader)
I | rafthttp: stopped streaming with peer dc84c4754fe5e219 (stream Message reader)
I | rafthttp: stopped peer dc84c4754fe5e219

<Call Close() here>
<Frozen>

The text was updated successfully, but these errors were encountered:

gyuho · 2018-04-05T23:09:39Z

Hmm, that channel should've been closed when starting embedded etcd failed?
Can you also try adding case err = <-(*embed.Etcd).Err(): to your select?

Quentin-M · 2018-04-05T23:35:34Z

Yes, I also have that case in a separate goroutine that starts after etcd has successfully started and became ready. I haven't verified if etcd pushes something there in the case above - but the reaction is about the same: call Stop() and hope it'll not freeze / free the ports.

func (c *Server) runErrorWatcher() {
	select {
	case <-c.server.Server.StopNotify():
		return
	case <-c.server.Err():
		log.Warnf("server has crashed")
		c.server.Close()
	}
}

Quentin-M · 2018-04-06T00:20:46Z

To be extra clear, here is an 'easy' reproduction case:

Get a cluster running with 2-8GB of data loaded (it buys time to setup the seemingly race condition)
Add a new member, matching the configuration the embed server you'll start
Start the embed server
Remove the member while the embed server is sync'ing
Try to stop the embed server - and witness Stop() going stale.

gyuho · 2018-04-06T00:43:22Z

Remove the member while the embed server is sync'ing

Is this removing the member that was started as an embed server in step 3?

Quentin-M · 2018-04-06T01:05:48Z

Yes, that is pretty rare/stupid. However, from what I have seen, that maybe could also be reproduced by injecting a failure in the snapshot transmission (e.g. something happens to the member transmitting the snapshot, or to the network in between). I could try to do this if that's of any interest for you. Although it feels like it is just about a channel not being closed properly.

Point being is that, albeit an human eye can notice the start failure from the logs, and a machine could potentially interpolate this from a 'longer than expected' start, recovering from such situation (i.e. etcd started, but was not able to ever become ready - for reasons) is impossible without panicking / exiting the whole wrapping application as Stop() is not able to stop the goroutines / free the ports.

By the way, I just added case <-c.server.Err(): as suggested to verify whether the start failure could be caught and raised in this channel, but it is not (a context timer is thus the only way to catch it at this point).

gyuho · 2018-04-06T01:13:42Z

@Quentin-M Got it. We will try to reproduce. Thanks!

gyuho · 2018-04-16T23:22:53Z

Sorry for delay. Just fyi, I am seeing similar blocking behaviors in other sets of testing.
So, will make sure to address this by this release cycle.

Quentin-M · 2018-04-16T23:25:21Z

@gyuho Thank you so much for looking into it, I don't think there's any urgency from anyone at the moment anyways!

Hendrik-H · 2023-10-12T16:45:28Z

I'm seeing the same issue in 3.5.9. My case seems to be very similar. I have a cluster with 3 nodes and while the a node is stopped it is being removed from the cluster. When I now try to start the node the start does not return an error but the server stops on its own. After that I can not invoke Close() on embed.Etcd as it hangs at https://github.com/etcd-io/etcd/blob/v3.5.9/server/embed/etcd.go#L397 as the channel does not get closed. This way the listeners are not closed and I cannot restart the server again as the ports are still in use. As the peer and client listeners are exposed I can close them directly to get around the issue but not everything is closed this way.

Hendrik-H · 2023-10-13T09:13:18Z

The issue seems to be:

During server startup the client serving is started here: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L274
This starts the actual serving here: https://github.com/etcd-io/etcd/blob/main/server/embed/etcd.go#L771
The serving code then hangs at https://github.com/etcd-io/etcd/blob/main/server/embed/serve.go#L102
The go routine keeps hanging at that point as ReadyNotify never happens as the server gets stopped on its own. That way also the close call at https://github.com/etcd-io/etcd/blob/main/server/embed/serve.go#L118 never gets reached.

A fix might be to wait for Ready and Stopping / Stopped using a select and in the stop case simply close the channel and return.

ahrtr · 2023-10-13T10:07:55Z

A fix might be to wait for Ready and Stopping / Stopped using a select and in the stop case simply close the channel and return.

It looks like a valid fix. #16754. Please feel free to continue to work on top of the PR if you are interested.

Signed-off-by: Hendrik Haddorp <2423293+Hendrik-H@users.noreply.github.com>

gyuho added the stage/investigating label Apr 6, 2018

Quentin-M mentioned this issue Apr 6, 2018

Fix condition where the server could fail to completely stop when started only partially & Remove workaround Quentin-M/etcd-cloud-operator#6

Open

gyuho added this to the etcd-v3.4 milestone Apr 17, 2018

gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019

serathius removed this from the etcd-v3.5 milestone Dec 15, 2021

serathius added stage/tracked and removed stage/investigating labels Jun 13, 2022

Hendrik-H mentioned this issue Oct 13, 2023

Plan release v3.5.10 #16733

Closed

ahrtr linked a pull request Oct 13, 2023 that will close this issue

Stop waiting for the ready notification if the server has stopped #16754

Draft

Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 13, 2023

first test case for etcd-io#9533

e81d677

Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 13, 2023

first test case for etcd-io#9533

0a52796

Signed-off-by: Hendrik Haddorp <2423293+Hendrik-H@users.noreply.github.com>

Hendrik-H added a commit to Hendrik-H/etcd that referenced this issue Oct 17, 2023

first test case for etcd-io#9533

605ed7c

Signed-off-by: Hendrik Haddorp <2423293+Hendrik-H@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Quentin-M commented Apr 5, 2018

gyuho commented Apr 5, 2018

Quentin-M commented Apr 5, 2018

Quentin-M commented Apr 6, 2018 •

edited

gyuho commented Apr 6, 2018

Quentin-M commented Apr 6, 2018 •

edited

gyuho commented Apr 6, 2018

gyuho commented Apr 16, 2018

Quentin-M commented Apr 16, 2018

Hendrik-H commented Oct 12, 2023

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

embed server never becomes ready / can't be stopped in certain start failure conditions #9533

Comments

Quentin-M commented Apr 5, 2018

gyuho commented Apr 5, 2018

Quentin-M commented Apr 5, 2018

Quentin-M commented Apr 6, 2018 • edited

gyuho commented Apr 6, 2018

Quentin-M commented Apr 6, 2018 • edited

gyuho commented Apr 6, 2018

gyuho commented Apr 16, 2018

Quentin-M commented Apr 16, 2018

Hendrik-H commented Oct 12, 2023

Hendrik-H commented Oct 13, 2023

ahrtr commented Oct 13, 2023

Quentin-M commented Apr 6, 2018 •

edited

Quentin-M commented Apr 6, 2018 •

edited