etcd daemon: creates member directory if cluster config is not valid #3827

kayrus · 2015-11-06T16:22:06Z

$ etcd2 --version
etcd Version: 2.2.1
Git SHA: 75f8282
Go Version: go1.5.1
Go OS/Arch: linux/amd64

Let's simulate situation when we bootstrap 4 etcd members but with size=3 discovery token. First three members bootstrap without any issue. But last one does. It creates /var/lib/etcd2/member even when it can not be started (for example DNS record is still not updated). Here is the gist:

$ rm -rf /var/lib/etcd2/member/
$ ls -la /var/lib/etcd2/
total 16
drwxr-xr-x  2 etcd etcd 4096 Nov  6 16:09 .
drwxr-xr-x 21 root root 4096 Nov  6 15:58 ..
$ export ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
$ export ETCD_DATA_DIR=/var/lib/etcd2
$ export ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
$ export ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
$ export ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
$ export ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
$ export ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
$ etcd2
2015-11-06 16:04:22.871587 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
2015-11-06 16:04:22.871773 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
2015-11-06 16:04:22.871924 I | flags: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
2015-11-06 16:04:22.872065 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
2015-11-06 16:04:22.872207 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2015-11-06 16:04:22.872330 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2015-11-06 16:04:22.872498 I | flags: recognized and used environment variable ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
2015-11-06 16:04:22.872781 I | etcdmain: etcd Version: 2.2.1
2015-11-06 16:04:22.872889 I | etcdmain: Git SHA: 75f8282
2015-11-06 16:04:22.873031 I | etcdmain: Go Version: go1.5.1
2015-11-06 16:04:22.873138 I | etcdmain: Go OS/Arch: linux/amd64
2015-11-06 16:04:22.873254 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2015-11-06 16:04:22.873509 I | etcdmain: listening for peers on http://0.0.0.0:2380
2015-11-06 16:04:22.873663 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2015-11-06 16:04:22.873826 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2015-11-06 16:04:23.070391 E | netutil: could not resolve host coreos4:2380
2015-11-06 16:04:23.070751 I | etcdmain: stopping listening for client requests on http://0.0.0.0:4001
2015-11-06 16:04:23.071103 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2015-11-06 16:04:23.071343 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2015-11-06 16:04:23.071562 I | etcdmain: --initial-cluster must include 2769da4ab46d3b310a3fcc6b06fa9b6e=http://coreos4:2380 given --initial-advertise-peer-urls=http://coreos4:2380
2015-11-06 16:04:23.071747 I | etcdmain: forgot to set --initial-cluster flag?

We can see that /var/lib/etcd2/member/ was created and it is empty:

$ ls -la /var/lib/etcd2
total 24
drwxr-xr-x  3 etcd etcd 4096 Nov  6 16:12 .
drwxr-xr-x 21 root root 4096 Nov  6 15:58 ..
drwx------  2 root root 4096 Nov  6 16:12 member

And etcd2 daemon can not switch to proxy mode automatically even when DNS record is already valid:

etcd2 
2015-11-06 16:14:48.232621 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
2015-11-06 16:14:48.233075 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
2015-11-06 16:14:48.233270 I | flags: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
2015-11-06 16:14:48.233583 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
2015-11-06 16:14:48.233794 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2015-11-06 16:14:48.234095 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2015-11-06 16:14:48.234286 I | flags: recognized and used environment variable ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
2015-11-06 16:14:48.234720 I | etcdmain: etcd Version: 2.2.1
2015-11-06 16:14:48.234896 I | etcdmain: Git SHA: 75f8282
2015-11-06 16:14:48.235032 I | etcdmain: Go Version: go1.5.1
2015-11-06 16:14:48.235231 I | etcdmain: Go OS/Arch: linux/amd64
2015-11-06 16:14:48.235496 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2015-11-06 16:14:48.235772 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2015-11-06 16:14:48.236047 I | etcdmain: listening for peers on http://0.0.0.0:2380
2015-11-06 16:14:48.236300 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2015-11-06 16:14:48.236576 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2015-11-06 16:14:48.238058 I | netutil: resolving coreos4:2380 to 192.168.122.228:2380
2015-11-06 16:14:48.238692 I | netutil: resolving coreos4:2380 to 192.168.122.228:2380
2015-11-06 16:14:49.250359 I | etcdmain: stopping listening for client requests on http://0.0.0.0:4001
2015-11-06 16:14:49.250660 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2015-11-06 16:14:49.251300 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2015-11-06 16:14:49.251593 C | etcdmain: discovery: cluster is full

and in addition, even when etcd2 runs as proxy - it still creates /var/lib/etcd2/member directory. and if you restart it will fail with the following message:

invalid datadir. Both member and proxy directories exist.

partly relates to #3713

The text was updated successfully, but these errors were encountered:

benjvi · 2015-11-09T09:32:52Z

+1 I'm seeing the same problem - although my extra member fall back to proxy as expected initially, on any subsequent restart I get this error

gyuho · 2015-11-13T21:50:25Z

Thanks for reporting. I will investigate and try to resolve this.

xcompass · 2015-11-22T08:58:47Z

+1, all proxies failed to start after reboot. Running CoreOS 845.

Nov 22 08:53:40 sigma etcd2[822]: etcd Version: 2.2.1
Nov 22 08:53:40 sigma etcd2[822]: Git SHA: 75f8282
Nov 22 08:53:40 sigma etcd2[822]: Go Version: go1.5.1
Nov 22 08:53:40 sigma etcd2[822]: Go OS/Arch: linux/amd64
Nov 22 08:53:40 sigma etcd2[822]: setting maximum number of CPUs to 2, total number of available 
Nov 22 08:53:40 sigma etcd2[822]: invalid datadir. Both member and proxy directories exist.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 08:53:40 sigma systemd[1]: Failed to start etcd2.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Unit entered failed state.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Failed with result 'exit-code'.

Running

rm -rf /var/lib/etcd2

solves problem.

nikkomiu · 2015-12-02T01:36:59Z

@xcompass I got the service to startup after removing only the directory with nothing in it (based on your etcd settings).

Removing both the member and proxy directories for some reason breaks it on the second startup/reboot.

nikkomiu · 2015-12-02T02:13:27Z

@gyuho could a possible solution to the problem be to change https://github.com/coreos/etcd/blob/master/etcdmain/etcd.go#L488 to check for the file in the cfg dir instead of the directory itself?

gyuho · 2015-12-02T03:43:44Z

@nikfoundas Sorry for being slack on this issue. I will look into it and give you updates on that.

xcompass · 2015-12-02T19:34:34Z

@nikkomiu I found a workaround by setting proxy=1 to force node to running in proxy mode. It will never create the member directory. Then there is no problem after rebooting.

This fixes etcd-io#3827 where member falls to back to proxy successfully at first, but fails in subsequent tries. It fails when there are 'member' and 'proxy' directory in the same place, one of which did not get cleaned up from failure and causes this conflict error message: 'invalid datadir. Both member and proxy directories exist.'

gyuho · 2015-12-08T19:10:45Z

My first approach can be found here #3949 in order to overwrite proxy setting by deleting member directory. If anybody has feedback, please let me know!

geku · 2015-12-17T14:28:20Z

With CoreOS a simple workaround is to create a drop-in in the cloud-config for etcd2.service which deletes the empty member directory before etc2 starts up:

- name: etcd2.service
  command: start
  drop-ins:
    - name: 50-cleanup-data.conf
      content: |
        [Service]
        ExecStartPre=-/usr/bin/rmdir /var/lib/etcd2/member

Afterwards etcd2 servers and proxies start as expected.

veye · 2015-12-17T15:21:40Z

@geku Cool! And thanks for posting here.

gyuho · 2015-12-28T21:55:38Z

FYI. I am working to improve our discover function to not write any conflicting directories unless the whole operation succeeds. Will keep you updated on this issue.

This is for etcd-io#3827. This removes member directory with defer statement. And it removes only when etcdserver.NewServer returns error.

When discovery JoinCluster fails returning DiscoveryError, etcd node falls back to proxy. And in order to start as a proxy, we need to make sure there are conflicting directories: 'member' and 'proxy' directories cannot be existent together. This deletes 'member' directory only when startEtcd fails with DiscoveryError type. This fixes etcd-io#3827.

When discovery fails, etcd node falls back to proxy. And in order to start as a proxy, we need to make sure there are conflicting directories: 'member' and 'proxy' directories cannot be existent together. This deletes 'member' directory only when startEtcd fails (etcd-io#3827).

This is for etcd-io#3827. This removes member directory with defer statement. And it removes only when etcdserver.NewServer returns error.

When discovery fails, etcd node falls back to proxy. And in order to start as a proxy, we need to make sure there are conflicting directories: 'member' and 'proxy' directories cannot be existent together. This deletes 'member' directory only when startEtcd fails (etcd-io#3827).

This removes member directory when bootstrap fails including joining existing cluster and forming a new cluster. This fixes etcd-io#3827.

gyuho · 2015-12-29T07:04:29Z

I believe #4087 fixes this issue, which removes member directory when bootstrap fails. Closing this. Please let us know if you still have the same issue.

jonboulle added type/bug area/configuration labels Nov 13, 2015

jonboulle added this to the v2.3.0 milestone Nov 13, 2015

jonboulle assigned gyuho Nov 13, 2015

gyuho mentioned this issue Nov 22, 2015

etcd2 2.2.0 check dir writable before starting etcdserver #3713

Closed

gyuho mentioned this issue Dec 3, 2015

etcdmain: delete member dir falling back to proxy #3949

Closed

ghost mentioned this issue Dec 8, 2015

etcd2 constantly restarting on some machines (CoreOS alpha (884.0.0)) coreos/bugs#1021

Closed

gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015

etcdserver: delete member directory when NewServer errors

68e2532

This is for etcd-io#3827. This removes member directory with defer statement. And it removes only when etcdserver.NewServer returns error.

gyuho mentioned this issue Dec 29, 2015

etcdserver: delete member directory when NewServer errors #4086

Closed

gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015

etcdserver: delete member directory when NewServer errors

1a5900a

This is for etcd-io#3827. This removes member directory with defer statement. And it removes only when etcdserver.NewServer returns error.

gyuho mentioned this issue Dec 29, 2015

etcdserver: always remove member directory when bootstrap fails #4087

Merged

gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015

etcdserver: delete member directory when NewServer errors

b05633e

This is for etcd-io#3827. This removes member directory with defer statement. And it removes only when etcdserver.NewServer returns error.

gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015

etcdserver: always remove member directory when bootstrap fails

a7e443d

This removes member directory when bootstrap fails including joining existing cluster and forming a new cluster. This fixes etcd-io#3827.

gyuho closed this as completed Dec 29, 2015

xiang90 mentioned this issue Dec 29, 2015

etcdserver: fix creating member dir #4089

Merged

gambol99 mentioned this issue Jan 6, 2017

etcd member already bootstrapped gambol99/kubernetes-platform#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd daemon: creates member directory if cluster config is not valid #3827

etcd daemon: creates member directory if cluster config is not valid #3827

kayrus commented Nov 6, 2015

benjvi commented Nov 9, 2015

gyuho commented Nov 13, 2015

xcompass commented Nov 22, 2015

nikkomiu commented Dec 2, 2015

nikkomiu commented Dec 2, 2015

gyuho commented Dec 2, 2015

xcompass commented Dec 2, 2015

gyuho commented Dec 8, 2015

geku commented Dec 17, 2015

veye commented Dec 17, 2015

gyuho commented Dec 28, 2015

gyuho commented Dec 29, 2015

etcd daemon: creates member directory if cluster config is not valid #3827

etcd daemon: creates member directory if cluster config is not valid #3827

Comments

kayrus commented Nov 6, 2015

benjvi commented Nov 9, 2015

gyuho commented Nov 13, 2015

xcompass commented Nov 22, 2015

nikkomiu commented Dec 2, 2015

nikkomiu commented Dec 2, 2015

gyuho commented Dec 2, 2015

xcompass commented Dec 2, 2015

gyuho commented Dec 8, 2015

geku commented Dec 17, 2015

veye commented Dec 17, 2015

gyuho commented Dec 28, 2015

gyuho commented Dec 29, 2015