Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd daemon: creates member directory if cluster config is not valid #3827

Closed
kayrus opened this issue Nov 6, 2015 · 12 comments
Closed

etcd daemon: creates member directory if cluster config is not valid #3827

kayrus opened this issue Nov 6, 2015 · 12 comments
Assignees
Labels
Milestone

Comments

@kayrus
Copy link
Contributor

kayrus commented Nov 6, 2015

$ etcd2 --version
etcd Version: 2.2.1
Git SHA: 75f8282
Go Version: go1.5.1
Go OS/Arch: linux/amd64

Let's simulate situation when we bootstrap 4 etcd members but with size=3 discovery token. First three members bootstrap without any issue. But last one does. It creates /var/lib/etcd2/member even when it can not be started (for example DNS record is still not updated). Here is the gist:

$ rm -rf /var/lib/etcd2/member/
$ ls -la /var/lib/etcd2/
total 16
drwxr-xr-x  2 etcd etcd 4096 Nov  6 16:09 .
drwxr-xr-x 21 root root 4096 Nov  6 15:58 ..
$ export ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
$ export ETCD_DATA_DIR=/var/lib/etcd2
$ export ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
$ export ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
$ export ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
$ export ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
$ export ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
$ etcd2
2015-11-06 16:04:22.871587 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
2015-11-06 16:04:22.871773 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
2015-11-06 16:04:22.871924 I | flags: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
2015-11-06 16:04:22.872065 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
2015-11-06 16:04:22.872207 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2015-11-06 16:04:22.872330 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2015-11-06 16:04:22.872498 I | flags: recognized and used environment variable ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
2015-11-06 16:04:22.872781 I | etcdmain: etcd Version: 2.2.1
2015-11-06 16:04:22.872889 I | etcdmain: Git SHA: 75f8282
2015-11-06 16:04:22.873031 I | etcdmain: Go Version: go1.5.1
2015-11-06 16:04:22.873138 I | etcdmain: Go OS/Arch: linux/amd64
2015-11-06 16:04:22.873254 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2015-11-06 16:04:22.873509 I | etcdmain: listening for peers on http://0.0.0.0:2380
2015-11-06 16:04:22.873663 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2015-11-06 16:04:22.873826 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2015-11-06 16:04:23.070391 E | netutil: could not resolve host coreos4:2380
2015-11-06 16:04:23.070751 I | etcdmain: stopping listening for client requests on http://0.0.0.0:4001
2015-11-06 16:04:23.071103 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2015-11-06 16:04:23.071343 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2015-11-06 16:04:23.071562 I | etcdmain: --initial-cluster must include 2769da4ab46d3b310a3fcc6b06fa9b6e=http://coreos4:2380 given --initial-advertise-peer-urls=http://coreos4:2380
2015-11-06 16:04:23.071747 I | etcdmain: forgot to set --initial-cluster flag?

We can see that /var/lib/etcd2/member/ was created and it is empty:

$ ls -la /var/lib/etcd2
total 24
drwxr-xr-x  3 etcd etcd 4096 Nov  6 16:12 .
drwxr-xr-x 21 root root 4096 Nov  6 15:58 ..
drwx------  2 root root 4096 Nov  6 16:12 member

And etcd2 daemon can not switch to proxy mode automatically even when DNS record is already valid:

etcd2 
2015-11-06 16:14:48.232621 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://coreos4:2379
2015-11-06 16:14:48.233075 I | flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
2015-11-06 16:14:48.233270 I | flags: recognized and used environment variable ETCD_DISCOVERY=https://discovery.etcd.io/d21a838dd608ab1831a9d84cdea2f7d1
2015-11-06 16:14:48.233583 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://coreos4:2380
2015-11-06 16:14:48.233794 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2015-11-06 16:14:48.234095 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2015-11-06 16:14:48.234286 I | flags: recognized and used environment variable ETCD_NAME=2769da4ab46d3b310a3fcc6b06fa9b6e
2015-11-06 16:14:48.234720 I | etcdmain: etcd Version: 2.2.1
2015-11-06 16:14:48.234896 I | etcdmain: Git SHA: 75f8282
2015-11-06 16:14:48.235032 I | etcdmain: Go Version: go1.5.1
2015-11-06 16:14:48.235231 I | etcdmain: Go OS/Arch: linux/amd64
2015-11-06 16:14:48.235496 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2015-11-06 16:14:48.235772 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2015-11-06 16:14:48.236047 I | etcdmain: listening for peers on http://0.0.0.0:2380
2015-11-06 16:14:48.236300 I | etcdmain: listening for client requests on http://0.0.0.0:2379
2015-11-06 16:14:48.236576 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2015-11-06 16:14:48.238058 I | netutil: resolving coreos4:2380 to 192.168.122.228:2380
2015-11-06 16:14:48.238692 I | netutil: resolving coreos4:2380 to 192.168.122.228:2380
2015-11-06 16:14:49.250359 I | etcdmain: stopping listening for client requests on http://0.0.0.0:4001
2015-11-06 16:14:49.250660 I | etcdmain: stopping listening for client requests on http://0.0.0.0:2379
2015-11-06 16:14:49.251300 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2015-11-06 16:14:49.251593 C | etcdmain: discovery: cluster is full

and in addition, even when etcd2 runs as proxy - it still creates /var/lib/etcd2/member directory. and if you restart it will fail with the following message:

invalid datadir. Both member and proxy directories exist.

partly relates to #3713

@benjvi
Copy link

benjvi commented Nov 9, 2015

+1 I'm seeing the same problem - although my extra member fall back to proxy as expected initially, on any subsequent restart I get this error

@gyuho
Copy link
Contributor

gyuho commented Nov 13, 2015

Thanks for reporting. I will investigate and try to resolve this.

@xcompass
Copy link

+1, all proxies failed to start after reboot. Running CoreOS 845.

Nov 22 08:53:40 sigma etcd2[822]: etcd Version: 2.2.1
Nov 22 08:53:40 sigma etcd2[822]: Git SHA: 75f8282
Nov 22 08:53:40 sigma etcd2[822]: Go Version: go1.5.1
Nov 22 08:53:40 sigma etcd2[822]: Go OS/Arch: linux/amd64
Nov 22 08:53:40 sigma etcd2[822]: setting maximum number of CPUs to 2, total number of available 
Nov 22 08:53:40 sigma etcd2[822]: invalid datadir. Both member and proxy directories exist.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Nov 22 08:53:40 sigma systemd[1]: Failed to start etcd2.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Unit entered failed state.
Nov 22 08:53:40 sigma systemd[1]: etcd2.service: Failed with result 'exit-code'.

Running

rm -rf /var/lib/etcd2

solves problem.

@nikkomiu
Copy link

nikkomiu commented Dec 2, 2015

@xcompass I got the service to startup after removing only the directory with nothing in it (based on your etcd settings).

Removing both the member and proxy directories for some reason breaks it on the second startup/reboot.

@nikkomiu
Copy link

nikkomiu commented Dec 2, 2015

@gyuho could a possible solution to the problem be to change https://github.com/coreos/etcd/blob/master/etcdmain/etcd.go#L488 to check for the file in the cfg dir instead of the directory itself?

@gyuho
Copy link
Contributor

gyuho commented Dec 2, 2015

@nikfoundas Sorry for being slack on this issue. I will look into it and give you updates on that.

@xcompass
Copy link

xcompass commented Dec 2, 2015

@nikkomiu I found a workaround by setting proxy=1 to force node to running in proxy mode. It will never create the member directory. Then there is no problem after rebooting.

gyuho added a commit to gyuho/etcd that referenced this issue Dec 3, 2015
This fixes etcd-io#3827 where member falls to
back to proxy successfully at first, but fails in subsequent tries.
It fails when there are 'member' and 'proxy' directory in the same place,
one of which did not get cleaned up from failure and causes this conflict
error message: 'invalid datadir. Both member and proxy directories exist.'
@gyuho
Copy link
Contributor

gyuho commented Dec 8, 2015

My first approach can be found here #3949 in order to overwrite proxy setting by deleting member directory. If anybody has feedback, please let me know!

@geku
Copy link

geku commented Dec 17, 2015

With CoreOS a simple workaround is to create a drop-in in the cloud-config for etcd2.service which deletes the empty member directory before etc2 starts up:

- name: etcd2.service
  command: start
  drop-ins:
    - name: 50-cleanup-data.conf
      content: |
        [Service]
        ExecStartPre=-/usr/bin/rmdir /var/lib/etcd2/member

Afterwards etcd2 servers and proxies start as expected.

@veye
Copy link

veye commented Dec 17, 2015

@geku Cool! And thanks for posting here.

@gyuho
Copy link
Contributor

gyuho commented Dec 28, 2015

FYI. I am working to improve our discover function to not write any conflicting directories unless the whole operation succeeds. Will keep you updated on this issue.

gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
This is for etcd-io#3827.
This removes member directory with defer statement. And it removes only when
etcdserver.NewServer returns error.
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
This is for etcd-io#3827.
This removes member directory with defer statement. And it removes only when
etcdserver.NewServer returns error.
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery JoinCluster fails returning DiscoveryError, etcd node falls
back to proxy. And in order to start as a proxy, we need to make sure there
are conflicting directories: 'member' and 'proxy' directories cannot be
existent together. This deletes 'member' directory only when startEtcd fails
with DiscoveryError type. This fixes
etcd-io#3827.
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery JoinCluster fails returning DiscoveryError, etcd node falls
back to proxy. And in order to start as a proxy, we need to make sure there
are conflicting directories: 'member' and 'proxy' directories cannot be
existent together. This deletes 'member' directory only when startEtcd fails
with DiscoveryError type. This fixes
etcd-io#3827.
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
This is for etcd-io#3827.
This removes member directory with defer statement. And it removes only when
etcdserver.NewServer returns error.
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
When discovery fails, etcd node falls back to proxy. And in order to
start as a proxy, we need to make sure there are conflicting directories:
'member' and 'proxy' directories cannot be existent together. This deletes
'member' directory only when startEtcd fails (etcd-io#3827).
gyuho added a commit to gyuho/etcd that referenced this issue Dec 29, 2015
This removes member directory when bootstrap fails including joining existing
cluster and forming a new cluster. This fixes etcd-io#3827.
@gyuho
Copy link
Contributor

gyuho commented Dec 29, 2015

I believe #4087 fixes this issue, which removes member directory when bootstrap fails. Closing this. Please let us know if you still have the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

8 participants