Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd2 fails to start with option ETCD_WAL_DIR #7287

Closed
pizzarabe opened this issue Feb 7, 2017 · 26 comments
Closed

etcd2 fails to start with option ETCD_WAL_DIR #7287

pizzarabe opened this issue Feb 7, 2017 · 26 comments
Assignees

Comments

@pizzarabe
Copy link

pizzarabe commented Feb 7, 2017

Bootstrapping a new etcd2 Server on CoreOS with a dedicated wal directory fails with the following error message:

cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory

removing the wal-dir flag resolves that problem.

Here is the config of a etcd2 cluster node:


Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://10.169.1.135:2379
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_CERT_FILE=/etc/ssl/alien/alien5.cert.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_DISCOVERY=http://10.169.1.129:2379/v2/keys/discovery/18fc9f7cc488dc43937718ffa3800e8778fa964b6d83a53307a3d7ef4f9a66f9
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.169.1.135:2380
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_KEY_FILE=/etc/ssl/alien/alien5.key.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379,https://0.0.0.0:4001
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_NAME=alien5.local
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/alien/alien5.cert.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/alien/alien5.key.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 07 10:53:47 alien5 etcd2[1941]: recognized and used environment variable ETCD_WAL_DIR=/var/wal/
Feb 07 10:53:47 alien5 etcd2[1941]: etcd Version: 2.3.7
Feb 07 10:53:47 alien5 etcd2[1941]: Git SHA: fd17c91
Feb 07 10:53:47 alien5 etcd2[1941]: Go Version: go1.7.3
Feb 07 10:53:47 alien5 etcd2[1941]: Go OS/Arch: linux/amd64

according to the docs the wal-dir option should only change the path of the wal files, I don't know why he got problems with the data-dir...

Path to the dedicated wal directory. If this flag is set, etcd will write the WAL files to the walDir rather than the dataDir. This allows a dedicated disk to be used, and helps avoid io competition between logging and other IO operations.

@heyitsanthony
Copy link
Contributor

I suspect you're booting etcd with a wal-dir that's already populated but with a wiped member directory. etcd fails because it expects a member directory but it's not there:

$ ./bin/etcd -wal-dir waldir
[ctrl-c]
# remove member dir, keep waldir
$ rm -rf default.etcd
$ ./bin/etcd -wal-dir waldir

@pizzarabe
Copy link
Author

The storage for the wal dir is created by the cloud-config (I tested it with a whole new cluster and new servers).

The following snippet of the cloud-config is used to create the wal-dir storage:

    - name: format-sdb.service
      command: start
      content: |
        [Unit]
        Description=Format sdb
        After=dev-sdb.device
        Requires=dev-sdb.device
        [Service]
        Type=oneshot
        RemainAfterExit=yes
        ExecStart=/usr/sbin/wipefs -af /dev/sdb
        ExecStart=/usr/sbin/parted /dev/sdb mklabel msdos
        ExecStart=/usr/sbin/parted /dev/sdb mkpart primary 2048s 100%
        ExecStart=/usr/sbin/mkfs.ext4 /dev/sdb1

It is started before the server is installed with coreos-install

and mounted with that piece of a cloud-config:

          - name: var-wal.mount
            command: start
            content: |
              [Mount]
              What=/dev/sdb1
              Where=/var/wal
              Type=ext4

The whole storage should be (and is) empty. According to the docs the member folder should not change if wal-dir is specified (as far I know).

@gyuho
Copy link
Contributor

gyuho commented Feb 21, 2017

@pizzarabe Is your WAL directory writable? Maybe check the permission?
.touch is a dummy file created by etcd to see if etcd process can write to that directory.

@xiang90
Copy link
Contributor

xiang90 commented Feb 27, 2017

@pizzarabe kindly ping

@pizzarabe
Copy link
Author

@gyuho

etcd2 runs as the user etcd, I changed the owner of the /var/wal to etcd (and set SELinux to permissive), but still no success.

$ ps aux  | grep etcd
etcd     19361  3.6  0.3 435336 62696 ?        Ssl  11:52   0:00 /usr/bin/etcd2

$ mount | grep val
/dev/sdb1 on /var/wal type ext4 (rw,relatime,seclabel,data=ordered)

$ ls -la /var
[...]
drwxr-xr-x.  3 etcd etcd 1024 Feb 28 10:54 wal


$ ls -la /var/lib/etcd2/
total 24
drwxr-xr-x.  3 etcd etcd 4096 Feb 28 10:54 .
drwxr-xr-x. 29 root root 4096 Feb 28 00:00 ..
drwx------.  4 etcd etcd 4096 Feb 28 10:54 member

$ getenforce 
Permissive

I followed https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux to remove and add the node to the cluster (so I can remove the wal dir located at /var/lib/etcd2/member)

Anyway, starting the node with wal-dir does not work:


-- Unit etcd2.service has begun starting up.
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://10.169.1.136:2379
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_CERT_FILE=/etc/ssl/alien/alien6.cert.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.169.1.136:2380
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_INITIAL_CLUSTER=alien6=https://10.169.1.136:2380,alien7=https://10.169.1.137:2380
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_KEY_FILE=/etc/ssl/alien/alien6.key.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379,https://0.0.0.0:4001
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_NAME=alien6
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/alien/alien6.cert.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/alien/alien6.key.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 28 11:34:48 alien6 etcd2[17709]: recognized and used environment variable ETCD_WAL_DIR=/var/wal
Feb 28 11:34:48 alien6 etcd2[17709]: unrecognized environment variable ETCD_DISCOVERY=
Feb 28 11:34:48 alien6 etcd2[17709]: etcd Version: 2.3.7
Feb 28 11:34:48 alien6 etcd2[17709]: Git SHA: fd17c91
Feb 28 11:34:48 alien6 etcd2[17709]: Go Version: go1.7.3
Feb 28 11:34:48 alien6 etcd2[17709]: Go OS/Arch: linux/amd64
Feb 28 11:34:48 alien6 etcd2[17709]: setting maximum number of CPUs to 6, total number of available CPUs is 6
Feb 28 11:34:48 alien6 etcd2[17709]: the server is already initialized as member before, starting as etcd member...
Feb 28 11:34:48 alien6 etcd2[17709]: peerTLS: cert = /etc/ssl/alien/alien6.cert.pem, key = /etc/ssl/alien/alien6.key.pem, ca = , trusted-ca = /etc/ssl/alien/alienca.c
Feb 28 11:34:48 alien6 etcd2[17709]: listening for peers on https://0.0.0.0:2380
Feb 28 11:34:48 alien6 etcd2[17709]: clientTLS: cert = /etc/ssl/alien/alien6.cert.pem, key = /etc/ssl/alien/alien6.key.pem, ca = , trusted-ca = /etc/ssl/alien/alienca
Feb 28 11:34:48 alien6 etcd2[17709]: listening for client requests on httpbs://0.0.0.0:2379
Feb 28 11:34:48 alien6 etcd2[17709]: listening for client requests on https://0.0.0.0:4001
Feb 28 11:34:48 alien6 etcd2[17709]: stopping listening for client requests on https://0.0.0.0:4001
Feb 28 11:34:48 alien6 etcd2[17709]: stopping listening for client requests on https://0.0.0.0:2379
Feb 28 11:34:48 alien6 etcd2[17709]: stopping listening for peers on https://0.0.0.0:2380
Feb 28 11:34:48 alien6 etcd2[17709]: open /var/lib/etcd2/member/snap: no such file or directory
Feb 28 11:34:48 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Feb 28 11:34:48 alien6 systemd[1]: Failed to start etcd2.
-- Subject: Unit etcd2.service has failed

after removing the wal-dir option from the systemd unit file the service works.

Feb 28 11:35:29 alien6 systemd[1]: Starting etcd2...
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://10.169.1.136:2379
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_CERT_FILE=/etc/ssl/alien/alien6.cert.pem
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_CLIENT_CERT_AUTH=true
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd2
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.169.1.136:2380
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_INITIAL_CLUSTER=alien6=https://10.169.1.136:2380,alien7=https://10.169.1.137:2380
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_KEY_FILE=/etc/ssl/alien/alien6.key.pem
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379,https://0.0.0.0:4001
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_NAME=alien6
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/ssl/alien/alien6.cert.pem
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_PEER_CLIENT_CERT_AUTH=true
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/ssl/alien/alien6.key.pem
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 28 11:35:29 alien6 etcd2[17832]: recognized and used environment variable ETCD_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem
Feb 28 11:35:29 alien6 etcd2[17832]: unrecognized environment variable ETCD_DISCOVERY=
Feb 28 11:35:29 alien6 etcd2[17832]: etcd Version: 2.3.7
Feb 28 11:35:29 alien6 etcd2[17832]: Git SHA: fd17c91
Feb 28 11:35:29 alien6 etcd2[17832]: Go Version: go1.7.3
Feb 28 11:35:29 alien6 etcd2[17832]: Go OS/Arch: linux/amd64
Feb 28 11:35:29 alien6 etcd2[17832]: setting maximum number of CPUs to 6, total number of available CPUs is 6
Feb 28 11:35:29 alien6 etcd2[17832]: the server is already initialized as member before, starting as etcd member...
Feb 28 11:35:29 alien6 etcd2[17832]: peerTLS: cert = /etc/ssl/alien/alien6.cert.pem, key = /etc/ssl/alien/alien6.key.pem, ca = , trusted-ca = /etc/ssl/alien/alienca.c
Feb 28 11:35:29 alien6 etcd2[17832]: listening for peers on https://0.0.0.0:2380
Feb 28 11:35:29 alien6 etcd2[17832]: clientTLS: cert = /etc/ssl/alien/alien6.cert.pem, key = /etc/ssl/alien/alien6.key.pem, ca = , trusted-ca = /etc/ssl/alien/alienca
Feb 28 11:35:29 alien6 etcd2[17832]: listening for client requests on https://0.0.0.0:2379
Feb 28 11:35:29 alien6 etcd2[17832]: listening for client requests on https://0.0.0.0:4001
Feb 28 11:35:29 alien6 etcd2[17832]: name = alien6
Feb 28 11:35:29 alien6 etcd2[17832]: data dir = /var/lib/etcd2
Feb 28 11:35:29 alien6 etcd2[17832]: member dir = /var/lib/etcd2/member
Feb 28 11:35:29 alien6 etcd2[17832]: heartbeat = 100ms
Feb 28 11:35:29 alien6 etcd2[17832]: election = 1000ms
Feb 28 11:35:29 alien6 etcd2[17832]: snapshot count = 10000
Feb 28 11:35:29 alien6 etcd2[17832]: advertise client URLs = https://10.169.1.136:2379
Feb 28 11:35:29 alien6 etcd2[17832]: starting member 8b2ffe3af067cda9 in cluster 9a8d8f9371a3bd44
Feb 28 11:35:29 alien6 etcd2[17832]: 8b2ffe3af067cda9 became follower at term 0
Feb 28 11:35:29 alien6 etcd2[17832]: newRaft 8b2ffe3af067cda9 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
Feb 28 11:35:29 alien6 etcd2[17832]: 8b2ffe3af067cda9 became follower at term 1
Feb 28 11:35:29 alien6 etcd2[17832]: starting server... [version: 2.3.7, cluster version: to_be_decided]
Feb 28 11:35:29 alien6 systemd[1]: Started etcd2.

The errormsg is different, but I guess it is related.
After configure the option "wal-dir" etcd is unable to create dir/files in /var/lib/etcd2/member (?) but he is able to do that without wal-dir.
Does this make sense?

@gyuho
Copy link
Contributor

gyuho commented Feb 28, 2017

Did you remove any files in data-dir before restart? Simple steps to reproduce would be helpful.

@pizzarabe
Copy link
Author

@gyuho yes, I removed the whole content

Clean up the /var/lib/etcd2 directory:
$ sudo rm -rf /var/lib/etcd2/*

To reproduce:

  • 3 Node Cluster
  • stop a etcd Node
  • remove node from cluster
    $ etcdctl member remove <id>
  • cleanup /var/lib/etcd2
    $ sudo rm -rf /var/lib/etcd2/*
  • add member to cluster
    $ etcdctl member add <nodename> <endpoint>
  • create /run/systemd/system/etcd2.service.d/99-restore.conf to add a node to a existing cluster (with empty discover-url)
  • add wal-dir to /run/systemd/system/etcd2.service.d/20-cloudinit.conf
    Environment="ETCD_WAL_DIR=/var/wal"
  • $ sudo systemctl daemon-reload && sudo systemctl start etcd2

if I skip

  • add wal-dir to /run/systemd/system/etcd2.service.d/20-cloudinit.conf
    Environment="ETCD_WAL_DIR=/var/wal"

it works :/

Like I said, I followed the docs at https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux

@gyuho
Copy link
Contributor

gyuho commented Feb 28, 2017

If you remove sudo rm -rf /var/lib/etcd2/* to start a new fresh node, you need to remove your old wal directory as well. Otherwise etcd will try to reload logs from the old wal directory. I think that's what's going on here? Do you have old wal data in that machine?

@heyitsanthony
Copy link
Contributor

@gyuho @xiang90 should etcd have a better error message / detect when trying to boot with a custom wal directory when the member directory is missing?

@gyuho
Copy link
Contributor

gyuho commented Feb 28, 2017

It will print out dedicated wal directory in the log

/etcd --wal-dir hello
2017-02-28 10:24:18.710152 I | etcdmain: etcd Version: 3.2.0+git
2017-02-28 10:24:18.710242 I | etcdmain: Git SHA: 1867d26
2017-02-28 10:24:18.710246 I | etcdmain: Go Version: go1.8
2017-02-28 10:24:18.710249 I | etcdmain: Go OS/Arch: darwin/amd64
2017-02-28 10:24:18.710253 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2017-02-28 10:24:18.710260 W | etcdmain: no data-dir provided, using default data-dir ./default.etcd
2017-02-28 10:24:18.710316 W | etcdmain: found invalid file/dir .DS_Store under data dir default.etcd (Ignore this if you are upgrading etcd)
2017-02-28 10:24:18.710324 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-02-28 10:24:18.710333 N | etcdmain: failed to detect default host, advertise falling back to "localhost" (default host not supported on darwin_amd64)
2017-02-28 10:24:18.710641 I | embed: listening for peers on http://localhost:2380
2017-02-28 10:24:18.710790 I | embed: listening for client requests on localhost:2379
2017-02-28 10:24:18.711601 I | etcdserver: name = default
2017-02-28 10:24:18.711610 I | etcdserver: data dir = default.etcd
2017-02-28 10:24:18.711613 I | etcdserver: member dir = default.etcd/member
2017-02-28 10:24:18.711620 I | etcdserver: dedicated WAL dir = hello

@gyuho
Copy link
Contributor

gyuho commented Feb 28, 2017

I misread the comment. I think it would be useful. Thanks!

trying to boot with a custom wal directory when the member directory is missing

^ @heyitsanthony

@pizzarabe
Copy link
Author

@gyuho

Do you have old wal data in that machine?

No, the old wal dir was located at /var/lib/etcd2/wal and was removed with sudo rm -rf /var/lib/etcd2/*

@gyuho
Copy link
Contributor

gyuho commented Mar 1, 2017

If you remove all the data, wal directories completely, how does your etcd log show this line?

Feb 28 11:34:48 alien6 etcd2[17709]: the server is already initialized as member before, starting as etcd member...

This means you had /var/lib/etcd2/member when you started this service.
Can you double-check?

@pizzarabe
Copy link
Author

Feb 28 11:34:48 alien6 etcd2[17709]: the server is already initialized as member before, starting as etcd member...

This means you had /var/lib/etcd2/member when you started this service.
Can you double-check?

Maybe I did not remove the member dir at that point...

Anyway, I removed it now too. If I try to start etcd2 with a dedicated wal dir and no member dir I get the following error (again):

Mar 01 16:43:56 alien6 etcd2[20036]: cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory
Mar 01 16:43:56 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:43:56 alien6 systemd[1]: Failed to start etcd2.

Wthout removing the member dir and a dedicated wal-dir configured:

Mar 01 16:49:42 alien6 etcd2[20510]: open /var/lib/etcd2/member/snap: no such file or directory
Mar 01 16:49:42 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:49:42 alien6 systemd[1]: Failed to start etcd2

If I do not configure wal-dir, it always works (with and without the member dir)

Mar 01 16:54:24 alien6 etcd2[21030]: data dir = /var/lib/etcd2
Mar 01 16:54:24 alien6 etcd2[21030]: member dir = /var/lib/etcd2/member
[...]
Mar 01 16:54:24 alien6 systemd[1]: Started etcd2.

@gyuho
Copy link
Contributor

gyuho commented Mar 1, 2017

Can you share your etcd2.service file? And also did you make any change in etcd by any chance? 2.3.7 was released with Go 1.6.2, and I see your log says Go 1.7.3 although it won't make much difference.

@heyitsanthony
Copy link
Contributor

If I try to start etcd2 with a dedicated wal dir and no member dir I get the following error (again):

Mar 01 16:43:56 alien6 etcd2[20036]: cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory
Mar 01 16:43:56 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:43:56 alien6 systemd[1]: Failed to start etcd2.

This is expected behavior; it should fail to start. It needs a better error message like I suggested.

Wthout removing the member dir and a dedicated wal-dir configured:

Mar 01 16:49:42 alien6 etcd2[20510]: open /var/lib/etcd2/member/snap: no such file or directory
Mar 01 16:49:42 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:49:42 alien6 systemd[1]: Failed to start etcd2

This seems corrupted in some way. The member directory is missing contents. It should not start.

If I do not configure wal-dir, it always works (with and without the member dir)

It's filling in the wal data. If the member directory already exists, it probably should refuse to start if there are no wal files.

The member directory and wal directory depend on each other. If either is missing data, etcd should usually refuse to boot. This isn't as much of a problem when the wal directory is inside the member directory since if the member directory is missing, the wal directory will be missing too.

@pizzarabe
Copy link
Author

pizzarabe commented Mar 2, 2017

@gyuho this is the normal installation of etcd from CoreOS (Container Linux) 1325.1.0 alpha. I did not changed anything on the source or sth. like that.
Edit: Since CoreOS 1221.0.0 Go binaries are built with 1.7.3 of golang (https://coreos.com/releases/#1221.0.0)

$ systemctl cat etcd2
# /usr/lib/systemd/system/etcd2.service
[Unit]
Description=etcd2
Conflicts=etcd.service

[Service]
User=etcd
Type=notify
Environment=ETCD_DATA_DIR=/var/lib/etcd2
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd2
Restart=always
RestartSec=10s
LimitNOFILE=40000
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

# /run/systemd/system/etcd2.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://10.1691.136:2379"
Environment="ETCD_CERT_FILE=/etc/ssl/alien/alien6.cert.pem"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_DISCOVERY=http://10.1691.129:2379/v2/keys/discovery/88fc9f7cc488dc43937718ffa3800e8778fa964b6d83a53307a3d7ef4f9a66f9"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://10.1691.136:2380"
Environment="ETCD_KEY_FILE=/etc/ssl/alien/alien6.key.pem"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379,https://0.0.0.0:4001"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_NAME=alien6"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/alien/alien6.cert.pem"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/alien/alien6.key.pem"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/alien/alienca.cert.pem"
#The wal dir I am trying to use
#Environment="ETCD_WAL_DIR=/var/wal"

# /run/systemd/system/etcd2.service.d/99-restore.conf
# drop-in to add the node to the cluster (according to https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux)
[Service]
Environment="ETCD_DISCOVERY="
Environment="ETCD_NAME=alien6"
Environment="ETCD_INITIAL_CLUSTER=alien6=https://10.1691.136:2380,alien7=https://10.1691.137:2380,alien5=https://10.1691.135:2380"
Environment="ETCD_INITIAL_CLUSTER_STATE=existing"

@heyitsanthony

This is expected behavior; it should fail to start. It needs a better error message like I suggested.

This seems corrupted in some way. The member directory is missing contents. It should not start.

I am not sure if I understand that correctly, both ways should fail? How can I configure a dedicated wal dir then?

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 2, 2017

@pizzarabe the dedicated wal directory should be created / destroyed along with the member directory. It would be configured at the time of member directory creation.

@pizzarabe
Copy link
Author

So this seems like a real bug and not a layer 8 error?

@lycclsltt
Copy link

lycclsltt commented Mar 3, 2017 via email

@heyitsanthony
Copy link
Contributor

@pizzarabe the bug is that it's not giving a reasonable error on panic. If etcd tries to start with some configured bits on disk but others missing, it should refuse to boot.

@pizzarabe
Copy link
Author

the bug is that it's not giving a reasonable error on panic. If etcd tries to start with some configured bits on disk but others missing, it should refuse to boot.

If this is the only bug here, how should I configure the wal dir to get it working?

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 6, 2017

@pizzarabe if you delete the member directory, then delete the wal directory too. If the member is already initialized with a wal directory inside the memory directory, move it to what's defined in -wal-dir.

@pizzarabe
Copy link
Author

pizzarabe commented Mar 8, 2017

@heyitsanthony Okay, this does work if I copy/move the wal dir to the dedicated directory after the node joined the cluster, if I try to bootstrap the cluster with new nodes they fail with:
cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory

The node is clean, a new OS installation and the dedicated wal dir storage was wiped before the installation:

$ /usr/sbin/wipefs -af /dev/sdb
$ /usr/sbin/parted /dev/sdb mklabel msdos
$ /usr/sbin/parted /dev/sdb mkpart primary 2048s 100%
$ /usr/sbin/mkfs.ext4 /dev/sdb1


$ cd /var/wal/
$ ls -la
total 21
drwxr-xr-x.  3 etcd etcd  1024 Mar  8 15:49 .
drwxr-xr-x. 11 root root  4096 Mar  8 15:52 ..
drwx------.  2 etcd etcd 12288 Mar  8 15:49 lost+found

And again, removing the line wal-dir: "/var/wal/" from cloud-config the node can start and bootstrap a cluster

@heyitsanthony
Copy link
Contributor

This is similar to the problem from before. There's an existing wal directory and no member directory. etcd must have both the wal directory and member directory or neither in order to boot.

etcd is interpreting the /var/wal as being an initialized wal directory because it has some files in it (lost+found). This can be tightened up on the etcd side.

@pizzarabe
Copy link
Author

etcd is interpreting the /var/wal as being an initialized wal directory because it has some files in it (lost+found).

I can confirm, that this is exactly the problem (creating a dir under that path solved that problem for me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants