New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostnamectl fails in coreos-1520.6.0 #2193

Closed
jeevarathinam-dhanapal opened this Issue Oct 13, 2017 · 27 comments

Comments

Projects
None yet
5 participants
@jeevarathinam-dhanapal

jeevarathinam-dhanapal commented Oct 13, 2017

Issue Report

Bug

Bug

Container Linux Version

1520.6.0

Environment

AWS

Expected Behavior

Hostname is properly set

Actual Behavior

"ip-10-1-1-154 jeeva # hostnamectl set-hostname k8s-worker-1507708889.ops1.srcclr.com
Could not set property: Failed to activate service 'org.freedesktop.hostname1': timed out"

systemd-hostname.service also failed.

logs --
"ct 13 04:24:14 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Main process exited, code=exited, status=226/NAMESPACE
Oct 13 04:24:14 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: Failed to start Hostname Service.
Oct 13 04:24:14 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Unit entered failed state."

All the systemd files are updated on Oct12th

Reproduction Steps

  1. hostnamectl set-hostname

Other Information

We are facing this only after recent update on systemd files.

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Oct 13, 2017

Member

I'm unable to reproduce this. Do you have any custom configuration (e.g. systemd drop-ins) for systemd-hostnamed.service or any other systemd-* services?

Possibly related: #2190 reports systemd-resolved.service failing with status=226/NAMESPACE.

Member

bgilbert commented Oct 13, 2017

I'm unable to reproduce this. Do you have any custom configuration (e.g. systemd drop-ins) for systemd-hostnamed.service or any other systemd-* services?

Possibly related: #2190 reports systemd-resolved.service failing with status=226/NAMESPACE.

@jeevarathinam-dhanapal

This comment has been minimized.

Show comment
Hide comment
@jeevarathinam-dhanapal

jeevarathinam-dhanapal Oct 13, 2017

we have flanneld an docker wrapper in /etc/systemd/system. Im not sure if you are looking for this.
I also see "status=226/NAMESPACE"

"oem-cloudinit.service -> /etc/systemd/system/oem-cloudinit.service
lrwxrwxrwx. 1 root root 38 Oct 12 04:33 ansible-pull.timer -> /etc/systemd/system/ansible-pull.timer
lrwxrwxrwx. 1 root root 40 Oct 12 04:33 flanneld.service -> /usr/lib/systemd/system/flanneld.service
lrwxrwxrwx. 1 root root 34 Oct 12 04:33 docker.service -> /run/systemd/system/docker.service
lrwxrwxrwx. 1 root root 35 Oct 12 04:35 kubelet.service -> /etc/systemd/system/kubelet.service
k8s-worker-1507782522 multi-user.target.wants # pwd
/etc/systemd/system/multi-user.target.wants"

I have file "/var/lib/coreos-install/user_data" and its content
"#cloud-config
hostname: k8s-worker-1507782522.ops1.srcclr.com
coreos:
flannel:
etcd_endpoints: htt"

/etc/environment
COREOS_PRIVATE_IPV4=10.x.x.x

jeevarathinam-dhanapal commented Oct 13, 2017

we have flanneld an docker wrapper in /etc/systemd/system. Im not sure if you are looking for this.
I also see "status=226/NAMESPACE"

"oem-cloudinit.service -> /etc/systemd/system/oem-cloudinit.service
lrwxrwxrwx. 1 root root 38 Oct 12 04:33 ansible-pull.timer -> /etc/systemd/system/ansible-pull.timer
lrwxrwxrwx. 1 root root 40 Oct 12 04:33 flanneld.service -> /usr/lib/systemd/system/flanneld.service
lrwxrwxrwx. 1 root root 34 Oct 12 04:33 docker.service -> /run/systemd/system/docker.service
lrwxrwxrwx. 1 root root 35 Oct 12 04:35 kubelet.service -> /etc/systemd/system/kubelet.service
k8s-worker-1507782522 multi-user.target.wants # pwd
/etc/systemd/system/multi-user.target.wants"

I have file "/var/lib/coreos-install/user_data" and its content
"#cloud-config
hostname: k8s-worker-1507782522.ops1.srcclr.com
coreos:
flannel:
etcd_endpoints: htt"

/etc/environment
COREOS_PRIVATE_IPV4=10.x.x.x

@jeevarathinam-dhanapal

This comment has been minimized.

Show comment
Hide comment
@jeevarathinam-dhanapal

jeevarathinam-dhanapal Oct 15, 2017

I tried creating drop-in(systemctl edit systemd-hostnamed.service --full) but the result is same.

"Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Unit entered failed state.
Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Failed with result 'exit-code'.
Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: cgroup is empty"

jeevarathinam-dhanapal commented Oct 15, 2017

I tried creating drop-in(systemctl edit systemd-hostnamed.service --full) but the result is same.

"Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Unit entered failed state.
Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: Failed with result 'exit-code'.
Oct 15 04:00:27 ip-10-1-1-154.us-west-2.compute.internal systemd[1]: systemd-hostnamed.service: cgroup is empty"

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Oct 18, 2017

I tried creating drop-in(systemctl edit systemd-hostnamed.service --full) but the result is same.

I assume you're talking about the drop mentioned in #2190. Can you give the output of systemctl cat systemd-hostnamed.service with the drop in applied?

Are you running on HVM or PV?

It looks like your cloud config you included was cut off. Also using backticks instead of quotes will monospace your text and make it easier to read.

ajeddeloh commented Oct 18, 2017

I tried creating drop-in(systemctl edit systemd-hostnamed.service --full) but the result is same.

I assume you're talking about the drop mentioned in #2190. Can you give the output of systemctl cat systemd-hostnamed.service with the drop in applied?

Are you running on HVM or PV?

It looks like your cloud config you included was cut off. Also using backticks instead of quotes will monospace your text and make it easier to read.

@jeevarathinam-dhanapal

This comment has been minimized.

Show comment
Hide comment
@jeevarathinam-dhanapal

jeevarathinam-dhanapal Oct 19, 2017

systemd-hostname.service

[Unit]
Description=Hostname Service
Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)
Documentation=https://www.freedesktop.org/wiki/Software/systemd/hostnamed

[Service]
ExecStart=/usr/lib/systemd/systemd-hostnamed
BusName=org.freedesktop.hostname1
WatchdogSec=3min
CapabilityBoundingSet=CAP_SYS_ADMIN
PrivateTmp=yes
PrivateDevices=yes
PrivateNetwork=yes
ProtectSystem=strict
ProtectHome=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
MemoryDenyWriteExecute=yes
RestrictRealtime=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_UNIX
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io @reboot @swap
SystemCallArchitectures=native
ReadWritePaths=/etc

Cloud-config content

#cloud-config
hostname: k8s-worker-1507708889.ops1.srcclr.com
coreos:
  flannel:
    etcd_endpoints: http://etcd-1.xxx.io:2379,http://etcd-2.xxx:2379,http://etcd-3.xxx.io:2379

Our instances are HVM

jeevarathinam-dhanapal commented Oct 19, 2017

systemd-hostname.service

[Unit]
Description=Hostname Service
Documentation=man:systemd-hostnamed.service(8) man:hostname(5) man:machine-info(5)
Documentation=https://www.freedesktop.org/wiki/Software/systemd/hostnamed

[Service]
ExecStart=/usr/lib/systemd/systemd-hostnamed
BusName=org.freedesktop.hostname1
WatchdogSec=3min
CapabilityBoundingSet=CAP_SYS_ADMIN
PrivateTmp=yes
PrivateDevices=yes
PrivateNetwork=yes
ProtectSystem=strict
ProtectHome=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
MemoryDenyWriteExecute=yes
RestrictRealtime=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_UNIX
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io @reboot @swap
SystemCallArchitectures=native
ReadWritePaths=/etc

Cloud-config content

#cloud-config
hostname: k8s-worker-1507708889.ops1.srcclr.com
coreos:
  flannel:
    etcd_endpoints: http://etcd-1.xxx.io:2379,http://etcd-2.xxx:2379,http://etcd-3.xxx.io:2379

Our instances are HVM

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Oct 19, 2017

Can you try this drop in for systemd-hostnamed.service (or change it with systemctl edit systemd-hostnamed.service --full)

[Service]
ProtectSystem=full

Then run systemctl daemon-reload and verify the "ProtectSystem" line changed when you view it with systemctl cat systemd-hostnamed.service and see if that changes anything.

Also, I should have mentioned, for multi-line monospace blocks you need triple backticks on their own line before and after the block.

ajeddeloh commented Oct 19, 2017

Can you try this drop in for systemd-hostnamed.service (or change it with systemctl edit systemd-hostnamed.service --full)

[Service]
ProtectSystem=full

Then run systemctl daemon-reload and verify the "ProtectSystem" line changed when you view it with systemctl cat systemd-hostnamed.service and see if that changes anything.

Also, I should have mentioned, for multi-line monospace blocks you need triple backticks on their own line before and after the block.

@jeevarathinam-dhanapal

This comment has been minimized.

Show comment
Hide comment
@jeevarathinam-dhanapal

jeevarathinam-dhanapal Oct 19, 2017

That did the trick. Systemd-hostnamed.service works now. But only few nodes were affected and not all. There is no difference in the way nodes been brought up. Is there a way I can track what could have caused this

jeevarathinam-dhanapal commented Oct 19, 2017

That did the trick. Systemd-hostnamed.service works now. But only few nodes were affected and not all. There is no difference in the way nodes been brought up. Is there a way I can track what could have caused this

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Oct 19, 2017

Glad to hear that worked. The fact that only some fail to come up despite them all being brough up the same makes me think there might be a race somewhere. I wasn't able to reproduce it before, but it could be I just wasn't hitting it by chance, so I'll try reproducing it again. Have you experienced this on new nodes or just ones that were upgraded?

The error is almost certainly coming from when systemd sets up the mount namespace for a new process, however it's not obvious why that is failing.

Do you see anything along the lines of:

systemd-hostnamed.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-hostnamed: <SOME ERROR>

When running sudo journalctl -p7 --no-pager UNIT=systemd-hostnamed.service

ajeddeloh commented Oct 19, 2017

Glad to hear that worked. The fact that only some fail to come up despite them all being brough up the same makes me think there might be a race somewhere. I wasn't able to reproduce it before, but it could be I just wasn't hitting it by chance, so I'll try reproducing it again. Have you experienced this on new nodes or just ones that were upgraded?

The error is almost certainly coming from when systemd sets up the mount namespace for a new process, however it's not obvious why that is failing.

Do you see anything along the lines of:

systemd-hostnamed.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-hostnamed: <SOME ERROR>

When running sudo journalctl -p7 --no-pager UNIT=systemd-hostnamed.service

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Oct 23, 2017

Same here I am getting this error with Container Linux by CoreOS stable (1520.6.0). I am running the aws ami ami-7edf0d07.

systemd-resolved.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-resolved: Value too large for defined data type

It's also worth mentioning that this piece of code for the upstart config file, doesn't help.

[Service]
ProtectSystem=full

Sorry my bad but I didn't see that the ProtectSystem=strict line was a bit further down in the configuration.
I changed it to ProtectSystem=full ( with the commands provided by @ajeddeloh ) and now it works. I am going to create a drop in to see if it works on startup.

Here's my systemd-resolved.service file now:

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Network Name Resolution
Documentation=man:systemd-resolved.service(8)
Documentation=https://www.freedesktop.org/wiki/Software/systemd/resolved
Documentation=https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
Documentation=https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
After=systemd-networkd.service network.target
Before=network-online.target nss-lookup.target
Wants=nss-lookup.target

# On kdbus systems we pull in the busname explicitly, because it
# carries policy that allows the daemon to acquire its name.
Wants=org.freedesktop.resolve1.busname
After=org.freedesktop.resolve1.busname

[Service]
Type=notify
Restart=always
RestartSec=0
ExecStart=/usr/lib/systemd/systemd-resolved
WatchdogSec=3min
CapabilityBoundingSet=CAP_SETUID CAP_SETGID CAP_SETPCAP CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER CAP_NET_RAW CAP_NET_BIND_SERVICE
PrivateTmp=yes
PrivateDevices=yes
ProtectSystem=full
ProtectHome=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
MemoryDenyWriteExecute=yes
RestrictRealtime=yes
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io @reboot @swap
SystemCallArchitectures=native
ReadWritePaths=/run/systemd

[Install]
WantedBy=multi-user.target
Alias=dbus-org.freedesktop.resolve1.service

parabolic commented Oct 23, 2017

Same here I am getting this error with Container Linux by CoreOS stable (1520.6.0). I am running the aws ami ami-7edf0d07.

systemd-resolved.service: Failed at step NAMESPACE spawning /usr/lib/systemd/systemd-resolved: Value too large for defined data type

It's also worth mentioning that this piece of code for the upstart config file, doesn't help.

[Service]
ProtectSystem=full

Sorry my bad but I didn't see that the ProtectSystem=strict line was a bit further down in the configuration.
I changed it to ProtectSystem=full ( with the commands provided by @ajeddeloh ) and now it works. I am going to create a drop in to see if it works on startup.

Here's my systemd-resolved.service file now:

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Network Name Resolution
Documentation=man:systemd-resolved.service(8)
Documentation=https://www.freedesktop.org/wiki/Software/systemd/resolved
Documentation=https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
Documentation=https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
After=systemd-networkd.service network.target
Before=network-online.target nss-lookup.target
Wants=nss-lookup.target

# On kdbus systems we pull in the busname explicitly, because it
# carries policy that allows the daemon to acquire its name.
Wants=org.freedesktop.resolve1.busname
After=org.freedesktop.resolve1.busname

[Service]
Type=notify
Restart=always
RestartSec=0
ExecStart=/usr/lib/systemd/systemd-resolved
WatchdogSec=3min
CapabilityBoundingSet=CAP_SETUID CAP_SETGID CAP_SETPCAP CAP_CHOWN CAP_DAC_OVERRIDE CAP_FOWNER CAP_NET_RAW CAP_NET_BIND_SERVICE
PrivateTmp=yes
PrivateDevices=yes
ProtectSystem=full
ProtectHome=yes
ProtectControlGroups=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
MemoryDenyWriteExecute=yes
RestrictRealtime=yes
RestrictAddressFamilies=AF_UNIX AF_NETLINK AF_INET AF_INET6
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io @reboot @swap
SystemCallArchitectures=native
ReadWritePaths=/run/systemd

[Install]
WantedBy=multi-user.target
Alias=dbus-org.freedesktop.resolve1.service
@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Oct 23, 2017

@parabolic Thanks for the error message. Are you seeing this across all nodes or just some?

Can you set LogLevel=debug in /etc/systemd/system.conf then run sudo systemctl daemon-reload; that should make it easier to tell where the unit is failing. Then try restarting the unit and post the output of sudo journalctl -p7 UNIT=systemd-resolved.service + _SYSTEMD_UNIT=systemd-resolved.service

What does df -iT show?

ajeddeloh commented Oct 23, 2017

@parabolic Thanks for the error message. Are you seeing this across all nodes or just some?

Can you set LogLevel=debug in /etc/systemd/system.conf then run sudo systemctl daemon-reload; that should make it easier to tell where the unit is failing. Then try restarting the unit and post the output of sudo journalctl -p7 UNIT=systemd-resolved.service + _SYSTEMD_UNIT=systemd-resolved.service

What does df -iT show?

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Oct 24, 2017

Hi @ajeddeloh,
I am seeing this only in newly created instances with the latest stable coreos. The rest are not updated and now I am afraid to update them :). I saw it every time I started the instance. Neither startup has passed without one of these services failing ( or both at the same time ):

  • systemd-resolved.service
  • systemd-hostnamed.service

Which led to subsequent failure of other service ( DNS doesn't work etc. etc. ).

Here's the output after setting the LogLevel=Debug
https://gist.github.com/parabolic/69f3fae95a4cd26a9cc56ad5071d8a8d

Output from df -iT

Filesystem                                Type       Inodes IUsed    IFree IUse% Mounted on
devtmpfs                                  devtmpfs   122848   300   122548    1% /dev
tmpfs                                     tmpfs      127281     1   127280    1% /dev/shm
tmpfs                                     tmpfs      127281   496   126785    1% /run
tmpfs                                     tmpfs      127281    16   127265    1% /sys/fs/cgroup
/dev/xvda9                                ext4     25474432 48283 25426149    1% /
/dev/mapper/usr                           ext4       260096 19603   240493    8% /usr
none                                      tmpfs      127281    26   127255    1% /run/torcx/unpack
/dev/xvda1                                vfat            0     0        0     - /boot
tmpfs                                     tmpfs      127281     1   127280    1% /media
tmpfs                                     tmpfs      127281    11   127270    1% /tmp
/dev/xvda6                                ext4        32768    13    32755    1% /usr/share/oem
<efs_endpoint>:/                          nfs4            0     0        0     - /mnt/efs
tmpfs                                     tmpfs      127281     5   127276    1% /run/user/500

Hope this helps!
Cheers.

parabolic commented Oct 24, 2017

Hi @ajeddeloh,
I am seeing this only in newly created instances with the latest stable coreos. The rest are not updated and now I am afraid to update them :). I saw it every time I started the instance. Neither startup has passed without one of these services failing ( or both at the same time ):

  • systemd-resolved.service
  • systemd-hostnamed.service

Which led to subsequent failure of other service ( DNS doesn't work etc. etc. ).

Here's the output after setting the LogLevel=Debug
https://gist.github.com/parabolic/69f3fae95a4cd26a9cc56ad5071d8a8d

Output from df -iT

Filesystem                                Type       Inodes IUsed    IFree IUse% Mounted on
devtmpfs                                  devtmpfs   122848   300   122548    1% /dev
tmpfs                                     tmpfs      127281     1   127280    1% /dev/shm
tmpfs                                     tmpfs      127281   496   126785    1% /run
tmpfs                                     tmpfs      127281    16   127265    1% /sys/fs/cgroup
/dev/xvda9                                ext4     25474432 48283 25426149    1% /
/dev/mapper/usr                           ext4       260096 19603   240493    8% /usr
none                                      tmpfs      127281    26   127255    1% /run/torcx/unpack
/dev/xvda1                                vfat            0     0        0     - /boot
tmpfs                                     tmpfs      127281     1   127280    1% /media
tmpfs                                     tmpfs      127281    11   127270    1% /tmp
/dev/xvda6                                ext4        32768    13    32755    1% /usr/share/oem
<efs_endpoint>:/                          nfs4            0     0        0     - /mnt/efs
tmpfs                                     tmpfs      127281     5   127276    1% /run/user/500

Hope this helps!
Cheers.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Oct 24, 2017

@parabolic Thanks for the logs. Can you run strace -f -p 1 -s500 and trigger the issue?

ajeddeloh commented Oct 24, 2017

@parabolic Thanks for the logs. Can you run strace -f -p 1 -s500 and trigger the issue?

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Oct 30, 2017

Hi @ajeddeloh and sorry for not replying earlier but I got sick and couldn't. I am not sure I'll be able to provide the output from the command because of a tight schedule that I am running on at the moment. I'll try to simulate the problem with spinning a coreos stable instance and passing basic user data.
Cheers!

parabolic commented Oct 30, 2017

Hi @ajeddeloh and sorry for not replying earlier but I got sick and couldn't. I am not sure I'll be able to provide the output from the command because of a tight schedule that I am running on at the moment. I'll try to simulate the problem with spinning a coreos stable instance and passing basic user data.
Cheers!

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Nov 15, 2017

@parabolic have you had a chance to look at this again?

ajeddeloh commented Nov 15, 2017

@parabolic have you had a chance to look at this again?

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 16, 2017

Hi @ajeddeloh, unfortunatelly no I haven't. I got dragged into something else.
Today or tomorrow I'll build a new cluster with CoreOS and I'll try it without the dropins and will provide you with an strace.
Thank you for your patience

parabolic commented Nov 16, 2017

Hi @ajeddeloh, unfortunatelly no I haven't. I got dragged into something else.
Today or tomorrow I'll build a new cluster with CoreOS and I'll try it without the dropins and will provide you with an strace.
Thank you for your patience

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 16, 2017

Hi @ajeddeloh,
unfortunately the issue still persists in the latest stable CoreOS version:

Container Linux by CoreOS stable (1520.8.0)
Update Strategy: No Reboots
Failed Units: 2
  dnsmasq.service
  systemd-resolved.service

The only difference that I see at the moment is that It happens after provisioning. So I can push cloud config and spin dockers and after some time this happens. So it's kinda delayed.

here's the gist with the output you've requested
https://gist.github.com/parabolic/e864d38fe963cc5e58e566075187d3b3
Hope this helps,
Regards.

parabolic commented Nov 16, 2017

Hi @ajeddeloh,
unfortunately the issue still persists in the latest stable CoreOS version:

Container Linux by CoreOS stable (1520.8.0)
Update Strategy: No Reboots
Failed Units: 2
  dnsmasq.service
  systemd-resolved.service

The only difference that I see at the moment is that It happens after provisioning. So I can push cloud config and spin dockers and after some time this happens. So it's kinda delayed.

here's the gist with the output you've requested
https://gist.github.com/parabolic/e864d38fe963cc5e58e566075187d3b3
Hope this helps,
Regards.

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 16, 2017

Just FYI, the dropins work just fine with

CoreOS stable (1520.8.0)

Cheers.

parabolic commented Nov 16, 2017

Just FYI, the dropins work just fine with

CoreOS stable (1520.8.0)

Cheers.

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 16, 2017

Hi, we've been testing this issue with a friend and we found out that if you run an instance c5 instance everything works as expected and no droping are required.
In my case the dropins are required for a t2.small instance.

It doesn't work without dropins on t2, m4, c5 so far.

parabolic commented Nov 16, 2017

Hi, we've been testing this issue with a friend and we found out that if you run an instance c5 instance everything works as expected and no droping are required.
In my case the dropins are required for a t2.small instance.

It doesn't work without dropins on t2, m4, c5 so far.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Nov 16, 2017

Thanks again for the logs. I'm looking into to it and will relay the info upstream as well.

ajeddeloh commented Nov 16, 2017

Thanks again for the logs. I'm looking into to it and will relay the info upstream as well.

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Nov 16, 2017

Contributor

@ajeddeloh even if we aren't sure exactly what's happening, it seems like we have enough evidence to switch over to ProtectSystem=full for now, right?

It seems like a low-risk enough change to backport that to the beta+stable branches to be picked up the next time they release.

Contributor

euank commented Nov 16, 2017

@ajeddeloh even if we aren't sure exactly what's happening, it seems like we have enough evidence to switch over to ProtectSystem=full for now, right?

It seems like a low-risk enough change to backport that to the beta+stable branches to be picked up the next time they release.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Nov 16, 2017

SGTM, then we'll switch back once it's resolved upstream?

ajeddeloh commented Nov 16, 2017

SGTM, then we'll switch back once it's resolved upstream?

ajeddeloh added a commit to ajeddeloh/systemd that referenced this issue Nov 16, 2017

units: use ProtectSystem=full
This was added in c7fb922 but causes
some bugs, so revert until they are resolved.

Bugs:
 - coreos/bugs#2193
 - coreos/bugs#2190
 - systemd#7082

ajeddeloh added a commit to ajeddeloh/systemd that referenced this issue Nov 20, 2017

units: use ProtectSystem=full
This was added in c7fb922 but causes
some bugs, so revert until they are resolved.

Bugs:
 - coreos/bugs#2193
 - coreos/bugs#2190
 - systemd#7082
@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Nov 21, 2017

Relaying from the upstream bug: it turns out its a kernel bug with the name_to_handle_at syscall and probably nfs. They're fixing it upstream to work around the bug and we're reverting the ProtectSystem=strict change until then.

@parabolic sorry to keep bothering you for more details. Can you share anything about how your nfs is configured?

@jeevarathinam-dhanapal are you also running nfs?

ajeddeloh commented Nov 21, 2017

Relaying from the upstream bug: it turns out its a kernel bug with the name_to_handle_at syscall and probably nfs. They're fixing it upstream to work around the bug and we're reverting the ProtectSystem=strict change until then.

@parabolic sorry to keep bothering you for more details. Can you share anything about how your nfs is configured?

@jeevarathinam-dhanapal are you also running nfs?

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 21, 2017

Hi @ajeddeloh,
I am glad that a bug has been found.
To answer your question yes we use nfs. Here's the configuration:

write-files:
  - path: /etc/conf.d/nfs
    permissions: '0644'
    content: |
      OPTS_RPC_MOUNTD=""
....
....
....
    - name: rpc-statd.service
      command: start
      enable: true
    - name: mnt-efs.mount
      command: start
      enable: true
      content: |
        [Unit]
        Description=Mount the efs.
        [Mount]
        What=efs_on.amazonaws.com:
        Where=/mnt/efs
        Type=nfs
        DirectoryMode=775
 

Hope this helps.
Cheers.

parabolic commented Nov 21, 2017

Hi @ajeddeloh,
I am glad that a bug has been found.
To answer your question yes we use nfs. Here's the configuration:

write-files:
  - path: /etc/conf.d/nfs
    permissions: '0644'
    content: |
      OPTS_RPC_MOUNTD=""
....
....
....
    - name: rpc-statd.service
      command: start
      enable: true
    - name: mnt-efs.mount
      command: start
      enable: true
      content: |
        [Unit]
        Description=Mount the efs.
        [Mount]
        What=efs_on.amazonaws.com:
        Where=/mnt/efs
        Type=nfs
        DirectoryMode=775
 

Hope this helps.
Cheers.

ajeddeloh added a commit to ajeddeloh/systemd that referenced this issue Nov 29, 2017

units: use ProtectSystem=full
This was added in c7fb922 but causes
some bugs, so revert until they are resolved.

Bugs:
 - coreos/bugs#2193
 - coreos/bugs#2190
 - systemd#7082

ajeddeloh added a commit to ajeddeloh/systemd that referenced this issue Nov 29, 2017

units: use ProtectSystem=full
This was added in c7fb922 but causes
some bugs, so revert until they are resolved.

Bugs:
 - coreos/bugs#2193
 - coreos/bugs#2190
 - systemd#7082

ajeddeloh added a commit to ajeddeloh/systemd that referenced this issue Nov 29, 2017

@bgilbert

This comment has been minimized.

Show comment
Hide comment
@bgilbert

bgilbert Nov 30, 2017

Member

We're reverting to ProtectSystem=full pending a complete fix in upstream systemd. This should avoid the problem starting in alpha 1590.1.0, beta 1576.3.0, and stable 1520.9.0, which are due shortly. I'll leave this open until we can replace this workaround with a long-term fix.

Member

bgilbert commented Nov 30, 2017

We're reverting to ProtectSystem=full pending a complete fix in upstream systemd. This should avoid the problem starting in alpha 1590.1.0, beta 1576.3.0, and stable 1520.9.0, which are due shortly. I'll leave this open until we can replace this workaround with a long-term fix.

@parabolic

This comment has been minimized.

Show comment
Hide comment
@parabolic

parabolic Nov 30, 2017

@bgilbert thanks for the update!
Cheers.

parabolic commented Nov 30, 2017

@bgilbert thanks for the update!
Cheers.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Dec 6, 2017

There's a kernel patch to fix the underlying issue as well as a bunch of systemd commits that work around the kernel bug. We will continue the ProtectSystem=strict reversion until the kernel patch is accepted by an upstream maintainer at which point we will backport it. We will then switch ProtectSystem back to strict. Once systemd 236 is released and makes its way into our branches, we will stop backporting the patch. We'll keep this bug open until the kernel patch is backported.

ajeddeloh commented Dec 6, 2017

There's a kernel patch to fix the underlying issue as well as a bunch of systemd commits that work around the kernel bug. We will continue the ProtectSystem=strict reversion until the kernel patch is accepted by an upstream maintainer at which point we will backport it. We will then switch ProtectSystem back to strict. Once systemd 236 is released and makes its way into our branches, we will stop backporting the patch. We'll keep this bug open until the kernel patch is backported.

@ajeddeloh

This comment has been minimized.

Show comment
Hide comment
@ajeddeloh

ajeddeloh Jan 23, 2018

Closed via coreos/coreos-overlay#2967 which is in Alpha 1662.0.0.

Systemd 236 ended up beating the kernel to fixing the problem, so we're closing this now that it is merged and released.

ajeddeloh commented Jan 23, 2018

Closed via coreos/coreos-overlay#2967 which is in Alpha 1662.0.0.

Systemd 236 ended up beating the kernel to fixing the problem, so we're closing this now that it is merged and released.

@ajeddeloh ajeddeloh closed this Jan 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment