New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd-networkd failure #1197

Closed
kbrwn opened this Issue Mar 30, 2016 · 7 comments

Comments

@kbrwn

kbrwn commented Mar 30, 2016

reproducible on CoreOS v835.13.0 and 766.5.0
This was reported by a customer. Here is what the network configs looks like:

cbr0-interface.network

[Match]
Name=eno1

[Network]
Bridge=cbr0
Gateway=

cbr0.netdev

[NetDev]
Kind=bridge
Name=cbr0

cbr0.network

[Match]
Name=cbr0

[Network]
Address=10.103.26.1/16
Gateway=

[Route]
Destination=10.0.0.0/8
Gateway=10.103.0.1

static.network

[Match]
Name=eno*

[Network]
DHCP=no

When these units are activated systemd-networkd.service will sporadically fail.

An strace shows epoll_ctl returning EBADF:

epoll_ctl(5, EPOLL_CTL_DEL, 9, NULL)    = -1 EBADF (Bad file descriptor)

It looks like the file descriptor that is being deregistered has been closed before it has a chance to dereg.

close(9)                                = 0
epoll_ctl(5, EPOLL_CTL_DEL, 9, NULL)    = -1 EBADF (Bad file descriptor)
epoll_ctl(5, EPOLL_CTL_DEL, 3, NULL)    = 0
signalfd4(7, [TERM], 8, O_NONBLOCK|O_CLOEXEC) = 7
signalfd4(7, [], 8, O_NONBLOCK|O_CLOEXEC) = 7
close(5)                                = 0
close(7)                                = 0
close(6)                                = 0
close(8)                                = 0

journalnetworklogs.txt

strace.txt

@crawford

This comment has been minimized.

Member

crawford commented Mar 30, 2016

Looks like we need coreos/systemd@2f5b4a7.

@crawford

This comment has been minimized.

Member

crawford commented Mar 31, 2016

I'm having a little trouble reproducing this. I'm using the following Ignition config on an AWS instance:

{  
  "ignitionVersion":1,
  "systemd":{  
    "units":[  
      {  
        "name":"systemd-networkd.service",
        "dropins":[  
          {  
            "name":"10-debug.conf",
            "contents":"[Service]\nEnvironment=SYSTEMD_LOG_LEVEL=debug\nRestart=no\n"
          }
        ]
      }
    ]
  },
  "networkd":{  
    "units":[  
      {  
        "name":"cbr0-interface.network",
        "contents":"[Match]\nName=eno1\n\n[Network]\nBridge=cbr0\n"
      },
      {  
        "name":"cbr0.netdev",
        "contents":"[NetDev]\nKind=bridge\nName=cbr0\n"
      },
      {  
        "name":"cbr0.network",
        "contents":"[Match]\nName=cbr0\n\n[Network]\nAddress=10.103.26.1/16\n\n[Route]\nDestination=10.0.0.0/8\nGateway=10.103.0.1\n"
      },
      {  
        "name":"static.network",
        "contents":"[Match]\nName=eno*\n\n[Network]\nDHCP=no\n"
      }
    ]
  }
}

and I have the following script looking for a failed startup:

while ssh core@example.com systemctl status systemd-networkd &>> log; do
    echo -n '.'; 
    ssh core@example.com sudo systemd-run --on-active=1 reboot &>> log;
    sleep 20;
done

This loop hasn't exited yet though.

@kbrwn when do you see the systemd-networkd failure? Is it shortly after boot or do I need to wait a bit? Does it only happen when it is setting up the network or will it fail at some point later in the boot lifetime?

@Vishant0031

This comment has been minimized.

Vishant0031 commented Mar 31, 2016

which file its trying to handle while it bails out?
is it some virtual device file or on of the network config files?

@crawford

This comment has been minimized.

Member

crawford commented Mar 31, 2016

I was able to reproduce it using this config:

{  
  "ignitionVersion":1,
  "systemd":{  
    "units":[  
      {  
        "name":"systemd-networkd.service",
        "dropins":[  
          {  
            "name":"10-debug.conf",
            "contents":"[Service]\nEnvironment=SYSTEMD_LOG_LEVEL=debug\nRestart=no\n"
          }
        ]
      },
      {  
        "name":"cpu-load.service",
        "contents":"[Service]\nExecStart=/usr/bin/dd if=/dev/zero of=/dev/null\n"
      },
      {  
        "name":"container@.service",
        "contents":"[Service]\nStartLimitInterval=0\nExecStart=/usr/bin/docker run ubuntu ls\n"
      },
      {  
        "name":"stress.target",
        "enable":true,
        "contents":"[Unit]\nRequires=cpu-load.service\nRequires=container@1.service container@2.service container@3.service container@4.service container@5.service container@6.service container@7.service container@8.service container@9.service container@10.service container@11.service container@12.service container@13.service container@14.service container@15.service container@16.service container@17.service container@18.service container@19.service container@20.service\n"
      }
    ]
  },
  "networkd":{  
    "units":[  
      {  
        "name":"cbr0-interface.network",
        "contents":"[Match]\nName=eno1\n\n[Network]\nBridge=cbr0\n"
      },
      {  
        "name":"cbr0.netdev",
        "contents":"[NetDev]\nKind=bridge\nName=cbr0\n"
      },
      {  
        "name":"cbr0.network",
        "contents":"[Match]\nName=cbr0\n\n[Network]\nAddress=10.103.26.1/16\n\n[Route]\nDestination=10.0.0.0/8\nGateway=10.103.0.1\n"
      },
      {  
        "name":"static.network",
        "contents":"[Match]\nName=eno*\n\n[Network]\nDHCP=no\n"
      }
    ]
  }
}

and this script:

while ssu core@example.com systemctl status systemd-networkd &>> event.log; do
    echo -n '.';
    ssu core@example.com sudo systemctl restart systemd-networkd &>> event.log;
    sleep 1m;
done
@Vishant0031

This comment has been minimized.

Vishant0031 commented Apr 1, 2016

so, does it look like veth* leftover from the exited containers(ubuntu ls) does not get cleaned up?
or they are actually orphaned after the container is exited? and the network service is not able to handle the orphaned veths*?

@crawford crawford added this to the CoreOS 899.14.0 milestone Apr 1, 2016

@crawford

This comment has been minimized.

Member

crawford commented Apr 4, 2016

The Docker daemon is being run as follows:

ExecStart=/usr/lib/coreos/dockerd --daemon --host=fd:// -b cbr0 --fixed-cidr=10.103.26.0/24
@crawford

This comment has been minimized.

Member

crawford commented Apr 5, 2016

@crawford crawford closed this Apr 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment