Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vultr: NetworkManager-wait-online.service fails, but system works anyway #1246

Closed
io7m opened this issue Jul 5, 2022 · 7 comments
Closed
Labels

Comments

@io7m
Copy link

io7m commented Jul 5, 2022

Describe the bug
On all three update streams right now, a VM instance created on Vultr will reliably have its NetworkManager-wait-online.service fail, but the system seems to work anyway (with correct networking, and all services running normally).

Reproduction steps
Steps to reproduce the behavior:

  1. Install a VM on Vultr
  2. Note that when you log in, you'll see:
Fedora CoreOS 36.20220605.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/tag/coreos

Last login: Tue Jul  5 16:52:10 2022 from 212.69.61.187
[systemd]
Failed Units: 1
  NetworkManager-wait-online.service

Expected behavior
Presumably, NetworkManager-wait-online.service should succeed.

Actual behavior

$ journalctl -u NetworkManager-wait-online.service
Jul 05 16:46:45 control.eigion.one systemd[1]: Starting NetworkManager-wait-online.service - Network Manager Wait Online...
Jul 05 16:47:46 control.eigion.one systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 05 16:47:46 control.eigion.one systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Jul 05 16:47:46 control.eigion.one systemd[1]: Failed to start NetworkManager-wait-online.service - Network Manager Wait Online.
-- Boot 66013714b5a24931930a30fb6ced4b36 --
Jul 05 16:50:49 control.eigion.one systemd[1]: Starting NetworkManager-wait-online.service - Network Manager Wait Online...
Jul 05 16:51:49 control.eigion.one systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 05 16:51:49 control.eigion.one systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Jul 05 16:51:49 control.eigion.one systemd[1]: Failed to start NetworkManager-wait-online.service - Network Manager Wait Online.

System details

  • Vultr, on a standard 1gb instance in the New Jersey datacenter.
  • Fedora CoreOS 36.20220605.3.0

Ignition config

This is the exact ignition file (because that's what the issue template asked for), but it also happens with an extremely minimal file that doesn't install any services or configure any users.

variant: fcos
version: 1.4.0

passwd:
  users:
    - name: <<redacted>>
      password_hash: <<redacted>>
      ssh_authorized_keys:
        - ssh-ed25519 <<redacted>>
    - name: _looseleaf
      uid: 1100
    - name: _certusine
      uid: 1101

storage:
  disks:
    - device: /dev/vdb
      wipe_table: false

  filesystems:
    - path: /var/storage
      device: /dev/vdb
      format: xfs
      with_mount_unit: true

  directories:
    - path: /var/storage/looseleaf
      mode: 0700
      user:
        id: 1100
      group:
        id: 1100
    - path: /var/storage/certusine
      mode: 0700
      user:
        id: 1101
      group:
        id: 1101

  files:
    - path: /var/storage/looseleaf/config.json
      contents:
        local: control/looseleaf/config.json
      overwrite: true
      mode: 0600
      user:
        id: 1100
      group:
        id: 1100
    - path: /var/storage/certusine/config.json
      contents:
        local: control/certusine/config.json
      overwrite: true
      mode: 0600
      user:
        id: 1101
      group:
        id: 1101
    - path: /var/storage/certusine/io7m.pri
      contents:
        local: control/certusine/io7m.pri
      overwrite: true
      mode: 0600
      user:
        id: 1101
      group:
        id: 1101
    - path: /var/storage/certusine/io7m.pub
      contents:
        local: control/certusine/io7m.pub
      overwrite: true
      mode: 0600
      user:
        id: 1101
      group:
        id: 1101
    - path: /var/storage/certusine/www.pri
      contents:
        local: control/certusine/www.pri
      overwrite: true
      mode: 0600
      user:
        id: 1101
      group:
        id: 1101
    - path: /var/storage/certusine/www.pub
      contents:
        local: control/certusine/www.pub
      overwrite: true
      mode: 0600
      user:
        id: 1101
      group:
        id: 1101

systemd:
  units:
    - name: docker.service
      mask: true

    - name: looseleaf.service
      enabled: true
      contents: |
        [Unit]
        Description=looseleaf
        After=network-online.target
        Wants=network-online.target

        [Service]
        Type=exec
        TimeoutStartSec=60
        User=_looseleaf
        Group=_looseleaf
        Restart=on-failure
        RestartSec=10s

        ExecStartPre=-/bin/podman kill looseleaf
        ExecStartPre=-/bin/podman rm looseleaf
        ExecStartPre=/bin/podman pull docker.io/io7m/looseleaf:0.0.1

        ExecStart=/bin/podman run \
          --name looseleaf \
          --volume /var/storage/looseleaf:/looseleaf/etc:Z \
          --net=host \
          --memory=100m \
          --memory-reservation=80m \
          docker.io/io7m/looseleaf:0.0.1 \
          /looseleaf/bin/looseleaf server --file /looseleaf/etc/config.json

        ExecStop=/bin/podman stop looseleaf

        [Install]
        WantedBy=multi-user.target

    - name: certusine.service
      enabled: true
      contents: |
        [Unit]
        Description=certusine
        After=network-online.target
        Wants=network-online.target

        [Service]
        Type=exec
        TimeoutStartSec=60
        User=_certusine
        Group=_certusine
        Restart=on-failure
        RestartSec=10s

        ExecStartPre=-/bin/podman kill certusine
        ExecStartPre=-/bin/podman rm certusine
        ExecStartPre=/bin/podman pull docker.io/io7m/certusine:0.0.2

        ExecStart=/bin/podman run \
          --name certusine \
          --volume /var/storage/certusine:/certusine/etc:Z \
          --net=host \
          --memory=100m \
          --memory-reservation=80m \
          docker.io/io7m/certusine:0.0.2 \
          /certusine/bin/certusine renew --file /certusine/etc/config.json

        ExecStop=/bin/podman stop certusine

        [Install]
        WantedBy=multi-user.target

Additional information

I've attached the system's full journald output of the first boot, and then the boot after zincati upgraded.

journal0.txt
journal1.txt

I'm not concerned about any "confidential" information that might be in the journals. These VM instances are purely for experimentation and don't matter in the slightest.

@io7m io7m added the kind/bug label Jul 5, 2022
@io7m
Copy link
Author

io7m commented Jul 5, 2022

I've just realized the issue template called for the ignition file and not the butane file. Here it is:

control.ign.txt

Confidential information is gloriously intact, because these VM instances won't even exist a couple of hours from now, and everything else is generated and not reused.

@dustymabe
Copy link
Member

dustymabe commented Jul 5, 2022

I just brought up a server using our documentation. It appears to work fine but I'll admit that I'm not starting services that pull in network-online.target. I think I see the problem in your logs, though.

Jul 05 16:48:17 control.eigion.one NetworkManager[1366]: <info>  [1657039697.0109] device (enp6s0): state change: ip-config -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')                                                                   
Jul 05 16:48:17 control.eigion.one NetworkManager[1366]: <warn>  [1657039697.0142] device (enp6s0): Activation: failed for connection 'Wired connection 2'                                                                                                           
Jul 05 16:48:17 control.eigion.one NetworkManager[1366]: <info>  [1657039697.0150] device (enp6s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')                                                                                 
Jul 05 16:48:17 control.eigion.one NetworkManager[1366]: <info>  [1657039697.0188] dhcp4 (enp6s0): canceled DHCP transaction

on my instance I only have one NIC. It looks like you have two NICs (enp1s0 and enp6s0). Are you using a VPC? Is DHCP configured correctly for the VPC?

@io7m
Copy link
Author

io7m commented Jul 5, 2022

Ah, I do have a VPC configured in the control panel. I've not touched the default network settings at all...

That makes sense, actually, I've just spotted in the VPC documentation:

VPCs do not have DHCP. You must manually configure the IP addresses or supply your own DHCP server.

@io7m
Copy link
Author

io7m commented Jul 5, 2022

Completely slipped my mind that VPC could be the culprit. I'm not actually doing anything with the VPC yet, so the fact that there was an extra network interface that could be causing issues didn't occur to me...

@dustymabe
Copy link
Member

Yeah. You can set IPs statically if you'd like. Here's a page on our network configuration: https://docs.fedoraproject.org/en-US/fedora-coreos/sysconfig-network-configuration/

I'll go ahead and close this out. Please re-open if new information comes in that means we should re-open.

@io7m
Copy link
Author

io7m commented Jul 5, 2022

Thanks!

@io7m
Copy link
Author

io7m commented Jul 5, 2022

Just to confirm: This was the problem. Statically assigning an address in ignition worked well, and it also had the happy side effect of decreasing the boot and provisioning time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants