Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS issues on Azure #356

Closed
arithx opened this issue Jan 28, 2020 · 17 comments
Closed

DNS issues on Azure #356

arithx opened this issue Jan 28, 2020 · 17 comments

Comments

@arithx
Copy link
Contributor

arithx commented Jan 28, 2020

Issue Report

Bug

Fedora CoreOS Version

31.20200113.3.1

Expected Behavior

Working DNS

Actual Behavior

When spawning FCOS machines on Azure there is no DNS. The machines do seem to have working networking otherwise.

Reproduction Steps

  1. boot machine on Azure
  2. ping google.com

Other Information

I haven't managed to get a machine booted on Azure via manual spawning in the CLI or kola that have working DNS.

@jlebon
Copy link
Member

jlebon commented Jan 28, 2020

That's odd. What does /etc/resolv.conf say? Logs from NetworkManager?

Hmm, but clearly this has to be working on RHCOS. One difference I can think of is that RHCOS does check-in from the initrd, though I don't think checking in would be related to DNS.

@arithx
Copy link
Contributor Author

arithx commented Jan 28, 2020

/etc/resolv.conf doesn't exist (likely because we aren't bringing down the networking in the initramfs like RHCOS is).

I've included /run/initramfs/state/etc/resolv.conf as well as the journal for NetworkManager (note that I did manually restart NetworkManager ealrier on via sudo systemctl restart NetworkManager to try to see if that resolved it)

[core@networktest ~]$ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory
[core@networktest ~]$ ls /etc/
adjtime                        csh.cshrc                fedora-release  hosts          libnl           multipath          pkcs11            rpm               sssd                tmpfiles.d
aliases                        csh.login                filesystems     idmapd.conf    libreport       netconfig          pkgconfig         rpm-ostreed.conf  statetab.d          trusted-key.key
alternatives                   dbus-1                   fuse.conf       inittab        libssh          NetworkManager     pki               rsyncd.conf       subgid              udev
bash_completion.d              default                  gcrypt          inputrc        libuser.conf    networks           pm                rwtab.d           subgid-             virc
bashrc                         depmod.d                 gnupg           iproute2       login.defs      nfs.conf           polkit-1          samba             subuid              X11
bindresvport.blacklist         dhcp                     GREP_COLORS     iscsi          logrotate.conf  nfsmount.conf      popt.d            sasl2             subuid-             xattr.conf
binfmt.d                       DIR_COLORS               group           issue          logrotate.d     nftables           prelink.conf.d    security          sudoers             xdg
chrony.conf                    DIR_COLORS.256color      group-          issue.d        lvm             nsswitch.conf      printcap          selinux           sudoers.d           yum.repos.d
chrony.keys                    DIR_COLORS.lightbgcolor  grub2.cfg       issue.net      machine-id      nsswitch.conf.bak  profile           services          swid                zincati
cifs-utils                     dnf                      grub2-efi.cfg   kernel         magic           openldap           profile.d         sestatus.conf     sysconfig
cni                            dracut.conf              grub.d          krb5.conf      mke2fs.conf     opt                protocols         shadow            sysctl.conf
console-login-helper-messages  dracut.conf.d            gshadow         krb5.conf.d    modprobe.d      os-release         rc.d              shadow-           sysctl.d
containerd                     environment              gshadow-        ld.so.cache    modules-load.d  ostree             redhat-release    shells            systemd
containers                     ethertypes               gss             ld.so.conf     motd            pam.d              request-key.conf  skel              system-release
cron.d                         exports                  host.conf       ld.so.conf.d   motd.d          passwd             request-key.d     ssh               system-release-cpe
crypto-policies                fedora-coreos-pinger     hostname        libaudit.conf  mtab            passwd-            rpc               ssl               terminfo
[core@networktest ~]$ cat /run/initramfs/state/etc/resolv.conf 
nameserver 168.63.129.16
search  u5e2tmrol1sebjifwcberhsgzf.dx.internal.cloudapp.net
[core@networktest ~]$ journalctl -t NetworkManager --no-pager
-- Logs begin at Tue 2020-01-28 20:03:24 UTC, end at Tue 2020-01-28 21:28:33 UTC. --
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.0456] NetworkManager (version 1.20.8-1.fc31) is starting... (for the first time)
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.0459] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-disable-default-plugins.conf, 20-client-id-from-mac.conf) (run: 10-dracut-dhclient.conf)
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.6094] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.6509] manager[0x56149f4c0130]: monitoring kernel firmware directory '/lib/firmware'.
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4818] hostname: hostname: using hostnamed
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4818] hostname: hostname changed from (none) to "networktest"
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4824] dns-mgr[0x56149f4a3240]: init: dns=default,systemd-resolved rc-manager=symlink
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.5263] manager[0x56149f4c0130]: rfkill: Wi-Fi hardware radio set enabled
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.5264] manager[0x56149f4c0130]: rfkill: WWAN hardware radio set enabled
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6368] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6369] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6370] manager: Networking is enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6724] dhcp-init: Using DHCP client 'dhclient'
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6725] settings: Loaded settings plugin: keyfile (internal)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6900] device (lo): carrier: link connected
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6903] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6910] device (eth0): carrier: link connected
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6914] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7868] settings: (eth0): created default wired connection 'Wired connection 1'
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7913] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7922] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7931] device (eth0): Activation: starting connection 'eth0' (4a30bc0c-48d3-49d2-a508-2edd429eaba7)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8076] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8080] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8083] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8085] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8195] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8197] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8200] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8207] device (eth0): Activation: successful, device activated.
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8212] manager: NetworkManager state is now CONNECTED_GLOBAL
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8215] manager: startup complete
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.3939] caught SIGTERM, shutting down normally.
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.3953] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.4758] exiting (success)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5124] NetworkManager (version 1.20.8-1.fc31) is starting... (after a restart)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5125] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-disable-default-plugins.conf, 20-client-id-from-mac.conf) (run: 10-dracut-dhclient.conf)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5201] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5350] manager[0x555befdfe130]: monitoring kernel firmware directory '/lib/firmware'.
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8577] hostname: hostname: using hostnamed
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8578] hostname: hostname changed from (none) to "networktest"
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8580] dns-mgr[0x555befde3240]: init: dns=default,systemd-resolved rc-manager=symlink
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8583] manager[0x555befdfe130]: rfkill: Wi-Fi hardware radio set enabled
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8583] manager[0x555befdfe130]: rfkill: WWAN hardware radio set enabled
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8597] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8598] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8599] manager: Networking is enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8600] dhcp-init: Using DHCP client 'dhclient'
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8601] settings: Loaded settings plugin: keyfile (internal)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8615] device (lo): carrier: link connected
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8617] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8624] device (eth0): carrier: link connected
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8628] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8648] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8657] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8665] device (eth0): Activation: starting connection 'eth0' (794b0119-9912-453b-b991-246d38a41599)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8678] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8681] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8684] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8686] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8831] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8833] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8836] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8841] device (eth0): Activation: successful, device activated.
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8845] manager: NetworkManager state is now CONNECTED_GLOBAL
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8847] manager: startup complete

@jlebon
Copy link
Member

jlebon commented Jan 28, 2020

likely because we aren't bringing down the networking in the initramfs like RHCOS is

Ahhh hmm yeah, that's a big delta. I don't quite remember now why we don't do this in FCOS too. Maybe we expected it to be unnecessary with the switch to NM in the initrd?

@jomeier
Copy link

jomeier commented Jan 29, 2020

#148 (comment)

Thats a showstopper because we have to reboot each machine a few times manually before it has an internet connection.

Is it a big task to resolve this in the official fcos image?

@lucab
Copy link
Contributor

lucab commented Jan 29, 2020

@jlebon I think coreos/ignition-dracut#119 is related.

@jomeier
Copy link

jomeier commented Feb 4, 2020

Hi folks. Do you have any updates for this one?

@LorbusChris
Copy link
Contributor

@dustymabe
Copy link
Member

cross referencing this with #394

@jomeier
Copy link

jomeier commented Mar 3, 2020

@simongottschlag
Copy link

Hitting this issue as well. Any updates?

@dustymabe
Copy link
Member

yes, once coreos/ignition-dracut#159 and coreos/fedora-coreos-config#310 are merged and into a release we think this should be taken care of.

@dustymabe dustymabe self-assigned this Mar 23, 2020
@simongottschlag
Copy link

Hi,

FYI, I used this in the ignition to work around the issue. Seems to be working:

systemd:
  units:
    - name: azure-restart-network.service
      enabled: true
      contents: |
        [Service]
        Type=oneshot
        ExecStart=/bin/bash -c '\
          /usr/bin/cp /run/initramfs/state/etc/resolv.conf /etc/resolv.conf; \
          /usr/bin/systemctl restart NetworkManager'

        [Install]
        WantedBy=multi-user.target

@dustymabe
Copy link
Member

@jomeier @simongottschlag - care to test https://builds.coreos.fedoraproject.org/prod/streams/testing-devel/builds/31.20200323.20.0/x86_64/fedora-coreos-31.20200323.20.0-azure.x86_64.vhd.xz to see if that fixes the problem?

@simongottschlag
Copy link

@jomeier @simongottschlag - care to test https://builds.coreos.fedoraproject.org/prod/streams/testing-devel/builds/31.20200323.20.0/x86_64/fedora-coreos-31.20200323.20.0-azure.x86_64.vhd.xz to see if that fixes the problem?

I'm having issues deploying our production VMs right now (capacity in West Europe), meaning I need to prioritise that before tests. Sorry!

@jomeier
Copy link

jomeier commented Mar 25, 2020

Strike!

I will try that out today. Give me a few hours, please.

@jomeier
Copy link

jomeier commented Mar 25, 2020

@dustymabe @vrutkovs @LorbusChris

Ok guys ... it looks good.

I installed OKD 4.4 successfully without manual interaction from my side. Everything is green in the web ui -> ok.

For your information: I had to resize, convert and upload the FCOS test image to Azure but I'm sure thats expected behaviour for this test. I used a helper VM which I patched in the OKD installer which did the work.

Good job !

@dustymabe
Copy link
Member

We are now using NetworkManager in the initramfs and also propagating network information from the initramfs (kargs) when appropriate, which we think fixes this issue.

See #394 (comment) and the preceding discussion for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants