Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

multiple OEM-GCE container get started on GCE when using non-Google DNS #2601

Closed
HeikoOnnebrink opened this issue Jul 8, 2019 · 1 comment · Fixed by coreos/coreos-overlay#3879

Comments

@HeikoOnnebrink
Copy link

Bug

Instead of a single continuous running oem-gce rkt container we found that every 1-2 minutes a new oem-gce container instance was spinning up until the system runs too many of them and in the end runs out of memory.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=2135.5.0
VERSION_ID=2135.5.0
BUILD_ID=2019-07-01-1959
PRETTY_NAME="Container Linux by CoreOS 2135.5.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
BUG_REPORT_URL="https://issues.coreos.com"

Environment

We run CoreOS on Google Cloud VMs since some years. Cloud is connected via site-2-site VPN to our corporate network.
In our setup up we use our corporate nameserver inside resolver.conf and added 169.254.169.254 metadata.google.internal to hosts file to allow lookup of metadata server name.
But since some time (apologize that I cannot name the exact CoreOS version since when problems started .. ) it does not work anymore and we get the symptoms described above.

Expected Behavior

Only one oem-gce container should be started after boot and stay running

Actual Behavior

every 1-2 minutes a new oem-gce rkt container instance is spinning up and stays running until we get OOM issues

UUID            APP     IMAGE NAME                      STATE           CREATED         STARTED         NETWORKS
065952d0        oem-gce coreos.com/oem-gce:2135.5.0     running         6 minutes ago   6 minutes ago
755867b3        oem-gce coreos.com/oem-gce:2135.5.0     running         1 minute ago    1 minute ago
92878e0b        oem-gce coreos.com/oem-gce:2135.5.0     exited garbage  4 days ago      4 days ago
bdb4d4e6        oem-gce coreos.com/oem-gce:2135.5.0     running         4 minutes ago   4 minutes ago
bea7cb1d        oem-gce coreos.com/oem-gce:2135.5.0     running         3 minutes ago   3 minutes ago

Reproduction Steps

deploy a VM with latest CoreOS image on google cloud
configure resolver.conf to use some corporate non-google DNS server
remove any nameserver 169.254.160.254 entry from resolver confirmation
add 169.254.169.254 metadata.google.internal to hosts file

Other Information

from journal we got these logs that seem to relate to the issue

instance-setup[2400]: ERROR GET request error retrieving metadata. <urlopen error [Errno -2] Name or service not known>.
google-accounts[913]: ERROR GET request error retrieving metadata. <urlopen error [Errno -2] Name or service not known>.
google-networking[915]: ERROR GET request error retrieving metadata. <urlopen error [Errno -2] Name or service not known>.

As a workaround we found out that once I add nameserver 169.254.169.254 as first entry to resolver.conf before our corporate nameserver the problem disappears.
But this is no solution as it disables name resolution for our internal machines.

Looks like something has changed inside oem-gce container so that just adding the metadata.google.internal entry inside hosts file is not sufficient any more to allow the container to start properly even this config worked fine for years.

In older versions 1576 and 1855 the problem did not exist. It even looks like that the latest CoreOS version does not show this problem as long a the oem-gce-container version is old. This we found on one machine that was deployed long time ago and got updated continuously. During these updates obviously the oem-gce- container was not updated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants