waagent.conf not created at Azure #1755

Closed
lshahar opened this Issue Jan 9, 2017 · 16 comments

Projects

None yet

3 participants

@lshahar
lshahar commented Jan 9, 2017 edited

Issue Report

Bug

CoreOS Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.4.0
VERSION_ID=1235.4.0
BUILD_ID=2017-01-04-0450
PRETTY_NAME="Container Linux by CoreOS 1235.4.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

...

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?
Azure

Expected Behavior

/etc/waagent.conf is missing

Actual Behavior

/etc/waagent.conf should be create by CoreOS / Coreos-cloutinit

Reproduction Steps

  1. After automatic upgrade the file is not created
@drbolsen
drbolsen commented Jan 9, 2017 edited

The same issue here, after an upgrade to 1235.4.0 the waagent.service can't start due to an incorrect path to the waagent.conf file - /etc/waagent.conf. The correct path, my understanding, should be /usr/share/oem/waagent.conf instead.

The environment is Azure.

WALinuxAgent-2.1.3 running on container linux by coreos 1235.4.0
Python: 2.7.6

UPDATE: I reckon that it is due to a name change for the CoreOS distribution

CoreOS 1185.5.0 metadata.get_distro() function returns
['coreos', '1185.5.0', 'coreos', 'CoreOS']

CoreOS 1235.4.0 metadata.get_distro() function returns
['container linux by coreos', '1235.4.0', 'coreos', 'Container Linux by CoreOS']

This piece of code

    if distro_name == "coreos":
        return CoreOSDistro()

effectively fails and fallbacks to a default distro which results in attempt to load /etc/waagent.conf instead of the one that is located in /usr/share/oem/ folder.

@crawford
Member
crawford commented Jan 9, 2017

That is correct. Machines provisioned before 1235.5.0 (which is basically all of them) incorrectly look at the name instead of the ID. We're working on a fix, but in the meantime, reprovisioning with 1235.5.0 will definitely solve it.

@drbolsen
drbolsen commented Jan 9, 2017

Thanks @crawford, noted. A quick workaround for "desperate" souls like me :) is to make the following change in loader.py file (/usr/share/oem/python/lib64/python2.7/site-packages/azurelinuxagent/distro)

    if distro_name == "container linux by coreos":
        return CoreOSDistro()

sudo mv loader.pyc loader.pyc.old
sudo /usr/share/oem/python/bin/python -c 'import py_compile; py_compile.compile("loader.py")'

This should solve the problem until the new release arrives.

@lshahar
lshahar commented Jan 9, 2017

There is estimated time for the new release?

@crawford
Member
crawford commented Jan 9, 2017

The new release should be available tomorrow.

@crawford
Member
crawford commented Jan 9, 2017 edited

@drbolsen I wouldn't recommend that change be made. Our fix is going to either involve rolling back the name for now or patching the python libraries to always return "CoreOS" as the distro's name. In either case, that is going to then fail your check when your machine updates.

This patch will be safer and is what we ended up doing in 1235.5.0, 1248.3.0, and 1284.1.0:

--- /usr/share/oem/python/lib64/python2.7/platform.py 2017-01-09 15:26:53.000000000 +0000
+++ /usr/share/oem/python/lib64/python2.7/platform.py 2017-01-09 15:27:39.000000000 +0000
@@ -340,7 +340,7 @@
     os_release_info = _parse_os_release()
     if os_release_info is not None:
         if 'NAME' in os_release_info:
-            distname = os_release_info['NAME']
+            distname = "CoreOS"
         if 'VERSION_ID' in os_release_info:
             version = os_release_info['VERSION_ID']
         if 'ID' in os_release_info:

Since the long-term fix is going to involve updating the Azure agent to read the ID instead of the NAME, this patch will eventually have no effect.

@crawford
Member
crawford commented Jan 9, 2017

Fortunately, platform.py has never changed since we originally shipped the Azure images. It looks like we'll be able to check the file against a known hash and apply this change if it's needed.

@crawford crawford referenced this issue in coreos/update_engine Jan 9, 2017
Merged

coreos-postinst: patch platform.py in python libs #139

@crawford
Member
crawford commented Jan 9, 2017

The following script can be run (as root) to fix the issue. The next update will include this script and run it automatically if it hasn't been done.

PLATFORM_PATH="/usr/share/oem/python/lib64/python2.7/platform.py"
if [ -e ${PLATFORM_PATH} ]; then
    sum=($(md5sum ${PLATFORM_PATH}))
    if [ ${sum} == "6315addf42c0b07f5f78d119b578e20a" ]; then
        sed --in-place \
            "s%distname = os_release_info\['NAME'\]%distname = \"CoreOS\"%" \
            ${PLATFORM_PATH}
    fi
fi
@drbolsen
drbolsen commented Jan 9, 2017

Thanks @crawford, much appreciated mate. Is sufficient to update the "py" file only or compilation is required?

@crawford
Member

@drbolsen It's sufficient to just update the py file. The waagent service will restart itself after failing again, pick up the changes, and start working. You should be able to just run that snippet.

@lshahar
lshahar commented Jan 10, 2017

We got the update (1235.5.0), but it still failed to start waagent:

core@cluster-d0 ~ $ systemctl status waagent 
● waagent.service - Microsoft Azure Agent
   Loaded: loaded (/run/systemd/system/waagent.service; static; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Tue 2017-01-10 07:42:24 UTC; 4s ago
  Process: 43628 ExecStart=/usr/share/oem/python/bin/python -u /usr/share/oem/bin/waagent -daemon (code=exited, status=1/FAILURE)
 Main PID: 43628 (code=exited, status=1/FAILURE)
    Tasks: 0
   Memory: 0B
      CPU: 0
   CGroup: /system.slice/waagent.service

Jan 10 07:42:24 gicistage-d0 python[43628]:   File "/usr/share/oem/python/lib64/python2.7/site-packages/azurelinuxagent/agent.py", line 41, in __init__
Jan 10 07:42:24 gicistage-d0 python[43628]:     self.distro.init_handler.run(verbose)
Jan 10 07:42:24 gicistage-d0 python[43628]:   File "/usr/share/oem/python/lib64/python2.7/site-packages/azurelinuxagent/distro/default/init.py", line 37, in run
Jan 10 07:42:24 gicistage-d0 python[43628]:     conf.load_conf_from_file(conf_file_path)
Jan 10 07:42:24 gicistage-d0 python[43628]:   File "/usr/share/oem/python/lib64/python2.7/site-packages/azurelinuxagent/conf.py", line 75, in load_conf_from_file
Jan 10 07:42:24 gicistage-d0 python[43628]:     "").format(conf_file_path))
Jan 10 07:42:24 gicistage-d0 python[43628]: azurelinuxagent.exception.AgentConfigError: (000001)Missing configuration in /etc/waagent.conf
Jan 10 07:42:24 gicistage-d0 systemd[1]: waagent.service: Main process exited, code=exited, status=1/FAILURE
Jan 10 07:42:24 gicistage-d0 systemd[1]: waagent.service: Unit entered failed state.
Jan 10 07:42:24 gicistage-d0 systemd[1]: waagent.service: Failed with result 'exit-code'.
@crawford
Member

@lshahar That's not the update. The fix will be in 1235.6.0 which should roll out tomorrow.

@lshahar
lshahar commented Jan 10, 2017

OK, We will check it tomorrow, Thanks

@crawford
Member

Just a heads up. We've been ready to roll this update since yesterday evening but we are still gated on an upstream vendor publishing a security advisory. Sorry for the delay.

@crawford
Member

It's rolling now. Thank you all for your patience.

@crawford crawford closed this Jan 11, 2017
@lshahar
lshahar commented Jan 11, 2017

I checked the new version (1235.6.0) today and this issue fixed.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment