New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "Ubuntu 16.04 (xenial)" VMs #304

Closed
riccardomurri opened this Issue Aug 24, 2016 · 10 comments

Comments

Projects
None yet
2 participants
@riccardomurri
Copy link
Member

riccardomurri commented Aug 24, 2016

Ubuntu 16.04 "xenial" does not ship with Python 2 installed by default, hence
Ansible fails very early with the following error::

    TASK [setup] *******************************************************************
    fatal: [master001]: FAILED! => {"changed": false, "failed": true, "module_stderr": "", "module_stdout": "/bin/sh: 1: /usr/bin/python: not found\r\n", "msg": "MODULE FAILURE", "parsed": false}

The solution entails a few steps:

  1. One should install Python 2.7 on the Ubuntu "xenial" VM::

    sudo apt-get install python2.7
    

    (There is also a python2.7-minimal which is the Python interpreter
    without the std library. That's not sufficient for running Ansible.)

    This step could be automated by:

    • running a "setup" shell script upon first connection to a machine, or
    • we could introduce a pre_setup_commands= setting which contains commands to execute (needs to be set per node kind)
  2. Ansible tries to execute /usr/bin/python, whereas Ubuntu's python2.7
    package only provides /usr/bin/python2.7. The
    Python interpreter that Ansible uses can be set with the
    ansible_python_interpreter variable but only in the inventory file.

    Since ElastiCluster rebuilds the inventory file at each invocation, we need a
    way to add per-host variables in the inventory.

@riccardomurri riccardomurri added this to the 1.3 milestone Aug 24, 2016

@riccardomurri riccardomurri self-assigned this Aug 24, 2016

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Aug 24, 2016

There is already a workaround for the "python interpreter location" issue: just
specify it using the "cluster variables" mechanism in the [setup/...]
section::

    [setup/test]
    provider=ansible

    # ...

    global_var_ansible_python_interpreter=/usr/bin/python2.7

What's counter-intuitive is that this must be done in the [setup/...] section
and does not work if inserted in the [cluster/...] section (where image_id
resides).

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

Another issue with the stock Ubuntu 16.04 images: upon boot, systemd runs an
apt-get update command, which prevents other apt/dpkg commands from
running. Hence Ansible setup fails with this error::

fatal: [frontend001]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to auto-install python-apt. Error was: 'E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)\nE: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?'"}
@corburn

This comment has been minimized.

Copy link
Contributor

corburn commented Oct 10, 2016

What is wrong with python2.7-minimal? If a task depends on a python module, the task can pip install it.

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

What is wrong with python2.7-minimal?

Ansible core modules depend on the Python standard library being available, whereas python2.7-minimal installs only a subset of it (the modules that are needed by the Debian/Ubuntu boot scripts). Therefore with python2.7-minimal you cannot rely on any Ansible task completing successfully because critical parts of the stdlib may be missing.

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

Regading the apt auto-upgrade task mentioned in comment 3, here is a snapshot of the running processes::

root       738  0.0  0.0   4508  1720 ?        Ss   15:18   0:00 /bin/sh /usr/lib/apt/apt.systemd.daily
root      1299 34.0  1.7 132024 70972 ?        S    15:18   0:05  \_ /usr/bin/python3 /usr/bin/unattended-upgrade
root      1571  0.0  1.4 131904 57972 ?        S    15:18   0:00      \_ /usr/bin/python3 /usr/bin/unattended-upgrade
root      1581  1.2  0.5  41204 20564 pts/1    Ss+  15:18   0:00          \_ /usr/bin/dpkg --status-fd 10 --unpack --auto-deconfigure /var/cache/apt/archives/systemd-sysv_229-4ubuntu10_amd64.deb
root      1591  0.0  0.0   4508   712 pts/1    S+   15:18   0:00              \_ /bin/sh /var/lib/dpkg/info/man-db.postinst triggered /usr/share/man
man       1592  1.3  0.0  25796  3572 pts/1    D+   15:18   0:00                  \_ /usr/bin/mandb -pq
@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

The apt auto-upgrade task is actually run by systemd unit apt-daily.service (use systemctl show apt-daily.service to inspect).

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

I have tried various options to disable the "unattended upgrades" feature
through a "user data" script, but all of them have failed so far::

  1. Try disabling the systemd task: systemd task apt-daily.service is triggered
    by apt-daily.timer. I have tried to disable one or the other, or both, with
    various cobinations of the following commands; still, the apt-daily.service is
    started moments after the VM becomes ready to accept SSH connections::

    #!/bin/bash
    
    systemctl stop apt-daily.timer
    systemctl disable apt-daily.timer
    systemctl mask apt-daily.service
    systemctl daemon-reload
    
  2. Script /usr/lib/apt/apt.systemd.daily reads a few APT configuration variables; the setting APT::Periodic::Enable disables the functionality altogether (lines 331--337). I have tried disabling it with the following script::

    #!/bin/bash
    
    # cannot use /etc/apt/apt.conf.d/10periodic as suggested in
    # /usr/lib/apt/apt.systemd.daily, as Ubuntu distributes the
    # unattended upgrades stuff with priority 20 and 50 ...
    # so override everything with a 99xxx file
    cat > /etc/apt/apt.conf.d/99elasticluster <<__EOF
    APT::Periodic::Enable "0";
    // undo what's in 20auto-upgrade
    APT::Periodic::Update-Package-Lists "0";
    APT::Periodic::Unattended-Upgrade "0";
    __EOF
    

However, despite APT::Periodic::Enable having value 0 from the command line
(see below), the unattended-upgrades program is still run...

    ubuntu@test:~$ apt-config shell AutoAptEnable APT::Periodic::Enable
    AutoAptEnable='0'
  1. The nuclear option: remove /usr/lib/apt/apt.systemd.daily altogether::

    #!/bin/bash
    
    mv /usr/lib/apt/apt.systemd.daily /usr/lib/apt/apt.systemd.daily.DISABLED
    

Still, the task runs and I can see it in the process table! although the file
does not exist if probed from the command line::

ubuntu@test:~$ ls /usr/lib/apt/apt.systemd.daily
ls: cannot access '/usr/lib/apt/apt.systemd.daily': No such file or directory

There must be some Linux namespace magic going on, which I do not understand --
it looks as though the cloud-init script (together with the SSH command-line)
and the root systemd process execute in separate filesystems and process spaces
...

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Oct 10, 2016

At last, it seems that feeding this script as image_userdata kills the
apt-get update as early as possible and solves the issues (up to a race
condition, which can theoretically still happen)::

    #!/bin/bash

    systemctl stop apt-daily.service
    systemctl kill --kill-who=all apt-daily.service

    # wait until done before starting other APT tasks
    while ! (systemctl list-units --all apt-daily.service | fgrep -q dead)
    do
      sleep 1;
    done
    # print state, mainly for debugging
    systemctl list-units --all 'apt-daily.*'

    # now ensure Ansible can find /usr/bin/python
    apt-get install -y python
@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Jan 24, 2017

There is apparently a race condition: the image_userdata script is executed
concurrently with other tasks in the init process. In particular, it is
possible that the SSH server is ready to serve connections before
image_userdata has run. In my tests on the local OpenStack infrastructure,
this happens 50% of the time.

Wherever this is an issue, a workaround is the following:

  • Start the cluster skipping the setup (Ansible) part::

      elasticluster start mycluster
    
  • Wait some time then run the setup::

      sleep 30
      elasticluster setup mycluster
    

Adjust the wait time to your local infrastructure: it should be long
enough so that you see the "Cloud-init v. 0.7.8 finished" message in the
VM boot log.

@riccardomurri

This comment has been minimized.

Copy link
Member Author

riccardomurri commented Apr 3, 2017

The issue with the apt-daily.service systemd job is also being discussed upstream, see comments on Debian bug #844453.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment