Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

120 seconds timeout on OpenNebula running subp.py #4078

Closed
ubuntu-server-builder opened this issue May 12, 2023 · 8 comments
Closed

120 seconds timeout on OpenNebula running subp.py #4078

ubuntu-server-builder opened this issue May 12, 2023 · 8 comments
Labels
launchpad Migrated from Launchpad

Comments

@ubuntu-server-builder
Copy link
Collaborator

This bug was originally filed in Launchpad as LP: #2007149

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2023-02-23T07:45:47.228941+00:00
date_created = 2023-02-13T13:09:39.495644+00:00
date_fix_committed = None
date_fix_released = None
id = 2007149
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/2007149
milestone = None
owner = rblom
owner_name = Ron
private = False
status = invalid
submitter = rblom
submitter_name = Ron
tags = []
duplicates = []

Launchpad user Ron(rblom) wrote on 2023-02-13T13:09:39.495644+00:00

Hi,

I'm using the latest AlmaLinux 9 GenericCloud image. This has cloud-init 22.1-5.el9.alma and works with OpenNebula. Booting is very slow on OpenNebula because of a 120 seconds timeout.

systemd-analyze blame give me:
2min 3.083s cloud-init-local.service

In de cloud-init.log I see:

2023-02-09 11:30:10,244 - util.py[DEBUG]: Reading from /run/cloud-init/tmp/tmpac937qy0/context.sh (quiet=False) 2023-02-09 11:30:10,245 - util.py[DEBUG]: Read 3674 bytes from /run/cloud-init/tmp/tmpac937qy0/context.sh 2023-02-09 11:30:10,245 - subp.py[DEBUG]: Running command ['sudo', '-u', 'nobody', 'bash', '-e'] with allowed return codes [0] (shell=False, capture=True) 2023-02-09 11:32:10,667 - util.py[DEBUG]: Reading from /sys/class/net/eth0/addr_assign_type (quiet=False) 2023-02-09 11:32:10,668 - util.py[DEBUG]: Read 2 bytes from /sys/class/net/eth0/addr_assign_type 2023-02-09 11:32:10,669 - util.py[DEBUG]: Reading from /sys/class/net/eth0/uevent (quiet=False)

See the two minute timeout after subp.py command.

When running the cloud-init-local.service manually when the machine is started I get the same logging but without the 120 seconds timeout.

@ubuntu-server-builder ubuntu-server-builder added the launchpad Migrated from Launchpad label May 12, 2023
@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ron(rblom) wrote on 2023-02-13T13:09:39.495644+00:00

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ron(rblom) wrote on 2023-02-13T13:28:58.813028+00:00

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ron(rblom) wrote on 2023-02-13T13:57:14.859601+00:00

Launchpad attachments: cloud-init.tar.gz

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Scott Moser(smoser) wrote on 2023-02-13T14:58:54.984894+00:00

OpenNebula passes network information ("Context" [1]) in shell syntax.
Cloud-init drops permissions to the 'nobody' user when parsing that content.

It drops the permissions using 'sudo'. I suspect that sudo is timing out on a dns lookup as networking is not configured at this point in boot. I've never felt it terribly useful for sudo to rely on dns, but it can depending on configuration [3].

The easy solutions for you are:

  • figure out why sudo is doing that and configure it to not do that.
  • hack cloud-init to not use 'sudo' there

The right solution is probably to implement a parser in cloud-init that does not involve executing 'sh' at all. (Sorry, it was me that wrote that the first time).

--
[1] https://archives.opennebula.org/documentation:archives:rel4.2:cong
[2] https://canonical-cloud-init.readthedocs-hosted.com/en/latest/reference/datasources/opennebula.html#opennebula
[3] https://serverfault.com/questions/38114/why-does-sudo-command-take-long-to-execute

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2023-02-15T04:33:25.707380+00:00

Looking over journalctl logs as well as your network config in the attached cloud-init.log it looks like one of the following could lead to that sudo timeout on DNS resolution Scott mentioned.

  1. you have 3 DNS server IPs referenced in the context.sh content provided to this instance. It's possible that one of those IPs is down or hitting a timeout due to either routing problems on the configured network or one of those IPs not responding.
  2. it's possible that your sudo config doesn't have hostname-specific sections defined. If so, you may be able to get away with USERDATA providing the "manage_etc_hosts: true" which should add 127.0.0.1 thereby short-circuiting some of the DNS lookups for your defined <...cloud> hostname

Two other things I note in your attached logs.

  1. cloud-init complains about cc_ntp failing (check cloud-init status --long).
    With error
    Stderr: Failed to reload-or-restart chrony.service: Unit chrony.service not found.

This is due to misconfiguration I believe in user-data provided in context.sh.
From the instance user-data provided to this VM the ntp-related schema is incorrect:

ntp:
enabled: true
ntp_client: chrony
service_name: chronyd
servers:
- ntp1.rinis.cloud
- ntp2.rinis.cloud
- ntp3.rinis.cloud

The service_name and services keys should live under a "config:" key like this:

ntp:
enabled: true
ntp_client: chrony
config:
service_name: chronyd
servers:
- ntp1.blah.cloud
- ntp2.blah.cloud
- ntp3.blah.cloud

  1. runcmd is restarting chronyd directly (probably as a work around for the failed cc_ntp module run in Add a spec draft for the new distro namespace. #1)?

sudo cloud-init schema --system --annotate # on this system should help highlight some of those schema problems in the future. Though I don't think it handles non-ASCII character sets currently.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Chad Smith(chad.smith) wrote on 2023-02-15T04:45:36.632560+00:00

Agreed that there could be a better parsing of context.sh instead of shelling out. That said, this has been in place since 2013, I'd rather avoid the risk if this is a network config problem.

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Ron(rblom) wrote on 2023-02-17T13:31:50.130591+00:00

Hi,

I hacked the cloud-init not to use sudo and then there's no timeout. So your first suggestion seems the right one. Will try to figure out why sudo is waiting for 120 seconds.

Tnx

@ubuntu-server-builder
Copy link
Collaborator Author

Launchpad user Alberto Contreras(aciba) wrote on 2023-02-23T07:45:49.612487+00:00

Per the discussion, it looks like it is a cloud misconfiguration and I am going to mark it as invalid for the time being.

@ubuntu-server-builder ubuntu-server-builder closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
launchpad Migrated from Launchpad
Projects
None yet
Development

No branches or pull requests

1 participant