-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix DNS resolution performance regression during cloud-init local #6707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes DNS queries for IP addresses that cause 2+ minute boot delays, particularly with systemd 259+. Moves IP detection earlier in is_resolvable() and removes legacy DNS-dependent metadata URL. Fixes canonical#6641
| metadata_urls = [ | ||
| "http://169.254.169.254", | ||
| "http://[fd00:ec2::254]", | ||
| "http://instance-data.:8773", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On EC2 this resolves to the current IMDS IP address. While this is known to be a specific IP address, there may be some systems that depend on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that keeping this "non-ip address" will trigger a call to is_resolvable() and go through the "expensive" process of checking for DNS ref:
cloudinit/sources/DataSourceEc2.py (line 363)
# Remove addresses from the list that wont resolve.
mdurls = mcfg.get("metadata_urls", self.metadata_urls)
filtered = [x for x in mdurls if util.is_resolvable_url(x)]
"instance-data.:8773" is actually removed from the list (at least on EC2) as it fails the is_resolvable_url() (which in turn calls is_resolvable()) call. We see that both before and after systemd 259 upgrade. From the logs attached to the bug raised:
From pre_259_journal.log:
Dec 26 13:36:19 ip-172-31-27-166 python3[318]: [CLOUDINIT] DataSourceEc2.py[DEBUG]: Removed the following from metadata urls: ['http://instance-data.:8773']
From post_259_journal.log:
Dec 26 13:43:18 ip-172-31-27-166 python3[484]: [CLOUDINIT] DataSourceEc2.py[DEBUG]: Removed the following from metadata urls: ['http://instance-data.:8773']
I also checked the documentation for the "clouds" identified at the beginning of DataSourceEc2.py (line 36, class CloudNames: ...) and none of these refer to "instance-data.", but only "169.254.169.254"
Brightbox: https://www.brightbox.com/docs/reference/metadata-service/
Zstack: https://cloudinit.readthedocs.io/en/24.1/reference/datasources/zstack.html
e24cloud: https://www.e24cloud.com/en/e24cloud-servers/meta-data/
Outscale: https://docs.outscale.com/en/userguide/Accessing-the-Metadata-and-User-Data-of-a-VM.html
Tilaa: https://support.tilaa.com/hc/en-us/articles/228652587-Using-the-VPS-Metadata-Service-169-254-169-254
Aware that there is an "UNKNOWN", if the cloud can not be identified, however I don't think its reasonable to assume that when falling into this category that "instance-data.:8773" is a valid endpoint.
There is also this old bug report (somewhat related): https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2039723
Where it is mentioned that:
"HOWEVER, the name as given in the list above "instance-data." is trying to do the "hostnames ending in a '.' are fully qualified" thing, but in fact that name in AWS is not fully qualified. Instead, it requires the AWS region-specific local domain be appended"
Which indicates that "instance-data.:8773" will (at least on EC2) always fail DNS resolution - no matter what.
It also theorized later on:
"(I wonder if this was originally done to support EC2-Classic? Detection of Classic instances is handled elsewhere, and AWS dropped supported for Classic networking in 2022 having migrated all such instances to a VPC. So if "instance-data." is a remnant of that era, it should be migrated also by removing the trailing dot.)"
This indicates that "instance-data.:8773" most likely is a relic that can be safely removed.
That being said, if reviewers still feel it should not be removed, it can be left in. There is a work around as it is possible to provide a "metadata_urls" list through the cloud-init config files that does not contain "instance-data.:8773" - ref.: https://cloudinit.readthedocs.io/en/latest/reference/datasources/ec2.html
It still requires the change to is_resolvable() to exit early on IP addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this, @drzee99. I think that I am comfortable with removing it.
holmanb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @drzee99!
Fixes DNS queries for IP addresses that cause 2+ minute boot delays with systemd 259+. Moves IP detection earlier in is_resolvable() and removes legacy DNS-dependent metadata URL. Fixes GH-6641
Fixes DNS queries for IP addresses that cause 2+ minute boot delays with systemd 259+. Moves IP detection earlier in is_resolvable() and removes legacy DNS-dependent metadata URL. Fixes GH-6641
Fixes DNS queries for IP addresses that cause 2+ minute boot delays with systemd 259+. Moves IP detection earlier in is_resolvable() and removes legacy DNS-dependent metadata URL. Fixes GH-6641
Fix DNS resolution performance regression during cloud-init local
Summary
This PR addresses critical DNS resolution performance issues during the early
cloud-init localstage that cause boot delays of 2+ minutes, particularly with systemd version 259 and later.Problem
cloud-init localSolution
1. Optimize IP address handling in
util.py2. Remove legacy DNS-dependent URL from
DataSourceEc2.pyhttp://instance-data.:8773which is not in current AWS IMDS documentationChanges
is_resolvable()Testing
is_resolvable()Related Issues
Fixes #6641 - Systemd version 259 slows down DNS check during cloud-init local
Backward Compatibility