New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azure: case-insensitive UUID to avoid new IID during kernel upgrade #798
azure: case-insensitive UUID to avoid new IID during kernel upgrade #798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looked at the integration test. Can do fuller review tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick integration test thoughts.
ba8e071
to
bf190cd
Compare
bf190cd
to
5c199e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just integration test review for now. I've left some specific comments, but I'd like us to step back and consider whether this is the right path for this test. Previously you were launching an image that would present with one UUID case, and then upgrading its kernel in a way that would present the other case. I suggested an iteration of that approach: using an old standard Ubuntu image to avoid the dependency on Pro images. I think it's likely that would give us a simpler test (with fewer external dependencies); did you try that approach before settling on this one?
|
||
def _check_iid_insensitive_across_kernel_upgrade(client): | ||
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid') | ||
assert uuid.islower(), "UUID does not appear to be lowercase {}".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to indicate (either in the message or in a comment) that this is a precondition assertion, rather than an assertion on what cloud-init should have done. Let's do this for other such assertions too.
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid') | ||
assert uuid.isupper(), "UUID does not appear to be uppercase {}".format( | ||
uuid | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that the UUID hasn't changed (and therefore invalidated our preconditions)?
case-insensitive comparison and avoid triggering "new instance" detection | ||
in cloud-init on Azure platform. | ||
|
||
The test will launch an Ubuntu image which has a 5.4 kernel. We allow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true: this test is not constrained by Ubuntu release, so will launch whatever version of the kernel is in the latest Azure image for the target Ubuntu release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've now pinned the test to launch a specific Ubuntu PRO FIPS image that will be available forever and exhibits a UUID case change when upgrading to linux-azure from linux-azure-fips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and now some not-integration-test review.
5566f9b
to
a38eb71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code, unit tests and general structure of the integration test looks good to me, thank you!
We still need to get this test to run only on bionic, see my inline thoughts.
|
||
@pytest.mark.azure | ||
@pytest.mark.sru_next | ||
@pytest.mark.not_xenial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still run on every release bionic+; we could introduce an only_bionic
mark or similar. However, given that this test doesn't use the client
fixtures, we don't actually need to perform the skip via marks to avoid the instance launch: we can do it in the body of the test before we call launch
instead.
Ideally, we'd be able to do something like if (session_cloud.image.os, session_cloud.image.release) != ("ubuntu", "bionic"): pytest.skip(...)
but we don't currently store the ImageSpecification
that we determine on session_cloud
. That's probably not too painful to hook up (though may require some thought around snapshotting), but I don't think we need to for this, re-parsing OS_IMAGE
should be sufficient: something like image_spec = ImageSpecification.from_os_image(); if (image_spec.os, image_spec.release) != ("ubuntu", "bionic"): pytest.skip(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OddBloke funny, I put that type of pytest.skip in locally and decided to pullt it out based on looking at self.settings.os_image I think. I'll put some semblance of that back into this PR
a38eb71
to
65b2311
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revisions, a few more thoughts inline.
One thing to note, if I try to run this test locally, I get:
E msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'
I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.
# work around pad.lv/1908287 | ||
instance.restart() | ||
if not instance.execute("cloud-init status --wait --long").ok: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this will workaround the issue: this first execute
call could still get into the instance before it reboots (which is the root cause of the issue) and therefore not trigger any of the waiting behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Please see my below comment on where we call this before spending too much time on these comments. 👍)
for _ in range(10): | ||
time.sleep(5) | ||
result = instance.execute("cloud-init status --wait --long") | ||
if result.ok: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If result.ok
is False, what is gained by retrying? --wait
means that we know that's the final status we're going to hit, so I think these retries will just run cloud-init status
4 additional times before raising the below exception.
# Allow for upgrade of cloud-init in FIPS image for pre-release testing | ||
if install_cloudinit: | ||
instance.install_new_cloud_init(source, take_snapshot=False) | ||
_restart_with_retries(instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we'll restart the instance for the new kernel, do we need to restart here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we restart after upgrading cloud-init is because install_new_cloud_init
calls instance.clean() which removes all cloud-init logs, semaphores and syslog though pycloudlib. What we are trying to establish is that the upgraded cloud-init detects uppercase UUID on first boot, then on 2nd boot into new kernel we don't accidentally detect a change in instance-id due to the now lowercase UUID. I support instead of the reboot I could either avoid instance.clean() in install_new_cloud_init
by providing a clean=False
param. That'd save time on the reboot (because cloud-init will have already detected the uppercase instance-id even before we upgrade cloud-init on that instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a change to avoid the extra-reboot and I'm testing this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd support modifying install_new_cloud_init
to allow for this case; a clean
parameter sounds reasonable to me.
@TheRealFalcon How does that sound to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added clean param and confirmed success run of the following (which a built deb from this PR)
CLOUD_INIT_OS_IMAGE=bionic CLOUD_INIT_CLOUD_INIT_SOURCE=/cloud-init_20.4.1-86-g5c199e90-1~bddeb_all.deb CLOUD_INIT_PUBLIC_SSH_KEY=/mykey CLOUD_INIT_PLATFORM=azure .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py
65b2311
to
6396c3c
Compare
To follow-up However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which |
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a lowercase value. Peviously UUID as uppercase. Azure datasource reads the product_uuid directly as their platform's instance-id. This present a problem if a kernel is either upgraded or downgraded across the 4.15 kernel version boundary because the case of the UUID will change, resulting in cloud-init seeing a "new" instance id and re-running all modules. This causes ssh host keys to get regenerated across reboot into the new kernels which will cause concern on long-running instances that something nefarious has happened. LP: #1835584
37c7fbe
to
6c738da
Compare
Yes this a good point, I was thinking for current dev it gives us the change to run the test to see the failure currently without the image_source set. But, we probably don't want to instrument tests that we can easily cause to break. Reviewers could just remove such a pytest.skip if they wanted to see failures. |
6c738da
to
2b83481
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integration test LGTM. I did a much shallower review of everything else given how much Dan has been involved.
Proposed Commit Message
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.
Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.
Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.
Also add:
a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
next SRU
LP: #1835584
Additional Context
Integration test will add a 4.14 generic kernel, add grub config to prefer the "generic" flavor of that kernel across reboot.
This triggers the uppercase product_uuid which on current cloud-init would generate new ssh host keys.
New cloud-init will not trigger this ssh host key creation.
Test Steps
Checklist: