azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

blackboxsw · 2021-02-01T22:24:12Z

Proposed Commit Message

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.

Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.

Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.

Also add:

An integration test for this for Azure Bionic Ubuntu FIPS upgrading from
a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
A new pytest.mark.sru_next to collect all integration tests related to our
next SRU

LP: #1835584

Additional Context

Integration test will add a 4.14 generic kernel, add grub config to prefer the "generic" flavor of that kernel across reboot.
This triggers the uppercase product_uuid which on current cloud-init would generate new ssh host keys.
New cloud-init will not trigger this ssh host key creation.

Test Steps

# create a passwordless ssh pub/private keypair
ssh-keygen 


 CLOUD_INIT_PUBLIC_SSH_KEY=/root/.ssh/id_rsa_az.pub CLOUD_INIT_PLATFORM=azure  .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py
# Expect failure running against current cloud-init: 
# E       AssertionError: config_ssh ran too many times 2
# E       assert 1 == 2
# E         +1
# E         -2

make deb
# Expect success running using CLOUD_INIT_CLOUD_INIT_SOURCE=<deb_from_this_pr>

Checklist:

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
[n/a] I have updated or added any documentation accordingly

TheRealFalcon

Only looked at the integration test. Can do fuller review tomorrow.

tests/integration_tests/bugs/test_lp1835584.py

OddBloke

Some quick integration test thoughts.

tests/integration_tests/bugs/test_lp1835584.py

OddBloke

Just integration test review for now. I've left some specific comments, but I'd like us to step back and consider whether this is the right path for this test. Previously you were launching an image that would present with one UUID case, and then upgrading its kernel in a way that would present the other case. I suggested an iteration of that approach: using an old standard Ubuntu image to avoid the dependency on Pro images. I think it's likely that would give us a simpler test (with fewer external dependencies); did you try that approach before settling on this one?

OddBloke · 2021-02-16T19:56:07Z

tests/integration_tests/bugs/test_lp1835584.py

+
+def _check_iid_insensitive_across_kernel_upgrade(client):
+    uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
+    assert uuid.islower(), "UUID does not appear to be lowercase {}".format(


It would be good to indicate (either in the message or in a comment) that this is a precondition assertion, rather than an assertion on what cloud-init should have done. Let's do this for other such assertions too.

tests/integration_tests/bugs/test_lp1835584.py

OddBloke · 2021-02-16T19:58:50Z

tests/integration_tests/bugs/test_lp1835584.py

+    uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
+    assert uuid.isupper(), "UUID does not appear to be uppercase {}".format(
+        uuid
+    )


Should we check that the UUID hasn't changed (and therefore invalidated our preconditions)?

tests/integration_tests/bugs/test_lp1835584.py

OddBloke · 2021-02-16T20:04:16Z

tests/integration_tests/bugs/test_lp1835584.py

+case-insensitive comparison and avoid triggering "new instance" detection
+in cloud-init on Azure platform.
+
+The test will launch an Ubuntu image which has a 5.4 kernel. We allow


This isn't true: this test is not constrained by Ubuntu release, so will launch whatever version of the kernel is in the latest Azure image for the target Ubuntu release.

We've now pinned the test to launch a specific Ubuntu PRO FIPS image that will be available forever and exhibits a UUID case change when upgrading to linux-azure from linux-azure-fips.

tests/integration_tests/bugs/test_lp1835584.py

OddBloke

OK, and now some not-integration-test review.

cloudinit/sources/DataSourceAzure.py

tests/unittests/test_datasource/test_azure.py

OddBloke

Code, unit tests and general structure of the integration test looks good to me, thank you!

We still need to get this test to run only on bionic, see my inline thoughts.

tests/integration_tests/bugs/test_lp1835584.py

OddBloke · 2021-02-17T22:21:30Z

tests/integration_tests/bugs/test_lp1835584.py

+
+@pytest.mark.azure
+@pytest.mark.sru_next
+@pytest.mark.not_xenial


This will still run on every release bionic+; we could introduce an only_bionic mark or similar. However, given that this test doesn't use the client fixtures, we don't actually need to perform the skip via marks to avoid the instance launch: we can do it in the body of the test before we call launch instead.

Ideally, we'd be able to do something like if (session_cloud.image.os, session_cloud.image.release) != ("ubuntu", "bionic"): pytest.skip(...) but we don't currently store the ImageSpecification that we determine on session_cloud. That's probably not too painful to hook up (though may require some thought around snapshotting), but I don't think we need to for this, re-parsing OS_IMAGE should be sufficient: something like image_spec = ImageSpecification.from_os_image(); if (image_spec.os, image_spec.release) != ("ubuntu", "bionic"): pytest.skip(...).

@OddBloke funny, I put that type of pytest.skip in locally and decided to pullt it out based on looking at self.settings.os_image I think. I'll put some semblance of that back into this PR

tests/integration_tests/bugs/test_lp1835584.py

OddBloke

Thanks for the revisions, a few more thoughts inline.

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

OddBloke · 2021-02-18T14:34:47Z

tests/integration_tests/bugs/test_lp1835584.py

+    # work around pad.lv/1908287
+    instance.restart()
+    if not instance.execute("cloud-init status --wait --long").ok:


I'm not 100% sure this will workaround the issue: this first execute call could still get into the instance before it reboots (which is the root cause of the issue) and therefore not trigger any of the waiting behaviour.

(Please see my below comment on where we call this before spending too much time on these comments. 👍)

OddBloke · 2021-02-18T14:36:29Z

tests/integration_tests/bugs/test_lp1835584.py

+        for _ in range(10):
+            time.sleep(5)
+            result = instance.execute("cloud-init status --wait --long")
+            if result.ok:


If result.ok is False, what is gained by retrying? --wait means that we know that's the final status we're going to hit, so I think these retries will just run cloud-init status 4 additional times before raising the below exception.

OddBloke · 2021-02-18T14:41:41Z

tests/integration_tests/bugs/test_lp1835584.py

+        # Allow for upgrade of cloud-init in FIPS image for pre-release testing
+        if install_cloudinit:
+            instance.install_new_cloud_init(source, take_snapshot=False)
+            _restart_with_retries(instance)


Given that we'll restart the instance for the new kernel, do we need to restart here as well?

The reason we restart after upgrading cloud-init is because install_new_cloud_init calls instance.clean() which removes all cloud-init logs, semaphores and syslog though pycloudlib. What we are trying to establish is that the upgraded cloud-init detects uppercase UUID on first boot, then on 2nd boot into new kernel we don't accidentally detect a change in instance-id due to the now lowercase UUID. I support instead of the reboot I could either avoid instance.clean() in install_new_cloud_init by providing a clean=False param. That'd save time on the reboot (because cloud-init will have already detected the uppercase instance-id even before we upgrade cloud-init on that instance.

I've pushed a change to avoid the extra-reboot and I'm testing this now.

Yeah, I'd support modifying install_new_cloud_init to allow for this case; a clean parameter sounds reasonable to me.

@TheRealFalcon How does that sound to you?

added clean param and confirmed success run of the following (which a built deb from this PR)

CLOUD_INIT_OS_IMAGE=bionic CLOUD_INIT_CLOUD_INIT_SOURCE=/cloud-init_20.4.1-86-g5c199e90-1~bddeb_all.deb CLOUD_INIT_PUBLIC_SSH_KEY=/mykey CLOUD_INIT_PLATFORM=azure .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py

tests/integration_tests/bugs/test_lp1835584.py

OddBloke · 2021-02-18T20:49:42Z

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a lowercase value. Peviously UUID as uppercase. Azure datasource reads the product_uuid directly as their platform's instance-id. This present a problem if a kernel is either upgraded or downgraded across the 4.15 kernel version boundary because the case of the UUID will change, resulting in cloud-init seeing a "new" instance id and re-running all modules. This causes ssh host keys to get regenerated across reboot into the new kernels which will cause concern on long-running instances that something nefarious has happened. LP: #1835584

blackboxsw · 2021-02-18T20:57:57Z

One thing to note, if I try to run this test locally, I get:
E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'
I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.
To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Yes this a good point, I was thinking for current dev it gives us the change to run the test to see the failure currently without the image_source set. But, we probably don't want to instrument tests that we can easily cause to break. Reviewers could just remove such a pytest.skip if they wanted to see failures.

TheRealFalcon

Integration test LGTM. I did a much shallower review of everything else given how much Dan has been involved.

TheRealFalcon reviewed Feb 1, 2021

View reviewed changes

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved

blackboxsw mentioned this pull request Feb 2, 2021

Enabling FIPS on Ubuntu Pro 18.04 in Azure changes the ssh host key canonical/ubuntu-pro-client#1177

Closed

OddBloke reviewed Feb 2, 2021

View reviewed changes

blackboxsw force-pushed the ssh-revert-default-auth-keys branch 3 times, most recently from ba8e071 to bf190cd Compare February 12, 2021 21:59

blackboxsw requested review from TheRealFalcon and OddBloke February 12, 2021 22:00

blackboxsw force-pushed the ssh-revert-default-auth-keys branch from bf190cd to 5c199e9 Compare February 12, 2021 22:34

OddBloke reviewed Feb 16, 2021

View reviewed changes

cloudinit/sources/DataSourceAzure.py Outdated Show resolved Hide resolved

tests/unittests/test_datasource/test_azure.py Outdated Show resolved Hide resolved

blackboxsw force-pushed the ssh-revert-default-auth-keys branch 2 times, most recently from 5566f9b to a38eb71 Compare February 17, 2021 20:19

blackboxsw requested a review from OddBloke February 17, 2021 20:28

OddBloke reviewed Feb 17, 2021

View reviewed changes

blackboxsw force-pushed the ssh-revert-default-auth-keys branch from a38eb71 to 65b2311 Compare February 18, 2021 02:13

OddBloke reviewed Feb 18, 2021

View reviewed changes

blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 65b2311 to 6396c3c Compare February 18, 2021 20:14

blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 37c7fbe to 6c738da Compare February 18, 2021 20:50

tests: add integration test for validating Azure UUID case-insentive

2b83481

blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 6c738da to 2b83481 Compare February 18, 2021 21:01

Merge branch 'master' into ssh-revert-default-auth-keys

17b3817

TheRealFalcon approved these changes Feb 19, 2021

View reviewed changes

blackboxsw merged commit 66e2d42 into canonical:master Feb 19, 2021

This was referenced May 12, 2023

Release 21.1 #3846

Closed

Randomly set credentials written in cleartext to world-readable file #3854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

blackboxsw commented Feb 1, 2021 •

edited

TheRealFalcon left a comment

OddBloke left a comment

OddBloke left a comment

OddBloke Feb 16, 2021

OddBloke Feb 16, 2021

OddBloke Feb 16, 2021

blackboxsw Feb 18, 2021

OddBloke left a comment

OddBloke left a comment

OddBloke Feb 17, 2021

blackboxsw Feb 17, 2021

OddBloke left a comment

OddBloke Feb 18, 2021

OddBloke Feb 18, 2021

OddBloke Feb 18, 2021

OddBloke Feb 18, 2021

blackboxsw Feb 18, 2021

blackboxsw Feb 18, 2021

OddBloke Feb 18, 2021

blackboxsw Feb 18, 2021

OddBloke commented Feb 18, 2021

blackboxsw commented Feb 18, 2021

TheRealFalcon left a comment

azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

Conversation

blackboxsw commented Feb 1, 2021 • edited

Proposed Commit Message

Additional Context

Test Steps

Checklist:

TheRealFalcon left a comment

Choose a reason for hiding this comment

OddBloke left a comment

Choose a reason for hiding this comment

OddBloke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OddBloke left a comment

Choose a reason for hiding this comment

OddBloke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OddBloke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OddBloke commented Feb 18, 2021

blackboxsw commented Feb 18, 2021

TheRealFalcon left a comment

Choose a reason for hiding this comment

blackboxsw commented Feb 1, 2021 •

edited