Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure: case-insensitive UUID to avoid new IID during kernel upgrade #798

Merged
merged 3 commits into from Feb 19, 2021

Conversation

blackboxsw
Copy link
Collaborator

@blackboxsw blackboxsw commented Feb 1, 2021

Proposed Commit Message

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.

Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.

Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.

Also add:

  • An integration test for this for Azure Bionic Ubuntu FIPS upgrading from
    a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
  • A new pytest.mark.sru_next to collect all integration tests related to our
    next SRU

LP: #1835584

Additional Context

Integration test will add a 4.14 generic kernel, add grub config to prefer the "generic" flavor of that kernel across reboot.
This triggers the uppercase product_uuid which on current cloud-init would generate new ssh host keys.
New cloud-init will not trigger this ssh host key creation.

Test Steps

# create a passwordless ssh pub/private keypair
ssh-keygen 


 CLOUD_INIT_PUBLIC_SSH_KEY=/root/.ssh/id_rsa_az.pub CLOUD_INIT_PLATFORM=azure  .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py
# Expect failure running against current cloud-init: 
# E       AssertionError: config_ssh ran too many times 2
# E       assert 1 == 2
# E         +1
# E         -2

make deb
# Expect success running using CLOUD_INIT_CLOUD_INIT_SOURCE=<deb_from_this_pr>

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • [n/a] I have updated or added any documentation accordingly

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the integration test. Can do fuller review tomorrow.

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick integration test thoughts.

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch 3 times, most recently from ba8e071 to bf190cd Compare February 12, 2021 21:59
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just integration test review for now. I've left some specific comments, but I'd like us to step back and consider whether this is the right path for this test. Previously you were launching an image that would present with one UUID case, and then upgrading its kernel in a way that would present the other case. I suggested an iteration of that approach: using an old standard Ubuntu image to avoid the dependency on Pro images. I think it's likely that would give us a simpler test (with fewer external dependencies); did you try that approach before settling on this one?


def _check_iid_insensitive_across_kernel_upgrade(client):
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
assert uuid.islower(), "UUID does not appear to be lowercase {}".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to indicate (either in the message or in a comment) that this is a precondition assertion, rather than an assertion on what cloud-init should have done. Let's do this for other such assertions too.

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
Comment on lines 72 to 66
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
assert uuid.isupper(), "UUID does not appear to be uppercase {}".format(
uuid
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that the UUID hasn't changed (and therefore invalidated our preconditions)?

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
case-insensitive comparison and avoid triggering "new instance" detection
in cloud-init on Azure platform.

The test will launch an Ubuntu image which has a 5.4 kernel. We allow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true: this test is not constrained by Ubuntu release, so will launch whatever version of the kernel is in the latest Azure image for the target Ubuntu release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've now pinned the test to launch a specific Ubuntu PRO FIPS image that will be available forever and exhibits a UUID case change when upgrading to linux-azure from linux-azure-fips.

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and now some not-integration-test review.

cloudinit/sources/DataSourceAzure.py Outdated Show resolved Hide resolved
tests/unittests/test_datasource/test_azure.py Outdated Show resolved Hide resolved
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch 2 times, most recently from 5566f9b to a38eb71 Compare February 17, 2021 20:19
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code, unit tests and general structure of the integration test looks good to me, thank you!

We still need to get this test to run only on bionic, see my inline thoughts.

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved

@pytest.mark.azure
@pytest.mark.sru_next
@pytest.mark.not_xenial
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still run on every release bionic+; we could introduce an only_bionic mark or similar. However, given that this test doesn't use the client fixtures, we don't actually need to perform the skip via marks to avoid the instance launch: we can do it in the body of the test before we call launch instead.

Ideally, we'd be able to do something like if (session_cloud.image.os, session_cloud.image.release) != ("ubuntu", "bionic"): pytest.skip(...) but we don't currently store the ImageSpecification that we determine on session_cloud. That's probably not too painful to hook up (though may require some thought around snapshotting), but I don't think we need to for this, re-parsing OS_IMAGE should be sufficient: something like image_spec = ImageSpecification.from_os_image(); if (image_spec.os, image_spec.release) != ("ubuntu", "bionic"): pytest.skip(...).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OddBloke funny, I put that type of pytest.skip in locally and decided to pullt it out based on looking at self.settings.os_image I think. I'll put some semblance of that back into this PR

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions, a few more thoughts inline.

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

Comment on lines 48 to 50
# work around pad.lv/1908287
instance.restart()
if not instance.execute("cloud-init status --wait --long").ok:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this will workaround the issue: this first execute call could still get into the instance before it reboots (which is the root cause of the issue) and therefore not trigger any of the waiting behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Please see my below comment on where we call this before spending too much time on these comments. 👍)

for _ in range(10):
time.sleep(5)
result = instance.execute("cloud-init status --wait --long")
if result.ok:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If result.ok is False, what is gained by retrying? --wait means that we know that's the final status we're going to hit, so I think these retries will just run cloud-init status 4 additional times before raising the below exception.

# Allow for upgrade of cloud-init in FIPS image for pre-release testing
if install_cloudinit:
instance.install_new_cloud_init(source, take_snapshot=False)
_restart_with_retries(instance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we'll restart the instance for the new kernel, do we need to restart here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we restart after upgrading cloud-init is because install_new_cloud_init calls instance.clean() which removes all cloud-init logs, semaphores and syslog though pycloudlib. What we are trying to establish is that the upgraded cloud-init detects uppercase UUID on first boot, then on 2nd boot into new kernel we don't accidentally detect a change in instance-id due to the now lowercase UUID. I support instead of the reboot I could either avoid instance.clean() in install_new_cloud_init by providing a clean=False param. That'd save time on the reboot (because cloud-init will have already detected the uppercase instance-id even before we upgrade cloud-init on that instance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a change to avoid the extra-reboot and I'm testing this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd support modifying install_new_cloud_init to allow for this case; a clean parameter sounds reasonable to me.

@TheRealFalcon How does that sound to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added clean param and confirmed success run of the following (which a built deb from this PR)

 CLOUD_INIT_OS_IMAGE=bionic CLOUD_INIT_CLOUD_INIT_SOURCE=/cloud-init_20.4.1-86-g5c199e90-1~bddeb_all.deb CLOUD_INIT_PUBLIC_SSH_KEY=/mykey  CLOUD_INIT_PLATFORM=azure  .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py

tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
tests/integration_tests/bugs/test_lp1835584.py Outdated Show resolved Hide resolved
@OddBloke
Copy link
Collaborator

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Peviously UUID as uppercase.

Azure datasource reads the product_uuid directly as their platform's
instance-id. This present a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.

This causes ssh host keys to get regenerated across reboot into the new
kernels which will cause concern on long-running instances that something
nefarious has happened.

LP: #1835584
@blackboxsw
Copy link
Collaborator Author

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Yes this a good point, I was thinking for current dev it gives us the change to run the test to see the failure currently without the image_source set. But, we probably don't want to instrument tests that we can easily cause to break. Reviewers could just remove such a pytest.skip if they wanted to see failures.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test LGTM. I did a much shallower review of everything else given how much Dan has been involved.

@blackboxsw blackboxsw merged commit 66e2d42 into canonical:master Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants