Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user control over update events #834

Merged
merged 22 commits into from
May 13, 2021

Conversation

TheRealFalcon
Copy link
Member

@TheRealFalcon TheRealFalcon commented Mar 5, 2021

Proposed Commit Message

Allow user control over update events

Control is currently limited to boot events, though this should
allow us to more easily incorporate HOTPLUG support. Disabling
'instance-first-boot' is not supported as we apply networking config
too early in boot to have processed userdata (along with the fact
that this would be a pretty big foot-gun).

The concept of update events on datasource has been split into
supported update events and default update events. Defaults will be
used if there is no user-defined update events, but user-defined
events won't be supplied if they aren't supported.
When applying the networking config, we now check to see if the event
is supported by the datasource as well as if it is enabled.

Configuration looks like:
updates:
  network:
    when: ['boot']

Additional Context

Most of the functionality here has been taken from the initial hotplug PR, though this PR is intentionally more limited in scope, and changes have been made as I saw fit.

Test Steps

WIP

@TheRealFalcon TheRealFalcon added the wip Work in progress, do not land label Mar 5, 2021
@TheRealFalcon TheRealFalcon force-pushed the events-user-data branch 2 times, most recently from 852ba9c to ec59faa Compare March 5, 2021 20:52
Comment on lines 764 to 769
if not any([
self.datasource is NULL_DATA_SOURCE,
self.is_new_instance(), # or should it be self._is_first_boot(),
(self.update_event_enabled(current_boot_type, scope='network')
and self.datasource.update_metadata([current_boot_type]))
]):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section has been where I've been banging my head most against the wall. (With a few exceptions), apply_networking_config gets called twice during boot. It get called first in LOCAL mode, and then again in NETWORK mode.

Previous to this PR, on first boot, if the datasource got initialized in LOCAL mode, we'd return at this point and not apply networking config in NETWORK mode (is_new_instance() returns False in network mode on first boot if there's a FILESYSTEM dep). On every subsequent boot, If the BOOT update event is enabled we apply networking config twice (one for LOCAL, one for NETWORK).

That same general behavior should still be true, but applying networking config twice on every boot seems wrong to me. Since self.is_new_instance() will always return False on every subsequent boot, that can't be our determining factor. I had considered checking datasource dependencies, and apply during LOCAL if there's a filesystem dependency, and at NETWORK if there's a network dependency, but that could cause problems on first boot because even if we find no local datasource, we still generate the fallback config and use that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheRealFalcon just caught up on mail so sorry for the delay.

I think we've had a bug for the longest time on apply_networking_config twice
as you've noticed. Generally this isn't an issue (other than wasteful) in
that for Datasources that are local and have network-config (say OpenStack)
we already get the network config we're going to use. In the case that the DS
is network-only (GCE) local never wrote a network config.

That said, I have wanted to apply it only once and the rub has been we've not
decided on how we can keep state between local and init that we've already
rendered network config. If we want to do so we could use either explicit
state on the cloud object such that when we re-use the object at net time (we
load the object with existing=trust (stages.py:_restore_checked_cache() we'll
have something that can be used to determine if we've already applied.

If we want to avoid attributes on the instance object, we can use ephemeral
files in /run/cloud-init; this won't have any state implications from boot to
boot but we should carefully think about the case of something re-running
cloud-init from the command line.

@TheRealFalcon
Copy link
Member Author

@raharper , do you have any insight to my comment here?

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, familiarising myself with the overall approach here. Thanks for the work, both Ryan and James!

======

`Cloud-init`_ will fetch and apply cloud and user data configuration
upon serveral event types. The two most common events for cloud-init
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
upon serveral event types. The two most common events for cloud-init
upon several event types. The two most common events for cloud-init

Each Datasource will provide a set of events that it is capable of handling.
Datasources may not support all event types. In some cases a system
may be configured to allow a particular event but may be running on
a platform who's datasource cannot support the event.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a platform who's datasource cannot support the event.
a platform whose datasource cannot support the event.

Comment on lines 55 to 56
event occur. Currently there are two known scopes: ``network`` and
``storage``. Scopes are defined by convention but arbitrary values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a storage scope yet?

Comment on lines 39 to 42
Cloud-init has a default updates policy to handle new instance
events always. Vendors may want an instance to handle additional
events. Users have the final say and may provide update configuration
which can be used to enable or disable handling of specific events. The
exception is first boot. Configuration will always be applied at first
boot, and the user cannot disable this.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is quite dense. Most readers won't care about vendors (or, honestly, even know what they mean in a cloud-init context), so that could be mentioned separately; it also feels like the first and last sentences are stating the same thing slightly differently.

========================

All :ref:`datasources` by default support the ``BOOT_NEW_INSTANCE`` event.
Each Datasource will provide a set of events that it is capable of handling.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each Datasource will provide a set of events that it is capable of handling.
Each Datasource will declare a set of these events that it is capable of handling.

"provide a set of event" implies more agency for the datasource than I think we intend

cloudinit/stages.py Outdated Show resolved Hide resolved
if self.is_new_instance():
current_boot_type = EventType.BOOT_NEW_INSTANCE

if not any([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using any([...]) here means that we'll evaluate all of these conditions, ironically, unconditionally:

In [2]: def foo(): 
   ...:     print("foo") 
   ...:     return True 
   ...:                                                                                                                                                                                                                                                                                   

In [3]: any([foo(), foo()])                                                                                                                                                                                                                                                               
foo
foo
Out[3]: True

rather than lazily:

In [6]: foo() or foo()                                                                                                                                                                                                                                                                    
foo
Out[6]: True

This is particularly problematic here because it means there's a decent chance we will call self.datasource.update_metadata in cases where we previously wouldn't: that could cause us to re-fetch metadata in cases where it isn't necessary.

Comment on lines 760 to 762
current_boot_type = EventType.BOOT
if self.is_new_instance():
current_boot_type = EventType.BOOT_NEW_INSTANCE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be a first-class attribute (presumably on self).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other method where this would currently be used. Can we take a YAGNI approach here and make it first class later if it's needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear is that people will just reimplement it (but probably slightly differently) instead of performing a refactor. I think it would also aid readability here, as it would be clear(er) that the current_boot_type is an invariant property of the Init stage we're running (in the same way that is_new_instance is an invariant property).

(But I certainly won't block landing on it.)

Comment on lines 777 to 779
else:
# refresh netcfg after update
netcfg, src = self._find_networking_config()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, for sure, but as we unconditionally return in the if branch, we don't need the else:

Suggested change
else:
# refresh netcfg after update
netcfg, src = self._find_networking_config()
# refresh netcfg after update
netcfg, src = self._find_networking_config()

.. _events:

******************
Events and Updates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc could use some work to make it a little more accessible for cloud-init users.

@TheRealFalcon TheRealFalcon changed the title WIP: Allow user control over update events Allow user control over update events Apr 12, 2021
@TheRealFalcon TheRealFalcon removed the wip Work in progress, do not land label Apr 12, 2021
Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James, this is looking good! I have some suggestions to improve clarity in cloudinit.event as well as some thoughts on testing, but the core of this functionality LGTM: now I'm mostly concerned with making sure we're setting ourselves up well for the future, in terms of API choices.

tests/integration_tests/modules/test_user_events.py Outdated Show resolved Hide resolved
tests/integration_tests/modules/test_user_events.py Outdated Show resolved Hide resolved
tests/integration_tests/modules/test_user_events.py Outdated Show resolved Hide resolved
bpath = os.path.join(
self.paths.get_cpath(),
'sem',
'config_scripts_per_once.once'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a brief window of cloud-init execution where this file exists on a first boot: after scripts-per-once runs, but before the remaining modules complete. I don't think this is a problem that we need to address, but a comment or docstring here indicating that corner case would be good.

'sem',
'config_scripts_per_once.once'
)
first_boot = not os.path.exists(bpath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar to

def has_run(self, name, freq):
if not freq or freq == PER_ALWAYS:
return False
cname = canon_sem_name(name)
sem_file = self._get_path(cname, freq)
# This isn't really a good atomic check
# but it suffices for where and when cloudinit runs
if os.path.exists(sem_file):
return True
# this case could happen if the migrator module hadn't run yet
# but the item had run before we did canon_sem_name.
if cname != name and os.path.exists(self._get_path(name, freq)):
LOG.warning("%s has run without canonicalized name [%s].\n"
"likely the migrator has not yet run. "
"It will run next boot.\n"
"run manually with: cloud-init single --name=migrator",
name, cname)
return True
return False
but missing some specific handling; should we be using that somehow instead?


# Default: generate network config on new instance id (first boot).
update_events = {'network': set([EventType.BOOT_NEW_INSTANCE])}
supported_update_events = {'network': set([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozenset, maybe? (A nit, regardless.)

EventType.BOOT_NEW_INSTANCE,
EventType.BOOT,
])}
default_update_events = {'network': set([EventType.BOOT_NEW_INSTANCE])}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a constant for 'network' so that we don't have a "magic" string affecting behaviour? (We'll notice EventScope.NETWROK a lot sooner than 'netwrok' via linting, etc.)

Comment on lines 714 to 756
allowed = get_enabled_events(
config_events, default_events
)

if not scope:
scopes = [allowed.keys()]
else:
scopes = [scope]
LOG.debug('Possible scopes for this event: %s', scopes)

for evt_scope in scopes:
if event_source_type in allowed.get(evt_scope, []):
LOG.debug('Event Allowed: scope=%s EventType=%s',
evt_scope, event_source_type)
return True

LOG.debug('Event Denied: scopes=%s EventType=%s',
scopes, event_source_type)
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this references self: perhaps a candidate to move into cloudinit.event as a function?

self.init.apply_network_config(True)
self.init.distro.apply_network_config_names.assert_called_with(net_cfg)
self.init.distro.apply_network_config_names.assert_called_with(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mock autospec'd?

self._apply_network_setup(m_macs)

self.init.datasource.supported_update_events = {
'network': [EventType.BOOT_NEW_INSTANCE]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a set instead of a list?

@TheRealFalcon
Copy link
Member Author

@OddBloke Pushed a big-ish update. I (hopefully) simplified event.py a lot, and a lot of the other changes flow out from there.

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James, this refactor has made it much easier to follow along! I think structurally we're pretty much there: I have some more specific comments/questions inline.

"""Classes and functions related to event handling."""

from enum import Enum
from typing import Dict, Set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This presents a compatibility break with Python 3.4, unfortunately. It does make the code much easier to follow, so perhaps switch to type comments without imports (which will cause mypy to complain, but we don't run it today and fixing these imports once we do will likely be one of the easier fixes we have to make)?

Comment on lines 188 to 194
supported_update_events = {EventScope.NETWORK: set([
EventType.BOOT_NEW_INSTANCE,
EventType.BOOT,
])}
default_update_events = {EventScope.NETWORK: set(
[EventType.BOOT_NEW_INSTANCE
])}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here is inconsistent; we won't be adding to these too often, so I think it would be fine to do without a trailing comma:

    supported_update_events = {
        EventScope.NETWORK: set([EventType.BOOT_NEW_INSTANCE, EventType.BOOT])
    }
    default_update_events = {
        EventScope.NETWORK: set([EventType.BOOT_NEW_INSTANCE])
    }

(but that's a nit: the inconsistency is not 😄 )

@@ -206,10 +206,10 @@ def read_user_data_callback(mount_dir):

class DataSourceRbxCloud(sources.DataSource):
dsname = "RbxCloud"
update_events = {'network': [
default_update_events = {EventScope.NETWORK: set([
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using set literals for these definitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a personal preference because I often think I'm seeing a dictionary when I see a set literal. I don't have a problem changing it though.

Comment on lines 67 to 68
self.original_supported_events = copy.deepcopy(
self.datasource.supported_update_events)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these lines in addition to the line in setUpClass?

Comment on lines 73 to 74
cls.original_supported_events = copy.deepcopy(
DataSource.supported_update_events)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use mock.patch for this? Borderline nit, but I expect it will handle corner cases we aren't considering, for example.

LOG.debug('Allowed events: %s', allowed)

if not scope:
scopes = list(allowed.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change it, it's harmless, but is there a specific reason I'm missing for needing the list call here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the "almost list" structures that dicts give us back.

>>> x = {1:2}
>>> [x.keys()]
[dict_keys([1])]
>>> list(x.keys())
[1]

Since I'm using a comprehension it later, I need it to be a "real" list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict_keys is iterable:

In [22]: from collections.abc import Iterable
In [21]: isinstance({}.keys(), Iterable)
Out[21]: True

So is usable in comprehensions:

In [3]: from enum import Enum

In [4]: class Test(Enum):
   ...:     thing = "string"
In [8]: d = {Test("string"): "foo"}

In [9]: [x for x in d.keys()]
Out[9]: [<Test.thing: 'string'>]

In [10]: [x.value for x in d.keys()]
Out[10]: ['string']

It isn't a sequence:

In [22]: from collections.abc import Sequence
In [23]: isinstance({}.keys(), Sequence)
Out[23]: False

because it doesn't implement __getitem__ (so you can't subscript it, which is perhaps what you were thinking of).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually...no 😄

The code I was copying from had a scopes = [allowed.keys()] and I misinterpreted why that wasn't working. I thought it was leaving it as a dict_keys, which is why I converted it to a proper list, when in actuality it was wrapping the dict_keys into another list.

All this to say...I'll drop the list(...)

return True

LOG.debug('Event Denied: scopes=%s EventType=%s',
[s.value for s in scopes], event_source_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These log messages look like they'll be a little inconsistent: the "Possible scopes" message doesn't call s.value on each scope before logging it; the "Event Allowed" one doesn't call it on evt.scope. Can we get these aligned?

Comment on lines 764 to 800
if not (
self.datasource is NULL_DATA_SOURCE or
self.boot_type == EventType.BOOT_NEW_INSTANCE or
(
self.update_event_enabled(
self.boot_type, scope=EventScope.NETWORK
) and self.datasource.update_metadata([self.boot_type])
)
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following the exact logic here; could we decompose it into one or more variables, that'll make it a little easier to understand what we're checking?

Comment on lines 93 to 104
log = client.read_from_file('/var/log/cloud-init.log')
assert 'Applying network configuration' in log
assert 'dummy0' not in client.execute('ls /sys/class/net')

_add_dummy_bridge_to_netplan(client)
client.execute('rm /var/log/cloud-init.log')
client.restart()
log = client.read_from_file('/var/log/cloud-init.log')

assert 'Event Allowed: scope=network EventType=boot' in log
assert 'Applying network configuration' in log
assert 'dummy0' not in client.execute('ls /sys/class/net')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth refactoring this into a _test_network_config_applied_on_reboot which both this and the Azure test can call?

Comment on lines 3 to 10
This is currently limited to applying network config on BOOT events.
Since we reapply the same config on reboot, verification is limited to
checking log messages.

Because the supported/default events are datasource specific, the datasource
is being limited to LXD containers/vms (NoCloud).
For NoCloud, BOOT_NEW_INSTANCE and BOOT are supported, but only
BOOT_NEW_INSTANCE is default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look much better now, thank you! (But this docstring is now out-of-date. 🏃💨)

evt_scope.value, event_source_type)
return True

LOG.debug('Event Denied: scopes=%s EventType=%s',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not happy with this. We attempt to apply networking config twice on every boot. Pre-this-PR behavior, if only BOOT_NEW_INSTANCE was enabled, we would apply exactly once on the first part of first boot, then deny on every subsequent attempt. If BOOT was enabled, we'd apply every single time (twice per boot, including first boot).

With this PR, on first boot we don't have access to userdata, so we fall back to the default update events of the datasource. If only BOOT_NEW_INSTANCE is enabled, we'll get an event denied message in the logs, even if BOOT is specified in the userdata. I'm concerned having an event denied message in the logs will be confusing given that it is enabled in the userdata.

I could add a check for first boot here and slightly alter the log message. It could be something like "Event Denied from data source...", but that also seems like overkill checking every boot for a not-much-more-helpful log message.

Any thoughts for improvement?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases when we would only successfully apply networking configuration on the second attempt during first boot? If so, then does it follow that BOOT_NEW_INSTANCE should be the toggle for both of those runs? And, therefore, should we actually be using the check for first boot for more than just a log message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also occurs to me that we already have language for these occasions, from our script execution modules: once, instance, and boot. I'm not suggesting we change the language we're using here ("instance" is obviously ~contained within "BOOT_NEW_INSTANCE"), but it does make me wonder if we should be following those modules and tracking whether we have applied network configuration for this instance, rather than trying to infer what we should be doing based on the type of a boot we think we have.

If we haven't yet configured networking, BOOT_NEW_INSTANCE or BOOT will cause it to be applied; if we have, then only BOOT will.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases when we would only successfully apply networking configuration on the second attempt during first boot?

Yes, I've been assuming that in my explanations. We will be in BOOT_NEW_INSTANCE state until the datasource has been initialized. It's entirely possible that the first attempt can result in "wait till networking is up", so the second attempt is still BOOT_NEW_INSTANCE. We always attempt to apply networking when BOOT_NEW_INSTANCE

If so, then does it follow that BOOT_NEW_INSTANCE should be the toggle for both of those runs? And, therefore, should we actually be using the check for first boot for more than just a log message?

I'm not sure I follow what should be the toggle means. But to the second question, I think the answer is no. First boot is really only relevant because we don't have userdata available yet.

but it does make me wonder if we should be following those modules and tracking whether we have applied network configuration for this instance, rather than trying to infer what we should be doing based on the type of a boot we think we have.
If we haven't yet configured networking, BOOT_NEW_INSTANCE or BOOT will cause it to be applied; if we have, then only BOOT will.

This is essentially what I'm already doing. Or am I misunderstanding what you're saying?

Sorry if I didn't make it clear, the but the case I'm concerned about is when networking is applied on first attempt and userdata specifies BOOT (which we can't access on first boot), what's happening on the second attempt.

CURRENT BEHAVIOR:

master when datasource specifies BOOT_NEW_INSTANCE:
first boot, one: apply
first boot, two: skip
subsequent boot, one: skip
subsequent boot, two: skip

master when datasource specifies BOOT:
first boot, one: apply
first boot, two: apply
subsequent boot, one: apply
subsequent boot, two: apply

The above should be exactly the same on this branch when userdata specifies BOOT_NEW_INSTANCE. However, when userdata specifies BOOT:

BRANCH BEHAVIOR:

branch with userdata BOOT:
first boot, one: apply
first boot, two: apply or skip based on datasource default because we don't have userdata
subsequent boot, one: apply
subsequent boot, two: apply

(These scenarios are assuming networking gets applied on the first attempt)

If on first boot, two we skip because of the datasource default, we get a big ugly "Event Denied" log message that will likely be confusing for somebody that doesn't know userdata won't take effect till next boot.

Changing this to instead apply on both attempts isn't difficult, but it would also apply in the BOOT_NEW_INSTANCE case, which is a change in current behavior. This could be important for backwards compatibility, in case we specifically DON'T want network config applied more than once when BOOT_NEW_INSTANCE is specified.

Does my reasoning here make sense? If so, do you think it makes sense to modify the log message on first boot for the case I mentioned?

@TheRealFalcon
Copy link
Member Author

@OddBloke should be ready for another review

else:
# refresh netcfg after update
netcfg, src = self._find_networking_config()
def _boot_event_enabled_and_supported():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update here James, it's much easier to follow what the intent is here!

def _boot_event_enabled_and_supported():
return self.update_event_enabled(
EventType.BOOT, scope=EventScope.NETWORK
) and self.datasource.update_metadata([EventType.BOOT])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call still throws me off: this isn't just checking for support, but will actually cause the datasource to discard cached data and fetch it anew. That feels to me like it should be two separate calls: "does this datasource support $event_type" and "update the datasource's metadata for $event_type".

It also feels a little odd that update_metadata doesn't take a scope: we only want to update network metadata here, if (e.g.) storage metadata has changed then we don't particularly want to update it.

Relatedly, it may look like update_metadata should just update network metadata (or, generally, metadata for the explicitly configured scopes), but our _get_data implementations generally fetch and store All The Metadata, so I think this will completely refresh all metadata regardless: perhaps not a problem, but worth noting.

@TheRealFalcon
Copy link
Member Author

@OddBloke most recent commit uses the semaphore as we discussed. Still needs a ton of test changes though.

Comment on lines 17 to 18
- **BOOT_NEW_INSTANCE**: ``New instance first boot``
- **BOOT**: ``Any system boot other than 'BOOT_NEW_INSTANCE'``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update this with the legacy event.

Copy link
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now, thanks for all the work, James! I've confirmed that a full LXD container run and a full Azure run both succeed locally.

@TheRealFalcon TheRealFalcon merged commit 8643469 into canonical:master May 13, 2021
@TheRealFalcon TheRealFalcon deleted the events-user-data branch May 13, 2021 17:55
TheRealFalcon added a commit to TheRealFalcon/cloud-init that referenced this pull request Sep 2, 2021
In canonical#834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356

However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.

Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
TheRealFalcon added a commit to TheRealFalcon/cloud-init that referenced this pull request Sep 2, 2021
In canonical#834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356

However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.

Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
TheRealFalcon added a commit to TheRealFalcon/cloud-init that referenced this pull request Sep 2, 2021
In canonical#834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356

However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.

Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
TheRealFalcon added a commit to TheRealFalcon/cloud-init that referenced this pull request Sep 2, 2021
In canonical#834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356

However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.

Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
TheRealFalcon added a commit that referenced this pull request Sep 3, 2021
In #834, we refactored the handling of events for fetching new metadata.
Previously, in Azure's __init__, the BOOT event was added to the
update_events, so it was assumed that Azure required the standard BOOT
behavior, which is to apply metadata twice every boot: once during
local-init, then again during standard init phase.
https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356

However, this line was effectively meaningless. After the metadata was
fetched in local-init, it was then pickled out to disk. Because
"update_events" was a class variable, the EventType.BOOT was not
persisted into the pickle. When the pickle was then unpickled in the
init phase, metadata did not get re-fetched because EventType.BOOT was
not present, so Azure is effectely only BOOT_NEW_INSTANCE.

Fetching metadata twice during boot causes some issue for
pre-provisioning on Azure because updating metadata during
re-provisioning will cause cloud-init to poll for reprovisiondata again
in DataSourceAzure, which will infinitely return 404(reprovisiondata
is deleted from IMDS after health signal was sent by cloud-init during
init-local). This makes cloud-init stuck in 'init'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants