-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user control over update events #834
Allow user control over update events #834
Conversation
852ba9c
to
ec59faa
Compare
cloudinit/stages.py
Outdated
if not any([ | ||
self.datasource is NULL_DATA_SOURCE, | ||
self.is_new_instance(), # or should it be self._is_first_boot(), | ||
(self.update_event_enabled(current_boot_type, scope='network') | ||
and self.datasource.update_metadata([current_boot_type])) | ||
]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section has been where I've been banging my head most against the wall. (With a few exceptions), apply_networking_config
gets called twice during boot. It get called first in LOCAL mode, and then again in NETWORK mode.
Previous to this PR, on first boot, if the datasource got initialized in LOCAL mode, we'd return at this point and not apply networking config in NETWORK mode (is_new_instance()
returns False
in network mode on first boot if there's a FILESYSTEM dep). On every subsequent boot, If the BOOT update event is enabled we apply networking config twice (one for LOCAL, one for NETWORK).
That same general behavior should still be true, but applying networking config twice on every boot seems wrong to me. Since self.is_new_instance()
will always return False on every subsequent boot, that can't be our determining factor. I had considered checking datasource dependencies, and apply during LOCAL if there's a filesystem dependency, and at NETWORK if there's a network dependency, but that could cause problems on first boot because even if we find no local datasource, we still generate the fallback config and use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheRealFalcon just caught up on mail so sorry for the delay.
I think we've had a bug for the longest time on apply_networking_config twice
as you've noticed. Generally this isn't an issue (other than wasteful) in
that for Datasources that are local and have network-config (say OpenStack)
we already get the network config we're going to use. In the case that the DS
is network-only (GCE) local never wrote a network config.
That said, I have wanted to apply it only once and the rub has been we've not
decided on how we can keep state between local and init that we've already
rendered network config. If we want to do so we could use either explicit
state on the cloud object such that when we re-use the object at net time (we
load the object with existing=trust (stages.py:_restore_checked_cache() we'll
have something that can be used to determine if we've already applied.
If we want to avoid attributes on the instance object, we can use ephemeral
files in /run/cloud-init; this won't have any state implications from boot to
boot but we should carefully think about the case of something re-running
cloud-init from the command line.
@raharper , do you have any insight to my comment here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, familiarising myself with the overall approach here. Thanks for the work, both Ryan and James!
doc/rtd/topics/events.rst
Outdated
====== | ||
|
||
`Cloud-init`_ will fetch and apply cloud and user data configuration | ||
upon serveral event types. The two most common events for cloud-init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upon serveral event types. The two most common events for cloud-init | |
upon several event types. The two most common events for cloud-init |
doc/rtd/topics/events.rst
Outdated
Each Datasource will provide a set of events that it is capable of handling. | ||
Datasources may not support all event types. In some cases a system | ||
may be configured to allow a particular event but may be running on | ||
a platform who's datasource cannot support the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a platform who's datasource cannot support the event. | |
a platform whose datasource cannot support the event. |
doc/rtd/topics/events.rst
Outdated
event occur. Currently there are two known scopes: ``network`` and | ||
``storage``. Scopes are defined by convention but arbitrary values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a storage
scope yet?
doc/rtd/topics/events.rst
Outdated
Cloud-init has a default updates policy to handle new instance | ||
events always. Vendors may want an instance to handle additional | ||
events. Users have the final say and may provide update configuration | ||
which can be used to enable or disable handling of specific events. The | ||
exception is first boot. Configuration will always be applied at first | ||
boot, and the user cannot disable this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is quite dense. Most readers won't care about vendors (or, honestly, even know what they mean in a cloud-init context), so that could be mentioned separately; it also feels like the first and last sentences are stating the same thing slightly differently.
doc/rtd/topics/events.rst
Outdated
======================== | ||
|
||
All :ref:`datasources` by default support the ``BOOT_NEW_INSTANCE`` event. | ||
Each Datasource will provide a set of events that it is capable of handling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each Datasource will provide a set of events that it is capable of handling. | |
Each Datasource will declare a set of these events that it is capable of handling. |
"provide a set of event" implies more agency for the datasource than I think we intend
cloudinit/stages.py
Outdated
if self.is_new_instance(): | ||
current_boot_type = EventType.BOOT_NEW_INSTANCE | ||
|
||
if not any([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using any([...])
here means that we'll evaluate all of these conditions, ironically, unconditionally:
In [2]: def foo():
...: print("foo")
...: return True
...:
In [3]: any([foo(), foo()])
foo
foo
Out[3]: True
rather than lazily:
In [6]: foo() or foo()
foo
Out[6]: True
This is particularly problematic here because it means there's a decent chance we will call self.datasource.update_metadata
in cases where we previously wouldn't: that could cause us to re-fetch metadata in cases where it isn't necessary.
cloudinit/stages.py
Outdated
current_boot_type = EventType.BOOT | ||
if self.is_new_instance(): | ||
current_boot_type = EventType.BOOT_NEW_INSTANCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should be a first-class attribute (presumably on self
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any other method where this would currently be used. Can we take a YAGNI approach here and make it first class later if it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fear is that people will just reimplement it (but probably slightly differently) instead of performing a refactor. I think it would also aid readability here, as it would be clear(er) that the current_boot_type
is an invariant property of the Init stage we're running (in the same way that is_new_instance
is an invariant property).
(But I certainly won't block landing on it.)
cloudinit/stages.py
Outdated
else: | ||
# refresh netcfg after update | ||
netcfg, src = self._find_networking_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit, for sure, but as we unconditionally return in the if
branch, we don't need the else
:
else: | |
# refresh netcfg after update | |
netcfg, src = self._find_networking_config() | |
# refresh netcfg after update | |
netcfg, src = self._find_networking_config() |
.. _events: | ||
|
||
****************** | ||
Events and Updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc could use some work to make it a little more accessible for cloud-init users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James, this is looking good! I have some suggestions to improve clarity in cloudinit.event
as well as some thoughts on testing, but the core of this functionality LGTM: now I'm mostly concerned with making sure we're setting ourselves up well for the future, in terms of API choices.
cloudinit/stages.py
Outdated
bpath = os.path.join( | ||
self.paths.get_cpath(), | ||
'sem', | ||
'config_scripts_per_once.once' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a brief window of cloud-init execution where this file exists on a first boot: after scripts-per-once
runs, but before the remaining modules complete. I don't think this is a problem that we need to address, but a comment or docstring here indicating that corner case would be good.
cloudinit/stages.py
Outdated
'sem', | ||
'config_scripts_per_once.once' | ||
) | ||
first_boot = not os.path.exists(bpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks similar to
cloud-init/cloudinit/helpers.py
Lines 114 to 135 in 9f9e154
def has_run(self, name, freq): | |
if not freq or freq == PER_ALWAYS: | |
return False | |
cname = canon_sem_name(name) | |
sem_file = self._get_path(cname, freq) | |
# This isn't really a good atomic check | |
# but it suffices for where and when cloudinit runs | |
if os.path.exists(sem_file): | |
return True | |
# this case could happen if the migrator module hadn't run yet | |
# but the item had run before we did canon_sem_name. | |
if cname != name and os.path.exists(self._get_path(name, freq)): | |
LOG.warning("%s has run without canonicalized name [%s].\n" | |
"likely the migrator has not yet run. " | |
"It will run next boot.\n" | |
"run manually with: cloud-init single --name=migrator", | |
name, cname) | |
return True | |
return False |
cloudinit/sources/__init__.py
Outdated
|
||
# Default: generate network config on new instance id (first boot). | ||
update_events = {'network': set([EventType.BOOT_NEW_INSTANCE])} | ||
supported_update_events = {'network': set([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frozenset
, maybe? (A nit, regardless.)
cloudinit/sources/__init__.py
Outdated
EventType.BOOT_NEW_INSTANCE, | ||
EventType.BOOT, | ||
])} | ||
default_update_events = {'network': set([EventType.BOOT_NEW_INSTANCE])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a constant for 'network'
so that we don't have a "magic" string affecting behaviour? (We'll notice EventScope.NETWROK
a lot sooner than 'netwrok'
via linting, etc.)
cloudinit/stages.py
Outdated
allowed = get_enabled_events( | ||
config_events, default_events | ||
) | ||
|
||
if not scope: | ||
scopes = [allowed.keys()] | ||
else: | ||
scopes = [scope] | ||
LOG.debug('Possible scopes for this event: %s', scopes) | ||
|
||
for evt_scope in scopes: | ||
if event_source_type in allowed.get(evt_scope, []): | ||
LOG.debug('Event Allowed: scope=%s EventType=%s', | ||
evt_scope, event_source_type) | ||
return True | ||
|
||
LOG.debug('Event Denied: scopes=%s EventType=%s', | ||
scopes, event_source_type) | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of this references self
: perhaps a candidate to move into cloudinit.event
as a function?
cloudinit/tests/test_stages.py
Outdated
self.init.apply_network_config(True) | ||
self.init.distro.apply_network_config_names.assert_called_with(net_cfg) | ||
self.init.distro.apply_network_config_names.assert_called_with( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mock autospec'd?
cloudinit/tests/test_stages.py
Outdated
self._apply_network_setup(m_macs) | ||
|
||
self.init.datasource.supported_update_events = { | ||
'network': [EventType.BOOT_NEW_INSTANCE] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a set instead of a list?
d8ccb3f
to
96f4934
Compare
@OddBloke Pushed a big-ish update. I (hopefully) simplified event.py a lot, and a lot of the other changes flow out from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James, this refactor has made it much easier to follow along! I think structurally we're pretty much there: I have some more specific comments/questions inline.
"""Classes and functions related to event handling.""" | ||
|
||
from enum import Enum | ||
from typing import Dict, Set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This presents a compatibility break with Python 3.4, unfortunately. It does make the code much easier to follow, so perhaps switch to type comments without imports (which will cause mypy to complain, but we don't run it today and fixing these imports once we do will likely be one of the easier fixes we have to make)?
cloudinit/sources/__init__.py
Outdated
supported_update_events = {EventScope.NETWORK: set([ | ||
EventType.BOOT_NEW_INSTANCE, | ||
EventType.BOOT, | ||
])} | ||
default_update_events = {EventScope.NETWORK: set( | ||
[EventType.BOOT_NEW_INSTANCE | ||
])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting here is inconsistent; we won't be adding to these too often, so I think it would be fine to do without a trailing comma:
supported_update_events = {
EventScope.NETWORK: set([EventType.BOOT_NEW_INSTANCE, EventType.BOOT])
}
default_update_events = {
EventScope.NETWORK: set([EventType.BOOT_NEW_INSTANCE])
}
(but that's a nit: the inconsistency is not 😄 )
@@ -206,10 +206,10 @@ def read_user_data_callback(mount_dir): | |||
|
|||
class DataSourceRbxCloud(sources.DataSource): | |||
dsname = "RbxCloud" | |||
update_events = {'network': [ | |||
default_update_events = {EventScope.NETWORK: set([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider using set literals for these definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a personal preference because I often think I'm seeing a dictionary when I see a set literal. I don't have a problem changing it though.
cloudinit/sources/tests/test_init.py
Outdated
self.original_supported_events = copy.deepcopy( | ||
self.datasource.supported_update_events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these lines in addition to the line in setUpClass
?
cloudinit/sources/tests/test_init.py
Outdated
cls.original_supported_events = copy.deepcopy( | ||
DataSource.supported_update_events) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use mock.patch
for this? Borderline nit, but I expect it will handle corner cases we aren't considering, for example.
cloudinit/stages.py
Outdated
LOG.debug('Allowed events: %s', allowed) | ||
|
||
if not scope: | ||
scopes = list(allowed.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change it, it's harmless, but is there a specific reason I'm missing for needing the list
call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the "almost list" structures that dicts give us back.
>>> x = {1:2}
>>> [x.keys()]
[dict_keys([1])]
>>> list(x.keys())
[1]
Since I'm using a comprehension it later, I need it to be a "real" list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict_keys
is iterable:
In [22]: from collections.abc import Iterable
In [21]: isinstance({}.keys(), Iterable)
Out[21]: True
So is usable in comprehensions:
In [3]: from enum import Enum
In [4]: class Test(Enum):
...: thing = "string"
In [8]: d = {Test("string"): "foo"}
In [9]: [x for x in d.keys()]
Out[9]: [<Test.thing: 'string'>]
In [10]: [x.value for x in d.keys()]
Out[10]: ['string']
It isn't a sequence:
In [22]: from collections.abc import Sequence
In [23]: isinstance({}.keys(), Sequence)
Out[23]: False
because it doesn't implement __getitem__
(so you can't subscript it, which is perhaps what you were thinking of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually...no 😄
The code I was copying from had a scopes = [allowed.keys()]
and I misinterpreted why that wasn't working. I thought it was leaving it as a dict_keys
, which is why I converted it to a proper list, when in actuality it was wrapping the dict_keys into another list.
All this to say...I'll drop the list(...)
cloudinit/stages.py
Outdated
return True | ||
|
||
LOG.debug('Event Denied: scopes=%s EventType=%s', | ||
[s.value for s in scopes], event_source_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These log messages look like they'll be a little inconsistent: the "Possible scopes" message doesn't call s.value
on each scope before logging it; the "Event Allowed" one doesn't call it on evt.scope
. Can we get these aligned?
cloudinit/stages.py
Outdated
if not ( | ||
self.datasource is NULL_DATA_SOURCE or | ||
self.boot_type == EventType.BOOT_NEW_INSTANCE or | ||
( | ||
self.update_event_enabled( | ||
self.boot_type, scope=EventScope.NETWORK | ||
) and self.datasource.update_metadata([self.boot_type]) | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time following the exact logic here; could we decompose it into one or more variables, that'll make it a little easier to understand what we're checking?
log = client.read_from_file('/var/log/cloud-init.log') | ||
assert 'Applying network configuration' in log | ||
assert 'dummy0' not in client.execute('ls /sys/class/net') | ||
|
||
_add_dummy_bridge_to_netplan(client) | ||
client.execute('rm /var/log/cloud-init.log') | ||
client.restart() | ||
log = client.read_from_file('/var/log/cloud-init.log') | ||
|
||
assert 'Event Allowed: scope=network EventType=boot' in log | ||
assert 'Applying network configuration' in log | ||
assert 'dummy0' not in client.execute('ls /sys/class/net') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth refactoring this into a _test_network_config_applied_on_reboot
which both this and the Azure test can call?
This is currently limited to applying network config on BOOT events. | ||
Since we reapply the same config on reboot, verification is limited to | ||
checking log messages. | ||
|
||
Because the supported/default events are datasource specific, the datasource | ||
is being limited to LXD containers/vms (NoCloud). | ||
For NoCloud, BOOT_NEW_INSTANCE and BOOT are supported, but only | ||
BOOT_NEW_INSTANCE is default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests look much better now, thank you! (But this docstring is now out-of-date. 🏃💨)
evt_scope.value, event_source_type) | ||
return True | ||
|
||
LOG.debug('Event Denied: scopes=%s EventType=%s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not happy with this. We attempt to apply networking config twice on every boot. Pre-this-PR behavior, if only BOOT_NEW_INSTANCE was enabled, we would apply exactly once on the first part of first boot, then deny on every subsequent attempt. If BOOT was enabled, we'd apply every single time (twice per boot, including first boot).
With this PR, on first boot we don't have access to userdata, so we fall back to the default update events of the datasource. If only BOOT_NEW_INSTANCE is enabled, we'll get an event denied message in the logs, even if BOOT is specified in the userdata. I'm concerned having an event denied message in the logs will be confusing given that it is enabled in the userdata.
I could add a check for first boot here and slightly alter the log message. It could be something like "Event Denied from data source...", but that also seems like overkill checking every boot for a not-much-more-helpful log message.
Any thoughts for improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases when we would only successfully apply networking configuration on the second attempt during first boot? If so, then does it follow that BOOT_NEW_INSTANCE
should be the toggle for both of those runs? And, therefore, should we actually be using the check for first boot for more than just a log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also occurs to me that we already have language for these occasions, from our script execution modules: once, instance, and boot. I'm not suggesting we change the language we're using here ("instance" is obviously ~contained within "BOOT_NEW_INSTANCE"), but it does make me wonder if we should be following those modules and tracking whether we have applied network configuration for this instance, rather than trying to infer what we should be doing based on the type of a boot we think we have.
If we haven't yet configured networking, BOOT_NEW_INSTANCE
or BOOT
will cause it to be applied; if we have, then only BOOT
will.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases when we would only successfully apply networking configuration on the second attempt during first boot?
Yes, I've been assuming that in my explanations. We will be in BOOT_NEW_INSTANCE state until the datasource has been initialized. It's entirely possible that the first attempt can result in "wait till networking is up", so the second attempt is still BOOT_NEW_INSTANCE. We always attempt to apply networking when BOOT_NEW_INSTANCE
If so, then does it follow that BOOT_NEW_INSTANCE should be the toggle for both of those runs? And, therefore, should we actually be using the check for first boot for more than just a log message?
I'm not sure I follow what should be the toggle
means. But to the second question, I think the answer is no. First boot is really only relevant because we don't have userdata available yet.
but it does make me wonder if we should be following those modules and tracking whether we have applied network configuration for this instance, rather than trying to infer what we should be doing based on the type of a boot we think we have.
If we haven't yet configured networking, BOOT_NEW_INSTANCE or BOOT will cause it to be applied; if we have, then only BOOT will.
This is essentially what I'm already doing. Or am I misunderstanding what you're saying?
Sorry if I didn't make it clear, the but the case I'm concerned about is when networking is applied on first attempt and userdata specifies BOOT (which we can't access on first boot), what's happening on the second attempt.
CURRENT BEHAVIOR:
master when datasource specifies BOOT_NEW_INSTANCE:
first boot, one: apply
first boot, two: skip
subsequent boot, one: skip
subsequent boot, two: skip
master when datasource specifies BOOT:
first boot, one: apply
first boot, two: apply
subsequent boot, one: apply
subsequent boot, two: apply
The above should be exactly the same on this branch when userdata specifies BOOT_NEW_INSTANCE. However, when userdata specifies BOOT:
BRANCH BEHAVIOR:
branch with userdata BOOT:
first boot, one: apply
first boot, two: apply or skip based on datasource default because we don't have userdata
subsequent boot, one: apply
subsequent boot, two: apply
(These scenarios are assuming networking gets applied on the first attempt)
If on first boot, two
we skip because of the datasource default, we get a big ugly "Event Denied" log message that will likely be confusing for somebody that doesn't know userdata won't take effect till next boot.
Changing this to instead apply on both attempts isn't difficult, but it would also apply in the BOOT_NEW_INSTANCE case, which is a change in current behavior. This could be important for backwards compatibility, in case we specifically DON'T want network config applied more than once when BOOT_NEW_INSTANCE
is specified.
Does my reasoning here make sense? If so, do you think it makes sense to modify the log message on first boot for the case I mentioned?
@OddBloke should be ready for another review |
cloudinit/stages.py
Outdated
else: | ||
# refresh netcfg after update | ||
netcfg, src = self._find_networking_config() | ||
def _boot_event_enabled_and_supported(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update here James, it's much easier to follow what the intent is here!
cloudinit/stages.py
Outdated
def _boot_event_enabled_and_supported(): | ||
return self.update_event_enabled( | ||
EventType.BOOT, scope=EventScope.NETWORK | ||
) and self.datasource.update_metadata([EventType.BOOT]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call still throws me off: this isn't just checking for support, but will actually cause the datasource to discard cached data and fetch it anew. That feels to me like it should be two separate calls: "does this datasource support $event_type" and "update the datasource's metadata for $event_type".
It also feels a little odd that update_metadata
doesn't take a scope: we only want to update network metadata here, if (e.g.) storage metadata has changed then we don't particularly want to update it.
Relatedly, it may look like update_metadata
should just update network metadata (or, generally, metadata for the explicitly configured scopes), but our _get_data
implementations generally fetch and store All The Metadata, so I think this will completely refresh all metadata regardless: perhaps not a problem, but worth noting.
18c075c
to
9fb6353
Compare
@OddBloke most recent commit uses the semaphore as we discussed. Still needs a ton of test changes though. |
f8cd1bf
to
399483a
Compare
default_update_events
Suggestion from OddBloke Co-authored-by: Daniel Watkins <daniel@daniel-watkins.co.uk>
It's a bad idea in general, but this is specifically for tests
…tworking config this boot
d931565
to
2d3d9c5
Compare
doc/rtd/topics/events.rst
Outdated
- **BOOT_NEW_INSTANCE**: ``New instance first boot`` | ||
- **BOOT**: ``Any system boot other than 'BOOT_NEW_INSTANCE'`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update this with the legacy event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now, thanks for all the work, James! I've confirmed that a full LXD container run and a full Azure run both succeed locally.
4549f37
to
6d20103
Compare
In canonical#834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
In canonical#834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
In canonical#834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
In canonical#834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
In #834, we refactored the handling of events for fetching new metadata. Previously, in Azure's __init__, the BOOT event was added to the update_events, so it was assumed that Azure required the standard BOOT behavior, which is to apply metadata twice every boot: once during local-init, then again during standard init phase. https://github.com/canonical/cloud-init/blob/21.2/cloudinit/sources/DataSourceAzure.py#L356 However, this line was effectively meaningless. After the metadata was fetched in local-init, it was then pickled out to disk. Because "update_events" was a class variable, the EventType.BOOT was not persisted into the pickle. When the pickle was then unpickled in the init phase, metadata did not get re-fetched because EventType.BOOT was not present, so Azure is effectely only BOOT_NEW_INSTANCE. Fetching metadata twice during boot causes some issue for pre-provisioning on Azure because updating metadata during re-provisioning will cause cloud-init to poll for reprovisiondata again in DataSourceAzure, which will infinitely return 404(reprovisiondata is deleted from IMDS after health signal was sent by cloud-init during init-local). This makes cloud-init stuck in 'init'
Proposed Commit Message
Additional Context
Most of the functionality here has been taken from the initial hotplug PR, though this PR is intentionally more limited in scope, and changes have been made as I saw fit.
Test Steps
WIP