Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Warn user of unexpected run mode #5209

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Apr 24, 2024

Proposed Commit Message

feat: Warn user of unexpected run mode
  
On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

- Reorder pid log to be able to reuse Distro information.
- Add docstring deprecating util.is_Linux().

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

On systemd, services are started by PID 1. When this doesn't happen, cloud-init
is in an unknown run state and should warn the user.

Reorder pid log to be able to reuse Distro information.

Add docstring deprecating util.is_Linux().
Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking some boxes:

  • ✔ Alpine (openrc-based) will invoke cloud-init boot stages from non-1 PPID, but we are ignoring non-systemd cases
  • ✔non-linux and therefore non-systemd is ignored

Here's a suggested diff to ensure the logs are actually written, and the log message has a bit more data about the launched boot stage. It doesn't resolve either the lack of logging PID on non-linux/non-systemd or the dropping of util.is_Linux.

diff --git a/cloudinit/cmd/main.py b/cloudinit/cmd/main.py
index ad9910e04..aff4d380f 100644
--- a/cloudinit/cmd/main.py
+++ b/cloudinit/cmd/main.py
@@ -72,15 +72,20 @@ def print_exc(msg=""):
     sys.stderr.write("\n")
 
 
-def log_ppid(distro):
+def log_ppid(distro, bootstage_name):
     if distro.is_linux:
         ppid = os.getppid()
         log = LOG.info
         extra_message = ""
         if 1 != ppid and distro.uses_systemd():
             log = LOG.warning
-            extra_message = ("Not a supported configuration.",)
-        log("PID [%s] started cloud-init. %s", ppid, extra_message)
+            extra_message = " Unsupported configuration: boot stage called outside of systemd"
+        log(
+            "PID [%s] started cloud-init '%s'.%s",
+            ppid,
+            bootstage_name,
+            extra_message,
+        )
 
 
 def welcome(action, msg=None):
@@ -321,14 +326,11 @@ def main_init(name, args):
     #    objects config as it may be different from init object
     # 10. Run the modules for the 'init' stage
     # 11. Done!
-    if not args.local:
-        w_msg = welcome_format(name)
-    else:
-        w_msg = welcome_format("%s-local" % (name))
+    bootstage_name = "%s-local" % (name) if args.local else name
+    w_msg = welcome_format(bootstage_name)
     init = stages.Init(ds_deps=deps, reporter=args.reporter)
     # Stage 1
     init.read_cfg(extract_fns(args))
-    log_ppid(init.distro)
     # Stage 2
     outfmt = None
     errfmt = None
@@ -354,6 +356,7 @@ def main_init(name, args):
     # config applied.  We send the welcome message now, as stderr/out have
     # been redirected and log now configured.
     welcome(name, msg=w_msg)
+    log_ppid(init.distro, bootstage_name)
 
     # re-play early log messages before logging was setup
     for lvl, msg in early_logs:
@@ -581,11 +584,11 @@ def main_modules(action_name, args):
     #    the modules objects configuration
     # 5. Run the modules for the given stage name
     # 6. Done!
-    w_msg = welcome_format("%s:%s" % (action_name, name))
+    bootstage_name = "%s:%s" % (action_name, name)
+    w_msg = welcome_format(bootstage_name)
     init = stages.Init(ds_deps=[], reporter=args.reporter)
     # Stage 1
     init.read_cfg(extract_fns(args))
-    log_ppid(init.distro)
     # Stage 2
     try:
         init.fetch(existing="trust")
@@ -620,6 +623,7 @@ def main_modules(action_name, args):
 
     # now that logging is setup and stdout redirected, send welcome
     welcome(name, msg=w_msg)
+    log_ppid(init.distro, bootstage_name)
 
     if name == "init":
         util.deprecate(

@@ -324,6 +328,7 @@ def main_init(name, args):
init = stages.Init(ds_deps=deps, reporter=args.reporter)
# Stage 1
init.read_cfg(extract_fns(args))
log_ppid(init.distro)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen until after log.setup_logging has been called.

@@ -580,6 +585,7 @@ def main_modules(action_name, args):
init = stages.Init(ds_deps=[], reporter=args.reporter)
# Stage 1
init.read_cfg(extract_fns(args))
log_ppid(init.distro)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to move this down a bit too after log.setup_logging otherwise no logs are written to /var/log/cloud-init.log. To retain original behavior, we may want to log this just after the welcome(... w_msg) to log it during the boot stage block.

extra_message = ""
if 1 != ppid and distro.uses_systemd():
log = LOG.warning
extra_message = ("Not a supported configuration.",)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add details in the extra message to tell use what's unsupported about it.
Unsupported configuration: boot stage called outside of systemd

@@ -490,6 +490,12 @@ def multi_log(

@lru_cache()
def is_Linux():
"""deprecated: prefer Distro object's `is_linux` property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just drop this function altogether if we can figure a better approach to sort cloud-init analyze dump as the only other use-case right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

if 1 != ppid and distro.uses_systemd():
log = LOG.warning
extra_message = ("Not a supported configuration.",)
log("PID [%s] started cloud-init. %s", ppid, extra_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intentionally not logging the PID when not distro.is_linux?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the code did before, but I suppose we should be able to log PPID on non-linux too. I'm curious whether the PPID=1 should hold true on any of the BSDs, and it wouldn't hurt to log it.

@blackboxsw blackboxsw self-assigned this Apr 25, 2024
Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and behaves well. One thing I think we need to sort here is integration tests which call cloud-init --local over SSH as they'll now exit non-zero due to the warning.

Tests affected are at least the following which both check response.ok from a cloud-init init --local call.

  • tests/integration_tests/datasources/test_ec2_ipv6.py
  • tests/integration_tests/test_upgrade.py

We might be able to adapt those tests to generally disregard exit 2 or maybe specifically call verify_clean_boot with a non-empty ignore_warnings item to ignore this expected warning log.

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Apr 30, 2024

Because we patched out the exit 2 behavior on old releases, there's no backwards incompatibility here, correct? Just an extra warning in the logs?

We would see an exit 2 when we SRU to Noble though...

@blackboxsw
Copy link
Collaborator

Because we patched out the exit 2 behavior on old releases, there's no backwards incompatibility here, correct? Just an extra warning in the logs?

We would see an exit 2 when we SRU to Noble though...

yep and yep. I think it's worth a followup quilt patch in ubuntu/noble:debian/patches/retain-ppid-unsupported-config-as-info-level.patch to ensure we don't set warning level on the logs in noble.

@holmanb
Copy link
Member Author

holmanb commented May 2, 2024

yep and yep. I think it's worth a followup quilt patch in ubuntu/noble:debian/patches/retain-ppid-unsupported-config-as-info-level.patch to ensure we don't set warning level on the logs in noble.

we could always just make this a deprecation log and deal with this in the same way that we choose to deal with other new deprecations on older series - a patch per log like this seems really expensive to maintain in the long run

@holmanb holmanb mentioned this pull request May 3, 2024
2 tasks
@blackboxsw
Copy link
Collaborator

we could always just make this a deprecation log and deal with this in the same way that we choose to deal with other new deprecations on older series - a patch per log like this seems really expensive to maintain in the long run

That is a much better idea. Good suggestion, let's go with deprecation.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on stable releaeses, we'll patch the warning to a deprecation. Only thing left to land this PR is integration test fixes per this comment

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label May 23, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label May 29, 2024
@canonical canonical deleted a comment from github-actions bot Jun 3, 2024
@blackboxsw blackboxsw added this to the cloud-init-24.3 milestone Jun 3, 2024
@blackboxsw
Copy link
Collaborator

I think this is just waiting on two integration test fixes. But, I'd like this in 24.3 as we are introducing a "breaking change" as far as warnings being emitted where there were not warnings before-hand and I'd like us to avoid trying to quickly stuff quilt patches into /noble for this across the SRU boundary for 24.2

@holmanb
Copy link
Member Author

holmanb commented Jun 4, 2024

I think this is just waiting on two integration test fixes. But, I'd like this in 24.3 as we are introducing a "breaking change" as far as warnings being emitted where there were not warnings before-hand and I'd like us to avoid trying to quickly stuff quilt patches into /noble for this across the SRU boundary for 24.2

Resolving this involves either:

  1. ignore this warning always
  2. do this somehow without using verify_clean_log()

I don't think that we want 1) (it involves muddying verify_clean_log() with even more single-use ignores aka tech debt), but 2) requires implementing what was described here:

One idea that I envision for this helper is a matrix of platforms and series that allows us to specify exactly when an expected warning message that will be seen. This would be more precise than the current "ignore this pattern everywhere" paradigm that verify_clean_log() uses. That said, I'm not sure that I want to include that refactor in this PR, which has already grown sizeable.

In order to make verify_clean_log() truly be a general-purpose drop-in replacement for verify_clean_boot(), I think we need to grow the ability to track tracebacks in cloud-init status --format json (which would be useful for users anyways). Unfortunately, that is not something that will happen quickly, but also I don't think that this is a strong requirement today for implementing this test, since ignoring tracebacks is an oracle-only requirement.

I'll share what I've been working on shortly.

Implement verify_clean_boot() to ignore certain expected logs
in a platform-specific way.
@holmanb
Copy link
Member Author

holmanb commented Jun 4, 2024

@blackboxsw Last commit contains what I had in mind.

With this commit I can successfully run tests/integration_tests/test_upgrade.py::test_clean_boot_of_upgraded_package and get an instance with this log:

# grep WARN /var/log/cloud-init.log 
2024-06-04 01:07:26,672 - main.py[WARNING]: PID [2082] started cloud-init 'init-local'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:27,299 - main.py[WARNING]: PID [2082] started cloud-init 'init'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:27,956 - main.py[WARNING]: PID [2082] started cloud-init 'modules:config'. Unsupported configuration: boot stage called outside of systemd
2024-06-04 01:07:28,548 - main.py[WARNING]: PID [2082] started cloud-init 'modules:final'. Unsupported configuration: boot stage called outside of systemd

And it passes.

Thoughts?

@holmanb holmanb requested a review from blackboxsw June 10, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants