Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(status): retry systemctl show to avoid dbus disconnect #4679

Closed

Conversation

blackboxsw
Copy link
Collaborator

@blackboxsw blackboxsw commented Dec 9, 2023

Proposed Commit Message

bug(status): retry systemctl show to avoid dbus disconnect

Fixes GH-4676

Additional Context

Test Steps

Checklist

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

When cloud-init encounters user-data and vendor-data that does not
contain a valid header, cloud-init logs a warning and creates an
empty processed user-data or vendor-data file at
/var/lib/cloud/instane/cloud-config.txt or vendor-cloud-config.txt.

In cases where invalid user-data or vendor-data was provided, prefer
to process the raw data files to alert the end-user to the specific
failure in parsing the raw data provided.

This approach improves error reporting from the CLI when user-data,
vendor-data or vendor2-data contain undeclared or invalid headers.

Add a utility function get_processed_or_fallback_path which will
return the preferred data path processed or raw fallback file
depending on whether the processed file or the raw data file
has content.

Two other minor changes:
 - Align exit 1 behavior for cloud-init schema --annotate command
   with and without --annotate parameter
 - refactor parts of handle_schema_args into simpler helper
   functions, _assert_exclusive_args and get_config_paths_from_args
Move get_processed_or_fallback_path out of helpers and into schema.py.
Add type hinting and fixup docstring.
Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves users that want a quick response to cloud-init status waiting for dbus. What if we did something different:

In this case, obviously cloud-init isn't in "done" or "degraded done" state yet, and it can't be in "not run" or "disabled" either, given the check before the call to _get_systemd_status() so that leaves either "error" or "running" or "degraded running". Since this function only returns "running" or "error" ("degraded" gets assigned later), and the absence of dbus means that we cannot introspect to discover an "error", I think that it would be reasonable for this function to return "running" when this happens, because to the best of our knowledge, it is running. If cloud-init is called with --wait, it will just loop anyways. In the case that cloud-init status was called without --wait, the user requested an immediate status, not to wait for something to get a status. We should report the status, to the best of our knowledge.

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection. We already saw that there are errors, so in that case this should just fall through.

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Dec 9, 2023

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection.

error becomes "not an error" when cloud-init is still running. The reason for that code was that status --wait was returning with error before cloud-init ever completed, which is a bug. I should've named it better, but _get_systemd_status() can only ever return RUNNING, ERROR, or None (ignore my status).

@blackboxsw
Copy link
Collaborator Author

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection.

error becomes "not an error" when cloud-init is still running. The reason for that code was that status --wait was returning with error before cloud-init ever completed, which is a bug. I should've named it better, but _get_systemd_status() can only ever return RUNNING, ERROR, or None (ignore my status).

Closing this approach. As Brett brought up, retrying in all cases is not ideal if we are just looking for a quick status on cloud-init without blocking on more systemctl retries due to dbus issues. What was decided in the meeting (per James comment) was that we will retry failed systemctl commands if overall cloud-init status in error condition to avoid early returns which claim that cloud-init is DONE when cloud-init units are still RUNNING.

@blackboxsw blackboxsw closed this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants