bug(status): retry systemctl show to avoid dbus disconnect #4679

blackboxsw · 2023-12-09T01:52:48Z

Proposed Commit Message

bug(status): retry systemctl show to avoid dbus disconnect

Fixes GH-4676

Additional Context

Test Steps

Checklist

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
I have updated or added any documentation accordingly

Merge type

Squash merge using "Proposed Commit Message"
Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

When cloud-init encounters user-data and vendor-data that does not contain a valid header, cloud-init logs a warning and creates an empty processed user-data or vendor-data file at /var/lib/cloud/instane/cloud-config.txt or vendor-cloud-config.txt. In cases where invalid user-data or vendor-data was provided, prefer to process the raw data files to alert the end-user to the specific failure in parsing the raw data provided. This approach improves error reporting from the CLI when user-data, vendor-data or vendor2-data contain undeclared or invalid headers. Add a utility function get_processed_or_fallback_path which will return the preferred data path processed or raw fallback file depending on whether the processed file or the raw data file has content. Two other minor changes: - Align exit 1 behavior for cloud-init schema --annotate command with and without --annotate parameter - refactor parts of handle_schema_args into simpler helper functions, _assert_exclusive_args and get_config_paths_from_args

Move get_processed_or_fallback_path out of helpers and into schema.py. Add type hinting and fixup docstring.

…#4679) Fixes canonicalGH-4676

holmanb

This leaves users that want a quick response to cloud-init status waiting for dbus. What if we did something different:

In this case, obviously cloud-init isn't in "done" or "degraded done" state yet, and it can't be in "not run" or "disabled" either, given the check before the call to _get_systemd_status() so that leaves either "error" or "running" or "degraded running". Since this function only returns "running" or "error" ("degraded" gets assigned later), and the absence of dbus means that we cannot introspect to discover an "error", I think that it would be reasonable for this function to return "running" when this happens, because to the best of our knowledge, it is running. If cloud-init is called with --wait, it will just loop anyways. In the case that cloud-init status was called without --wait, the user requested an immediate status, not to wait for something to get a status. We should report the status, to the best of our knowledge.

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection. We already saw that there are errors, so in that case this should just fall through.

TheRealFalcon · 2023-12-09T05:24:33Z

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection.

error becomes "not an error" when cloud-init is still running. The reason for that code was that status --wait was returning with error before cloud-init ever completed, which is a bug. I should've named it better, but _get_systemd_status() can only ever return RUNNING, ERROR, or None (ignore my status).

blackboxsw · 2023-12-11T16:22:17Z

Also, looking at the call site of _get_systemd_status(), I see no reason that this function should be called if we have already seen that status is set to UXAppStatus.ERROR. The presence of an error cannot be proven to be "not an error" by systemd introspection.

error becomes "not an error" when cloud-init is still running. The reason for that code was that status --wait was returning with error before cloud-init ever completed, which is a bug. I should've named it better, but _get_systemd_status() can only ever return RUNNING, ERROR, or None (ignore my status).

Closing this approach. As Brett brought up, retrying in all cases is not ideal if we are just looking for a quick status on cloud-init without blocking on more systemctl retries due to dbus issues. What was decided in the meeting (per James comment) was that we will retry failed systemctl commands if overall cloud-init status in error condition to avoid early returns which claim that cloud-init is DONE when cloud-init units are still RUNNING.

blackboxsw added 3 commits December 8, 2023 09:50

review suggestion: get_processed_or_fallback_path use suppress

4c4221a

Move get_processed_or_fallback_path out of helpers and into schema.py. Add type hinting and fixup docstring.

bug(status): retry systemctl show to avoid dbus disconnect (canonical…

b4f683b

…#4679) Fixes canonicalGH-4676

blackboxsw force-pushed the GH-4676-retry-systemctl-show branch from e091fcc to b4f683b Compare December 9, 2023 01:53

holmanb reviewed Dec 9, 2023

View reviewed changes

blackboxsw closed this Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(status): retry systemctl show to avoid dbus disconnect #4679

bug(status): retry systemctl show to avoid dbus disconnect #4679

blackboxsw commented Dec 9, 2023 •

edited

holmanb left a comment •

edited

TheRealFalcon commented Dec 9, 2023 •

edited

blackboxsw commented Dec 11, 2023

bug(status): retry systemctl show to avoid dbus disconnect #4679

bug(status): retry systemctl show to avoid dbus disconnect #4679

Conversation

blackboxsw commented Dec 9, 2023 • edited

Proposed Commit Message

Additional Context

Test Steps

Checklist

Merge type

holmanb left a comment • edited

Choose a reason for hiding this comment

TheRealFalcon commented Dec 9, 2023 • edited

blackboxsw commented Dec 11, 2023

blackboxsw commented Dec 9, 2023 •

edited

holmanb left a comment •

edited

TheRealFalcon commented Dec 9, 2023 •

edited