Skip to content

Commit

Permalink
doc: Explain breaking change in status code (#5049)
Browse files Browse the repository at this point in the history
  • Loading branch information
holmanb committed Mar 18, 2024
1 parent 18daab5 commit e517f5a
Show file tree
Hide file tree
Showing 4 changed files with 206 additions and 18 deletions.
2 changes: 2 additions & 0 deletions doc/rtd/explanation/failure_states.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ module-level keys: ``init-local``, ``init``, ``modules-config``,
See :ref:`this more detailed explanation<exported_errors>` for to learn how to
use cloud-init's exported errors.

.. _error_codes:

Cloud-init error codes
----------------------

Expand Down
1 change: 1 addition & 0 deletions doc/rtd/explanation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ knowledge and become better at using and configuring ``cloud-init``.
kernel-cmdline.rst
failure_states.rst
exported_errors.rst
return_codes.rst
150 changes: 150 additions & 0 deletions doc/rtd/explanation/return_codes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
.. _return_codes:

Why did `cloud-init status` start returning exit code 2?
========================================================

Cloud-init introduced :ref:`a new error code<error_codes>`
in 23.4. This page describes the purpose of this change and
gives some context for why this change was made.

.. _return_codes_history:

Background
----------

Since cloud-init provides access to cloud instances, the
paradigm for handling errors was "log errors, but proceed".
Exiting on failure conditions doesn't make sense when that
may prevent one from accessing the system to debug it.

Since cloud-init's behavior is heavily tied to specific cloud
platforms, reproducing cloud-init bugs without exactly
reproducing a specific cloud environment is often impossible,
and often requires guesswork. To make debugging cloud-init
possible without reproducing exactly, cloud-init logs are
quite verbose.

.. _return_codes_pain_points:

Pain points
-----------

1) Invalid configurations were historically ignored.
2) Log verbosity is unfriendly to end users that may not know
what to look for. Verbose logs means users often ignore real
errors.
3) Cloud-init's reported status was only capable of telling the user
whether cloud-init crashed. Cloud-init would report a status of
"done" in the following cases:

* a user's configuration was invalid
* if the operating system or cloud environment experienced some error that
prevented cloud-init from configuring the instance
* if cloud-init internally experienced an error - all of these previously
reported a status of "done".

.. _return_codes_improvements:

Efforts to improve cloud-init
-----------------------------

Several changes have been introduced to cloud-init to address the pain
points described above.

JSON schema
^^^^^^^^^^^

Cloud-init has defined a JSON schema which fully documents the user-data
cloud-config. This JSON schema may be used in several different ways:

Text editor integration
"""""""""""""""""""""""

Thanks to `yaml-language-server`_, cloud-init's JSON schema may be
used for YAML syntax checking, warnings when invalid keys are used, and
autocompletion. Several different text editors are capable of this.
See this `blog post on configuring this for neovim`_, or for VScode one
can install the `extension`_ and then a file named ``cloud-config.yaml``
will automatically use cloud-init's JSON schema.


Cloud-init schema subcommand
""""""""""""""""""""""""""""

The cloud-init package includes a cloud-init subcommand,
:ref:`cloud-init schema<check_user_data_cloud_config>` which uses the schema
to validate either the configuration passed to the instance that you are
running the command on, or to validate an arbitrary text file containing a
configuration.

Return codes
^^^^^^^^^^^^

Cloud-init historically used two return codes from the
:code:`cloud-init status` subcommand: 0 to indicate success and 1 to indicate
failure. These return codes lacked nuance. Return code 0 (success) included
the in-between when something went wrong, but cloud-init was able to finish.

Many users of cloud-init run :code:`cloud-init status --wait` and expect that
when complete, cloud-init has finished. Since cloud-init is not guaranteed to
succeed, users should also be check the return code of this command.

As of 23.4, errors that do not crash cloud-init will have an exit code of 2.
Exit code of 1 means that cloud-init crashed, and an exit code 0 more correctly
means that cloud-init succeeded. Anyone that previously checked for exit code 0
should probably update their assumptions in one of the following two ways:

Users that wish to take advantage of cloud-init's error reporting
capabilities should check for exit code of 2 from :code:`cloud-init status`.
An example of this:

.. code-block:: python
from logging import getLogger
from json import loads
from subprocess import run
from sys import exit
logger = getLogger(__name__)
completed = run("cloud-init status --format json")
output = loads(completed.stdout)
if 2 == completed.return_code:
# something bad might have happened - we should check it out
logger.warning("cloud-init experienced a recoverable error")
logger.warning("status: %s", output.get("extended_status"))
logger.warning("recoverable error: %s", output.get("recoverable_errors"))
elif 1 == completed.return_code:
# cloud-init completely failed
logger.error("cloud-init crashed, all bets are off!")
exit(1)
Users that wish to use ignore cloud-init's errors and check the return code in
a backwards-compatible way should check that the return code is not equal to
1. This will provide the same behavior before and after the changed exit code.
See an example of this:

.. code-block:: python
from logging import getLogger
from subprocess import run
from sys import exit
logger = getLogger(__name__)
completed = run("cloud-init status --format json")
if 1 == completed.return_code:
# cloud-init completely failed
logger.error("cloud-init crashed, all bets are off!")
exit(1)
# cloud-init might have failed, but this code ignores that possibility
# in preference of backwards compatibility
See :ref:`our explanation of failure states<failure_states>` for more
information.

.. _yaml-language-server: https://github.com/redhat-developer/yaml-language-server
.. _extension: https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml
.. _blog post on configuring this for neovim: https://phoenix-labs.xyz/blog/setup-neovim-cloud-init-completion/
71 changes: 53 additions & 18 deletions doc/rtd/reference/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -404,12 +404,14 @@ module default frequency of ``instance``:
:command:`status`
=================

Report whether ``cloud-init`` is running, done, disabled or errored. Exits
non-zero if an error is detected in ``cloud-init``.
Report cloud-init's current status.

Exits 1 if ``cloud-init`` crashes, 2 if ``cloud-init`` finishes but experienced
recoverable errors, and 0 if ``cloud-init`` ran without error.

* :command:`--long`: Detailed status information.
* :command:`--wait`: Block until ``cloud-init`` completes.
* :command:`--format [yaml|json|tabular]`: Machine-readable JSON or YAML
* :command:`--format [yaml|json]`: Machine-readable JSON or YAML
detailed output.

The :command:`status` command can be used simply as follows:
Expand All @@ -419,7 +421,8 @@ The :command:`status` command can be used simply as follows:
$ cloud-init status
Which shows whether ``cloud-init`` is currently running, done, disabled, or in
error, as in this example output:
error. Note that the ``extended_status`` key in ``--long`` or ``--format json``
contains more accurate and complete status information. Example output:

.. code-block::
Expand All @@ -436,19 +439,24 @@ Example output when ``cloud-init`` is running:
.. code-block::
status: running
time: Fri, 26 Jan 2018 21:39:43 +0000
detail:
Running in stage: init-local
extended_status: running
boot_status_code: enabled-by-generator
last_update: Wed, 13 Mar 2024 18:46:26 +0000
detail: DataSourceLXD
errors: []
recoverable_errors: {}
Example output when ``cloud-init`` is done:

.. code-block::
status: done
extended_status: done
boot_status_code: enabled-by-generator
last_update: Tue, 16 Aug 2022 19:12:58 +0000
detail:
DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]
last_update: Wed, 13 Mar 2024 18:46:26 +0000
detail: DataSourceLXD
errors: []
recoverable_errors: {}
The detailed output can be shown in machine-readable JSON or YAML with the
:command:`format` option, for example:
Expand All @@ -461,13 +469,40 @@ Which would produce the following example output:

.. code-block::
{
"boot_status_code": "enabled-by-generator",
"datasource": "nocloud",
"detail": "DataSourceNoCloud [seed=/var/lib/cloud/seed/nocloud-net][dsmode=net]",
"errors": [],
"last_update": "Tue, 16 Aug 2022 19:12:58 +0000",
"status": "done"
}
{
"boot_status_code": "enabled-by-generator",
"datasource": "lxd",
"detail": "DataSourceLXD",
"errors": [],
"extended_status": "done",
"init": {
"errors": [],
"finished": 1710355584.3603137,
"recoverable_errors": {},
"start": 1710355584.2216876
},
"init-local": {
"errors": [],
"finished": 1710355582.279756,
"recoverable_errors": {},
"start": 1710355582.2255273
},
"last_update": "Wed, 13 Mar 2024 18:46:26 +0000",
"modules-config": {
"errors": [],
"finished": 1710355585.5042186,
"recoverable_errors": {},
"start": 1710355585.334438
},
"modules-final": {
"errors": [],
"finished": 1710355586.9038777,
"recoverable_errors": {},
"start": 1710355586.8076844
},
"recoverable_errors": {},
"stage": null,
"status": "done"
}
.. _More details on machine-id: https://www.freedesktop.org/software/systemd/man/machine-id.html

0 comments on commit e517f5a

Please sign in to comment.