Skip to content

add new sub_state return value to service_facts module#84618

Closed
NomakCooper wants to merge 0 commit intoansible:develfrom
NomakCooper:devel
Closed

add new sub_state return value to service_facts module#84618
NomakCooper wants to merge 0 commit intoansible:develfrom
NomakCooper:devel

Conversation

@NomakCooper
Copy link
Copy Markdown

SUMMARY

The ansible.builtin.service_facts module does not accurately report the state of services when the source is systemd.

On hosts using systemd, the state exited or dead indicates that the service is inactive.
However, this does not confirm that the service is properly in the stopped state.

According to the code in the SystemctlScanService class here :

the service state on hosts is determined by the SUB parameter from the following command.

$ systemctl list-units --no-pager --type service --all --plain

Unfortunately, the SUB parameter, which is referenced as fields[3], is later overridden by the state_val variable, which is set to stopped by default at the beginning.

As a result, this leads to the module always returning only two possible outcomes: stopped or running, rather than accurately reporting the true state of the services.

The best solution would be to add a new value, sub_state, that reflects the exact state from systemctl, but only when the source is systemd.

I have modified the service_facts.py file by adding the return of a new value, sub_state.

  • The state value remains unchanged.
  • The new value, sub_state, has been added to reflect the SUB state of systemctl.

To avoid errors, the sub_state value has been added to both _list_from_units and _list_from_unit_files.

In _list_from_unit_files, the sub_state value will be identical to state.

I have modified the documentation with sub_state description and example

Fixes #84607

ISSUE TYPE
  • Feature Pull Request

@ansibot ansibot added feature This issue/PR relates to a feature request. needs_triage Needs a first human triage before being processed. module This issue/PR relates to a module. has_issue labels Jan 27, 2025
@NomakCooper
Copy link
Copy Markdown
Author

added new changelog/fragments 84618-service_facts_sub_state.yml

@mkrizek mkrizek removed the needs_triage Needs a first human triage before being processed. label Jan 28, 2025
@NomakCooper
Copy link
Copy Markdown
Author

NomakCooper commented Jan 28, 2025

I created a temporary test env to evaluate the module with these changes.
The test was conducted using ansible [core 2.18.2]

# ansible --version
ansible [core 2.18.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.11/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.11.5 (main, Jan 27 2025, 22:17:48) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] (/usr/local/bin/python3.11)
  jinja version = 3.1.5
  libyaml = True

This test playbook kills the rsyslogd daemon to set the service to a dead state.
It then populates ansible_facts['services'] and displays the debug output of the rsyslog service dict.

# systemd_test.yml
- name: Test systemd service_facts
  hosts: localhost
  gather_facts: no

  tasks:

  - name: kill rsyslog
    command: killall rsyslogd

  - name: populate service facts
    service_facts:

  - name: print rsyslog dict from service_facts
    debug:
      msg: "{{ ansible_facts['services'].values() | selectattr('name', 'equalto', 'rsyslog.service') }}"

results:

# ansible-playbook systemd_test.yml

PLAY [Test systemd service_facts] ***********************************************************************************

TASK [kill rsyslog] *************************************************************************************************
changed: [localhost]

TASK [populate service facts] ***************************************************************************************
ok: [localhost]

TASK [print rsyslog dict from service_facts] ************************************************************************
ok: [localhost] => {
    "msg": [
        {
            "name": "rsyslog.service",
            "source": "systemd",
            "state": "stopped",
            "status": "enabled",
            "sub_state": "dead"
        }
    ]
}

PLAY RECAP *********************************************************************************************************
localhost                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Comment thread lib/ansible/modules/service_facts.py Outdated
Comment thread lib/ansible/modules/service_facts.py Outdated
Comment thread lib/ansible/modules/service_facts.py Outdated
Comment thread lib/ansible/modules/service_facts.py Outdated
Comment thread lib/ansible/modules/service_facts.py Outdated
Comment thread changelogs/fragments/84618-service_facts_sub_state.yml Outdated
@ansibot
Copy link
Copy Markdown
Contributor

ansibot commented Feb 1, 2025

The test ansible-test sanity --test yamllint [explain] failed with 2 errors:

lib/ansible/modules/service_facts.py:103:11: error: RETURN: syntax error: could not find expected ':' (syntax)
lib/ansible/modules/service_facts.py:103:11: unparsable-with-libyaml: RETURN: while scanning a simple key - could not find expected ':'

The test ansible-test sanity --test ansible-doc [explain] failed with the error:

Command "ansible-doc -t module service_facts" returned exit status 1.
>>> Standard Error
ERROR! module service_facts missing documentation (or could not parse documentation): service_facts did not contain a DOCUMENTATION attribute (/root/ansible/lib/ansible/modules/service_facts.py). Unable to parse documentation in python file '/root/ansible/lib/ansible/modules/service_facts.py': while scanning a simple key
  in "<unicode string>", line 33, column 11
could not find expected ':'
  in "<unicode string>", line 34, column 11. while scanning a simple key
  in "<unicode string>", line 33, column 11
could not find expected ':'
  in "<unicode string>", line 34, column 11

The test ansible-test sanity --test validate-modules [explain] failed with 3 errors:

lib/ansible/modules/service_facts.py:0:0: documentation-error: Unknown DOCUMENTATION error, see TRACE: Unable to parse documentation in python file 'lib/ansible/modules/service_facts.py': while scanning a simple key
  in "<unicode string>", line 33, column 11
could not find expected ':'
  in "<unicode string>", line 34, column 11. while scanning a simple key
  in "<unicode string>", line 33, column 11
could not find expected ':'
  in "<unicode string>", line 34, column 11
lib/ansible/modules/service_facts.py:0:0: invalid-documentation-markup: Directive "RV(ansible_facts.services)" contains a non-existing return value "ansible_facts.services"
lib/ansible/modules/service_facts.py:103:11: return-syntax-error: RETURN is not valid YAML

click here for bot help

@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Feb 1, 2025
@NomakCooper
Copy link
Copy Markdown
Author

I have added changes to run integration tests for the sub_state value:

  • Modified the original file test/integration/targets/service_facts/tasks/main.yml
    • Added a new block to avoid interfering with the existing disabled test
  • Added a new file test/integration/targets/service_facts/tasks/subtests.yml

In this new test, the sub_state value will be extracted in several key scenarios to observe how the sub_state value reflects on the state or takes on different values.

The test uses the already present ansible_test service in integration.

Scenarios:

state enabled
started yes
stopped yes
stopped no
started no
  • In the penultimate test, the main process is intentionally killed to force the service into a failed state.

  • In the final test, the .service file is modified, then reload and restart to force the service into an exited state immediately after starting.

In this last test, you can see how the module reports the following values:

"ansible_test.service": {
    "name": "ansible_test.service", 
    "source": "systemd", 
    "state": "stopped", 
    "status": "enabled", 
    "sub_state": "exited"
}

@NomakCooper
Copy link
Copy Markdown
Author

NomakCooper commented Feb 2, 2025

I have rewritten the integration test to make it more efficient and easier to maintain.

  • Separated the standard operation tests into standard_operation.yml using include_tasks in a loop.
  • Isolated the kill test into kill_operation.yml.
  • Moved the test for the exited service into exited_operation.yml.
    • This test replaces the ansible_test.systemd file with exited_ansible.systemd.

new tree:

test/integration/targets/service_facts/
├── aliases
├── files
│   ├── ansible.systemd
│   ├── ansible_test_service.py
│   └── exited_ansible.systemd
└── tasks
    ├── exited_operation.yml
    ├── kill_operation.yml
    ├── main.yml
    ├── standard_operation.yml
    ├── subtests.yml
    ├── systemd_cleanup.yml
    ├── systemd_setup.yml
    └── tests.yml

@Akasurde Akasurde requested a review from bcoca February 4, 2025 20:10
Comment thread test/integration/targets/service_facts/tasks/main.yml Outdated
@Akasurde
Copy link
Copy Markdown
Member

Akasurde commented Feb 4, 2025

@NomakCooper While you are here, can you please check #76667 changes? Thanks.

@NomakCooper
Copy link
Copy Markdown
Author

@Akasurde I viewed PR #76667, and based on James Livulpi's description, it seems similar to my original idea regarding the state value.

Setting aside the fact that the PR is too old,it references an outdated version of service_facts and requires a complete rebase, the proposed change could be risky.
It removes the default values for state and status, instead mapping each value directly to the corresponding columns in LOAD/ACTIVE/SUB.

While the core idea is valid, I initially considered this approach as well.
However, @felixfontein rightly pointed out that it would break any playbook or role currently relying on the state and status fields.

I quickly tested my updated version of the module in my test environment (CentOS 9 - Ansible Core 2.18.2) with the auto-cpufreq.service, which is currently in a "bad" state because its LOAD value is not-found.

# systemctl list-units auto-cpufreq.service --no-pager --type service --all --plain
UNIT                 LOAD      ACTIVE   SUB  DESCRIPTION
auto-cpufreq.service not-found inactive dead auto-cpufreq.service

The module reports the following values:

        {
            "name": "auto-cpufreq.service",
            "source": "systemd",
            "state": "stopped",
            "status": "not-found",
            "sub_state": "dead"
        }

Specifically, since the service is in a "bad" state due to not-found, the module replaces the status value, which should be inactive, with the LOAD value, not-found.

When testing the module with a failed service like mcelog.service, it reports the following values:

# systemctl list-units mcelog.service --no-pager --type service --all --plain
UNIT           LOAD   ACTIVE SUB    DESCRIPTION
mcelog.service loaded failed failed Machine Check Exception Logging Daemon
        {
            "name": "mcelog.service",
            "source": "systemd",
            "state": "stopped",
            "status": "failed",
            "sub_state": "failed"
        }

The state field correctly remains stopped, while sub_state and status properly reflect failed.

The current version of service_facts may be too simplistic. However, with the new sub_state field, we can enhance it without breaking existing playbooks or roles by simply introducing an additional value.

Systemd handles more state/status values than other service sources, and in the future, it might be worth considering a complete rewrite of its dedicated class. However, making drastic changes to long-standing values without proper warnings could lead to significant issues for users.

@NomakCooper NomakCooper requested a review from Akasurde February 6, 2025 13:32
@webknjaz
Copy link
Copy Markdown
Member

webknjaz commented Feb 7, 2025

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Feb 7, 2025
@NomakCooper
Copy link
Copy Markdown
Author

I made a new change to the integration test tasks.

Starting a service that immediately enters the exited state is not typical behavior.

It turns out that the integration test was failing randomly only on RHEL 9.5 py312 because restarting an exited service would get stuck during the restart process.

To prevent this, I added the TimeoutStartSec parameter to the service, setting a 30 second timeout for startup (the first time it got stuck for about 50 minutes).

To ensure the service doesn’t cause further issues, I added a task in exited_operation.yml to disable the service, as it would otherwise be in a failed state.

I also re-enable the service immediately after the daemon reload, so instead of a restart, it performs a simple start, bringing it back to the exited state.

@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Feb 10, 2025
@NomakCooper
Copy link
Copy Markdown
Author

I had to modify the exited_ansible.systemd unit file once again because, apparently, on RHEL 9.5 with Python 3.12, it fails to handle operations on the exited service in the exited_operation.yml test after running kill_operation.yml.

I had to add a service cleanup operation after the failed tests in kill_operation.yml by a systemctl reset-failed

I had to change the ExecStart parameter to /bin/true because starting the service normally caused an unexpected delay, leading to a failed startup and, consequently, a failed test.

In my opinion, the integration tests for failed and exited should be removed once the correct functionality of the sub_state value has been confirmed.
Keeping them could cause unintended issues in future PRs for the service_facts module.

@ansibot ansibot added needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. labels Feb 24, 2025
@ansibot ansibot added the stale_pr This PR has not been pushed to for more than one year. label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature This issue/PR relates to a feature request. has_issue module This issue/PR relates to a module. needs_rebase https://docs.ansible.com/ansible/devel/dev_guide/developing_rebasing.html stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. stale_pr This PR has not been pushed to for more than one year.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ansible.builtin.service_facts incorrect and simplified state

6 participants