Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout and retry do not work together #881

Closed
1 task done
sourcedelica opened this issue Sep 18, 2022 · 13 comments
Closed
1 task done

timeout and retry do not work together #881

sourcedelica opened this issue Sep 18, 2022 · 13 comments
Assignees
Labels
bug Something isn't working MUNI tech writers

Comments

@sourcedelica
Copy link

sourcedelica commented Sep 18, 2022

Summary

Using the timeout and retry keywords together do not work as expected.

A typical use case for this would be downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out.

Issue Type

Bug Report

Component Name

core

Ansible Version

$ ansible --version
ansible [core 2.13.4]
  config file = /home/eric/work/TradeSystem/target/ansible.cfg
  configured module search path = ['/home/eric/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/eric/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/eric/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/eric/.local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Configuration

# if using a version older than ansible-core 2.12 you should omit the '-t all'
$ ansible-config dump --only-changed -t all
INTERPRETER_PYTHON(/etc/ansible/ansible.cfg) = auto
LOCALHOST_WARNING(/etc/ansible/ansible.cfg) = False

OS / Environment

Ubuntu 20.04

Steps to Reproduce

The first test is to see if a timeout of 15 seconds will fail a retry loop of 3 times 10 seconds

- hosts: localhost
  connection: local
  gather_facts: no
  tasks:
    - shell:
        cmd: sleep 10; false
      register: res
      timeout: 15
      retries: 2
      until: res is not failed

The second test is to see if a timeout of 5 seconds would fail the individual tries

- hosts: localhost
  connection: local
  gather_facts: no
  tasks:
    - shell:
        cmd: sleep 10; false
      register: res
      timeout: 5
      retries: 2
      until: res is not failed

Expected Results

I would expect either

  • The timeout would fail the whole task if, during the retry loop, the timeout was exceeded

Or

  • The timeout would fail an individual "try" of the retry loop

Actual Results

In the first test I would expect that the timeout would fail the task at 15 seconds, but the retry loop continues for 30 seconds:

PLAY [localhost] *************************************************************************************************************************************************

TASK [shell] *****************************************************************************************************************************************************
FAILED - RETRYING: [localhost]: shell (2 retries left).
FAILED - RETRYING: [localhost]: shell (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "sleep 10; false", "delta": "0:00:10.011513", "end": "2022-09-18 12:50:19.531123", "msg": "non-zero return code", "rc": 1, "start": "2022-09-18 12:50:09.519610", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *******************************************************************************************************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

If the timeout doesn't work to fail the overall task while it is retrying then I would expect that it would fail an individual try. Instead it fails the whole task.

PLAY [localhost] *************************************************************************************************************************************************

TASK [shell] *****************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "The shell action failed to execute in the expected time frame (5) and was terminated"}

PLAY RECAP *******************************************************************************************************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

The workaround for this is, instead of using retries, we have to run the task multiple times, like:

  tasks:
    - include_tasks: download_installer.yml

    - include_tasks: download_installer.yml
      when: win_get_result is failed

    - include_tasks: download_installer.yml
      when: win_get_result is failed

    - fail:
        msg: "Download failed after 3 attempts"
      when: win_get_result is failed

This is ok for 2 retries but any more would get ridiculous :) Also when there are hundreds of hosts it translates in to a ton of skipping which can be distracting.

There's no way to do "loop until condition" in Ansible without using retries unfortunately.

Code of Conduct

  • I agree to follow the Ansible Code of Conduct
@ansibot
Copy link
Contributor

ansibot commented Sep 18, 2022

Files identified in the description:
None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibot ansibot added bug Something isn't working needs_triage Needs a first human triage before being processed. labels Sep 18, 2022
@mkrizek mkrizek removed the needs_triage Needs a first human triage before being processed. label Sep 20, 2022
@s-hertel
Copy link
Contributor

s-hertel commented Oct 4, 2022

@sourcedelica I'm not sure this is a bug. The timeout applies per task execution. All three attempts are made for your first test case since each lasts less time than the timeout. The second test fails on the first attempt since it takes longer than the timeout.

If a task takes a minimum amount of time to complete, I could see why someone might want the full duration of the timeout to apply to each retry. Maybe we just need to document this behavior so it's clear timeout applies per attempt rather than the total amount of time spent on a task.

@sourcedelica
Copy link
Author

sourcedelica commented Oct 4, 2022

What is a task execution? One task or one attempt of a set of retries?

I would expect either
a. The timeout would fail the whole task if, during the retry loop, the timeout was exceeded
Or
b. The timeout would fail an individual "try" of the retry loop

My tests showed that both did not happen.
a. It never timed out - the max number of retries was reached
b. It timed out the first try but it never retried

How would you solve the problem: downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out. With the current behavior you can't do this with retries+timeout.

@s-hertel
Copy link
Contributor

s-hertel commented Oct 4, 2022

The behavior is b, the timeout fails an individual attempt.

My tests showed that both did not happen.
a. It never timed out - the max number of retries was reached
b. It timed out the first try but it never retried

A task timeout stops the retry process. b) timed out on the first attempt, and because of this it was not retried. This is consistent with the behavior for a) - each attempt is less than the timeout duration, so it reached the max number of retries.

@sourcedelica
Copy link
Author

If a task timeout stops the retry process it sounds like the bug is (a) then. If should have timeout out the task at 15 seconds but did not. It appears that the retry restarts the timeout timer.

@s-hertel
Copy link
Contributor

s-hertel commented Oct 5, 2022

@sourcedelica It doesn't seem like a bug in my opinion. It seems like you want the sequence of events to be:

- timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- timeout is calculated from the duration of the last task attempt
- timeout is 5 seconds
- task execution 2 takes 5 seconds and timed out
- task execution 3 does not occur

But the behavior is

- timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- task execution 2 takes 10 seconds and doesn't exceed execution timeout
- task execution 3 takes 10 seconds and doesn't exceed execution timeout

It's not obviously to me which is the better behavior, they are just different expectations.

Playbooks are likely relying on the current design. If a task is known to take a minimum duration and the timeout accounts for that, subtracting the previous attempt duration for the next attempt could make the attempt pointless. So those tasks may need a longer timeout. But if the timeout is used to prevent a hang it will take longer to detect that.

My point though is a) and b) are internally consistent. The timeout and retry work predictably together but just not how you expected.

@bcoca
Copy link
Member

bcoca commented Oct 5, 2022

The 'timeout' keyword was designed to affect the execution of the task code, not the templating, looping and retries.

@sourcedelica
Copy link
Author

Besides the first interpretation I would also accept my second interpretation:

  • Timeout is 5 seconds
  • Task execution 1 takes 10 seconds and is failed by the timeout
  • Task execution 2 takes 10 seconds and is failed by the timeout
  • Task execution 3 takes 10 seconds and is failed by the timeout
  • Max retries reached so the overall task fails

Right now it's not treating a timeout like a failure which would trigger a retry. A timeout should be considered a failure.

Obviously I'm not expecting that specific scenario. What I'm expecting instead is something like:

  • Timeout is 5 seconds
  • Task execution 1 takes 10 seconds and is failed by the timeout
  • Task execution 2 takes 10 seconds and is failed by the timeout
  • Task execution 3 succeeds

But this is not possible due to the bug that a timeout is not considered a failure and retried after the first execution.

@bcoca
Copy link
Member

bcoca commented Oct 6, 2022

It is considered a failure, it is just not a 'task failure' and closer to a connection/unreachable failure.

Currently it is seen as the task itself has not failed but something in the system/network has, so the timeout was triggered.

I can see an argument to change this status and consider it a task failure (it didn't execute in the allocated time), as it does change the task status to such.

We make such distinctions because recovery on 'task failing due to task inherit logic' and 'task failing due to outside circumstances' have very different recovery and response requirements.

@mattclay
Copy link
Member

This is something that could definitely be explained better in the documentation, most likely where we cover playbook keywords. I'll move this to the docs repo so the issue can be addressed there.

@mattclay mattclay transferred this issue from ansible/ansible Nov 29, 2023
@oraNod
Copy link
Contributor

oraNod commented Mar 20, 2024

Estimated effort: L (MUNI tech writers)

@sourn00dl
Copy link

@oraNod Can I please be assigned this issue? Thank you!

@oraNod
Copy link
Contributor

oraNod commented Apr 3, 2024

Welcome @sourn00dl I've assigned the issue to you. Please feel free to ask any questions here or join us in the docs channel on Matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working MUNI tech writers
Projects
Status: Done
Development

No branches or pull requests

8 participants