timeout and retry do not work together #881

sourcedelica · 2022-09-18T17:06:14Z

Summary

Using the timeout and retry keywords together do not work as expected.

A typical use case for this would be downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out.

Issue Type

Bug Report

Component Name

core

Ansible Version

$ ansible --version
ansible [core 2.13.4]
  config file = /home/eric/work/TradeSystem/target/ansible.cfg
  configured module search path = ['/home/eric/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/eric/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/eric/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/eric/.local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Configuration

# if using a version older than ansible-core 2.12 you should omit the '-t all'
$ ansible-config dump --only-changed -t all
INTERPRETER_PYTHON(/etc/ansible/ansible.cfg) = auto
LOCALHOST_WARNING(/etc/ansible/ansible.cfg) = False

OS / Environment

Ubuntu 20.04

Steps to Reproduce

The first test is to see if a timeout of 15 seconds will fail a retry loop of 3 times 10 seconds

- hosts: localhost
  connection: local
  gather_facts: no
  tasks:
    - shell:
        cmd: sleep 10; false
      register: res
      timeout: 15
      retries: 2
      until: res is not failed

The second test is to see if a timeout of 5 seconds would fail the individual tries

- hosts: localhost
  connection: local
  gather_facts: no
  tasks:
    - shell:
        cmd: sleep 10; false
      register: res
      timeout: 5
      retries: 2
      until: res is not failed

Expected Results

I would expect either

The timeout would fail the whole task if, during the retry loop, the timeout was exceeded

Or

The timeout would fail an individual "try" of the retry loop

Actual Results

In the first test I would expect that the timeout would fail the task at 15 seconds, but the retry loop continues for 30 seconds:

PLAY [localhost] *************************************************************************************************************************************************

TASK [shell] *****************************************************************************************************************************************************
FAILED - RETRYING: [localhost]: shell (2 retries left).
FAILED - RETRYING: [localhost]: shell (1 retries left).
fatal: [localhost]: FAILED! => {"attempts": 2, "changed": true, "cmd": "sleep 10; false", "delta": "0:00:10.011513", "end": "2022-09-18 12:50:19.531123", "msg": "non-zero return code", "rc": 1, "start": "2022-09-18 12:50:09.519610", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *******************************************************************************************************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

If the timeout doesn't work to fail the overall task while it is retrying then I would expect that it would fail an individual try. Instead it fails the whole task.

PLAY [localhost] *************************************************************************************************************************************************

TASK [shell] *****************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "The shell action failed to execute in the expected time frame (5) and was terminated"}

PLAY RECAP *******************************************************************************************************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

The workaround for this is, instead of using retries, we have to run the task multiple times, like:

  tasks:
    - include_tasks: download_installer.yml

    - include_tasks: download_installer.yml
      when: win_get_result is failed

    - include_tasks: download_installer.yml
      when: win_get_result is failed

    - fail:
        msg: "Download failed after 3 attempts"
      when: win_get_result is failed

This is ok for 2 retries but any more would get ridiculous :) Also when there are hundreds of hosts it translates in to a ton of skipping which can be distracting.

There's no way to do "loop until condition" in Ansible without using retries unfortunately.

Code of Conduct

I agree to follow the Ansible Code of Conduct

The text was updated successfully, but these errors were encountered:

ansibot · 2022-09-18T17:11:37Z

Files identified in the description:
None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

s-hertel · 2022-10-04T14:55:58Z

@sourcedelica I'm not sure this is a bug. The timeout applies per task execution. All three attempts are made for your first test case since each lasts less time than the timeout. The second test fails on the first attempt since it takes longer than the timeout.

If a task takes a minimum amount of time to complete, I could see why someone might want the full duration of the timeout to apply to each retry. Maybe we just need to document this behavior so it's clear timeout applies per attempt rather than the total amount of time spent on a task.

sourcedelica · 2022-10-04T17:37:07Z

What is a task execution? One task or one attempt of a set of retries?

I would expect either
a. The timeout would fail the whole task if, during the retry loop, the timeout was exceeded
Or
b. The timeout would fail an individual "try" of the retry loop

My tests showed that both did not happen.
a. It never timed out - the max number of retries was reached
b. It timed out the first try but it never retried

How would you solve the problem: downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out. With the current behavior you can't do this with retries+timeout.

s-hertel · 2022-10-04T19:43:13Z

The behavior is b, the timeout fails an individual attempt.

My tests showed that both did not happen.
a. It never timed out - the max number of retries was reached
b. It timed out the first try but it never retried

A task timeout stops the retry process. b) timed out on the first attempt, and because of this it was not retried. This is consistent with the behavior for a) - each attempt is less than the timeout duration, so it reached the max number of retries.

sourcedelica · 2022-10-05T04:02:34Z

If a task timeout stops the retry process it sounds like the bug is (a) then. If should have timeout out the task at 15 seconds but did not. It appears that the retry restarts the timeout timer.

s-hertel · 2022-10-05T14:20:31Z

@sourcedelica It doesn't seem like a bug in my opinion. It seems like you want the sequence of events to be:

- timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- timeout is calculated from the duration of the last task attempt
- timeout is 5 seconds
- task execution 2 takes 5 seconds and timed out
- task execution 3 does not occur

But the behavior is

- timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- task execution 2 takes 10 seconds and doesn't exceed execution timeout
- task execution 3 takes 10 seconds and doesn't exceed execution timeout

It's not obviously to me which is the better behavior, they are just different expectations.

Playbooks are likely relying on the current design. If a task is known to take a minimum duration and the timeout accounts for that, subtracting the previous attempt duration for the next attempt could make the attempt pointless. So those tasks may need a longer timeout. But if the timeout is used to prevent a hang it will take longer to detect that.

My point though is a) and b) are internally consistent. The timeout and retry work predictably together but just not how you expected.

bcoca · 2022-10-05T14:23:49Z

The 'timeout' keyword was designed to affect the execution of the task code, not the templating, looping and retries.

sourcedelica · 2022-10-06T03:15:59Z

Besides the first interpretation I would also accept my second interpretation:

Timeout is 5 seconds
Task execution 1 takes 10 seconds and is failed by the timeout
Task execution 2 takes 10 seconds and is failed by the timeout
Task execution 3 takes 10 seconds and is failed by the timeout
Max retries reached so the overall task fails

Right now it's not treating a timeout like a failure which would trigger a retry. A timeout should be considered a failure.

Obviously I'm not expecting that specific scenario. What I'm expecting instead is something like:

Timeout is 5 seconds
Task execution 1 takes 10 seconds and is failed by the timeout
Task execution 2 takes 10 seconds and is failed by the timeout
Task execution 3 succeeds

But this is not possible due to the bug that a timeout is not considered a failure and retried after the first execution.

bcoca · 2022-10-06T03:44:55Z

It is considered a failure, it is just not a 'task failure' and closer to a connection/unreachable failure.

Currently it is seen as the task itself has not failed but something in the system/network has, so the timeout was triggered.

I can see an argument to change this status and consider it a task failure (it didn't execute in the allocated time), as it does change the task status to such.

We make such distinctions because recovery on 'task failing due to task inherit logic' and 'task failing due to outside circumstances' have very different recovery and response requirements.

mattclay · 2023-11-29T19:49:30Z

This is something that could definitely be explained better in the documentation, most likely where we cover playbook keywords. I'll move this to the docs repo so the issue can be addressed there.

oraNod · 2024-03-20T10:36:22Z

Estimated effort: L (MUNI tech writers)

sourn00dl · 2024-04-02T20:55:14Z

@oraNod Can I please be assigned this issue? Thank you!

oraNod · 2024-04-03T07:09:57Z

Welcome @sourn00dl I've assigned the issue to you. Please feel free to ask any questions here or join us in the docs channel on Matrix.

ansibot added bug Something isn't working needs_triage Needs a first human triage before being processed. labels Sep 18, 2022

mkrizek removed the needs_triage Needs a first human triage before being processed. label Sep 20, 2022

mkrizek assigned bcoca Sep 20, 2022

mattclay transferred this issue from ansible/ansible Nov 29, 2023

oraNod added the MUNI tech writers label Jan 25, 2024

oraNod unassigned bcoca Mar 20, 2024

oraNod assigned sourn00dl Apr 3, 2024

sourn00dl mentioned this issue Apr 18, 2024

Added a section on retrying tasks with timeout. #1282

Closed

sourn00dl mentioned this issue Apr 28, 2024

MUNI tech writers: timeout documentation. ansible/ansible#83152

Merged

sourn00dl mentioned this issue May 6, 2024

MUNI Tech Writers: Timeout with retry in loops documentation #1416

Merged

oraNod closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout and retry do not work together #881

timeout and retry do not work together #881

sourcedelica commented Sep 18, 2022 •

edited

ansibot commented Sep 18, 2022

s-hertel commented Oct 4, 2022

sourcedelica commented Oct 4, 2022 •

edited

s-hertel commented Oct 4, 2022

sourcedelica commented Oct 5, 2022

s-hertel commented Oct 5, 2022

bcoca commented Oct 5, 2022

sourcedelica commented Oct 6, 2022

bcoca commented Oct 6, 2022 •

edited

mattclay commented Nov 29, 2023

oraNod commented Mar 20, 2024

sourn00dl commented Apr 2, 2024

oraNod commented Apr 3, 2024

timeout and retry do not work together #881

timeout and retry do not work together #881

Comments

sourcedelica commented Sep 18, 2022 • edited

Summary

Issue Type

Component Name

Ansible Version

Configuration

OS / Environment

Steps to Reproduce

Expected Results

Actual Results

Code of Conduct

ansibot commented Sep 18, 2022

s-hertel commented Oct 4, 2022

sourcedelica commented Oct 4, 2022 • edited

s-hertel commented Oct 4, 2022

sourcedelica commented Oct 5, 2022

s-hertel commented Oct 5, 2022

bcoca commented Oct 5, 2022

sourcedelica commented Oct 6, 2022

bcoca commented Oct 6, 2022 • edited

mattclay commented Nov 29, 2023

oraNod commented Mar 20, 2024

sourn00dl commented Apr 2, 2024

oraNod commented Apr 3, 2024

sourcedelica commented Sep 18, 2022 •

edited

sourcedelica commented Oct 4, 2022 •

edited

bcoca commented Oct 6, 2022 •

edited