-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeout and retry do not work together #881
Comments
Files identified in the description: If these files are incorrect, please update the |
@sourcedelica I'm not sure this is a bug. The timeout applies per task execution. All three attempts are made for your first test case since each lasts less time than the timeout. The second test fails on the first attempt since it takes longer than the timeout. If a task takes a minimum amount of time to complete, I could see why someone might want the full duration of the |
What is a task execution? One task or one attempt of a set of retries? I would expect either My tests showed that both did not happen. How would you solve the problem: downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out. With the current behavior you can't do this with retries+timeout. |
The behavior is b, the timeout fails an individual attempt.
A task timeout stops the retry process. b) timed out on the first attempt, and because of this it was not retried. This is consistent with the behavior for a) - each attempt is less than the timeout duration, so it reached the max number of retries. |
If a task timeout stops the retry process it sounds like the bug is (a) then. If should have timeout out the task at 15 seconds but did not. It appears that the retry restarts the timeout timer. |
@sourcedelica It doesn't seem like a bug in my opinion. It seems like you want the sequence of events to be: - timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- timeout is calculated from the duration of the last task attempt
- timeout is 5 seconds
- task execution 2 takes 5 seconds and timed out
- task execution 3 does not occur But the behavior is - timeout is 15 seconds
- task execution 1 takes 10 seconds and doesn't exceed execution timeout
- task execution 2 takes 10 seconds and doesn't exceed execution timeout
- task execution 3 takes 10 seconds and doesn't exceed execution timeout It's not obviously to me which is the better behavior, they are just different expectations. Playbooks are likely relying on the current design. If a task is known to take a minimum duration and the timeout accounts for that, subtracting the previous attempt duration for the next attempt could make the attempt pointless. So those tasks may need a longer timeout. But if the timeout is used to prevent a hang it will take longer to detect that. My point though is a) and b) are internally consistent. The timeout and retry work predictably together but just not how you expected. |
The 'timeout' keyword was designed to affect the execution of the task code, not the templating, looping and retries. |
Besides the first interpretation I would also accept my second interpretation:
Right now it's not treating a timeout like a failure which would trigger a retry. A timeout should be considered a failure. Obviously I'm not expecting that specific scenario. What I'm expecting instead is something like:
But this is not possible due to the bug that a timeout is not considered a failure and retried after the first execution. |
It is considered a failure, it is just not a 'task failure' and closer to a connection/unreachable failure. Currently it is seen as the task itself has not failed but something in the system/network has, so the timeout was triggered. I can see an argument to change this status and consider it a task failure (it didn't execute in the allocated time), as it does change the task status to such. We make such distinctions because recovery on 'task failing due to task inherit logic' and 'task failing due to outside circumstances' have very different recovery and response requirements. |
This is something that could definitely be explained better in the documentation, most likely where we cover playbook keywords. I'll move this to the docs repo so the issue can be addressed there. |
Estimated effort: L (MUNI tech writers) |
@oraNod Can I please be assigned this issue? Thank you! |
Welcome @sourn00dl I've assigned the issue to you. Please feel free to ask any questions here or join us in the docs channel on Matrix. |
Summary
Using the
timeout
andretry
keywords together do not work as expected.A typical use case for this would be downloading large files over a flaky connection. We want to retry failed downloads but we don't want hung downloads to hang the whole playbook, so we want to be able to time them out.
Issue Type
Bug Report
Component Name
core
Ansible Version
Configuration
OS / Environment
Ubuntu 20.04
Steps to Reproduce
The first test is to see if a timeout of 15 seconds will fail a retry loop of 3 times 10 seconds
The second test is to see if a timeout of 5 seconds would fail the individual tries
Expected Results
I would expect either
Or
Actual Results
If the timeout doesn't work to fail the overall task while it is retrying then I would expect that it would fail an individual try. Instead it fails the whole task.
The workaround for this is, instead of using
retries
, we have to run the task multiple times, like:This is ok for 2 retries but any more would get ridiculous :) Also when there are hundreds of hosts it translates in to a ton of skipping which can be distracting.
There's no way to do "loop until condition" in Ansible without using
retries
unfortunately.Code of Conduct
The text was updated successfully, but these errors were encountered: