-
Notifications
You must be signed in to change notification settings - Fork 23.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AnsibleModule.run_command() scheduling race #51393
Comments
Files identified in the description: If these files are inaccurate, please update the |
Files identified in the description: If these files are inaccurate, please update the |
The existing behavior appears to be due to coping with the possibility a child process starts a long-lived child of its own that keeps the pipes open, so testing for pipe disconnect is not a reliable way to exit the loop by itself. |
Rough patch in https://github.com/dw/ansible/tree/issue51393 |
similar first attempt to fix index 007cb32a33..4406c80482 100644
--- a/lib/ansible/module_utils/basic.py
+++ b/lib/ansible/module_utils/basic.py
@@ -2833,8 +2833,14 @@ class AnsibleModule(object):
cmd.stdin.write(data)
cmd.stdin.close()
+ read_once = False
while True:
rfds, wfds, efds = select.select(rpipes, [], rpipes, 1)
+ if not read_once:
+ if rfds:
+ read_once = True
+ else:
+ continue
stdout += self._read_from_pipes(rpipes, rfds, cmd.stdout)
stderr += self._read_from_pipes(rpipes, rfds, cmd.stderr)
# if we're checking for prompts, do it now``` |
I'm not sure it's possible to avoid a timeout-second delay when the pipes are held open, kindof annoying. |
read() is not guaranteed to empty the buffer in one call even given a suitably sized buffer |
If process is trying to write more than pipe buffer size data, it won't be able to exit. The problem only manifests when written output can fit entirely in buffer, we need to be sure the buffer is fully drained (9KiB read in 1 iteration isn't enough if the buffer could be up to 64KB) |
Pushed another patch, but still not happy with it.
I don't like the '10 iterations' bit, but there is no guarantee read() will return entire buffer in one call. On Linux I think it's true that read() on a pipe will return full buffer given a large enough userspace buffer, that may not be true on other OSes, and it's not guaranteed by specification. The os.read() should also be wrapped in an EINTR try/except for Python 2 (for example, I believe SIGWINCH could be delivered during connection:local) but I haven't done that yet. [edit: the select.select needs an EINTR wrapper too] |
don't assume linux buffer size, 1/2 of my test targets are using 4k buffers |
Should I just keep reads size 9000? That looks like a made up number, tweaking it is fairly orthogonal to the rest of the patch, except that if the pipe buffer is legitimately full, 7 loop iterations will be required to empty it if the buffer is 64KiB, so bumping exit_loops to 20 or 30 might be wise in that case. A better solution would not require that exit_loops variable at all, but I'm out of ideas :) |
i was looking at dynamically querying the size via |
PIPE_BUF constant (maybe available via sysconf) is what you're after, but per above it's inaccurate on Linux, since Linux does not pick a constant Sorry, someone talking in my ear :) "per above" doesn't actually appear above. Linux pipe buf size varies according to admin sysctl configurables, memory pressure, whether or not the process has CAP_SYS_ADMIN, and varies dynamically across process lifetime, though I think it's at least fixed per pipe lifetime |
So this seems to work across most systems
|
I've pushed another copy that includes the EINTR handling. It's easy to hit this while running Ansible under a profiler, and I've witnessed it in the past. The script below demonstrates import time
import os
import subprocess
import signal
signal.signal(signal.SIGALRM, (lambda *_: None))
signal.setitimer(signal.ITIMER_REAL, 100/1e6, 100/1e6)
proc = subprocess.Popen(['cat', '/dev/zero'], stdout=subprocess.PIPE)
while True:
os.read(proc.stdout.fileno(), 1) |
I'm happy to add tests and turn this into a PR, but there should be clarity on what the desired behavior is.. fixing one data loss with another, or breaking something elsewhere would suck :) It boils down to having no way to know whether the child or its noisy grandchild is responsible for continued output. I like bcoca's idea, if the pipe buffer size can be portably determined, stop reading after we know for sure at least a full buffer's worth has been emptied, or a select timeout occurs. That is more principled than the heuristic in my patch, but if the size cannot be determined everywhere, a heuristic might be necessary. If you are happy for me to continue working on a patch, is it okay to fix the EINTR stuff in the same change? At some stage I was re-starting Ansible constantly as a run would fail under profiling due to it, though I can't remember which profiler caused this. (I guess it hardly matters, as profilers aren't the only source of signals) This has been present for >4 years.. 5d0bb33, I looked for related issues but couldn't find any with obvious searches. It's fairly likely this is responsible for quite a few random failures |
I don't think EINTR is an issue here since we are running one off scripts and they should really avoid using signals, this is not affecting the ansible runtime itself. As for the race condition, I doubt many have hit this, since it would require very particular loads on a system when ansible executes the module. It is still a bug and I think we can fix it, either your code or mine seems to address the issue, I do prefer a simpler approach but I would bring more people in to discuss it. As for the buffer size, I've tested on several platforms and it seems pretty consistent, I don't have access to 'everything' but I consider we cover 95^%j of what users should encounter and can |
* Try to get correct buffer size to avoid races fixes ansible#51393 * fix test, mock buffer function since all is mocked
SUMMARY
_read_from_pipes
will not attempt to read from the associated descriptorproc.poll()
test,proc.poll()
test will succeed and input loop will exitISSUE TYPE
COMPONENT NAME
lib/ansible/module_utils/basic.py
ANSIBLE VERSION
devel
OS / ENVIRONMENT
Ubuntu 18.10
STEPS TO REPRODUCE
Modify run_command():
time.sleep(2)
immediately following the select to simulate scheduler uncertaintyRun playbook:
Observe stdout is absent in output:
EXPECTED RESULTS
"stdout" key contains "hi"
ACTUAL RESULTS
"stdout" key is empty due to race
The text was updated successfully, but these errors were encountered: