Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait_for: treat broken connections as "unready" #28839

Merged
merged 2 commits into from
Oct 12, 2017

Conversation

sethp-nr
Copy link
Contributor

SUMMARY

We have observed the following condition while waiting for hosts:

Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

wait_for module

ANSIBLE VERSION
ansible 2.3.1.0
  config file =
  configured module search path = Default w/o overrides
  python version = 2.7.13 (default, Apr  4 2017, 08:47:57) [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)]

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.
@ansibot
Copy link
Contributor

ansibot commented Aug 30, 2017

@ansibot ansibot added affects_2.4 This issue/PR affects Ansible v2.4 bugfix_pull_request core_review In order to be merged, this PR must follow the core review workflow. module This issue/PR relates to a module. needs_triage Needs a first human triage before being processed. support:core This issue/PR relates to code supported by the Ansible Engineering Team. ci_verified Changes made in this PR are causing tests to fail. needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed core_review In order to be merged, this PR must follow the core review workflow. labels Aug 30, 2017
We were missing an import and a space after the `#`
@ansibot ansibot removed the ci_verified Changes made in this PR are causing tests to fail. label Aug 30, 2017
@jborean93 jborean93 removed the needs_triage Needs a first human triage before being processed. label Aug 30, 2017
@dagwieers
Copy link
Contributor

dagwieers commented Aug 30, 2017

Is the wait_for_connection action plugin not what you need ?

@ansibot ansibot added core_review In order to be merged, this PR must follow the core review workflow. and removed needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Aug 30, 2017
@sethp-nr
Copy link
Contributor Author

Ah! That's a neat new addition in 2.3. I'll take a look at that action, but I think still need this change for at least the case where we're restarting into a system that doesn't have python installed yet, so ping will never run successfully.

@dagwieers
Copy link
Contributor

@sethp-nr From your description it isn't clear what you are testing or what connection you are talking about.

@sethp-nr
Copy link
Contributor Author

Sorry for the confusion, we've not managed to isolate the ultimate cause but the general flow is as follows:

  1. Our ansible play targets a host and makes a change that requires reboot
  2. We trigger a shutdown, and use wait_for to await the system's restart
  3. The play continues on the rebooted host

The issue this patch addresses is that often step 2 fails (spuriously) because the shutdown call in the module races with packets from the remote host that put the connection in the state that Darwin throws errno 59. Since that happens to say "we can't close this connection because it's not connected", it seems appropriate to say "ok, thanks Darwin, no problem" and continue on rather than raising that transient error and failing the play.

For the common case, we can migrate to using the (relatively new) wait_for_connection, but we'll need to continue to use wait_for in the case that step 1 is "install a new Core OS image" and step 2 is "reboot into that install'd OS". Core OS doesn't ship with python installed, so we need to run a bootstrap play with raw commands before the ping module that wait_for_connection action relies on has any chance of working.

@dagwieers
Copy link
Contributor

Right, for that case wait_for_connection was designed. Testing a TCP port is not sufficient to continue your playbook, you need to know if the transport works end-to-end.

There is an open project I am doing to implement a platform-independent reboot module that would reuse the exact same logic, but would also verify if the system was effectively rebooted. And would have some additional functionality, you can find some information at #16186

There's some preliminary code. I wanted to finish it for v2.4 but too much on my plate...

@sethp-nr
Copy link
Contributor Author

Indeed, though I'm not sure the changes here and there are mutually exclusive? Even if I was waiting for a service to become available on a port, I suspect the same race would occur between socket shutdown and a RST (or similar) from the peer. Wouldn't it make sense to consider those errors as spurious?

Further, there is at least one case that we know the transport will not work end-to-end until the play continues (because the next thing the play does is install python, which ping requires) – in that case, we must use the wait_for module and muddle through. For at least that reboot case, this change is still important.

I'm glad to hear you're working on the reboot module! That would indeed make our lives better – I read through that thread, and it wasn't clear to me how you were planning to handle our complex case (booting into a system that has SSH, but not much else). Should I leave a comment over there describing that difficulty?

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Sep 8, 2017
@sethp-nr
Copy link
Contributor Author

@dagwieers Just wanted to check in – this change is still important for us in at least the case I mentioned in my last comment, and doesn't seem harmful otherwise. Any thoughts?

@ansible ansible deleted a comment from ansibot Oct 11, 2017
@dagwieers
Copy link
Contributor

@sethp-nr I am not the person who can decide if this gets merged. If you want to bring this up to the core team, feel free to add your PR to the core team's agenda and be present on the next core meeting. ansible/community#263

@sethp-nr
Copy link
Contributor Author

Ah, thank you for the pointer!

@bcoca bcoca merged commit 402b095 into ansible:devel Oct 12, 2017
@bcoca bcoca added this to TODO: Next release in 2.4.x Blocker List Oct 12, 2017
bcoca pushed a commit that referenced this pull request Oct 12, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`

(cherry picked from commit 402b095)
BondAnthony pushed a commit to BondAnthony/ansible that referenced this pull request Oct 14, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`
mtb-xt pushed a commit to mtb-xt/ansible that referenced this pull request Oct 15, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`
@abadger
Copy link
Contributor

abadger commented Oct 17, 2017

cherry-picked to the temp-staging-post-2.4.1 branch for the 2.4.2 release.

@abadger abadger moved this from TODO: Next release to Done in 2.4.2 in 2.4.x Blocker List Oct 17, 2017
abadger pushed a commit that referenced this pull request Oct 17, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`

(cherry picked from commit 402b095)
abadger pushed a commit that referenced this pull request Oct 18, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`

(cherry picked from commit 402b095)
abadger pushed a commit that referenced this pull request Oct 26, 2017
* wait_for: treat broken connections as "unready"

We have observed the following condition while waiting for hosts:

```
Traceback (most recent call last):
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 585, in <module>
    main()
  File "/var/folders/f8/23xp00654plcv2b2tcc028680000gn/T/ansible_8hxm4_/ansible_module_wait_for.py", line 535, in main
    s.shutdown(socket.SHUT_RDWR)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 57] Socket is not connected
```

This appears to happen while the host is still starting; we believe something is
accepting our connection but immediately resetting it. In these cases, we'd
prefer to continue waiting instead of immediately failing the play.

This patch has been applied locally for some time, and we have seen no adverse
effects.

* wait_for: fixup change

We were missing an import and a space after the `#`

(cherry picked from commit 402b095)
@ansibot ansibot added bug This issue/PR relates to a bug. and removed bugfix_pull_request labels Mar 6, 2018
@ansible ansible locked and limited conversation to collaborators Apr 26, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.4 This issue/PR affects Ansible v2.4 bug This issue/PR relates to a bug. core_review In order to be merged, this PR must follow the core review workflow. module This issue/PR relates to a module. stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. support:core This issue/PR relates to code supported by the Ansible Engineering Team.
Projects
No open projects
2.4.x Blocker List
Done in 2.4.2
Development

Successfully merging this pull request may close these issues.

None yet

6 participants