New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset network_cli command_timeout (alarm) on data recv'd #35817

Open
wants to merge 7 commits into
base: devel
from

Conversation

@bdowling
Contributor

bdowling commented Feb 7, 2018

SUMMARY

network_cli currently depends on the command_timeout variable to set the length of time that it waits for a command to FULLY complete. But it does not take into account if the device is actually sending output still.

Why not instead reset this timer when data is received?

This PR proposes that the alarm() timer should simply be reset whenever data is received. I did this in a simple way by seeing if the handler for SIGALRM was set.

FYI, with this change, I was able to set command_timeout=1 in the config and still run a number of long-winded test plays successfully including ones that previously failed with timeout set to 10 (due to the show running all that took 23 seconds on my DUT).

ISSUE TYPE
  • Bugfix Pull Request
COMPONENT NAME

network_cli

ANSIBLE VERSION
devel
ADDITIONAL INFORMATION

Extending COMMAND_TIMEOUT is a oft recommended fix to these timeout issues, but setting it to some large value is impractical because it results in any non-responsive device taking that much extra time to abort the command.

@bdowling bdowling force-pushed the bdowling:fix_network_timeout branch Feb 7, 2018

@ansibot

This comment has been minimized.

@gundalow

This comment has been minimized.

Contributor

gundalow commented Feb 14, 2018

@bdowling Is this causing you issues at the moment, ie do you have playbooks that fail at the moment with the default timeouts?

@bdowling

This comment has been minimized.

Contributor

bdowling commented Feb 14, 2018

Yes, a very simple play using ios_config triggers this in part because it calls show running-config all. Now I may have a slightly slow test device (csr1000v) on my hands, but this should still be considered a bug. As noted, when I switch to this PR branch it completes just fine, resetting the alarm timer when data is still being received is the "right thing to do" ™️.

---

- hosts: onerouter
  gather_facts: no
  connection: network_cli

  tasks:
  - name: acl
    ios_config:
      lines:
        - access-list 99 permit 192.168.1.100

Result:

TASK [acl] **************************************************************************************************************************************************
task path: /home/bdowling/src/netdev/plays/bug-acl.yaml:8
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all
fatal: [bd-csr1000v-00]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 551, in <module>\n    main()\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 454, in main\n    config, have_banners = get_running_config(module, config)\n  File \"/tmp/ansible_LMpMhE/ansible_module_ios_config.py\", line 352, in get_running_config\n    contents = get_config(module, flags=flags)\n  File \"/tmp/ansible_LMpMhE/ansible_modlib.zip/ansible/module_utils/network/ios/ios.py\", line 117, in get_config\n  File \"/tmp/ansible_LMpMhE/ansible_modlib.zip/ansible/module_utils/connection.py\", line 146, in __rpc__\nansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}
        to retry, use: --limit @/home/bdowling/.ansible/retry/bug-acl.retry
@bdowling

This comment has been minimized.

Contributor

bdowling commented Feb 14, 2018

With some additional (local) debugging turned on, you can see that the show running all is taking a good amount of time and ansible cuts the connection 10 seconds in a partial receipt of the response:

2018-02-14 16:15:44,253 p=1766 u=bdowling |    1773 1518624944.25329: CLI-PROMPT: looking for prompts in [21]| show running-config all\r\nBuilding configuration...\r\n
2018-02-14 16:15:45,019 p=1766 u=bdowling |    1773 1518624945.01909: CLI-PROMPT: looking for prompts in [22]| \r\nCurrent configuration with default configu
rations exposed : 223507 bytes\r\n!\r\n! Last configuration change at 15:46:59 UTC Wed Feb 14 2018 by neoadmin\r\n!\r\nversion 15.5\r\ndownward-compatible-co
nfig 15.5\r\nno service log backtrace\r\nservice config\r\nno service e

...
2018-02-14 16:15:53,767 p=1766 u=bdowling |    1773 1518624953.76767: CLI-PROMPT: looking for prompts in [437]| slot mma bay notice process 1 module R1 level forwarding-manager\r\nplatform trace runtime slot mmon bay notice process 0 module F0 level forwarding-manager\r\nplatform trace runtime slot mmon bay notice process 0 module F1 level forwarding-manager\r\nplatform t
2018-02-14 16:15:54,062 p=1766 u=bdowling |  command timeout triggered, timeout value is 10 secs
2018-02-14 16:15:54,062 p=1766 u=bdowling |    1773 1518624954.06231: closing ssh connection to device
2018-02-14 16:15:54,062 p=1766 u=bdowling |    1773 1518624954.06262: firing event: on_close_shell()
2018-02-14 16:15:54,062 paramiko.transport [chan 0] EOF sent (0)
2018-02-14 16:15:54,063 p=1766 u=bdowling |    1773 1518624954.06330: cli session is now closed
2018-02-14 16:15:54,063 p=1766 u=bdowling |    1773 1518624954.06355: ssh connection has been closed successfully
2018-02-14 16:15:54,063 p=1766 u=bdowling |  shutdown complete
2018-02-14 16:15:54,064 p=1766 u=bdowling |  shutdown complete
2018-02-14 16:15:54,086 p=1756 u=bdowling |    1765 1518624954.08666: done communicating
2018-02-14 16:15:54,087 p=1756 u=bdowling |    1765 1518624954.08696: done with local.exec_command()
2018-02-14 16:15:54,087 p=1756 u=bdowling |    1765 1518624954.08719: _low_level_execute_command() done: rc=1, stdout=, stderr=Traceback (most recent call last):
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 551, in <module>
    main()
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 454, in main
    config, have_banners = get_running_config(module, config)
  File "/tmp/ansible_qNPUQd/ansible_module_ios_config.py", line 352, in get_running_config
    contents = get_config(module, flags=flags)
  File "/tmp/ansible_qNPUQd/ansible_modlib.zip/ansible/module_utils/network/ios/ios.py", line 117, in get_config
  File "/tmp/ansible_qNPUQd/ansible_modlib.zip/ansible/module_utils/connection.py", line 146, in __rpc__
ansible.module_utils.connection.ConnectionError: timeout trying to send command: show running-config all

@ansibot ansibot added stale_ci and removed stale_ci labels Feb 22, 2018

@bdowling bdowling changed the title from Fix network_cli command_timeout by extending on data recv'd to Reset network_cli command_timeout (alarm) on data recv'd Feb 24, 2018

@privateip

This comment has been minimized.

Member

privateip commented Feb 26, 2018

@bdowling your proposed change will break if the prompt is not properly identified and the session will be left in a hung state. It is precisely that situation that the current code path protects.

@bdowling

This comment has been minimized.

Contributor

bdowling commented Feb 26, 2018

@privateip previously this may have been the case. But now with the great work from @ganeshrn and in particular the merged fix in #35439 network_cli no longer sends any commands to try to wake up the prompt (it just uses the last seen).

So in this PR, even in cases where the prompt is not found, the timeout will still occur because the device stops sending any output. I hope that you will see from the tests below that I have addressed your concerns. Network delays, device performance, etc can all affect how fast the device can return all the data we are asking of it. Using a timeout that is reset whenever the device outputs data is better than abruptly disconnecting a working, but slow connection short.

To prove out this theory, I have tested this in my branch in two ways.

  1. I broke the enable password: prompt by fudging the regex. Which results in a timeout:
AnsibleConnectionFailure: unable to elevate privilege to enable mode, at prompt [None] with error: timeout trying to send command: enable
  1. I patched network_cli to mess with the stdout_re after I send a special command:
diff --git a/lib/ansible/plugins/connection/network_cli.py b/lib/ansible/plugins/connection/network_cli.py
index 1e14949..03adbca 100644
--- a/lib/ansible/plugins/connection/network_cli.py
+++ b/lib/ansible/plugins/connection/network_cli.py
@@ -388,6 +388,11 @@ class Connection(ConnectionBase):
         if signal_handler == signal.SIG_DFL:
             signal_handler = None

+
+        if command and 'break prompts' in command:
+            display.debug("CLI-RECEIVE Breaking prompt matching")
+            self._terminal.terminal_stdout_re = [re.compile(br'NeverMatchThis')]
+
         while True:
             data = self._ssh_shell.recv(256)

With play:

---

- hosts: all
  gather_facts: no
  connection: network_cli

  tasks:
  - name: echo break prompts
    ios_command:
      authorize: yes
      commands: echo break prompts

Which also results in the expected timeout:

ansible.module_utils.connection.ConnectionError: timeout trying to send command: echo break prompts
@privateip

This comment has been minimized.

Member

privateip commented Mar 1, 2018

@bdowling just a quick update. I talked with @ganeshrn and he is going to do some testing of this patch and we will try to pull it in. no promises but we will try our best.

@privateip privateip self-requested a review Mar 1, 2018

@privateip privateip added this to the 2.5.0 milestone Mar 1, 2018

@privateip privateip added this to Stretch in Networking Mar 1, 2018

@privateip privateip requested a review from ganeshrn Mar 1, 2018

@ansibot

This comment has been minimized.

Contributor

ansibot commented Mar 1, 2018

@ganeshrn

This comment has been minimized.

Member

ganeshrn commented Mar 8, 2018

I have done some manual testing and added integration and unit test as part of PR #37185 to test this patch. Seems to be working fine.

@samccann

This comment has been minimized.

Contributor

samccann commented Sep 12, 2018

Based on slack comment, this PR still has code issues.

@ansibot

This comment has been minimized.

Contributor

ansibot commented Sep 17, 2018

@bdowling this PR contains the following merge commits:

Please rebase your branch to remove these commits.

click here for bot help

@ansibot ansibot added merge_commit and removed stale_ci labels Sep 17, 2018

@ansibot ansibot added the stale_ci label Sep 25, 2018

@ganeshrn

This comment has been minimized.

Member

ganeshrn commented Oct 2, 2018

@bdowling Please rebase this PR to remove merge commit

@bdowling bdowling force-pushed the bdowling:fix_network_timeout branch from d675918 to b5ddd7e Oct 3, 2018

@ansibot

This comment has been minimized.

Contributor

ansibot commented Oct 3, 2018

@bdowling bdowling force-pushed the bdowling:fix_network_timeout branch from e755185 to 3f15793 Oct 3, 2018

@bdowling bdowling force-pushed the bdowling:fix_network_timeout branch from 3f15793 to b114727 Oct 3, 2018

@ansibot ansibot removed the needs_revision label Oct 6, 2018

@ansibot ansibot added the docsite label Nov 5, 2018

@samccann

This comment has been minimized.

Contributor

samccann commented Nov 28, 2018

@ganeshrn - is there more work to go on this PR? (as in does it still need more revisions or perhaps just a rebase?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment