Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contiv installer - Intermittent Install failures seen w/ latest 1.1.7 installer bits #340

Open
rkharya opened this issue Jan 25, 2018 · 4 comments
Labels

Comments

@rkharya
Copy link

rkharya commented Jan 25, 2018

Description

v2Plugin installation failures seen multiple times on 2 different setups.
There are different error messages for the failure for Contiv master and Contiv worker nodes.

Expected Behavior

Contiv install should succeed on all Master/Worker Nodes w/o any errors.

Observed Behavior

Issue is being seen intermittently but can be stated for sure - After complete clean-up of the Docker Swarm cluster from Contiv bits, first iteration of installation fails then subsequent re-try eventually succeeds in installing Contiv. This behaviour is being seen only with the latest code-changes done some 20 days back on 1.1.7 release. We have not seen this issue during the CVD validation cycle till the CVD was released on Dec'18th, 2017.

##Master Node install failures -

TASK [contiv_network : install v2plugin on master nodes] ***********************
fatal: [node2]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.63 control_url=10.65.122.63:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:11.601524", "end": "2018-01-22 15:11:25.034534", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.433010", "stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}
fatal: [node1]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.61 control_url=10.65.122.61:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.083192", "end": "2018-01-22 15:11:25.836960", "failed": true, "rc": 1, "start": "2018-01-22 15:05:13.753768", "stderr": "Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/6f11c1b2fea19a72d9aa2ef95c0e85c224891f982826f815ff8a556dc640e48c/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}
fatal: [node3]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.122.62 control_url=10.65.122.62:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=master fwd_mode=bridge", "delta": "0:06:12.404043", "end": "2018-01-22 15:11:25.136644", "failed": true, "rc": 1, "start": "2018-01-22 15:05:12.732601", "stderr": "Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/9c15133fdbe9ee55f4054b0f3af7fbd9be9ae8efc0bfd72d70b791f3ecfb27fd/netplugin.sock: connect: no such file or directory"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}
to retry, use: --limit @/ansible/install_plays.retry

PLAY RECAP *********************************************************************
node1 : ok=17 changed=9 unreachable=0 failed=1
node2 : ok=17 changed=9 unreachable=0 failed=1
node3 : ok=17 changed=9 unreachable=0 failed=1
node4 : ok=9 changed=4 unreachable=0 failed=0
node5 : ok=9 changed=4 unreachable=0 failed=0
node6 : ok=9 changed=4 unreachable=0 failed=0
node7 : ok=9 changed=4 unreachable=0 failed=0
node8 : ok=9 changed=4 unreachable=0 failed=0
node9 : ok=9 changed=4 unreachable=0 failed=0

##Worker Node install failures -

TASK [contiv_network : install v2plugin on worker nodes] ***********************
fatal: [node6]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.140 control_url=10.65.121.140:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:51.934836", "end": "2018-01-25 11:38:37.231374", "failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node7]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.141 control_url=10.65.121.141:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.343379", "end": "2018-01-25 11:38:44.770569", "failed": true, "rc": 1, "start": "2018-01-25 11:33:52.427190", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node4]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.142 control_url=10.65.121.142:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:52.475222", "end": "2018-01-25 11:38:46.382501", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.907279", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node8]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.130 control_url=10.65.121.130:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:54.685860", "end": "2018-01-25 11:38:48.099427", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.413567", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node5]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.143 control_url=10.65.121.143:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:04:55.817107", "end": "2018-01-25 11:38:49.210135", "failed": true, "rc": 1, "start": "2018-01-25 11:33:53.393028", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node12]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.129 control_url=10.65.121.129:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:54.202116", "end": "2018-01-25 11:40:35.330632", "failed": true, "rc": 1, "start": "2018-01-25 11:38:41.128516", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node11]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.128 control_url=10.65.121.128:9999 vxlan_port=8472 iflist=ens192 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:01:56.424311", "end": "2018-01-25 11:40:43.263658", "failed": true, "rc": 1, "start": "2018-01-25 11:38:46.839347", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
fatal: [node9]: FAILED! => {"changed": true, "cmd": "/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=10.65.121.124 control_url=10.65.121.124:9999 vxlan_port=8472 iflist=eno6 plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=worker fwd_mode=bridge", "delta": "0:02:54.790835", "end": "2018-01-25 11:41:46.656811", "failed": true, "rc": 1, "start": "2018-01-25 11:38:51.865976", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}
changed: [node10]


PLAY RECAP *********************************************************************
node1 : ok=38 changed=19 unreachable=0 failed=0
node10 : ok=23 changed=14 unreachable=0 failed=0
node11 : ok=16 changed=9 unreachable=0 failed=1
node12 : ok=16 changed=9 unreachable=0 failed=1
node2 : ok=37 changed=18 unreachable=0 failed=0
node3 : ok=37 changed=18 unreachable=0 failed=0
node4 : ok=16 changed=9 unreachable=0 failed=1
node5 : ok=16 changed=9 unreachable=0 failed=1
node6 : ok=16 changed=9 unreachable=0 failed=1
node7 : ok=16 changed=9 unreachable=0 failed=1
node8 : ok=16 changed=9 unreachable=0 failed=1
node9 : ok=16 changed=9 unreachable=0 failed=1

##Worker node failure key error message -
failed": true, "rc": 1, "start": "2018-01-25 11:33:45.296538", "stderr": "failed to download: unexpected EOF", "stderr_lines": ["failed to download: unexpected EOF"], "stdout": "1.1.7: Pulling from contiv/v2plugin", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin"]}

##Master node failure key error message -
"stderr": "Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused", "stderr_lines": ["Error response from daemon: dial unix /run/docker/plugins/330e5e6cb7025e7c40805912541ff706fad4d35eb4bb34b877ea5004dfcf8511/netplugin.sock: connect: connection refused"], "stdout": "1.1.7: Pulling from contiv/v2plugin\n1ba3fc0d8c93: Verifying Checksum\n1ba3fc0d8c93: Download complete\nDigest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30\nStatus: Downloaded newer image for contiv/v2plugin:1.1.7", "stdout_lines": ["1.1.7: Pulling from contiv/v2plugin", "1ba3fc0d8c93: Verifying Checksum", "1ba3fc0d8c93: Download complete", "Digest: sha256:2b610546b385bcc46ca6c76a9be7fd859a3abf4b37f529ba9df41a4dc3853c30", "Status: Downloaded newer image for contiv/v2plugin:1.1.7"]}

Steps to Reproduce (for bugs)

  1. Create DEE swarm mode cluster setup with 3 master and couple of worker nodes
  2. Download latest Contiv Installer bits version 1.1.7 from Contiv Github Install release location for full install
  3. Modify cfg.yml and env.json to suit your cluster environment
  4. Issue command for installation - ./install/ansible/install_swarm.sh -f install/ansible/cfg.yml -u root -e ~/.ssh/id_rsa -p

Your Environment

  • netctl version - 1.1.7/v2Plugin
  • Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm/UCP2.2.4/Docker Engine17.06.2-ee-6
  • Operating System and version: RHEL7.3

##Installation logs are attached herewith -
contiv_install_01-22-2018.09-34-14.UTC.log
contiv_install_01-25-2018.05-56-47.UTC.log

@vhosakot
Copy link
Member

vhosakot commented Jan 25, 2018

Looking at the attached logs contiv_install_01-22-2018.09-34-14.UTC.log and contiv_install_01-25-2018.05-56-47.UTC.log, I see failures when the contiv docker v2plugin was installed.

The following command failed on both master and worker nodes in the logs:

/usr/bin/docker plugin install --grant-all-permissions contiv/v2plugin:1.1.7 ctrl_ip=<IP> control_url=<IP>:9999 vxlan_port=8472 iflist=<interface> plugin_name=contiv/v2plugin:1.1.7 cluster_store=etcd://localhost:2379 plugin_role=[master|worker] fwd_mode=bridge

Can you send the logs in /var/log/contiv/ and /var/log/contiv*.log from the master and worker nodes that saw this issue?

@rkharya
Copy link
Author

rkharya commented Jan 26, 2018

Worker node install failures -
worker nodes don't have /var/log/contiv/ folder or any other contiv logs. So attaching logs from corresponding master nodes in the same cluster -
contiv-master-logs-workerfailure.tar.gz

Master node intall failures - (as observed on 2nd cluter) -
contiv-master-node-logs.tar.gz

in this case master nodes doesn't have netctl installed, though netplugin booted up cleanly -
[root@DEE-Ctrl-1 contiv]# cat plugin_bootup.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-db.log 2018-01-22T09:41:03Z|00001|vlog|INFO|opened log file /var/log/contiv/ovs-vswitchd.log Waiting for netmaster to be ready for connections Netmaster ready for connections, setting forward mode to bridge Forward mode is set n-if=eno6 -cluster-store=etcd://localhost:2379 -ctrl-ip=10.65.122.61 /netmaster -plugin-name=contiv/v2plugin:1.1.7 -cluster-mode=swarm-mode -cluster-store=etcd://localhost:2379 -control-url=10.65.122.61:9999

Also docker plugin ls doesn't list Contiv -

[root@DEE-Ctrl-1 contiv]# docker plugin ls ID NAME DESCRIPTION ENABLED 631d379403b4 docker/telemetry:1.0.0.linux-x86_64-stable Docker Inc. metrics exporter false

@blaksmit blaksmit added the P1 label Jan 31, 2018
@unclejack
Copy link
Contributor

@rkharya: Have you reproduced this on CentOS or on another distribution?

@rkharya
Copy link
Author

rkharya commented Feb 8, 2018

@unclejack: Reproducible on RHEL7.3 environments - BareMetal and BareMetal with VMs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants