[BUG] Increase the waitforssh retry to fix the CI issues. #766

praveenkumar · 2019-10-24T05:40:40Z

Right now our CI failing with following output.

17:21:18   Scenario Outline: Start CRC # features/story_registry.feature:9
17:21:18     When starting CRC with default bundle and hypervisor "libvirt" succeeds # features/story_registry.feature:11
17:21:18       Error: command 'crc start -d libvirt -p '/home/crc_ci/payload/crc_pull_secret' -b crc_libvirt_4.2.0.crcbundle --log-level debug', expected to succeed, exited with exit code: 1
17:21:18 
17:21:18   Scenario: Push local image to OpenShift image registry # features/story_registry.feature:28
17:21:18     Given executing "oc new-project testproj-img" succeeds # features/story_registry.feature:29
17:21:18       Error: command 'oc new-project testproj-img', expected to succeed, exited with exit code: 127
17:21:18 
17:21:18   Scenario: Deploy the image # features/story_registry.feature:34
17:21:18     Given executing "oc new-app testproj-img/hello:test" succeeds # features/story_registry.feature:35
17:21:18       Error: command 'oc new-app testproj-img/hello:test', expected to succeed, exited with exit code: 127

If we check the logs from the artifacts then see following, which means the ssh is not able to available in 10 retry and this might be because of the nested virt and resources limitation.

time="2019-10-23T12:48:29+01:00" level=debug msg="Waiting until ssh is available"
time="2019-10-23T12:48:38+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:39+01:00" level=debug msg="retry loop 1"
time="2019-10-23T12:48:47+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:48+01:00" level=debug msg="retry loop 2"
time="2019-10-23T12:48:56+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:57+01:00" level=debug msg="retry loop 3"
time="2019-10-23T12:49:05+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:06+01:00" level=debug msg="retry loop 4"
time="2019-10-23T12:49:14+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:15+01:00" level=debug msg="retry loop 5"
time="2019-10-23T12:49:23+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:24+01:00" level=debug msg="retry loop 6"
time="2019-10-23T12:49:32+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:33+01:00" level=debug msg="retry loop 7"
time="2019-10-23T12:49:41+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:42+01:00" level=debug msg="retry loop 8"
time="2019-10-23T12:49:50+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:51+01:00" level=debug msg="retry loop 9"
time="2019-10-23T12:49:59+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:50:00+01:00" level=error msg="Failed to connect to the CRC VM with SSH"

The text was updated successfully, but these errors were encountered:

Our CI failing with following output. ``` 17:21:18 Scenario Outline: Start CRC # features/story_registry.feature:9 17:21:18 When starting CRC with default bundle and hypervisor "libvirt" succeeds # features/story_registry.feature:11 17:21:18 Error: command 'crc start -d libvirt -p '/home/crc_ci/payload/crc_pull_secret' -b crc_libvirt_4.2.0.crcbundle --log-level debug', expected to succeed, exited with exit code: 1 17:21:18 17:21:18 Scenario: Push local image to OpenShift image registry # features/story_registry.feature:28 17:21:18 Given executing "oc new-project testproj-img" succeeds # features/story_registry.feature:29 17:21:18 Error: command 'oc new-project testproj-img', expected to succeed, exited with exit code: 127 17:21:18 17:21:18 Scenario: Deploy the image # features/story_registry.feature:34 17:21:18 Given executing "oc new-app testproj-img/hello:test" succeeds # features/story_registry.feature:35 17:21:18 Error: command 'oc new-app testproj-img/hello:test', expected to succeed, exited with exit code: 127 ``` If we check the logs from the artifacts then it is following, which means the ssh is not able to available in 10 retry and this might be because of the nested virt and resources limitation. ``` time="2019-10-23T12:49:24+01:00" level=debug msg="retry loop 6" time="2019-10-23T12:49:32+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:33+01:00" level=debug msg="retry loop 7" time="2019-10-23T12:49:41+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:42+01:00" level=debug msg="retry loop 8" time="2019-10-23T12:49:50+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:51+01:00" level=debug msg="retry loop 9" time="2019-10-23T12:49:59+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:50:00+01:00" level=error msg="Failed to connect to the CRC VM with SSH" ```

anjannath · 2019-10-24T07:25:17Z

Fixed via #767

cfergeau · 2019-10-24T08:20:35Z

I think we should only do this for CI. Am I right in thinking that for a user of crc, if the VM we create does not get an IP, we just made the time until we report failure much longer?

praveenkumar · 2019-10-24T08:39:51Z

if the VM we create does not get an IP, we just made the time until we report failure much longer?

@cfergeau I think not only for CI but also for a slow system if ssh connection is not established because of VM taking bit longer to start that would help.

This is not about getting IP since on stop=>start scenario IP is captured using crc.status file which but then VM is still booting up which sometime takes longer then usual and waitforssh to make sure it only start executing commands once it up.

cfergeau · 2019-10-24T09:15:01Z

WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure.
Apart from CI, I don't think we've had reports of regular users hitting the time limit, and I have doubts such a slow system would give a good crc experience.

gbraad · 2019-10-24T09:18:04Z

regular users hitting the time limit,

right. Usually this occurs with nested virtualization or when resources are pushed on a node.

…

On Thu, Oct 24, 2019 at 5:15 PM Christophe Fergeau ***@***.***> wrote: WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure. Apart from CI, I don't think we've had reports of regular users hitting the time limit, and I have doubts such a slow system would give a good crc experience. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Gerard Braad | http://gbraad.nl [ Doing Open Source Matters ]

anjannath · 2019-10-24T09:25:38Z

It stops waiting once SSH is available, so it won't increase the overall time in a machine where crc is already working properly. Only in a slow machine this increases the wait time. On Thu, Oct 24, 2019 at 2:48 PM Gerard Braad <notifications@github.com> wrote:

…

> regular users hitting the time limit, right. Usually this occurs with nested virtualization or when resources are pushed on a node. On Thu, Oct 24, 2019 at 5:15 PM Christophe Fergeau ***@***.***> wrote: > > WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure. > Apart from CI, I don't think we've had reports of regular users hitting the time limit, and I have doubts such a slow system would give a good crc experience. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub, or unsubscribe. -- Gerard Braad | http://gbraad.nl [ Doing Open Source Matters ] — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#766>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACDZL3UI6UJSMQRZEH576OLQQFR45ANCNFSM4JEPBBQA> .

-- ANJAN J NATH nathearthling.me

cfergeau · 2019-10-24T09:31:51Z

Yes this is what I'm saying, this commit increased the time it takes us to report a failure, if my node is broken, I'll now have to wait 6 times longer until crc lets meknow.

gbraad · 2019-10-24T09:42:20Z

6 times longer until crc lets me know

seems wrong ... but CI only :-/

…

On Thu, Oct 24, 2019 at 5:31 PM Christophe Fergeau ***@***.***> wrote: Yes this is what I'm saying, this commit increased the time it takes us to report a failure, if my node is broken, I'll now have to wait 6 times longer until crc lets meknow. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#766>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAAOZQ226RSAGP6EOLMPXTQQFTQPANCNFSM4JEPBBQA> .

-- Gerard Braad | http://gbraad.nl [ Doing Open Source Matters ]

praveenkumar added the kind/bug Something isn't working label Oct 24, 2019

praveenkumar self-assigned this Oct 24, 2019

anjannath closed this as completed Oct 24, 2019

anjannath added this to Done in Sprint 174 Oct 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Increase the waitforssh retry to fix the CI issues. #766

[BUG] Increase the waitforssh retry to fix the CI issues. #766

praveenkumar commented Oct 24, 2019

anjannath commented Oct 24, 2019

cfergeau commented Oct 24, 2019

praveenkumar commented Oct 24, 2019

cfergeau commented Oct 24, 2019

gbraad commented Oct 24, 2019 via email

anjannath commented Oct 24, 2019 via email

cfergeau commented Oct 24, 2019

gbraad commented Oct 24, 2019 via email

[BUG] Increase the waitforssh retry to fix the CI issues. #766

[BUG] Increase the waitforssh retry to fix the CI issues. #766

Comments

praveenkumar commented Oct 24, 2019

anjannath commented Oct 24, 2019

cfergeau commented Oct 24, 2019

praveenkumar commented Oct 24, 2019

cfergeau commented Oct 24, 2019

gbraad commented Oct 24, 2019 via email

anjannath commented Oct 24, 2019 via email

cfergeau commented Oct 24, 2019

gbraad commented Oct 24, 2019 via email