Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Increase the waitforssh retry to fix the CI issues. #766

Closed
praveenkumar opened this issue Oct 24, 2019 · 8 comments
Closed

[BUG] Increase the waitforssh retry to fix the CI issues. #766

praveenkumar opened this issue Oct 24, 2019 · 8 comments
Assignees
Labels
kind/bug Something isn't working
Projects

Comments

@praveenkumar
Copy link
Member

Right now our CI failing with following output.

17:21:18   Scenario Outline: Start CRC # features/story_registry.feature:9
17:21:18     When starting CRC with default bundle and hypervisor "libvirt" succeeds # features/story_registry.feature:11
17:21:18       Error: command 'crc start -d libvirt -p '/home/crc_ci/payload/crc_pull_secret' -b crc_libvirt_4.2.0.crcbundle --log-level debug', expected to succeed, exited with exit code: 1
17:21:18 
17:21:18   Scenario: Push local image to OpenShift image registry # features/story_registry.feature:28
17:21:18     Given executing "oc new-project testproj-img" succeeds # features/story_registry.feature:29
17:21:18       Error: command 'oc new-project testproj-img', expected to succeed, exited with exit code: 127
17:21:18 
17:21:18   Scenario: Deploy the image # features/story_registry.feature:34
17:21:18     Given executing "oc new-app testproj-img/hello:test" succeeds # features/story_registry.feature:35
17:21:18       Error: command 'oc new-app testproj-img/hello:test', expected to succeed, exited with exit code: 127

If we check the logs from the artifacts then see following, which means the ssh is not able to available in 10 retry and this might be because of the nested virt and resources limitation.

time="2019-10-23T12:48:29+01:00" level=debug msg="Waiting until ssh is available"
time="2019-10-23T12:48:38+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:39+01:00" level=debug msg="retry loop 1"
time="2019-10-23T12:48:47+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:48+01:00" level=debug msg="retry loop 2"
time="2019-10-23T12:48:56+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:48:57+01:00" level=debug msg="retry loop 3"
time="2019-10-23T12:49:05+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:06+01:00" level=debug msg="retry loop 4"
time="2019-10-23T12:49:14+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:15+01:00" level=debug msg="retry loop 5"
time="2019-10-23T12:49:23+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:24+01:00" level=debug msg="retry loop 6"
time="2019-10-23T12:49:32+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:33+01:00" level=debug msg="retry loop 7"
time="2019-10-23T12:49:41+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:42+01:00" level=debug msg="retry loop 8"
time="2019-10-23T12:49:50+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:51+01:00" level=debug msg="retry loop 9"
time="2019-10-23T12:49:59+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:50:00+01:00" level=error msg="Failed to connect to the CRC VM with SSH"
@praveenkumar praveenkumar added the kind/bug Something isn't working label Oct 24, 2019
@praveenkumar praveenkumar self-assigned this Oct 24, 2019
anjannath pushed a commit that referenced this issue Oct 24, 2019
Our CI failing with following output.
```
17:21:18   Scenario Outline: Start CRC # features/story_registry.feature:9
17:21:18     When starting CRC with default bundle and hypervisor "libvirt" succeeds # features/story_registry.feature:11
17:21:18       Error: command 'crc start -d libvirt -p '/home/crc_ci/payload/crc_pull_secret' -b crc_libvirt_4.2.0.crcbundle --log-level debug', expected to succeed, exited with exit code: 1
17:21:18
17:21:18   Scenario: Push local image to OpenShift image registry # features/story_registry.feature:28
17:21:18     Given executing "oc new-project testproj-img" succeeds # features/story_registry.feature:29
17:21:18       Error: command 'oc new-project testproj-img', expected to succeed, exited with exit code: 127
17:21:18
17:21:18   Scenario: Deploy the image # features/story_registry.feature:34
17:21:18     Given executing "oc new-app testproj-img/hello:test" succeeds # features/story_registry.feature:35
17:21:18       Error: command 'oc new-app testproj-img/hello:test', expected to succeed, exited with exit code: 127
```

If we check the logs from the artifacts then it is following,
which means the ssh is not able to available in 10 retry and
this might be because of the nested virt and resources limitation.

```
time="2019-10-23T12:49:24+01:00" level=debug msg="retry loop 6"
time="2019-10-23T12:49:32+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:33+01:00" level=debug msg="retry loop 7"
time="2019-10-23T12:49:41+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:42+01:00" level=debug msg="retry loop 8"
time="2019-10-23T12:49:50+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:49:51+01:00" level=debug msg="retry loop 9"
time="2019-10-23T12:49:59+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr     : exit status 255\noutput  :  - sleeping 1s"
time="2019-10-23T12:50:00+01:00" level=error msg="Failed to connect to the CRC VM with SSH"
```
@anjannath
Copy link
Member

Fixed via #767

@anjannath anjannath added this to Done in Sprint 174 Oct 24, 2019
@cfergeau
Copy link
Contributor

I think we should only do this for CI. Am I right in thinking that for a user of crc, if the VM we create does not get an IP, we just made the time until we report failure much longer?

@praveenkumar
Copy link
Member Author

if the VM we create does not get an IP, we just made the time until we report failure much longer?

@cfergeau I think not only for CI but also for a slow system if ssh connection is not established because of VM taking bit longer to start that would help.

This is not about getting IP since on stop=>start scenario IP is captured using crc.status file which but then VM is still booting up which sometime takes longer then usual and waitforssh to make sure it only start executing commands once it up.

@cfergeau
Copy link
Contributor

WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure.
Apart from CI, I don't think we've had reports of regular users hitting the time limit, and I have doubts such a slow system would give a good crc experience.

@gbraad
Copy link
Contributor

gbraad commented Oct 24, 2019 via email

@anjannath
Copy link
Member

anjannath commented Oct 24, 2019 via email

@cfergeau
Copy link
Contributor

Yes this is what I'm saying, this commit increased the time it takes us to report a failure, if my node is broken, I'll now have to wait 6 times longer until crc lets meknow.

@gbraad
Copy link
Contributor

gbraad commented Oct 24, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
No open projects
Sprint 174
  
Done
Development

No branches or pull requests

4 participants