-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Increase the waitforssh retry to fix the CI issues. #766
Comments
Our CI failing with following output. ``` 17:21:18 Scenario Outline: Start CRC # features/story_registry.feature:9 17:21:18 When starting CRC with default bundle and hypervisor "libvirt" succeeds # features/story_registry.feature:11 17:21:18 Error: command 'crc start -d libvirt -p '/home/crc_ci/payload/crc_pull_secret' -b crc_libvirt_4.2.0.crcbundle --log-level debug', expected to succeed, exited with exit code: 1 17:21:18 17:21:18 Scenario: Push local image to OpenShift image registry # features/story_registry.feature:28 17:21:18 Given executing "oc new-project testproj-img" succeeds # features/story_registry.feature:29 17:21:18 Error: command 'oc new-project testproj-img', expected to succeed, exited with exit code: 127 17:21:18 17:21:18 Scenario: Deploy the image # features/story_registry.feature:34 17:21:18 Given executing "oc new-app testproj-img/hello:test" succeeds # features/story_registry.feature:35 17:21:18 Error: command 'oc new-app testproj-img/hello:test', expected to succeed, exited with exit code: 127 ``` If we check the logs from the artifacts then it is following, which means the ssh is not able to available in 10 retry and this might be because of the nested virt and resources limitation. ``` time="2019-10-23T12:49:24+01:00" level=debug msg="retry loop 6" time="2019-10-23T12:49:32+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:33+01:00" level=debug msg="retry loop 7" time="2019-10-23T12:49:41+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:42+01:00" level=debug msg="retry loop 8" time="2019-10-23T12:49:50+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:49:51+01:00" level=debug msg="retry loop 9" time="2019-10-23T12:49:59+01:00" level=debug msg="error: Temporary Error: ssh command error:\ncommand : exit 0\nerr : exit status 255\noutput : - sleeping 1s" time="2019-10-23T12:50:00+01:00" level=error msg="Failed to connect to the CRC VM with SSH" ```
Fixed via #767 |
I think we should only do this for CI. Am I right in thinking that for a user of crc, if the VM we create does not get an IP, we just made the time until we report failure much longer? |
@cfergeau I think not only for CI but also for a slow system if ssh connection is not established because of VM taking bit longer to start that would help. This is not about getting IP since on |
WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure. |
regular users hitting the time limit,
right. Usually this occurs with nested virtualization or when
resources are pushed on a node.
…On Thu, Oct 24, 2019 at 5:15 PM Christophe Fergeau ***@***.***> wrote:
WaitForSSH is our main way of checking the OS we just started is fully functional and we can keep doing our post-start setup. So we will be detecting most boot failures through it I think. We just made that wait 6 times longer before reporting a failure.
Apart from CI, I don't think we've had reports of regular users hitting the time limit, and I have doubts such a slow system would give a good crc experience.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Gerard Braad | http://gbraad.nl
[ Doing Open Source Matters ]
|
It stops waiting once SSH is available, so it won't increase the overall
time in a machine where crc is already working properly.
Only in a slow machine this increases the wait time.
On Thu, Oct 24, 2019 at 2:48 PM Gerard Braad <notifications@github.com>
wrote:
… > regular users hitting the time limit,
right. Usually this occurs with nested virtualization or when
resources are pushed on a node.
On Thu, Oct 24, 2019 at 5:15 PM Christophe Fergeau
***@***.***> wrote:
>
> WaitForSSH is our main way of checking the OS we just started is fully
functional and we can keep doing our post-start setup. So we will be
detecting most boot failures through it I think. We just made that wait 6
times longer before reporting a failure.
> Apart from CI, I don't think we've had reports of regular users hitting
the time limit, and I have doubts such a slow system would give a good crc
experience.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub, or unsubscribe.
--
Gerard Braad | http://gbraad.nl
[ Doing Open Source Matters ]
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#766>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACDZL3UI6UJSMQRZEH576OLQQFR45ANCNFSM4JEPBBQA>
.
--
ANJAN J NATH
nathearthling.me
|
Yes this is what I'm saying, this commit increased the time it takes us to report a failure, if my node is broken, I'll now have to wait 6 times longer until crc lets meknow. |
6 times longer until crc lets me know
seems wrong ... but CI only :-/
…On Thu, Oct 24, 2019 at 5:31 PM Christophe Fergeau ***@***.***> wrote:
Yes this is what I'm saying, this commit increased the time it takes us to
report a failure, if my node is broken, I'll now have to wait 6 times
longer until crc lets meknow.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#766>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAOZQ226RSAGP6EOLMPXTQQFTQPANCNFSM4JEPBBQA>
.
--
Gerard Braad | http://gbraad.nl
[ Doing Open Source Matters ]
|
Right now our CI failing with following output.
If we check the logs from the artifacts then see following, which means the ssh is not able to available in 10 retry and this might be because of the nested virt and resources limitation.
The text was updated successfully, but these errors were encountered: