Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic failures #29

Open
davidchisnall opened this issue Feb 8, 2023 · 14 comments
Open

Nondeterministic failures #29

davidchisnall opened this issue Feb 8, 2023 · 14 comments

Comments

@davidchisnall
Copy link

I'm trying to use this with microsoft/snmalloc#588, but each run seems to have a high chance of one of the jobs failing by hitting the timeout. The output looks as if it's just disconnecting. Is it possible that the ssh connection is dropped under high load? Would it be possible to run dtach in the VMs and reconnect if the session is dropped?

@jacob-carlborg
Copy link
Contributor

Do you have an example of a workflow run that shows it's failing?

@Slackadays
Copy link
Contributor

I saw an example here https://github.com/microsoft/snmalloc/actions/runs/4124100375/jobs/7122943134 where the job is cancelled, but it looks like maybe the output just cut off for a while, prompting the cancellation request.

@jacob-carlborg
Copy link
Contributor

I don't think that no output would cause the workflow to be cancelled. It's hitting the timeout of 25 minutes. I've compared the above failing job with a successful job. I noticed that both Test 9 (func-first_operation-fast) and Test 10 (func-first_operation-check) timed out after 400 seconds (that's 6 minutes and 40 seconds). In the successful run they take slightly less than 5 seconds. I also noticed that there's no output for the last 15 minutes, before the job times out.

It's difficult to say what's the cause of this. There are many layers involved. I would hope that if the SSH connection drops an exception would be thrown and abort the action. To me it seems like the VM just stops doing work.

@davidchisnall
Copy link
Author

I added some more aggressive timeouts because they were taking a very long time in the cases where they didn't make progress. This one hit the timeout (set to 25 minutes, a successful run takes <15): https://github.com/microsoft/snmalloc/actions/runs/4123711673/jobs/7122600447

@jacob-carlborg
Copy link
Contributor

Hmm, this one also stops making progress. I'll see if I can fork the repository and debug the issue.

@davidchisnall
Copy link
Author

For reference, the perf-contention-fast test takes <20s on the macOS runner, yet we sit waiting for timeout after 20 minutes on the FreeBSD VM on the macOS runner. On FreeBSD on Hyper-V VM on a Xeon W-2155, that test takes a shade over 1s, so I can confirm that it doesn't hit any special weirdness with FreeBSD.

@Slackadays
Copy link
Contributor

I'm getting a slightly similar issue with a couple of my actions where some random component either times out or stops on its own. This time, it was rsync: https://github.com/Slackadays/Clipboard/actions/runs/4154242313/jobs/7186806540

@jacob-carlborg
Copy link
Contributor

I forked snmalloc but I haven't been able to reproduce the issue yet.

@Slackadays
Copy link
Contributor

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

@davidchisnall
Copy link
Author

A more exciting failure today, the job succeeded, but the VM teardown failed and so the runner reported failure.

@knightjoel
Copy link

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

I've noticed similar with OpenBSD jobs. Unit tests in a couple of projects I contribute to will run fine locally, but fail in weird ways in CI when run on macos-12 hosts (eg, 'write after free' and unexplained segfaults). Switching the host to ubuntu-latest results in consistently clean runs. I would run everything on ubuntu, except the performance tradeoff is significant.

@jacob-carlborg
Copy link
Contributor

jacob-carlborg commented Mar 13, 2023

OpenBSD and FreeBSD uses the xhyve hypervisor on macOS runners. The xhyve hypervisor is probably not as battle tested as QEMU. I could add an option to allow selecting hypervisor. Then QEMU could be selected on macOS runners and it would support hardware accelerated virtualization, this should make the performance better compared to Linux runners and hopefully make it more stable.

@knightjoel
Copy link

I'd be happy to give that a whirl and test it on the projects where I had issues using the xhyve hypervisor.

@jacob-carlborg
Copy link
Contributor

jacob-carlborg commented Apr 3, 2023

@Slackadays @knightjoel I created a new release which supports selecting the hypervisor. Now it's possible to use QEMU (which is the default on Linux runners) on macOS runners for FreeBSD and OpenBSD. Previously the would only use the Xhyve hypervisor.

https://github.com/cross-platform-actions/action/releases/tag/v0.11.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants