Nondeterministic failures #29

davidchisnall · 2023-02-08T12:15:31Z

I'm trying to use this with microsoft/snmalloc#588, but each run seems to have a high chance of one of the jobs failing by hitting the timeout. The output looks as if it's just disconnecting. Is it possible that the ssh connection is dropped under high load? Would it be possible to run dtach in the VMs and reconnect if the session is dropped?

jacob-carlborg · 2023-02-08T12:45:06Z

Do you have an example of a workflow run that shows it's failing?

Slackadays · 2023-02-08T17:26:50Z

I saw an example here https://github.com/microsoft/snmalloc/actions/runs/4124100375/jobs/7122943134 where the job is cancelled, but it looks like maybe the output just cut off for a while, prompting the cancellation request.

jacob-carlborg · 2023-02-08T19:54:07Z

I don't think that no output would cause the workflow to be cancelled. It's hitting the timeout of 25 minutes. I've compared the above failing job with a successful job. I noticed that both Test 9 (func-first_operation-fast) and Test 10 (func-first_operation-check) timed out after 400 seconds (that's 6 minutes and 40 seconds). In the successful run they take slightly less than 5 seconds. I also noticed that there's no output for the last 15 minutes, before the job times out.

It's difficult to say what's the cause of this. There are many layers involved. I would hope that if the SSH connection drops an exception would be thrown and abort the action. To me it seems like the VM just stops doing work.

davidchisnall · 2023-02-08T21:33:40Z

I added some more aggressive timeouts because they were taking a very long time in the cases where they didn't make progress. This one hit the timeout (set to 25 minutes, a successful run takes <15): https://github.com/microsoft/snmalloc/actions/runs/4123711673/jobs/7122600447

jacob-carlborg · 2023-02-09T06:47:38Z

Hmm, this one also stops making progress. I'll see if I can fork the repository and debug the issue.

davidchisnall · 2023-02-09T16:47:17Z

For reference, the perf-contention-fast test takes <20s on the macOS runner, yet we sit waiting for timeout after 20 minutes on the FreeBSD VM on the macOS runner. On FreeBSD on Hyper-V VM on a Xeon W-2155, that test takes a shade over 1s, so I can confirm that it doesn't hit any special weirdness with FreeBSD.

Slackadays · 2023-02-12T03:11:22Z

I'm getting a slightly similar issue with a couple of my actions where some random component either times out or stops on its own. This time, it was rsync: https://github.com/Slackadays/Clipboard/actions/runs/4154242313/jobs/7186806540

jacob-carlborg · 2023-02-12T11:16:17Z

I forked snmalloc but I haven't been able to reproduce the issue yet.

Slackadays · 2023-02-13T05:22:18Z

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

davidchisnall · 2023-02-13T13:24:45Z

A more exciting failure today, the job succeeded, but the VM teardown failed and so the runner reported failure.

knightjoel · 2023-03-13T00:13:27Z

As another data point, I noticed that my FreeBSD issues only happened when using macos-latest and not ubuntu-latest.

I've noticed similar with OpenBSD jobs. Unit tests in a couple of projects I contribute to will run fine locally, but fail in weird ways in CI when run on macos-12 hosts (eg, 'write after free' and unexplained segfaults). Switching the host to ubuntu-latest results in consistently clean runs. I would run everything on ubuntu, except the performance tradeoff is significant.

jacob-carlborg · 2023-03-13T08:30:40Z

OpenBSD and FreeBSD uses the xhyve hypervisor on macOS runners. The xhyve hypervisor is probably not as battle tested as QEMU. I could add an option to allow selecting hypervisor. Then QEMU could be selected on macOS runners and it would support hardware accelerated virtualization, this should make the performance better compared to Linux runners and hopefully make it more stable.

knightjoel · 2023-03-13T12:13:02Z

I'd be happy to give that a whirl and test it on the projects where I had issues using the xhyve hypervisor.

jacob-carlborg · 2023-04-03T05:48:24Z

@Slackadays @knightjoel I created a new release which supports selecting the hypervisor. Now it's possible to use QEMU (which is the default on Linux runners) on macOS runners for FreeBSD and OpenBSD. Previously the would only use the Xhyve hypervisor.

https://github.com/cross-platform-actions/action/releases/tag/v0.11.0

jacob-carlborg mentioned this issue Feb 23, 2023

NetBSD - very slow network (IPv6 timeouts?) #46

Closed

jacob-carlborg mentioned this issue Mar 13, 2023

Allow to select hypervisor #50

Closed

albinahlback mentioned this issue Mar 23, 2023

Temporary fix for failing to sync files #51

Open

jacob-carlborg mentioned this issue Aug 6, 2023

Very frequent freezing of FreeBSD VM during teardown #61

Closed

ibuclaw mentioned this issue Oct 27, 2023

CI: Move non-coverage FreeBSD Cirrus CI jobs to GitHub Actions dlang/dmd#15618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nondeterministic failures #29

Nondeterministic failures #29

davidchisnall commented Feb 8, 2023

jacob-carlborg commented Feb 8, 2023

Slackadays commented Feb 8, 2023

jacob-carlborg commented Feb 8, 2023

davidchisnall commented Feb 8, 2023

jacob-carlborg commented Feb 9, 2023

davidchisnall commented Feb 9, 2023

Slackadays commented Feb 12, 2023

jacob-carlborg commented Feb 12, 2023

Slackadays commented Feb 13, 2023

davidchisnall commented Feb 13, 2023

knightjoel commented Mar 13, 2023

jacob-carlborg commented Mar 13, 2023 •

edited

knightjoel commented Mar 13, 2023

jacob-carlborg commented Apr 3, 2023 •

edited

Nondeterministic failures #29

Nondeterministic failures #29

Comments

davidchisnall commented Feb 8, 2023

jacob-carlborg commented Feb 8, 2023

Slackadays commented Feb 8, 2023

jacob-carlborg commented Feb 8, 2023

davidchisnall commented Feb 8, 2023

jacob-carlborg commented Feb 9, 2023

davidchisnall commented Feb 9, 2023

Slackadays commented Feb 12, 2023

jacob-carlborg commented Feb 12, 2023

Slackadays commented Feb 13, 2023

davidchisnall commented Feb 13, 2023

knightjoel commented Mar 13, 2023

jacob-carlborg commented Mar 13, 2023 • edited

knightjoel commented Mar 13, 2023

jacob-carlborg commented Apr 3, 2023 • edited

jacob-carlborg commented Mar 13, 2023 •

edited

jacob-carlborg commented Apr 3, 2023 •

edited