Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeBSD tasks signaled to exit, not restarted, likely not OOM ? #592

Closed
jsiwek opened this issue Mar 5, 2020 · 29 comments
Closed

FreeBSD tasks signaled to exit, not restarted, likely not OOM ? #592

jsiwek opened this issue Mar 5, 2020 · 29 comments
Labels

Comments

@jsiwek
Copy link

jsiwek commented Mar 5, 2020

Maybe related to #591, but wasn't sure if they actually tracked it down to OOM, I've seen several FreeBSD tasks with "Signaled to exit!" at various points:

https://cirrus-ci.com/task/5566901552152576
https://cirrus-ci.com/task/6189245568122880
https://cirrus-ci.com/task/4958657070759936
https://cirrus-ci.com/task/6374248700706816
https://cirrus-ci.com/task/6374248700706816
https://cirrus-ci.com/task/6568457424601088

The OOM angle didn't seem right because the those two are very early on, just installing dependency packages. Didn't yet check how much memory to expect that to use at that point, but if installing packages takes more than 8GB, guess I'd be surprised!

Any way to get more information for reason those are killed? If it's just pre-emption, I expected an auto-restart? Or let me know if there are other limits to be aware of, like disk usage that could explain it (what's the default limit in that case?)

@jsiwek jsiwek added the question label Mar 5, 2020
@fkorotkov
Copy link
Contributor

Indeed seems they were preempted!

Cirrus CI relies on GCE's startup script to bootstrap Cirrus CI Agent on a VM and start executing the task. There were some issues with it on FreeBSD (see #594).

Cirrus CI also relies on shutdown script to detect preeptions: if shutdown script is executed before while task is still executing that mean that something from the outside killed it e.g. potential preemtion.

I've checked the most recent task 4958657070759936 and the VM was preempted but shutdown script was never executed.

@emaste who is preparing and publishing these VMs? I would like to get in touch with them to add some integration tests that new VMs are compatible with Cirrus CI (startup and shutdown are executed as expected).

@swaldhoer
Copy link

I'm having the same issue with FreeBSD images. I need to re-run them and then they sometimes work as expected.

@emaste
Copy link
Contributor

emaste commented Mar 25, 2020

Under discussion on the FreeBSD-cloud mailing list: https://lists.freebsd.org/pipermail/freebsd-cloud/2020-March/000229.html

@swills
Copy link

swills commented Mar 26, 2020

I just tested a shutdown script using the 12.1 RELEASE image and the latest 13-CURRENT image and the shutdown script was run, so I'm not sure what's happening.

@emaste
Copy link
Contributor

emaste commented Mar 26, 2020

@fkorotkov where does "Signaled to exit!" come from? Is it from a Cirrus script?

@fkorotkov
Copy link
Contributor

@swills how did you test it?

@emaste "Signaled to exit!" comes from the Cirrus agent which means it was killed from the outside before task execution finished. In this case because of VM was shutdown.

@swills
Copy link

swills commented Mar 26, 2020

@fkorotkov I created a VM with shutdown-script custom metadata set to:

#!/bin/sh
date >> /tmp/shutdown.txt

and then saw a new entry appear for each time I shut the VM down.

@swills
Copy link

swills commented Mar 26, 2020

@fkorotkov can you share your shutdown script please?

@emaste
Copy link
Contributor

emaste commented Mar 26, 2020

@fkorotkov ah, so the "Signaled to exit!" is from the standard FreeBSD shutdown mechanism terminating all processes.

I agree having FreeBSD-Cirrus CI would be very valuable; do you have suggestions on how to test the operation of the Cirrus shutdown script?

@fkorotkov
Copy link
Contributor

I've just tried myself and I see that the shutdown was executed. Digging though GCP docs I've found some limitation of shutdown scripts:

Compute Engine executes shutdown scripts only on a best-effort basis. In rare cases, Compute Engine cannot guarantee that the shutdown script will complete.

So GCP itself does not guarantee that the script will be executed all the time.

On the second though I've just added additional check for this "Signaled to exit!" so in a little while this will be an additional indication of preemption to not rely completely on shutdown script. Will update the issue once the change is deployed.

@emaste
Copy link
Contributor

emaste commented Mar 26, 2020

I don't know why GCP has that limitation, but you could hook into FreeBSD's standard shutdown system to report that the instance is terminating - i.e. a /usr/local/etc/rc.d/cirrus script with a stop_cmd

@swills
Copy link

swills commented Mar 31, 2020

What is the status of this?

@fkorotkov
Copy link
Contributor

The change with an additional check beside the stophook was deployed and it seems preemptions are now detected again:

image

@emaste
Copy link
Contributor

emaste commented Apr 18, 2020

Seems I'm still getting an occasional "Signaled to exit!" failure, e.g. https://cirrus-ci.com/task/5995876548083712

@Minoru
Copy link

Minoru commented Apr 24, 2020

Me too. Saw one yesterday, two more just now:

Happens while running pkg install. I didn't change this command for months, and it works majority of the time, so it's unclear why it failed now. Restarted both jobs, I'm pretty sure they're going to execute fine the second time.

@fkorotkov fkorotkov reopened this Apr 24, 2020
@swills
Copy link

swills commented May 5, 2020

Any update here?

@timwoj
Copy link

timwoj commented Jun 11, 2020

Any updates on this? We continue to see jobs get randomly killed, including non-FreeBSD ones. See the test step of https://cirrus-ci.com/task/5904400537354240, for example.

@RDIL
Copy link
Contributor

RDIL commented Jun 11, 2020

Just to confirm, this happens on FreeBSD and which other OSes?

@fkorotkov
Copy link
Contributor

Sorry missed the @swills' message. I think the original issue should be fixed as of early May where Cirrus beside the stop hook started to check VM statuses before deleting and detecting the preemption in another way. I'll try to collect more data to verify it.

@timwoj's case looks differently. It's Linux and I see that a corresponding container wasn't preempted. Since the task continued execution it means that test return exit code 0 but also wait status of the shell process indicated that it was signaled which is weird. Could it be that your scripted exited in an uncommon way?

@Minoru
Copy link

Minoru commented Jun 11, 2020

I think the original issue should be fixed as of early May

I'm still restarting about one FreeBSD job in ten. I'll start collecting URLs of failing tasks again, so you can take a look.

@fkorotkov
Copy link
Contributor

fkorotkov commented Jun 11, 2020

@Minoru, sorry to hear that. Links will be helpful. 🙌

@Minoru
Copy link

Minoru commented Jun 16, 2020

Of course now that I'm waiting for failures, they happen less often… But here's one that occurred just now (with FreeBSD): https://cirrus-ci.com/task/6014718282301440

I also had a couple Linux jobs fail in the same way:

I'll keep collecting them in hopes of amassing enough info that the root cause becomes obvious.

@Minoru
Copy link

Minoru commented Jun 17, 2020

@fkorotkov
Copy link
Contributor

The issue was that VMs were getting preempted (shutting down due to preemption and causing signaled to exit) while GCE API was not showing the VM as preempted. I've implemented another mechanism of detecting such cases. Let's see how it goes. 🤞

@Minoru
Copy link

Minoru commented Jul 2, 2020

It's been two weeks, and I haven't seen this issue pop up in that time. I'd consider it fixed now. Thanks for the hard work, @fkorotkov!

@fkorotkov
Copy link
Contributor

Awesome! Glad to hear it! 🙌

@aviramha
Copy link

Hi, Sorry for being the party pooper but it seems like it's still happening:
https://cirrus-ci.com/task/5277731556425728

@swaldhoer
Copy link

I think it also happening again for me:

https://cirrus-ci.com/task/4865315773349888

@swills
Copy link

swills commented Aug 31, 2020

I'm seeing it as well:

https://cirrus-ci.com/task/6295815264141312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants