-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline and CI sometimes hitting EAGAIN (Resource temporarily unavailable) on fork #1186
Comments
We're seeing this also in the rpm-ostree CI: coreos/rpm-ostree#3639 (comment) |
Isn't this just ye olde kubernetes-vs-nproc https://www.flamingspork.com/blog/2020/11/25/why-you-should-use-nproc-and-not-grep-proc-cpuinfo/ that we've patched in a few places, but probably not enough? I think classically still today |
I'm not sure if it is the same root-cause, but Zincati CI was also endelessly deadlocking: coreos/zincati#758 (comment). At the moment it seems to be back on track, at least for now. |
We're seeing issues where the pipeline sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 This may have started happening more frequently after the recent addition of VirtualBox. Anyway, it's kinda silly to build that many images in parallel. There's diminishing returns and eventually a performance cost. So let's do it in two rounds instead. (Ideally instead it'd be a kind of task pool, but let's not try to implement that in Groovy.)
We're seeing issues where CI sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 It's possible something changed recently that made rpm-ostree use more threads than before and more likely to hit this. There's no way for us control the PID limit from our side (it's a cluster setting), so let's just lower parallelism here. (Another approach is splitting it into multiple pods, but it doesn't seem worth the complexity yet.)
We're seeing issues where CI sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 It's possible something changed recently that made rpm-ostree use more threads than before and more likely to hit this. There's no way for us control the PID limit from our side (it's a cluster setting), so let's just lower parallelism here. (Another approach is splitting it into multiple pods, but it doesn't seem worth the complexity yet.)
We're seeing issues where CI sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 It's possible something changed recently that made rpm-ostree use more threads than before and more likely to hit this. There's no way for us control the PID limit from our side (it's a cluster/node setting), so let's just lower parallelism here. (Another approach is splitting it into multiple pods, but it doesn't seem worth the complexity yet.)
We're seeing issues where CI sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 It's possible something changed recently that made rpm-ostree use more threads than before and more likely to hit this. There's no way for us control the PID limit from our side (it's a node setting), so let's just lower parallelism here. (Another approach is splitting it into multiple pods, but it doesn't seem worth the complexity yet.)
My recollection of that is that our pod CPU request wasn't matching the parallel threads we were hitting. But here it seems like we're hitting some PID limit, which is obviously related to the amount of parallelism we do, but is independent of the pod's CPU request. But yeah, clearly one thing we could do is lower the parallelism. OK did that in:
Hmm, I was reading https://kubernetes.io/docs/concepts/policy/pid-limiting/. I wonder if the Fedora cluster admins recently started using some of the PID limit switches mentioned there too. Will ask around. |
We're seeing issues where CI sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 It's possible something changed recently that made rpm-ostree use more threads than before and more likely to hit this. There's no way for us control the PID limit from our side (it's a node setting), so let's just lower parallelism here. (Another approach is splitting it into multiple pods, but it doesn't seem worth the complexity yet.)
We're seeing issues where the pipeline sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 This may have started happening more frequently after the recent addition of VirtualBox. Anyway, it's kinda silly to build that many images in parallel. There's diminishing returns and eventually a performance cost. So let's do it in two rounds instead. (Ideally instead it'd be a kind of task pool, but let's not try to implement that in Groovy.)
We're seeing issues where the pipeline sometimes hits EAGAIN on `fork`: coreos/fedora-coreos-tracker#1186 This may have started happening more frequently after the recent addition of VirtualBox. Anyway, it's kinda silly to build that many images in parallel. There's diminishing returns and eventually a performance cost. So let's do it in two rounds instead. (Ideally instead it'd be a kind of task pool, but let's not try to implement that in Groovy.)
OK, cosa also hitting this now:
This bit:
makes me think this is very unlikely to be us hitting the limit at the pod level, but instead either at the configured kubelet level, or the actual host level. |
We know how to increase the limits, we can create a machineconfig. But we're just wondering, if there is some issue in the pipeline, increasing the limit may not be the right fix here. Is there a way to set the limit somehow within Jenkins instead?
|
I think do need to try to limit parallelism inside our workloads, which...I don't think Jenkins can help with, but chasing down things that are trying to create too many CPU threads would. We also have the generic problem that we have absolutely no prioritization going on - when e.g. dependabot appears on Monday it creates a storm of PRs which will compete for resources with the production FCOS jobs, AFAIK. Perhaps another approach is to use kata containers by default - that will inherently create a small VM which I think will limit what appears to be the "physical" CPU count and things like the Go runtime will only see e.g. 4 processors and not 64 or whatever the host has. |
OK this is super confusing. We were investigating this a bit yesterday with @darknao, who was monitoring Today I wanted to dig in more and so pushed diff --git a/build.sh b/build.sh
index 0629db4a5..003d08fd7 100755
--- a/build.sh
+++ b/build.sh
@@ -96,7 +96,11 @@ install_ocp_tools() {
}
make_and_makeinstall() {
+ while sleep 1; do ps -eLf | wc -l | xargs echo threads:; done &
+ local pid=$!
+ export GODEBUG=schedtrace=1000
make && make install
+ kill $pid
}
configure_user(){ and ran a build through CoreOS CI (https://jenkins-coreos-ci.apps.ocp.fedoraproject.org/job/coreos-assembler/job/try/1/console). At its peak, I got:
So But also, the total number of threads was 74, which is a far cry from 600, and even farther than the default 1024 limit. Hmm, I wonder if this is specific to some nodes only somehow, which would help explain why we don't hit this more often. @davidkirwan Do all the nodes have the same specs? |
@jlebon nope, it's a bit of a hodge podge. We've the following machines with the following physical cores, each can have 2-4 threads per core. It varies. |
Encountered this during a build of the
next
stream:Hitting up against limits somehow? My first thought was we're not reaping zombies, but we do have:
Let's keep a close eye on this and see if it happens again.
The text was updated successfully, but these errors were encountered: