Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildkitd stuck at 100% CPU for 5 minutes during SAVE ARTIFACT for ~20 MiB output #1187

Open
mologie opened this issue Aug 25, 2021 · 20 comments
Labels
type:bug Something isn't working

Comments

@mologie
Copy link

mologie commented Aug 25, 2021

Versions: earthly 0.5.22 (in --ci mode), docker 20.10.7

Hi, I am trying to migrate an existing GitLab CI pipeline to Earthly but I am hitting a major performance roadblock. Welcome to contribute to a solution here, but as it stands I would need some pointers to start diagnosing this properly.

Symptoms: A cross-compilation task, which takes 9 minutes with Docker on the same build machine I run Earthly with takes 14 minutes with Earthly. Splitting it up into 4 sub-tasks, each taking roughly 2m20s indivudually, take over 7 minutes individually with Earthly!

I found that the time difference is explained by the SAVE ARTIFACT step. Earthly prints "ongoing" messages, and buildkitd shows 100% CPU usage. It eventually completes and produces the correct result:

image

The container, from which artifacts are saved, weights roughly 8 GiB. However, the exported artifacts are only 20 MiB large.

Indeed the I/O rate of ~22MiB/s average read and delay of roughly 5 minutes suggests that it's probably reading/exporting the whole 8 GiB build container, and thereby gets CPU-bottlenecked by compression (note how it writes 6 MiB/s).

Is my assumption plausible? If so, how can I prevent it, or could Earthly be improved to directly export artifacts over some sidechannel instead of reading in the whole build image?

@vladaionescu vladaionescu added the type:bug Something isn't working label Aug 28, 2021
@vladaionescu
Copy link
Member

Hi @mologie, your assumptions could be correct. A few questions:

  • Just curious: is the cross-compilation emulation based? Or does it run natively? (is FROM --platform=... being used?)
  • The container image does seem pretty big. I assume you're not doing any SAVE IMAGE anywhere else?
  • I assume that the findings you're seeing so far are taking place on your own computer? What OS are you on?
  • We can help with debugging this. Do you have a way to give us an Earthfile with a minimal reproduction of this issue?

@mologie
Copy link
Author

mologie commented Sep 1, 2021

Hi @vladaionescu, thank you for your response and sorry for the delay on my end. The CI changes I am working on do not receive a high priority in my team right now but rest assured this is being worked on, we love Earthly so far, and hope to be able to contribute back here.

Thank you first of all for confirming I /could/ be on the right track. I will continue investigating!

If you mind having this open with the sparse info I have given then feel free to close, otherwise I will keep documenting my findings here probably beginning on next week.

My steps to reproduce will likely have to contain an Earthfile combined with a Fedora CoreOS machine config and shell scripts to spawn a VM with qemu or vmware-vmx. I was not yet able to reproduce outside of FCOS with e.g. just Ubuntu and Docker.


Just to answer your questions in advance:

Just curious: is the cross-compilation emulation based? Or does it run natively? (is FROM --platform=... being used?)

It runs natively, FROM --platform is not used.

The container image does seem pretty big. I assume you're not doing any SAVE IMAGE anywhere else?

An image for the cross-compiler is collected with SAVE IMAGE --cache-from=x:foo x:bar. The build step itself, which has a FROM +crossenv, does not have a SAVE IMAGE, but only SAVE ARTIFACT which takes forever. I am working on isolating an Earthfile + VM to reproduce the issue.

I assume that the findings you're seeing so far are taking place on your own computer? What OS are you on?

Negative, the issue can be observed only in CI, but not on my computer. CI runs Fedora CoreOS 34.20210808.3.0 with Docker 20.10.7.

We can help with debugging this. Do you have a way to give us an Earthfile with a minimal reproduction of this issue?

I appreciate the offer! So far I have failed to reliably find a minimal reproducible example because it looks like something in my CI environment is the deciding factor, like described above. All builds in our CI env are slower with Earthly (compared to just running Docker), but the massive difference is only observed with the cross-build with large build container.

@vladaionescu
Copy link
Member

Some more things to try:

  • If you try to use earthly --artifact +your-target/artifafct ./some/dest/path can you reproduce the issue? This command only performs the artifact output. If it works fast, perhaps there is some other unexpected exporting that takes place.
  • We have a new hidden feature (SAVE IMAGE is slow, even when there's no work to be done #500 (comment)) that allows faster image exports. I wonder if this new feature somehow solves the issue magically. To enable the feature you can run this in your CI
    earthly config global.local_registry_host 'tcp://127.0.0.1:8371'
    
    (You can remove the created local_registry_host entry from the Earthly config file to go back to normal)

@mxey
Copy link

mxey commented Oct 8, 2021

I have a similar issue, which seems tied to pushing an image to the registry. All the build steps are cached and then Earthly takes a while before it begins pushing the image. This is running with --ci, so there should be no artifacts exported, right?

+test-headless-exporter | *cached* --> RUN ./headless-exporter.test
             ongoing | 
             ongoing | 
             ongoing | 
              output | --> exporting outputs
              output | [          ] pushing layers ... 0%
              output | [██████████] pushing layers ... 100%

During the ongoing lines, buildkitd is running at 100% CPU (only using a single thread apparently)

@mxey
Copy link

mxey commented Oct 8, 2021

If I remove --push from my call, it does not happen. If I add --push, it takes forever, before it starts pushing layers.

The image that is eventually pushed to the registry is only about 100 MB

@mxey
Copy link

mxey commented Oct 8, 2021

If I replace --ci with --no-output --strict --use-inline-cache, the problem goes away. So I assume the issue is with --save-inline-cache.

@vladaionescu
Copy link
Member

So I assume the issue is with --save-inline-cache.

It's very possible - I think I've noticed that correlation too in the past. Could be some internal buildkit cache prep routine, but not sure.

@mariusvniekerk
Copy link

I've started seeing this on my builds now too. Anything potentially useful to diagnose this behavior?

@mariusvniekerk
Copy link

Tailing the logs of buildkit I see something like

time="2021-12-13T14:50:53Z" level=info msg="response completed" go.version=go1.17.3 http.request.host="127.0.0.1:8371" http.request.id=7bfd83bb-c3b6-4938-9b50-41d345720006 http.request.method=GET http.request.remoteaddr="172.17.0.1:41320" http.request.uri=/v2/ http.request.useragent="docker/20.10.9 go/go1.16.8 git-commit/79ea9d3 kernel/4.14.248-189.473.amzn2.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.9 \\(linux\\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=1.136363ms http.response.status=200 http.response.written=2
time="2021-12-13T14:50:53Z" level=info msg="response completed" go.version=go1.17.3 http.request.host="127.0.0.1:8371" http.request.id=e3bc87cf-4c75-4a31-b0f4-0c03d664f9a5 http.request.method=HEAD http.request.remoteaddr="172.17.0.1:41324" http.request.uri=/v2/sess-rhr6rr47nbfjm29l31tk2dzwy/sp/manifests/img10 http.request.useragent="docker/20.10.9 go/go1.16.8 git-commit/79ea9d3 kernel/4.14.248-189.473.amzn2.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.9 \\(linux\\))" http.response.contenttype=application/vnd.docker.distribution.manifest.v2+json http.response.duration=1.471681ms http.response.status=200 http.response.written=740
time="2021-12-13T14:50:53Z" level=info msg="response completed" go.version=go1.17.3 http.request.host="127.0.0.1:8371" http.request.id=70a11798-b5b2-431f-9bca-b80f68df2036 http.request.method=GET http.request.remoteaddr="172.17.0.1:41332" http.request.uri=/v2/ http.request.useragent="docker/20.10.9 go/go1.16.8 git-commit/79ea9d3 kernel/4.14.248-189.473.amzn2.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.9 \\(linux\\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=1.077896ms http.response.status=200 http.response.written=2
time="2021-12-13T14:50:53Z" level=info msg="response completed" go.version=go1.17.3 http.request.host="127.0.0.1:8371" http.request.id=4a011475-52e5-4b42-8b25-9ab60fcad984 http.request.method=HEAD http.request.remoteaddr="172.17.0.1:41336" http.request.uri=/v2/sess-rhr6rr47nbfjm29l31tk2dzwy/sp/manifests/img1 http.request.useragent="docker/20.10.9 go/go1.16.8 git-commit/79ea9d3 kernel/4.14.248-189.473.amzn2.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.9 \\(linux\\))" http.response.contenttype=application/vnd.docker.distribution.manifest.v2+json http.response.duration=1.565915ms http.response.status=200 http.response.written=2194
time="2021-12-13T14:50:54Z" level=error msg="reading from connection failed: EOF" app=shellrepeater
time="2021-12-13T14:50:54Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:59868"
time="2021-12-13T14:50:54Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:59868"
time="2021-12-13T14:50:56Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-12-13T14:51:03Z" level=warning msg="failed to match any cache with layers"
time="2021-12-13T14:51:06Z" level=error msg="reading from connection failed: EOF" app=shellrepeater
time="2021-12-13T14:51:06Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:59910"
time="2021-12-13T14:51:06Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:59910"
time="2021-12-13T14:51:08Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-12-13T14:53:39Z" level=error msg="reading from connection failed: EOF" app=shellrepeater
time="2021-12-13T14:53:39Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:60130"
time="2021-12-13T14:53:39Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:60130"
time="2021-12-13T14:53:40Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-12-13T14:54:46Z" level=error msg="reading from connection failed: EOF" app=shellrepeater
time="2021-12-13T14:54:46Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:60242"
time="2021-12-13T14:54:46Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:60242"
time="2021-12-13T14:54:47Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-12-13T14:57:41Z" level=error msg="reading from connection failed: EOF" app=shellrepeater
time="2021-12-13T14:57:42Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:60506"
time="2021-12-13T14:57:42Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:60506"
time="2021-12-13T14:57:43Z" level=info msg="trying next host - response was http.StatusNotFound"
time="2021-12-13T15:00:59Z" level=info msg="PurgeUploads starting: olderThan=2021-12-06 15:00:59.693646652 +0000 UTC m=-602519.477236891, actuallyDelete=true"
time="2021-12-13T15:00:59Z" level=info msg="Purge uploads finished.  Num deleted=0, num errors=1"
time="2021-12-13T15:00:59Z" level=info msg="Starting upload purge in 24h0m0s" go.version=go1.17.3

At this point buildkitd is using 300% CPU and just spinning

@mariusvniekerk
Copy link

mariusvniekerk commented Dec 13, 2021

$ perf top -g
+   73.95%     5.75%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops                                 ▒
+   24.92%     0.08%  perf                     [.] hist_entry_iter__add                                                                                       ▒
+   19.30%    19.28%  buildkitd                [.] runtime.mapiternext                                                                                        ▒
+   14.80%     4.22%  buildkitd                [.] runtime.mapiterinit                                                                                        ▒
+   13.62%    13.61%  buildkitd                [.] aeshashbody                                                                                                ▒
+   13.51%    13.50%  buildkitd                [.] runtime.mapaccess2_faststr                                                                                 ▒
+    8.53%     0.35%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).checkLoops.func1                           ▒
+    7.76%     0.04%  perf                     [.] callchain_append                                                                                           ▒
+    5.28%     0.00%  buildkitd                [.] runtime.goexit.abi0                                                                                        ▒
+    5.28%     0.00%  buildkitd                [.] google.golang.org/grpc.(*Server).serveStreams.func1.2                                                      ▒
+    5.28%     0.00%  buildkitd                [.] google.golang.org/grpc.(*Server).handleStream                                                              ▒
+    5.28%     0.00%  buildkitd                [.] google.golang.org/grpc.(*Server).processUnaryRPC                                                           ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/api/services/control._Control_Solve_Handler                                       ▒
+    5.28%     0.00%  buildkitd                [.] github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1                                        ▒
+    5.28%     0.00%  buildkitd                [.] github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1                                    ▒
+    5.28%     0.00%  buildkitd                [.] main.unaryInterceptor.func1                                                                                ▒
+    5.28%     0.00%  buildkitd                [.] go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1   ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/util/grpcerrors.UnaryServerInterceptor                                            ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/api/services/control._Control_Solve_Handler.func1                                 ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/control.(*Controller).Solve                                                       ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/solver/llbsolver.(*Solver).Solve                                                  ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/solver/llbsolver.inlineCache                                                      ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/inline.(*exporter).ExportForLayers                              ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).Marshal                                       ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/v1.(*CacheChains).normalize                                     ▒
+    5.28%     0.00%  buildkitd                [.] github.com/moby/buildkit/cache/remotecache/v1.(*normalizeState).removeLoops                                ▒
+    5.27%     5.26%  buildkitd                [.] runtime.mapdelete_faststr                                                                                  ▒
+    4.68%     4.68%  buildkitd                [.] runtime.mapassign_faststr                                

@mariusvniekerk
Copy link

Might be related moby/buildkit#2009

@mariusvniekerk
Copy link

Additionally when cancelling the build, the buildkitd instance is still busy doing whatever it was stuck on.

@m0ar
Copy link

m0ar commented Feb 17, 2022

This is happening for us as well in the ongoing | cache stage, before actual pushing happens. Last night I had to cancel a CI run after 90 minutes, for the cache of building one small service 🧐 The actual work was done in ~2 minutes 😂

I have removed all cache hints, and am still seeing the issue as soon as I add the --push flag to the earthly command. Have any of you managed to work around this @mariusvniekerk @mologie ?

@mariusvniekerk
Copy link

My only workaround is to not use the cache features

@vladaionescu
Copy link
Member

I wonder if there's some kind of infinite loop in the Buildkit code on computing the inline cache metadata. If anyone has a minimal reproduction Earthfile that we can run, it would help tremendously to pinpoint the exact cause.

@m0ar
Copy link

m0ar commented Feb 18, 2022

@vladaionescu I'm using explicit caching (without any additional hints), so it's probably not just happening on inline caching.

earthly --push --strict --remote-cache=europe-docker.pkg.dev/path/to/image:tag +build

@brumhard
Copy link

brumhard commented May 9, 2022

+1 for this issue. It really took some time to figure this out since it's just stuck forever without any error when using --ci or --save-inline-cache respectively.
Maybe some timeout for the action could also help here as a quick win.

@vladaionescu
Copy link
Member

We have identified an issue and have fixed it in Earthly v0.6.20. Closing this as resolved.

@vladaionescu
Copy link
Member

There are signs that this issue still reproduces after the fix too. Reopening...

@vladaionescu vladaionescu reopened this Jul 18, 2022
@sam-utila
Copy link

Having the same problem in GitHub actions, 2 hours with --ci, 10 minutes without.

@vladaionescu vladaionescu mentioned this issue Oct 25, 2022
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

7 participants