You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What went wrong?
Original observation: when running earthly tests (+test against main @ 8bb5634), executions that were cached took a long time to finish. On every execution, there was a significant delay between "exporting outputs" and the actual completion of the build. Shortly after the "exporting outputs" message, the "Waiting on Buildkit... (Utilization: 0 other builds, 0/20 op load)" message appears until the build ends. During this delay, which is measured in minutes (and accounting for half of the total build time), CPU was noticeably/consistently elevated, and memory utilization quadrupled. The memory eventually dropped a few minutes after the build was complete.
The behavior itself is 100% reproducible. However, to this point it has only been noticeable while running the earthly tests, and when "fully cached". It has not been noticeable or detrimental in my typical day-to-day usage, though I suspect the problem is still there.
Earthly's own Earthfile is the best reproducer at this point. However, in order to get an apples-to-apples comparison across versions (see below), some specific tests had to be disabled. I can provide the diff, but I don't think it's relevant.
What should have happened?
The expectation is that "fully cached" builds are fast and that delays have logical explanations.
So far, the only workaround appears to be downgrading to version 0.7.2.
Other Helpful Information
After experimentation and discussion with others, I believe this is due to buildkit changes between the earthly 0.7.2 and 0.7.4 releases, and not the earthly binary itself. My original observations called out increased memory and CPU, but I now believe that increased disk writes is the real issue (with memory/CPU being symptoms, or having a common root cause).
At one point, the test suite took only 18 seconds when fully cached, with 0.6.30 being the first release to include those changes. Therefore I started my reproduction/investigation with a 0.6.30 client and corresponding tag as a baseline, with buildkit version being variable.
In both tables, only builkit versions 0.7.2 and 0.7.3 are relevant, but additional versions are included to help illustrate the patterns. The last three columns are verbatim from the output of docker stats for the buildkit container. My original observations were on my corporate laptop (with lots of security software). To aid in isolation/consistency, these numbers were with my laptop as the client, with a remote server (on the same LAN).
Stats for "fresh" builds (empty buildkit cache, but fully populated registry cache):
Client/Code version
Buildkit version
(Wall) time
Net I/O
Disk I/O
PIDs
0.6.30
0.6.30
8m33s
2.44GB / 184MB
0B / 14.1GB
35
0.6.30
0.7.0
8m44s
2.44GB / 184MB
8.19kB / 14.3GB
28
0.6.30
0.7.2
8m59s
2.44GB / 186MB
0B / 14.2GB
31
0.6.30
0.7.3
9m10s
2.44GB / 185MB
0B / 15GB
21
0.6.30
0.7.4
9m11s
2.44GB / 186MB
0B / 15.4GB
23
0.6.30
0.7.5
9m19s
2.44GB / 185MB
0B / 15.1GB
22
0.6.30
0.7.9
8m59s
2.44GB / 184MB
221kB / 15.2GB
24
The main takeaway is that there was almost a (consistent) 1GB increase in disk writing. There was an increase in wall time, but possibly not statistically significant. And notably, there was no obvious delay at the end of the build (no "Waiting on Buildkit").
Stats for "fully cached" builds (fully populated buildkit cache, fully populated registry cache). Note that some tests had RUN --no-cache, so the baseline is still somewhat elevated:
Client/Code version
Buildkit version
(Wall) time
Net I/O
Disk I/O
PIDs
0.6.30
0.6.30
2m14s
348MB / 161MB
0B / 402MB
25
0.6.30
0.7.0
2m24s
348MB / 160MB
369kB / 399MB
26
0.6.30
0.7.2
2m24s
348MB / 161MB
0B / 404MB
25
0.6.30
0.7.3
3m31s
348MB / 161MB
0B / 3.34GB
33
0.6.30
0.7.4
3m37s
348MB / 160MB
0B / 3.64GB
32
0.6.30
0.7.5
3m25s
348MB / 161MB
0B / 3.3GB
32
0.6.30
0.7.9
3m20s
348MB / 160MB
0B / 3.34GB
35
The increased disk writes is extremely obvious. The increased times are significant as well, but is made worse (to a human) due to every one resulting in "Waiting on Buildkit" at the end. The PID counts are noticeably and consistently elevated, but may or may not be relevant. In each of the "fresh"/"cached" scenarios, the network I/O was extremely consistent, implying that neither build context nor exports are factors.
After shifting focus to disk writes, some other things of interest were found. There is one very specific system call (pwrite64) that stood out. An strace of a fully-cached build with 0.7.2 had 78238 calls, where a fully-cached 0.7.4 build had 467255. Just as significantly, the majority of those 467k writes occurred during the noticeable lag at the end of the build. As well, the predominant file descriptor being written during that time corresponds to /tmp/earthly/buildkit/runc-overlayfs/containerdmeta.db.
I have asciinema casts (to illustrate the lags that a human sees) and strace dumps, but those are likely too large to attach here. Hopefully there's enough here to go on (and hopefully reproducible by others).
The text was updated successfully, but these errors were encountered:
This tests, and implements earthly/buildkit#20
as a partial performance resolution to #3085. Some of the higher
resource utilization at the end exists (per user reports testing this
fix), but the pauses appear to be gone.
Recap of Slack conversation
What went wrong?
Original observation: when running earthly tests (
+test
against main @ 8bb5634), executions that were cached took a long time to finish. On every execution, there was a significant delay between "exporting outputs" and the actual completion of the build. Shortly after the "exporting outputs" message, the "Waiting on Buildkit... (Utilization: 0 other builds, 0/20 op load)" message appears until the build ends. During this delay, which is measured in minutes (and accounting for half of the total build time), CPU was noticeably/consistently elevated, and memory utilization quadrupled. The memory eventually dropped a few minutes after the build was complete.The behavior itself is 100% reproducible. However, to this point it has only been noticeable while running the earthly tests, and when "fully cached". It has not been noticeable or detrimental in my typical day-to-day usage, though I suspect the problem is still there.
Earthly's own Earthfile is the best reproducer at this point. However, in order to get an apples-to-apples comparison across versions (see below), some specific tests had to be disabled. I can provide the diff, but I don't think it's relevant.
What should have happened?
The expectation is that "fully cached" builds are fast and that delays have logical explanations.
So far, the only workaround appears to be downgrading to version
0.7.2
.Other Helpful Information
After experimentation and discussion with others, I believe this is due to buildkit changes between the earthly
0.7.2
and0.7.4
releases, and not the earthly binary itself. My original observations called out increased memory and CPU, but I now believe that increased disk writes is the real issue (with memory/CPU being symptoms, or having a common root cause).At one point, the test suite took only 18 seconds when fully cached, with
0.6.30
being the first release to include those changes. Therefore I started my reproduction/investigation with a0.6.30
client and corresponding tag as a baseline, with buildkit version being variable.In both tables, only builkit versions
0.7.2
and0.7.3
are relevant, but additional versions are included to help illustrate the patterns. The last three columns are verbatim from the output ofdocker stats
for the buildkit container. My original observations were on my corporate laptop (with lots of security software). To aid in isolation/consistency, these numbers were with my laptop as the client, with a remote server (on the same LAN).Stats for "fresh" builds (empty buildkit cache, but fully populated registry cache):
The main takeaway is that there was almost a (consistent) 1GB increase in disk writing. There was an increase in wall time, but possibly not statistically significant. And notably, there was no obvious delay at the end of the build (no "Waiting on Buildkit").
Stats for "fully cached" builds (fully populated buildkit cache, fully populated registry cache). Note that some tests had
RUN --no-cache
, so the baseline is still somewhat elevated:The increased disk writes is extremely obvious. The increased times are significant as well, but is made worse (to a human) due to every one resulting in "Waiting on Buildkit" at the end. The PID counts are noticeably and consistently elevated, but may or may not be relevant. In each of the "fresh"/"cached" scenarios, the network I/O was extremely consistent, implying that neither build context nor exports are factors.
After shifting focus to disk writes, some other things of interest were found. There is one very specific system call (pwrite64) that stood out. An
strace
of a fully-cached build with0.7.2
had 78238 calls, where a fully-cached0.7.4
build had 467255. Just as significantly, the majority of those 467k writes occurred during the noticeable lag at the end of the build. As well, the predominant file descriptor being written during that time corresponds to/tmp/earthly/buildkit/runc-overlayfs/containerdmeta.db
.I have asciinema casts (to illustrate the lags that a human sees) and strace dumps, but those are likely too large to attach here. Hopefully there's enough here to go on (and hopefully reproducible by others).
The text was updated successfully, but these errors were encountered: