New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache results for Container.Import and Container.Build #5014
Conversation
I haven't read the code in this PR thoroughly yet, but essentially this skips submitting the same LLB to buildkit multiple times? In general our approach has been to rely on buildkit's caching to do that. E.g. for host dirs buildkit will only sync the directory once per-session. If possible, I think I'd prefer keeping it that way, but could be convinced otherwise. The main reason is just simplicity; it's much easier to think about how the caching works when it's more centralized and it also reduces the lines of code that could contain bugs. e.g. now we need to be very careful about what include in the cache keys in our caches, otherwise we could end up deduplicating objects that are actually different. I guess the tradeoff is performance, so if this makes a dramatic improvement that can't be addressed any other way then maybe it's worth it. Were there any noticeable performance improvements besides the For the |
I'll try to reproduce this. I cached host paths around the same time that I switched Bass to use a single gateway (they're the same commit). I remember noticing that using a single gateway helped but that there was some other step that was still not cached, but maybe I forgot to recompile or something goofy like that and it's completely redundant.
100% agree
The trouble here is we also stream the input into the local OCI store, which takes a long time even for a "no-op". I think one alternative here could be to support importing OCI layout directories instead of tar streams. That way we can just fetch the index and skip a whole bunch of I/O. But even with that optimization, there would still be the overhead of making a gateway request to fetch the index every single time the image is used. This was my same motivation for caching So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me). Will get back to you! Thanks for keeping my weekend experiments in check. 😄 |
Confirming the host path change isn't necessary. I was seeing Buildkit repeatedly upload, but it's because I was missing these Lines 73 to 79 in 96c37a2
|
This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me).
Cool yeah I think we should only do the caching for container import, but looking at the implementation of that resolver again I think it makes perfect sense to have our own layer of caching. Mostly just because it's unique in that we are doing a lot of extra work and storing state on disk separately from buildkit.
core/container.go
Outdated
tag string, | ||
store content.Store, | ||
) (*Container, error) { | ||
cached, initializer, found := importCache.GetOrInitialize(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the cache key include tag
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally; adding
core/cachemap.go
Outdated
Release() | ||
} | ||
|
||
// TODO per-key locks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you haven't seen it before, singleflight does what we need here (iiuc) https://pkg.go.dev/golang.org/x/sync/singleflight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, yeah that looks like what I'm looking for. We essentially want a version of that with generics. golang/go#53427
Maybe I'll ChatGPT it. 😈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, also looks like bradfitz
added a generic implementation to tailscale too: tailscale/tailscale#4876
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up just adjusting my implementation for this after seeing that it's pretty straightforward.
@sipsma Confirmed that caching Full results here. Both were run twice so the Dockerfile build itself was cached, and we're just measuring time taken to use it repeatedly. It looks like most time is wasted on repeatedly resolving images in the Dockerfile. A full trace would give us a better idea though. No caching:
With caching:
It's nowhere near as dramatic as |
30e69d0
to
96d401d
Compare
Interesting, that makes sense too in retrospect. I feel like container Build is going to have a more tricky cache key since there's more inputs and also stuff like service dependencies. We can obviously get it right, but it is a downside in terms of maintenance (though not a blocker). I think I'd be good with updating it to be cached too for now. Actually, thinking out loud here, is it feasible to cache queries instead of individual resources? Like could we cache the result for query {
container {
from(address: "alpine") {
withExec(args: ["apk", "add", "curl"]) {
withExec(args: ["curl", "https://dagger.io/"]) {
stdout
}
}
}
}
} using a hash of the query string (with some normalization)? I have a sense that something would break somewhere... but I can't actually think of an example off the top of my head. And I suppose even if there are some exceptions that can't be cached, as long as they are limited we could probably set it up so caching is skipped for them. Also, it would probably be important to not cache every return value (caching every stdout in the session would probably be a lot of memory). Instead we'd just want to cache IDs. So in the query above, we'd cache not I guess one complication would come from the need to cache each step of the pipeline. So if someone queried query {
container {
from(address: "alpine") {
withExec(args: ["apk", "add", "curl"]) {
stdout
}
}
}
} we'd to make sure that if submitted another pipeline that was the same up until then, we'd just re-use the cached If this is actually viable, I don't think we need to do it right away, I'd be okay with the per-resource caching for now. But this approach is appealing in that it would be totally generic. And while getting the cache key correct for a query is tricky, it's only tricky once (as opposed to having to solve that for every resource type). |
Yeah, it has a ton of inputs, but I think I got'em all. Should be ready for review.
I like that idea! We're actually very close to this, with coincidentally related code for Progrock's Lines 133 to 157 in 9d107b0
The part at the end where it checks if the return value is |
core/container.go
Outdated
target string, | ||
secrets []SecretID, | ||
) (*Container, error) { | ||
cached, initializer, found := buildCache.GetOrInitialize(cacheKey(container, context, dockerfile, buildArgs, target, secrets)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that wasn't so hard after all 😄 I guess the service dependencies are implicitly in context
.
One thing, should we sort the slices containing objects where the order doesn't matter? e.g. I don't think the order matters for buildArgs
, but I also didn't think that json.Marshal
would implicitly sort them due to the fact that they are objects and not strings (maybe I'm wrong about that?)
Actually, while this might be nice to have, I guess it's not totally necessary for correctness. It would result in more caching, but if we didn't do this everything should still behave correctly, just slower for certain cases. So no biggie either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that two Build calls with different-ordered args, but same name/value pairs and all else being equal, are semantically equivalent. We could sort here, but I think it would be hard to handle that generically e.g. in ToResolver
where the --debug
vertex is created. To me it's not worth the trouble, since that seems like a super unlikely situation, and they'd at least be individually cached, and the order is at least stable.
core/cachemap.go
Outdated
init.Wait() | ||
return init.val, nil, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an error happens in the initialization being waited on here, should we return that here? Otherwise afaict the blocked caller will hit the found==true
path and then return an object but no error, even though the "real" work to create the object resulted in an error.
Also, it looks like the lock for the whole cache is held while waiting here, I thought the idea was that there would be one lock per key
? Or am I misunderstanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both totally correct, fixing!
I was hoping to avoid a nested callback pattern here, but that seems like the simplest path after all if we want to propagate errors. Alternatively Get
could loop as long as initialization didn't finish so each call site could give it a try, but that's a lot of potentially wasted work for nonrecoverable errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sipsma Ready for another look. 🙏
I switched to something closer to singleflight's callback flow, the difference being that errors won't be cached forever, but they will still be returned to all concurrent runs.
To be honest this is all theoretical. It felt a little aggressive to cache errors forever since they could be from network flakes, but realistically any retries might be done with an entirely different session anyway. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes perfect sense I think, it should be the same as what the caching behavior of buildkit is essentially
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
necessary for including e.g. platform Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Yeah, I'm kind of wondering if this could lead to us replacing IDs entirely with the query cache key (which would probably just be a normalized form of the query). Could be really cool if all the code for creating the cache key lived in some of the generic functions like |
The conversation was very nice to read, thank you @vito & @sipsma for taking the time to async this. I have not yet read the code, or tried this in practice, but I intend to do so in the next few hours. If everything checks out - which I have no doubt that it will - I would like to get this merged too so that we get it out in today's release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the code, all looks good.
I ran my pipeline with this change (including the dev engine built from this), and there were no noticeable changes. Everything worked the same for me before this change, as well as after.
I don't see any reason to not merge this 🛣
Builds on #5013 to use IDs as cache keys for expensive API requests.
This has by far the largest performance improvement for
dagger run ./bass/test -i src=./
because I use OCI archives for images everywhere. Without this, every single call toContainer.Import
imports the same file every time I run something, spending tens of seconds on redundant I/O that's invisible to the user.I think it's important for these caches to live on the Dagger side because it frees the API consumer to structure their code however they want. I applied these same optimizations in Bass a couple of weeks ago for the same reason.