Cache results for Container.Import and Container.Build #5014

vito · 2023-04-24T16:01:04Z

Builds on #5013 to use IDs as cache keys for expensive API requests.

This has by far the largest performance improvement for dagger run ./bass/test -i src=./ because I use OCI archives for images everywhere. Without this, every single call to Container.Import imports the same file every time I run something, spending tens of seconds on redundant I/O that's invisible to the user.

I think it's important for these caches to live on the Dagger side because it frees the API consumer to structure their code however they want. I applied these same optimizations in Bass a couple of weeks ago for the same reason.

sipsma · 2023-04-24T22:55:07Z

I haven't read the code in this PR thoroughly yet, but essentially this skips submitting the same LLB to buildkit multiple times?

In general our approach has been to rely on buildkit's caching to do that. E.g. for host dirs buildkit will only sync the directory once per-session.

If possible, I think I'd prefer keeping it that way, but could be convinced otherwise. The main reason is just simplicity; it's much easier to think about how the caching works when it's more centralized and it also reduces the lines of code that could contain bugs. e.g. now we need to be very careful about what include in the cache keys in our caches, otherwise we could end up deduplicating objects that are actually different.

I guess the tradeoff is performance, so if this makes a dramatic improvement that can't be addressed any other way then maybe it's worth it. Were there any noticeable performance improvements besides the Container.Import case?

For the Container.Import case, that almost feels like it could be a bug in buildkit. It definitely sounds different than the local dir sync behavior where the sync only happens once per session. Do you know if doing a sync every time is the expected behavior upstream? If not, I feel like it would be better to fix there.

vito · 2023-04-25T00:09:59Z

In general our approach has been to rely on buildkit's caching to do that. E.g. for host dirs buildkit will only sync the directory once per-session.

I'll try to reproduce this. I cached host paths around the same time that I switched Bass to use a single gateway (they're the same commit). I remember noticing that using a single gateway helped but that there was some other step that was still not cached, but maybe I forgot to recompile or something goofy like that and it's completely redundant.

If possible, I think I'd prefer keeping it that way, but could be convinced otherwise. The main reason is just simplicity; it's much easier to think about how the caching works when it's more centralized and it also reduces the lines of code that could contain bugs. e.g. now we need to be very careful about what include in the cache keys in our caches, otherwise we could end up deduplicating objects that are actually different.

100% agree

I guess the tradeoff is performance, so if this makes a dramatic improvement that can't be addressed any other way then maybe it's worth it. Were there any noticeable performance improvements besides the Container.Import case?

For the Container.Import case, that almost feels like it could be a bug in buildkit. It definitely sounds different than the local dir sync behavior where the sync only happens once per session. Do you know if doing a sync every time is the expected behavior upstream? If not, I feel like it would be better to fix there.

The trouble here is we also stream the input into the local OCI store, which takes a long time even for a "no-op". I think one alternative here could be to support importing OCI layout directories instead of tar streams. That way we can just fetch the index and skip a whole bunch of I/O.

But even with that optimization, there would still be the overhead of making a gateway request to fetch the index every single time the image is used. This was my same motivation for caching Container.Build - the repeated gateway requests really add up quickly and punish you for avoiding a registry.

So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me). Will get back to you! Thanks for keeping my weekend experiments in check. 😄

vito · 2023-04-25T00:52:43Z

Confirming the host path change isn't necessary. I was seeing Buildkit repeatedly upload, but it's because I was missing these LocalOpts:

dagger/core/host.go

Lines 73 to 79 in 96c37a2

    
           // synchronize concurrent filesyncs for the same path 
        
           llb.SharedKeyHint(localID), 
        
           // make the LLB stable so we can test invariants like: 
        
           // 
        
           //   workdir == directory(".") 
        
           llb.LocalUniqueID(localID),

github-actions · 2023-05-09T01:54:20Z

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

sipsma

So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me).

Cool yeah I think we should only do the caching for container import, but looking at the implementation of that resolver again I think it makes perfect sense to have our own layer of caching. Mostly just because it's unique in that we are doing a lot of extra work and storing state on disk separately from buildkit.

sipsma · 2023-05-09T22:08:57Z

core/container.go

 	tag string,
 	store content.Store,
 ) (*Container, error) {
+	cached, initializer, found := importCache.GetOrInitialize(source)


Should the cache key include tag too?

Totally; adding

sipsma · 2023-05-09T22:18:37Z

core/cachemap.go

+	Release()
+}
+
+// TODO per-key locks


In case you haven't seen it before, singleflight does what we need here (iiuc) https://pkg.go.dev/golang.org/x/sync/singleflight

Oh nice, yeah that looks like what I'm looking for. We essentially want a version of that with generics. golang/go#53427

Maybe I'll ChatGPT it. 😈

Haha, also looks like bradfitz added a generic implementation to tailscale too: tailscale/tailscale#4876

Ended up just adjusting my implementation for this after seeing that it's pretty straightforward.

vito · 2023-05-10T14:58:45Z

@sipsma Confirmed that caching Container.Build helps greatly. Open to other ideas for optimizing it though.

Full results here. Both were run twice so the Dockerfile build itself was cached, and we're just measuring time taken to use it repeatedly. It looks like most time is wasted on repeatedly resolving images in the Dockerfile. A full trace would give us a better idea though.

No caching:

     226         376868324 ns/op
--- BENCH: BenchmarkContainerBuild-16
    bench_test.go:69: iterations: 1
    bench_test.go:69: iterations: 38
    bench_test.go:69: iterations: 187
    bench_test.go:69: iterations: 226

With caching:

    8616           9284815 ns/op
--- BENCH: BenchmarkContainerBuild-16
    bench_test.go:69: iterations: 1
    bench_test.go:69: iterations: 40
    bench_test.go:69: iterations: 1855
    bench_test.go:69: iterations: 8616

It's nowhere near as dramatic as Container.Import which can take tens of seconds for large archives, but it's still very noticeable overhead, especially if the Dockerfile-built container is used repeatedly (as tends to happen).

sipsma · 2023-05-10T16:29:48Z

Confirmed that caching Container.Build helps greatly. Open to other ideas for optimizing it though.

Interesting, that makes sense too in retrospect. I feel like container Build is going to have a more tricky cache key since there's more inputs and also stuff like service dependencies. We can obviously get it right, but it is a downside in terms of maintenance (though not a blocker). I think I'd be good with updating it to be cached too for now.

Actually, thinking out loud here, is it feasible to cache queries instead of individual resources? Like could we cache the result for

query {
  container {
    from(address: "alpine") {
      withExec(args: ["apk", "add", "curl"]) {
        withExec(args: ["curl", "https://dagger.io/"]) {
          stdout
        }
      }
    }
  }
}

using a hash of the query string (with some normalization)?

I have a sense that something would break somewhere... but I can't actually think of an example off the top of my head. And I suppose even if there are some exceptions that can't be cached, as long as they are limited we could probably set it up so caching is skipped for them.

Also, it would probably be important to not cache every return value (caching every stdout in the session would probably be a lot of memory). Instead we'd just want to cache IDs. So in the query above, we'd cache not stdout, but the ID of the withExec.

I guess one complication would come from the need to cache each step of the pipeline. So if someone queried

query {
  container {
    from(address: "alpine") {
      withExec(args: ["apk", "add", "curl"]) {
        stdout
      }
    }
  }
}

we'd to make sure that if submitted another pipeline that was the same up until then, we'd just re-use the cached ID rather than re-run the whole pipeline up until then.

If this is actually viable, I don't think we need to do it right away, I'd be okay with the per-resource caching for now. But this approach is appealing in that it would be totally generic. And while getting the cache key correct for a query is tricky, it's only tricky once (as opposed to having to solve that for every resource type).

vito · 2023-05-10T17:07:16Z

@sipsma

I feel like container Build is going to have a more tricky cache key since there's more inputs and also stuff like service dependencies. We can obviously get it right, but it is a downside in terms of maintenance (though not a blocker). I think I'd be good with updating it to be cached too for now.

Yeah, it has a ton of inputs, but I think I got'em all. Should be ready for review.

Actually, thinking out loud here, is it feasible to cache queries instead of individual resources?

I like that idea! We're actually very close to this, with coincidentally related code for Progrock's --debug mode:

dagger/router/schema.go

Lines 133 to 157 in 9d107b0

    
           vtx, err := queryVertex(recorder, p) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           ctx := Context{ 
        
           	Context:       p.Context, 
        
           	ResolveParams: p, 
        
           	Vertex:        vtx, 
        
           } 
        
           res, err := f(&ctx, parent, args) 
        
           if err != nil { 
        
           	vtx.Done(err) 
        
           	return nil, err 
        
           } 
        
           if edible, ok := any(res).(Digestible); ok { 
        
           	dg, err := edible.Digest() 
        
           	if err != nil { 
        
           		return nil, fmt.Errorf("failed to compute digest: %w", err) 
        
           	} 
        
           	vtx.Output(dg) 
        
           }

The part at the end where it checks if the return value is Digestible could be updated to populate the cache, keyed by the queryDigest. Probably other things to worry about, but it's kind of neat that we're already trending in that direction.

sipsma · 2023-05-10T17:54:28Z

core/container.go

+	target string,
+	secrets []SecretID,
+) (*Container, error) {
+	cached, initializer, found := buildCache.GetOrInitialize(cacheKey(container, context, dockerfile, buildArgs, target, secrets))


Cool, that wasn't so hard after all 😄 I guess the service dependencies are implicitly in context.

One thing, should we sort the slices containing objects where the order doesn't matter? e.g. I don't think the order matters for buildArgs, but I also didn't think that json.Marshal would implicitly sort them due to the fact that they are objects and not strings (maybe I'm wrong about that?)

Actually, while this might be nice to have, I guess it's not totally necessary for correctness. It would result in more caching, but if we didn't do this everything should still behave correctly, just slower for certain cases. So no biggie either way.

You're right that two Build calls with different-ordered args, but same name/value pairs and all else being equal, are semantically equivalent. We could sort here, but I think it would be hard to handle that generically e.g. in ToResolver where the --debug vertex is created. To me it's not worth the trouble, since that seems like a super unlikely situation, and they'd at least be individually cached, and the order is at least stable.

sipsma · 2023-05-10T18:04:15Z

core/cachemap.go

+		init.Wait()
+		return init.val, nil, true


If an error happens in the initialization being waited on here, should we return that here? Otherwise afaict the blocked caller will hit the found==true path and then return an object but no error, even though the "real" work to create the object resulted in an error.

Also, it looks like the lock for the whole cache is held while waiting here, I thought the idea was that there would be one lock per key? Or am I misunderstanding

Both totally correct, fixing!

I was hoping to avoid a nested callback pattern here, but that seems like the simplest path after all if we want to propagate errors. Alternatively Get could loop as long as initialization didn't finish so each call site could give it a try, but that's a lot of potentially wasted work for nonrecoverable errors.

@sipsma Ready for another look. 🙏

I switched to something closer to singleflight's callback flow, the difference being that errors won't be cached forever, but they will still be returned to all concurrent runs.

To be honest this is all theoretical. It felt a little aggressive to cache errors forever since they could be from network flakes, but realistically any retries might be done with an entirely different session anyway. 🤷‍♂️

Yeah that makes perfect sense I think, it should be the same as what the caching behavior of buildkit is essentially

Signed-off-by: Alex Suraci <alex@dagger.io>

necessary for including e.g. platform Signed-off-by: Alex Suraci <alex@dagger.io>

Signed-off-by: Alex Suraci <alex@dagger.io>

sipsma · 2023-05-11T06:09:08Z

I like that idea! We're actually very close to this, with coincidentally related code for Progrock's --debug mode:

Yeah, I'm kind of wondering if this could lead to us replacing IDs entirely with the query cache key (which would probably just be a normalized form of the query). Could be really cool if all the code for creating the cache key lived in some of the generic functions like ToResolver. We might be able to just delete all the other code for handling IDs at that point? All of resources would just be inherently IDable by virtue of being queryable.

gerhard · 2023-05-11T08:21:54Z

The conversation was very nice to read, thank you @vito & @sipsma for taking the time to async this.

I have not yet read the code, or tried this in practice, but I intend to do so in the next few hours. If everything checks out - which I have no doubt that it will - I would like to get this merged too so that we get it out in today's release.

gerhard

I went through the code, all looks good.

I ran my pipeline with this change (including the dev engine built from this), and there were no noticeable changes. Everything worked the same for me before this change, as well as after.

I don't see any reason to not merge this 🛣

vito force-pushed the cache-money branch from a42e780 to cd22b5a Compare April 25, 2023 01:06

vito changed the title ~~Cache results for Container.Import, Host.Directory, Container.Build~~ Cache results for Container.Import and Container.Build Apr 25, 2023

github-actions bot added the stale label May 9, 2023

sipsma reviewed May 9, 2023

View reviewed changes

vito force-pushed the cache-money branch 2 times, most recently from 30e69d0 to 96d401d Compare May 10, 2023 15:12

sipsma mentioned this pull request May 10, 2023

Introduce Progrock and inline progress UI #5078

Merged

sipsma reviewed May 10, 2023

View reviewed changes

vito added 7 commits May 10, 2023 15:02

cache Container.Import

aaa080e

Signed-off-by: Alex Suraci <alex@dagger.io>

cache Container.Build

2c79317

Signed-off-by: Alex Suraci <alex@dagger.io>

cache imports based on source + tag

2d74fc6

Signed-off-by: Alex Suraci <alex@dagger.io>

cacheMap: per-key locking, add test

6fb9bcb

Signed-off-by: Alex Suraci <alex@dagger.io>

Import/Build include receiver in cache key

9ffa25e

necessary for including e.g. platform Signed-off-by: Alex Suraci <alex@dagger.io>

switch to callback-based cacheMap

6c8c3e4

Signed-off-by: Alex Suraci <alex@dagger.io>

don't cache errors forever

8f8c518

Signed-off-by: Alex Suraci <alex@dagger.io>

vito force-pushed the cache-money branch from f47d6db to 8f8c518 Compare May 10, 2023 19:03

github-actions bot removed the stale label May 11, 2023

sipsma approved these changes May 11, 2023

View reviewed changes

gerhard approved these changes May 11, 2023

View reviewed changes

vito merged commit a2d8aa0 into dagger:main May 11, 2023
16 checks passed

vito deleted the cache-money branch May 11, 2023 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache results for Container.Import and Container.Build #5014

Cache results for Container.Import and Container.Build #5014

vito commented Apr 24, 2023

sipsma commented Apr 24, 2023

vito commented Apr 25, 2023

vito commented Apr 25, 2023

github-actions bot commented May 9, 2023

sipsma left a comment

sipsma May 9, 2023

vito May 10, 2023

sipsma May 9, 2023

vito May 10, 2023

sipsma May 10, 2023

vito May 10, 2023

vito commented May 10, 2023 •

edited

sipsma commented May 10, 2023 •

edited

vito commented May 10, 2023

sipsma May 10, 2023

vito May 10, 2023

sipsma May 10, 2023

vito May 10, 2023

vito May 10, 2023

sipsma May 11, 2023

sipsma commented May 11, 2023

gerhard commented May 11, 2023 •

edited

gerhard left a comment

Cache results for Container.Import and Container.Build #5014

Cache results for Container.Import and Container.Build #5014

Conversation

vito commented Apr 24, 2023

sipsma commented Apr 24, 2023

vito commented Apr 25, 2023

vito commented Apr 25, 2023

github-actions bot commented May 9, 2023

sipsma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vito commented May 10, 2023 • edited

sipsma commented May 10, 2023 • edited

vito commented May 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipsma commented May 11, 2023

gerhard commented May 11, 2023 • edited

gerhard left a comment

Choose a reason for hiding this comment

vito commented May 10, 2023 •

edited

sipsma commented May 10, 2023 •

edited

gerhard commented May 11, 2023 •

edited