Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache results for Container.Import and Container.Build #5014

Merged
merged 7 commits into from May 11, 2023

Conversation

vito
Copy link
Contributor

@vito vito commented Apr 24, 2023

Builds on #5013 to use IDs as cache keys for expensive API requests.

This has by far the largest performance improvement for dagger run ./bass/test -i src=./ because I use OCI archives for images everywhere. Without this, every single call to Container.Import imports the same file every time I run something, spending tens of seconds on redundant I/O that's invisible to the user.

I think it's important for these caches to live on the Dagger side because it frees the API consumer to structure their code however they want. I applied these same optimizations in Bass a couple of weeks ago for the same reason.

@sipsma
Copy link
Contributor

sipsma commented Apr 24, 2023

I haven't read the code in this PR thoroughly yet, but essentially this skips submitting the same LLB to buildkit multiple times?

In general our approach has been to rely on buildkit's caching to do that. E.g. for host dirs buildkit will only sync the directory once per-session.

If possible, I think I'd prefer keeping it that way, but could be convinced otherwise. The main reason is just simplicity; it's much easier to think about how the caching works when it's more centralized and it also reduces the lines of code that could contain bugs. e.g. now we need to be very careful about what include in the cache keys in our caches, otherwise we could end up deduplicating objects that are actually different.

I guess the tradeoff is performance, so if this makes a dramatic improvement that can't be addressed any other way then maybe it's worth it. Were there any noticeable performance improvements besides the Container.Import case?

For the Container.Import case, that almost feels like it could be a bug in buildkit. It definitely sounds different than the local dir sync behavior where the sync only happens once per session. Do you know if doing a sync every time is the expected behavior upstream? If not, I feel like it would be better to fix there.

@vito
Copy link
Contributor Author

vito commented Apr 25, 2023

In general our approach has been to rely on buildkit's caching to do that. E.g. for host dirs buildkit will only sync the directory once per-session.

I'll try to reproduce this. I cached host paths around the same time that I switched Bass to use a single gateway (they're the same commit). I remember noticing that using a single gateway helped but that there was some other step that was still not cached, but maybe I forgot to recompile or something goofy like that and it's completely redundant.

If possible, I think I'd prefer keeping it that way, but could be convinced otherwise. The main reason is just simplicity; it's much easier to think about how the caching works when it's more centralized and it also reduces the lines of code that could contain bugs. e.g. now we need to be very careful about what include in the cache keys in our caches, otherwise we could end up deduplicating objects that are actually different.

100% agree

I guess the tradeoff is performance, so if this makes a dramatic improvement that can't be addressed any other way then maybe it's worth it. Were there any noticeable performance improvements besides the Container.Import case?

For the Container.Import case, that almost feels like it could be a bug in buildkit. It definitely sounds different than the local dir sync behavior where the sync only happens once per session. Do you know if doing a sync every time is the expected behavior upstream? If not, I feel like it would be better to fix there.

The trouble here is we also stream the input into the local OCI store, which takes a long time even for a "no-op". I think one alternative here could be to support importing OCI layout directories instead of tar streams. That way we can just fetch the index and skip a whole bunch of I/O.

But even with that optimization, there would still be the overhead of making a gateway request to fetch the index every single time the image is used. This was my same motivation for caching Container.Build - the repeated gateway requests really add up quickly and punish you for avoiding a registry.

So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me). Will get back to you! Thanks for keeping my weekend experiments in check. 😄

@vito
Copy link
Contributor Author

vito commented Apr 25, 2023

Confirming the host path change isn't necessary. I was seeing Buildkit repeatedly upload, but it's because I was missing these LocalOpts:

dagger/core/host.go

Lines 73 to 79 in 96c37a2

// synchronize concurrent filesyncs for the same path
llb.SharedKeyHint(localID),
// make the LLB stable so we can test invariants like:
//
// workdir == directory(".")
llb.LocalUniqueID(localID),

@vito vito changed the title Cache results for Container.Import, Host.Directory, Container.Build Cache results for Container.Import and Container.Build Apr 25, 2023
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2023

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 9, 2023
Copy link
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the tl;dr seems to be: host path caches possibly redundant, container image caches kind of important (at least to me).

Cool yeah I think we should only do the caching for container import, but looking at the implementation of that resolver again I think it makes perfect sense to have our own layer of caching. Mostly just because it's unique in that we are doing a lot of extra work and storing state on disk separately from buildkit.

tag string,
store content.Store,
) (*Container, error) {
cached, initializer, found := importCache.GetOrInitialize(source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the cache key include tag too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally; adding

core/cachemap.go Outdated
Release()
}

// TODO per-key locks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you haven't seen it before, singleflight does what we need here (iiuc) https://pkg.go.dev/golang.org/x/sync/singleflight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice, yeah that looks like what I'm looking for. We essentially want a version of that with generics. golang/go#53427

Maybe I'll ChatGPT it. 😈

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, also looks like bradfitz added a generic implementation to tailscale too: tailscale/tailscale#4876

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up just adjusting my implementation for this after seeing that it's pretty straightforward.

@vito
Copy link
Contributor Author

vito commented May 10, 2023

@sipsma Confirmed that caching Container.Build helps greatly. Open to other ideas for optimizing it though.

Full results here. Both were run twice so the Dockerfile build itself was cached, and we're just measuring time taken to use it repeatedly. It looks like most time is wasted on repeatedly resolving images in the Dockerfile. A full trace would give us a better idea though.

No caching:

     226         376868324 ns/op
--- BENCH: BenchmarkContainerBuild-16
    bench_test.go:69: iterations: 1
    bench_test.go:69: iterations: 38
    bench_test.go:69: iterations: 187
    bench_test.go:69: iterations: 226

With caching:

    8616           9284815 ns/op
--- BENCH: BenchmarkContainerBuild-16
    bench_test.go:69: iterations: 1
    bench_test.go:69: iterations: 40
    bench_test.go:69: iterations: 1855
    bench_test.go:69: iterations: 8616

It's nowhere near as dramatic as Container.Import which can take tens of seconds for large archives, but it's still very noticeable overhead, especially if the Dockerfile-built container is used repeatedly (as tends to happen).

@vito vito force-pushed the cache-money branch 2 times, most recently from 30e69d0 to 96d401d Compare May 10, 2023 15:12
@sipsma
Copy link
Contributor

sipsma commented May 10, 2023

Confirmed that caching Container.Build helps greatly. Open to other ideas for optimizing it though.

Interesting, that makes sense too in retrospect. I feel like container Build is going to have a more tricky cache key since there's more inputs and also stuff like service dependencies. We can obviously get it right, but it is a downside in terms of maintenance (though not a blocker). I think I'd be good with updating it to be cached too for now.


Actually, thinking out loud here, is it feasible to cache queries instead of individual resources? Like could we cache the result for

query {
  container {
    from(address: "alpine") {
      withExec(args: ["apk", "add", "curl"]) {
        withExec(args: ["curl", "https://dagger.io/"]) {
          stdout
        }
      }
    }
  }
}

using a hash of the query string (with some normalization)?

I have a sense that something would break somewhere... but I can't actually think of an example off the top of my head. And I suppose even if there are some exceptions that can't be cached, as long as they are limited we could probably set it up so caching is skipped for them.

Also, it would probably be important to not cache every return value (caching every stdout in the session would probably be a lot of memory). Instead we'd just want to cache IDs. So in the query above, we'd cache not stdout, but the ID of the withExec.

I guess one complication would come from the need to cache each step of the pipeline. So if someone queried

query {
  container {
    from(address: "alpine") {
      withExec(args: ["apk", "add", "curl"]) {
        stdout
      }
    }
  }
}

we'd to make sure that if submitted another pipeline that was the same up until then, we'd just re-use the cached ID rather than re-run the whole pipeline up until then.

If this is actually viable, I don't think we need to do it right away, I'd be okay with the per-resource caching for now. But this approach is appealing in that it would be totally generic. And while getting the cache key correct for a query is tricky, it's only tricky once (as opposed to having to solve that for every resource type).

@vito
Copy link
Contributor Author

vito commented May 10, 2023

@sipsma

I feel like container Build is going to have a more tricky cache key since there's more inputs and also stuff like service dependencies. We can obviously get it right, but it is a downside in terms of maintenance (though not a blocker). I think I'd be good with updating it to be cached too for now.

Yeah, it has a ton of inputs, but I think I got'em all. Should be ready for review.

Actually, thinking out loud here, is it feasible to cache queries instead of individual resources?

I like that idea! We're actually very close to this, with coincidentally related code for Progrock's --debug mode:

dagger/router/schema.go

Lines 133 to 157 in 9d107b0

vtx, err := queryVertex(recorder, p)
if err != nil {
return nil, err
}
ctx := Context{
Context: p.Context,
ResolveParams: p,
Vertex: vtx,
}
res, err := f(&ctx, parent, args)
if err != nil {
vtx.Done(err)
return nil, err
}
if edible, ok := any(res).(Digestible); ok {
dg, err := edible.Digest()
if err != nil {
return nil, fmt.Errorf("failed to compute digest: %w", err)
}
vtx.Output(dg)
}

The part at the end where it checks if the return value is Digestible could be updated to populate the cache, keyed by the queryDigest. Probably other things to worry about, but it's kind of neat that we're already trending in that direction.

target string,
secrets []SecretID,
) (*Container, error) {
cached, initializer, found := buildCache.GetOrInitialize(cacheKey(container, context, dockerfile, buildArgs, target, secrets))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that wasn't so hard after all 😄 I guess the service dependencies are implicitly in context.

One thing, should we sort the slices containing objects where the order doesn't matter? e.g. I don't think the order matters for buildArgs, but I also didn't think that json.Marshal would implicitly sort them due to the fact that they are objects and not strings (maybe I'm wrong about that?)

Actually, while this might be nice to have, I guess it's not totally necessary for correctness. It would result in more caching, but if we didn't do this everything should still behave correctly, just slower for certain cases. So no biggie either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that two Build calls with different-ordered args, but same name/value pairs and all else being equal, are semantically equivalent. We could sort here, but I think it would be hard to handle that generically e.g. in ToResolver where the --debug vertex is created. To me it's not worth the trouble, since that seems like a super unlikely situation, and they'd at least be individually cached, and the order is at least stable.

core/cachemap.go Outdated
Comment on lines 47 to 48
init.Wait()
return init.val, nil, true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error happens in the initialization being waited on here, should we return that here? Otherwise afaict the blocked caller will hit the found==true path and then return an object but no error, even though the "real" work to create the object resulted in an error.

Also, it looks like the lock for the whole cache is held while waiting here, I thought the idea was that there would be one lock per key? Or am I misunderstanding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both totally correct, fixing!

I was hoping to avoid a nested callback pattern here, but that seems like the simplest path after all if we want to propagate errors. Alternatively Get could loop as long as initialization didn't finish so each call site could give it a try, but that's a lot of potentially wasted work for nonrecoverable errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sipsma Ready for another look. 🙏

I switched to something closer to singleflight's callback flow, the difference being that errors won't be cached forever, but they will still be returned to all concurrent runs.

To be honest this is all theoretical. It felt a little aggressive to cache errors forever since they could be from network flakes, but realistically any retries might be done with an entirely different session anyway. 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes perfect sense I think, it should be the same as what the caching behavior of buildkit is essentially

vito added 7 commits May 10, 2023 15:02
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
necessary for including e.g. platform

Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
@sipsma
Copy link
Contributor

sipsma commented May 11, 2023

I like that idea! We're actually very close to this, with coincidentally related code for Progrock's --debug mode:

Yeah, I'm kind of wondering if this could lead to us replacing IDs entirely with the query cache key (which would probably just be a normalized form of the query). Could be really cool if all the code for creating the cache key lived in some of the generic functions like ToResolver. We might be able to just delete all the other code for handling IDs at that point? All of resources would just be inherently IDable by virtue of being queryable.

@gerhard
Copy link
Member

gerhard commented May 11, 2023

The conversation was very nice to read, thank you @vito & @sipsma for taking the time to async this.

I have not yet read the code, or tried this in practice, but I intend to do so in the next few hours. If everything checks out - which I have no doubt that it will - I would like to get this merged too so that we get it out in today's release.

Copy link
Member

@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code, all looks good.

I ran my pipeline with this change (including the dev engine built from this), and there were no noticeable changes. Everything worked the same for me before this change, as well as after.

I don't see any reason to not merge this 🛣

@vito vito merged commit a2d8aa0 into dagger:main May 11, 2023
16 checks passed
@vito vito deleted the cache-money branch May 11, 2023 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants