add CacheID and CacheVolume#3287
Conversation
✅ Deploy Preview for cloak-docs ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
sipsma
left a comment
There was a problem hiding this comment.
Awesome! No major comments looks great
There was a problem hiding this comment.
I have the sense I'm forgetting something obvious, but do we still need "Cache" here?
There was a problem hiding this comment.
Whoops, copy-pasta - fixing!
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
|
I’m a bit late to the party, sorry… I don’t understand the value of the “token” concept. As a developer: Why do I need to provide one? How do I choose a token name? How many should I provide? Can’t I just create a cache volume, get its ID, then use that ID wherever I want to mount that volume? |
|
@shykes Assuming there's no volume state behind the scenes, and you're suggesting to just generate random IDs, that seems painful as it puts the responsibility on the client to manage the lifecycle of that ID themselves. There isn't a clear practice that I can imagine for the API consumer to stash that cache ID away and re-use it for later runs, and there's no way to "retrieve" the ID from the last run because there were no inputs associated to it in the first place, and for similar reasons it's hard for the client to decide when to invalidate the cache and create a new one. In my experience cache keys are always either a static string or they're keyed on some external dependency. For example here I need a cache mount that will be invalidated whenever the base image changes, so I put the image tag + digest in the cache key. The other examples are mostly simple strings, but some demonstrate a basic hierarchy like So tokens are just a seed for generating deterministic cache IDs instead of random ones. I went with a list instead of a single string just to make composite cache keys like the one I linked easier to use (worth noting that GraphQL doesn't support string concatenation or interpolation). To re-use a cache, just keep passing the same tokens. To create a new cache, just pass different tokens. For now it's just a CacheID with extra steps, but it's a layer of indirection that'll let us throw more things into CacheID generation if we ever need scoping. Hope that makes sense and I'm not totally missing your point. 😅 |
|
Totally not married to the name 'token' btw, it was a game-time decision. |
|
Thanks, you’re absolutely right. I was so focused on how to reuse a cache volume across queries in the same session, I totally forgot about the need to reuse it across sessions… Quick follow-up:
|
I agree that we'll need scoping really soon if not on day 1. I wonder if it's closer to "project" than "user" though. I may be using the same machine/user to work on multiple projects that all use Dagger and it'd be awkward if they collided with one another. In some cases you might want to share caches across projects though, for example it should be safe to share one big Go build/module cache. Maybe it could be a value in Also one thing worth clarifying: we're aiming to prevent cache collisions just from a usability perspective, not a security perspective, right? So it should be OK to let the user provide the scope key. Multiple untrusted tenants on the same Buildkit is never a safe thing to do anyway.
Works for me!
I'll change it to a regular string. I started writing up an example that reused GraphQL query params to feed into the cache tokens, but then realized it's a separate query to create the cache anyway; given that the user will have to decide which params to pass for creating the cache, they can just as easily come up with a cache key ahead of time. The other thought I had was related to GitHub Action's
Agree that |
|
re: scoping Yeah I mean technically this exact issue has been present since Europa (it's just that the cache id was a string implicit to the exec api, but it was shared across all sessions concurrently connected to buildkit). That's not a reason to ignore but it is true that we have gotten by ignoring it up til now :-) This issue of when/how to share cache mounts came up when discussing the cloud backend too. It's part of a broader question of namespacing and session-state, which will also impact services, synced directories and even our long-future plans around magical caching (there was some discussion about tiering+namespacing there). That is to say, this is a very complicated topic with ties to many other complicated topics. The reason I like the tokens approach is that it should work for any number of different future possibilities. It may even make sense to use a similar approach with some other resources like services (maybe). If I had to guess, I'd think one likely first step would be to just enable the engine to optionally implicitly scope all cache mounts to a single session, to support the cloud backend use case. It could do this by just implicitly inserting the session id into the set of tokens. For more broader use cases, it gets complicated because there's valid use cases for having global caches, for having "team-wide" caches (shared with a few individuals) and for having "individual" caches, all at the same time. The tokens approach will also be a great base for the implementation there I think. It's more figuring out to enable that tiering of namespaces+sessions that also works with everything else that will become challenging. There's plenty of prior art for this too in k8s and even containerd (and many many others I'm sure), we can learn from that. Either way, I very briefly mention cache mounts in my WIP arch discussion here (pain point 5), but as that grows and splits out into sub issues it will probably grow in prominence, can keep figuring this all out there: #3287 (comment) EDIT: to be clear, whether tokens are literally in the API schema or if we just have a single string there and have users construct tokens themselves is mostly independent of the above. I was thinking more about the "spirit" of that sort of namespacing design, the underlying implementation is more flexible. |
|
Created #3341 with the changes I mentioned. I went with Internally this actually works the same as before; |
|
Starting with the easy part: bikeshedding the API signature :)
I wonder if we need an ID at all now - why not just lookup by key? On first lookup, we just create it. extend type Query {
“””
Lookup a cache volume by its key, creating a new one if needed.
Key is namespaced to the current project.
“””
cacheVolume(key: String!): CacheVolume!
} |
|
Focusing on the hard part now...
You are right. I see two possible paths here:
I'm leaning towards option 1. Mostly because it seems really hard to ship a UX with global scoping, then change it to support per-project scoping. I could be wrong though. Note: I think @vito 's use of "project" is roughly equivalent to @sipsma 's use of "session" in this discussion, is this accurate?
This part I'm OK with deferring to later. I think if we have a good project scoping primitive, we can find a way to add this feature later.
IMO we need a UX primitive that works in a single-tenant and multi-tenant context. Totally agree that we don't want to actually make a buildkit instance multi-tenant (at least not with untrusted tenants) but we might want to implement multi-tenancy at a higher level (with cloud etc). So if the UX has a hook for that, it would be ideal. |
Too simple and clear. We need to show that our brains are large and wrinkled. I'll make that change. 😄 Note that there's still no lookup/create though at API call time, it's all just song and dance for generating a deterministic cache ID that we can choose to scope later on the backend. Whether that scoping happens at |
fixes #3270
Implements the following:
(Bikeshedding welcome.)
Prior discussions were in Discord, crystallized in this comment: #3217 (comment)
The basic idea is to re-introduce
CacheIDbut with an API for creating them rather than having the user provide such a low-level and far-reaching identifier directly.Rather than an arbitrary string value, a
CacheIDis a special encoded payload, just like other IDs, containing arbitrary string tokens the user provides for controlling cache busting.The idea is to prevent users from accidentally passing in a
CacheID("banana")without realizing the API exists. There's no state anywhere, so it's possible to create one client-side if you know the structure, but I don't think we really care yet. If/when we do care, and we want to prevent users from accessing each other's caches, we can just augment the tokens with a user-scoped value.