Skip to content

add CacheID and CacheVolume#3287

Merged
vito merged 2 commits into
dagger:cloakfrom
vito:cache-volumes
Oct 8, 2022
Merged

add CacheID and CacheVolume#3287
vito merged 2 commits into
dagger:cloakfrom
vito:cache-volumes

Conversation

@vito
Copy link
Copy Markdown
Contributor

@vito vito commented Oct 7, 2022

fixes #3270

Implements the following:

"A global cache volume identifier"
scalar CacheID

extend type Query {
  "Construct a cache volume from its ID"
  cache(id: CacheID!): CacheVolume!

  "Create a new cache volume identified by an arbitrary set of tokens"
  cacheFromTokens(tokens: [String!]!): CacheVolume!
}

"A directory whose contents persist across runs"
type CacheVolume {
  id: CacheID!
}

extend type Container {
  withMountedCache(path: String!, cache: CacheID!, source: DirectoryID): Container!
}

(Bikeshedding welcome.)

Prior discussions were in Discord, crystallized in this comment: #3217 (comment)

The basic idea is to re-introduce CacheID but with an API for creating them rather than having the user provide such a low-level and far-reaching identifier directly.

Rather than an arbitrary string value, a CacheID is a special encoded payload, just like other IDs, containing arbitrary string tokens the user provides for controlling cache busting.

The idea is to prevent users from accidentally passing in a CacheID("banana") without realizing the API exists. There's no state anywhere, so it's possible to create one client-side if you know the structure, but I don't think we really care yet. If/when we do care, and we want to prevent users from accessing each other's caches, we can just augment the tokens with a user-scoped value.

@netlify
Copy link
Copy Markdown

netlify Bot commented Oct 7, 2022

Deploy Preview for cloak-docs ready!

Name Link
🔨 Latest commit fd84af1
🔍 Latest deploy log https://app.netlify.com/sites/cloak-docs/deploys/6340e33844c5db0009720d34
😎 Deploy Preview https://deploy-preview-3287--cloak-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Copy Markdown
Contributor

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! No major comments looks great

Comment thread core/cache.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the sense I'm forgetting something obvious, but do we still need "Cache" here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, copy-pasta - fixing!

vito added 2 commits October 7, 2022 22:40
Signed-off-by: Alex Suraci <alex@dagger.io>
Signed-off-by: Alex Suraci <alex@dagger.io>
@vito vito merged commit 940cbf3 into dagger:cloak Oct 8, 2022
@vito vito deleted the cache-volumes branch October 8, 2022 17:24
@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 8, 2022

I’m a bit late to the party, sorry… I don’t understand the value of the “token” concept. As a developer: Why do I need to provide one? How do I choose a token name? How many should I provide?

Can’t I just create a cache volume, get its ID, then use that ID wherever I want to mount that volume?

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 12, 2022

@shykes Assuming there's no volume state behind the scenes, and you're suggesting to just generate random IDs, that seems painful as it puts the responsibility on the client to manage the lifecycle of that ID themselves. There isn't a clear practice that I can imagine for the API consumer to stash that cache ID away and re-use it for later runs, and there's no way to "retrieve" the ID from the last run because there were no inputs associated to it in the first place, and for similar reasons it's hard for the client to decide when to invalidate the cache and create a new one.

In my experience cache keys are always either a static string or they're keyed on some external dependency. For example here I need a cache mount that will be invalidated whenever the base image changes, so I put the image tag + digest in the cache key. The other examples are mostly simple strings, but some demonstrate a basic hierarchy like bass-loop/gopath, which you could interpret as two tokens.

So tokens are just a seed for generating deterministic cache IDs instead of random ones. I went with a list instead of a single string just to make composite cache keys like the one I linked easier to use (worth noting that GraphQL doesn't support string concatenation or interpolation). To re-use a cache, just keep passing the same tokens. To create a new cache, just pass different tokens. For now it's just a CacheID with extra steps, but it's a layer of indirection that'll let us throw more things into CacheID generation if we ever need scoping.

Hope that makes sense and I'm not totally missing your point. 😅

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 12, 2022

Totally not married to the name 'token' btw, it was a game-time decision.

@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 12, 2022

Thanks, you’re absolutely right. I was so focused on how to reuse a cache volume across queries in the same session, I totally forgot about the need to reuse it across sessions

Quick follow-up:

  • How do we avoid conflicts between tenants of the same buildkit? I think we might have to deal with that level of scoping from day one, or face complications as soon as 2 dagger users point to the same buildkit (which could happen quickly). My first guess is: maybe the local dagger installation has a random $HOME/.dagger/hostid that the graphql server uses for scoping?

  • I like “key” better than “token” because it evokes the familiar term “cache key” already stored in my brain. I think that’s pretty common?

  • I get your point on the lack of string interpolation in graphql. Allowing multiple keys/tokens is the most correct solution from an engineering perspective, and yet… I think a single key might be better DX, because that’s what most devs will expect; while power users like bass can trivially interpolate.

  • could we somehow use a trick similar to container { withFS } and container { from } to avoid the awkward cacheFromTokens ?

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 12, 2022

  • How do we avoid conflicts between tenants of the same buildkit? I think we might have to deal with that level of scoping from day one, or face complications as soon as 2 dagger users point to the same buildkit (which could happen quickly). My first guess is: maybe the local dagger installation has a random $HOME/.dagger/hostid that the graphql server uses for scoping?

I agree that we'll need scoping really soon if not on day 1. I wonder if it's closer to "project" than "user" though. I may be using the same machine/user to work on multiple projects that all use Dagger and it'd be awkward if they collided with one another.

In some cases you might want to share caches across projects though, for example it should be safe to share one big Go build/module cache.

Maybe it could be a value in project.json? (cc @sipsma) And then if you want to share caches you'd just set the same value in each project?

Also one thing worth clarifying: we're aiming to prevent cache collisions just from a usability perspective, not a security perspective, right? So it should be OK to let the user provide the scope key. Multiple untrusted tenants on the same Buildkit is never a safe thing to do anyway.

  • I like “key” better than “token” because it evokes the familiar term “cache key” already stored in my brain. I think that’s pretty common?

Works for me!

  • I get your point on the lack of string interpolation in graphql. Allowing multiple keys/tokens is the most correct solution from an engineering perspective, and yet… I think a single key might be better DX, because that’s what most devs will expect; while power users like bass can trivially interpolate.

I'll change it to a regular string. I started writing up an example that reused GraphQL query params to feed into the cache tokens, but then realized it's a separate query to create the cache anyway; given that the user will have to decide which params to pass for creating the cache, they can just as easily come up with a cache key ahead of time.

The other thought I had was related to GitHub Action's restore-keys config which configures a prefix to use for finding the last cache to restore, which is pretty handy. It felt like a list conveyed more of a prefix hierarchy, but it clearly works with strings too. And we might never even implement that.

  • could we somehow use a trick similar to container { withFS } and container { from } to avoid the awkward cacheFromTokens ?

Agree that cacheFromTokens sticks out like a store thumb. My issue with this as-is is that while container {} and directory {} are both universally valid (scratch container, empty dir), cache {} would then imply one big shared cache, which feels kind of odd. I think this could make sense once we have scoped caches since it'd at least be only affect one tenant. Essentially a "global project cache"?

@sipsma
Copy link
Copy Markdown
Contributor

sipsma commented Oct 12, 2022

re: scoping

Yeah I mean technically this exact issue has been present since Europa (it's just that the cache id was a string implicit to the exec api, but it was shared across all sessions concurrently connected to buildkit). That's not a reason to ignore but it is true that we have gotten by ignoring it up til now :-)

This issue of when/how to share cache mounts came up when discussing the cloud backend too. It's part of a broader question of namespacing and session-state, which will also impact services, synced directories and even our long-future plans around magical caching (there was some discussion about tiering+namespacing there).

That is to say, this is a very complicated topic with ties to many other complicated topics. The reason I like the tokens approach is that it should work for any number of different future possibilities. It may even make sense to use a similar approach with some other resources like services (maybe).

If I had to guess, I'd think one likely first step would be to just enable the engine to optionally implicitly scope all cache mounts to a single session, to support the cloud backend use case. It could do this by just implicitly inserting the session id into the set of tokens.

For more broader use cases, it gets complicated because there's valid use cases for having global caches, for having "team-wide" caches (shared with a few individuals) and for having "individual" caches, all at the same time. The tokens approach will also be a great base for the implementation there I think. It's more figuring out to enable that tiering of namespaces+sessions that also works with everything else that will become challenging.

There's plenty of prior art for this too in k8s and even containerd (and many many others I'm sure), we can learn from that.

Either way, I very briefly mention cache mounts in my WIP arch discussion here (pain point 5), but as that grows and splits out into sub issues it will probably grow in prominence, can keep figuring this all out there: #3287 (comment)

EDIT: to be clear, whether tokens are literally in the API schema or if we just have a single string there and have users construct tokens themselves is mostly independent of the above. I was thinking more about the "spirit" of that sort of namespacing design, the underlying implementation is more flexible.

@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 12, 2022

Created #3341 with the changes I mentioned.

I went with cache { withKey } even though it means we expose a global top-level cache which could be potentially misused until we make it scoped. I like withKey because it has nice scoping/nesting semantics since it just appends a key to create a new child.

Internally this actually works the same as before; withKey just appends the key to the internal list. So I think this meets both the engineering and DX goals.

@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 12, 2022

Starting with the easy part: bikeshedding the API signature :)

I went with cache { withKey }

I wonder if we need an ID at all now - why not just lookup by key? On first lookup, we just create it.

extend type Query {
  “””
  Lookup a cache volume by its key, creating a new one if needed.
  Key is namespaced to the current project.
  “””
  cacheVolume(key: String!): CacheVolume!
}

@shykes
Copy link
Copy Markdown
Contributor

shykes commented Oct 12, 2022

Focusing on the hard part now...

I agree that we'll need scoping really soon if not on day 1. I wonder if it's closer to "project" than "user" though. I may be using the same machine/user to work on multiple projects that all use Dagger and it'd be awkward if they collided with one another.

You are right. I see two possible paths here:

  1. We introduce the smallest possible concept of project, just enough for have basic scoping, and improve later.
  2. We defer project management, and therefore scoping, entirely. We take the leap of faith that we can introduce it later and not break the UX (similar to extensions).

I'm leaning towards option 1. Mostly because it seems really hard to ship a UX with global scoping, then change it to support per-project scoping. I could be wrong though.

Note: I think @vito 's use of "project" is roughly equivalent to @sipsma 's use of "session" in this discussion, is this accurate?

In some cases you might want to share caches across projects though, for example it should be safe to share one big Go build/module cache.

This part I'm OK with deferring to later. I think if we have a good project scoping primitive, we can find a way to add this feature later.

Also one thing worth clarifying: we're aiming to prevent cache collisions just from a usability perspective, not a security perspective, right? So it should be OK to let the user provide the scope key. Multiple untrusted tenants on the same Buildkit is never a safe thing to do anyway.

IMO we need a UX primitive that works in a single-tenant and multi-tenant context. Totally agree that we don't want to actually make a buildkit instance multi-tenant (at least not with untrusted tenants) but we might want to implement multi-tenancy at a higher level (with cloud etc). So if the UX has a hook for that, it would be ideal.

@shykes shykes mentioned this pull request Oct 12, 2022
@vito
Copy link
Copy Markdown
Contributor Author

vito commented Oct 13, 2022

@shykes

  cacheVolume(key: String!): CacheVolume!

Too simple and clear. We need to show that our brains are large and wrinkled.

I'll make that change. 😄

Note that there's still no lookup/create though at API call time, it's all just song and dance for generating a deterministic cache ID that we can choose to scope later on the backend. Whether that scoping happens at cache {} time or cacheVolume() time is immaterial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants