Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic namespacing in API #3345

Closed
shykes opened this issue Oct 12, 2022 · 21 comments
Closed

Basic namespacing in API #3345

shykes opened this issue Oct 12, 2022 · 21 comments
Labels
area/engine About dagger core engine resolution/duplicate This issue or pull request already exists

Comments

@shykes
Copy link
Contributor

shykes commented Oct 12, 2022

Overview

In the new cloak core API, cache volumes are associated with a key provided by the client. Buildkit does not provide a facility for namespacing these keys, so by default every client that looks up the cache volume foo on the same engine, will use the same cache volume. This is not practical, and will probably lead to ad-hoc namespacing solutions. It would be better for Dagger itself to handle namespacing. But how?

As explained by @vito (see original discussion below), the answer is probably namespacing by something roughly equivalent to a "project". This issue is to discuss whether this feature is actually needed, and if so how to design it.

Original discussion

As discussed in #3287 (comment)_ :

I agree that we'll need scoping really soon if not on day 1. I wonder if it's closer to "project" than "user" though. I may be using the same machine/user to work on multiple projects that all use Dagger and it'd be awkward if they collided with one another.

You are right. I see two possible paths here:

  1. We introduce the smallest possible concept of namespace, just enough for have basic scoping, and improve later.
  2. We defer namespacing entirely. We take the leap of faith that we can introduce it later and not break the UX (similar to extensions).

I'm leaning towards option 1. Mostly because it seems really hard to ship a UX with global scoping, then change it to support namespacing that can support something like a project. I could be wrong though.

Note: I think @vito 's use of "project" is roughly equivalent to @sipsma 's use of "session" in this discussion, is this accurate?

In some cases you might want to share caches across projects though, for example it should be safe to share one big Go build/module cache.

This part I'm OK with deferring to later. I think if we have a good project scoping primitive, we can find a way to add this feature later.

Also one thing worth clarifying: we're aiming to prevent cache collisions just from a usability perspective, not a security perspective, right? So it should be OK to let the user provide the scope key. Multiple untrusted tenants on the same Buildkit is never a safe thing to do anyway.

IMO we need a UX primitive that works in a single-tenant and multi-tenant context. Totally agree that we don't want to actually make a buildkit instance multi-tenant (at least not with untrusted tenants) but we might want to implement multi-tenancy at a higher level (with cloud etc). So if the UX has a hook for that, it would be ideal.

@shykes shykes added the cloak label Oct 12, 2022
@shykes shykes added this to the 0.3.0-alpha.1 milestone Oct 12, 2022
@shykes shykes changed the title Basic project management Basic project scoping Oct 12, 2022
@sipsma
Copy link
Contributor

sipsma commented Oct 12, 2022

"Session" and the use of "project" here are related but not entirely the same.

Session is an already existing concept today. It's how buildkit scopes certain resources (localdirs, gateway containers, among others). Sessions have a random ID (they also have a name you can set, but that's different than the ID) and are tied to a single client connection. Once that connection is closed, the session is gone.

  • Confusingly some resources associated with it are freed (gateway containers) but others have more complicated stories (local dirs)

Project as it's used here (and as I understand it, correct me if wrong) would be more persistent than a session because it would be scoping cache dirs. If you use a cache dir, disconnect and then reconnect, it should be possible for the cache dir to still exist (as contrasted with e.g. gateway containers where they are gone when the session ends).

The direction I've been leaning more and more towards for a while is that the session concept we get from Buildkit is really hard to work with and often not what we want. What we really want is more like the "project" concept above where resources are less ephemeral. You should be able to let a service keep running even if you disconnect.

  • I actually asked Buildkit maintainers about whether they'd want persistent gateway containers specifically a few months ago, they said no.

The vague, incomplete idea that's been knocking around my head is that we maybe should invent our own "dagger session" concept which encompasses not only services and local dirs (like Buildkit sessions) but probably also the loaded schema. Can add cache dirs to that too. Dagger sessions would have some degree of persistence.

If we wanted to call "dagger session" a project instead that'd make sense to me in isolation, we just need to sort of the conflicts with the existing concept called project. The two concepts have overlap in that they may both be associated with a schema, but in other ways they are not really the same. I'd actually be in favor of calling the "persistent-session thing" a "project" and the "config for an extension" (i.e. what's currently called "project") something else.

Hard parts:

  1. Garbage collection - does project state live forever? Get pruned after some time? etc.
  2. Sharing between projects - what if a service should be accessible in multiple projects? Or a cache volume? Or a synced dir? etc.
    • Support for tiered namespacing is a possibility, but complicated

@shykes
Copy link
Contributor Author

shykes commented Oct 12, 2022

In order to avoid confusion with existing definitions of "session" and "project", for the purpose of this discussion I propose using the more neutral and functional "namespace". Analogy is eg. to kubernetes namespaces, containerd namespaces, and perhaps docker context?

Doesn't have to be the actual name we use, it just seems useful to have our own name for now, to avoid interference.

EDIT: add comparison to containerd namespace

@shykes shykes changed the title Basic project scoping Basic namespacing in API Oct 12, 2022
@sipsma
Copy link
Contributor

sipsma commented Oct 12, 2022

kubernetes namespaces

containerd has namespaces too, which could be especially relevant to us if we decide to use the containerd worker

@mircubed mircubed removed the cloak label Nov 29, 2022
@shykes shykes added area/engine About dagger core engine and removed area/core labels Dec 5, 2022
@mircubed mircubed added the stuck label Jan 12, 2023
@mircubed mircubed removed the stuck label Mar 20, 2023
@shykes
Copy link
Contributor Author

shykes commented Jul 7, 2023

I see two possible paths here:

  1. We introduce the smallest possible concept of namespace, just enough for have basic scoping, and improve later.
  2. We defer namespacing entirely. We take the leap of faith that we can introduce it later and not break the UX (similar to extensions).

Update: we have gone with option 2 (defer namespace entirely). Now we are facing an urgent need to introduce namespacing, in order to implement persistence of cache volumes in Dagger Cloud.

I am resurrecting this thread to make sure we avoid breaking the UX.

cc @sipsma @vito @marcosnils

@sipsma
Copy link
Contributor

sipsma commented Jul 7, 2023

@shykes What is the connection between namespaces and cache volume syncing? I can imagine updating the cache volume API to support settings that would impact the cloud's sync of them, but I don't see yet the connection point with namespacing, which has more to do with multitenancy in my mind.

@shykes
Copy link
Contributor Author

shykes commented Jul 7, 2023

@shykes What is the connection between namespaces and cache volume syncing? I can imagine updating the cache volume API to support settings that would impact the cloud's sync of them, but I don't see yet the connection point with namespacing, which has more to do with multitenancy in my mind.

The connection is that when creating a cache volume in your code, you need to give it a name; that name is chosen with an assumption that cache volumes are namespaced. How namespacing works exactly, will impact when my cache volume is reused, and when it's not.

My understanding is that, in the current implementation of Dagger Cloud, cache volume names are global to the entire Cloud organization. So if I run 5 unrelated pipelines, all with a cache volume named "npm", they will be synced as a single volume. This is leading to a) stopgaps where eg. @jpadams is writing code with globally unique cache volume names such as customer_name_2, and b) cloud team is designing a cloud-specific way to namespace cache volumes.

Instead of designing a cloud-specific way to namespace cache volumes, I would like us to design namespacing platform-wide, then implement it in Cloud.

@shykes shykes unassigned vito Jul 7, 2023
@sipsma
Copy link
Contributor

sipsma commented Jul 7, 2023

So if I run 5 unrelated pipelines, all with a cache volume named "npm", they will be synced as a single volume. This is leading to a) stopgaps where eg. @jpadams is writing code with globally unique cache volume names such as customer_name_2,

I'm still missing how "cache volumes are shared" leads to us needing to namespace by customer_name_*. They just seem completely unrelated still. @jpadams can you help fill in the gaps here?

To be clear, I'm not saying we don't want some type of namespacing ever (namespacing by npm version could be a nice feature to have for this specific case), it just feels independent of cloud synchronization.

b) cloud team is designing a cloud-specific way to namespace cache volumes.

@marcosnils can you clarify this point? I understand that in the backend we store layers+cache-mounts under org-specific prefixes, but that's a 100% totally internal implementation detail that users need zero awareness of. Where else does namespacing come into play that it impacts users?

@shykes
Copy link
Contributor Author

shykes commented Jul 7, 2023

I'm still missing how "cache volumes are shared" leads to us needing to namespace by customer_name_*.

Sorry that was confusing on my part. I’m conflating cross-org global namespacing (unrelated and very temporary cloud issue) with org-wide namespacing, which would require volume names more like teamfoo_projectbar_npm.

Also just illustrating that design considerations on the cloud side have impact on DX on the engine side, so they should be considered carefully.

@sipsma
Copy link
Contributor

sipsma commented Jul 7, 2023

Sorry that was confusing on my part. I’m conflating cross-org global namespacing (unrelated and very temporary cloud issue) with org-wide namespacing, which would require volume names more like teamfoo_projectbar_npm.

Oh okay, so we're thinking about when users want to split up their org into separate "sub-orgs" (teams, etc.) and then have independent synchronization of cache mounts?

If so, makes much more sense and I can see the conceptual connection. In terms of the implementation, I believe it still is orthogonal to cloud synchronization right now, but that could change in the future.

Reason being that cloud sync currently works with the low-level name of the cache mount in buildkit, which would be one or two layers of abstraction below namespacing dagger applies. So it shouldn't need any awareness of namespacing, it just syncs based on whatever the namespace "compiles" to in terms of the low-level cache mount name.

One hypothetical situation where a connection point arises is if we want some sort of "tiered" caching that reflects org structures. So if you are in OrgFoo and on TeamBar, which is a part of OrgFoo, then maybe you want to be able to say something like: "give me the npm cache for TeamBar, but if there isn't any then check to see if there's any in OrgFoo".

I would have previously assumed that was a feature a ways off in terms of needing, but maybe not?

Either way, agree it's worth more thought if we are starting down the "org hierarchy" path already. We should also consider whether layer caches would be impacted by this too though (mostly in terms of storage, pruning, pricing, etc.).

@shykes
Copy link
Contributor Author

shykes commented Jul 7, 2023

Oh okay, so we're thinking about when users want to split up their org into separate "sub-orgs" (teams, etc.) and then have independent synchronization of cache mounts?

One way to describe the problem: when I write code describing my pipeline logic (which includes creating and naming my cache volumes). I shouldn't have to know everything else that runs, or will run in the future, in the same Dagger Cloud org as my pipeline. But if Dagger Cloud considers cache volume names to be global to the org, then that's effectively what I have to do.

cloud sync currently works with the low-level name of the cache mount in buildkit, which would be one or two layers of abstraction below namespacing dagger applies. So it shouldn't need any awareness of namespacing, it just syncs based on whatever the namespace "compiles" to in terms of the low-level cache mount name.

What is an example of such a low-level volume ID, and what other inputs go into "compiling" it? I still feel like when developing my pipeline logic, I need to know how my cache volumes will be namespaced - "don't worry about it" doesn't help me decide how to name my volume.

@sipsma
Copy link
Contributor

sipsma commented Jul 10, 2023

One way to describe the problem: when I write code describing my pipeline logic (which includes creating and naming my cache volumes). I shouldn't have to know everything else that runs, or will run in the future, in the same Dagger Cloud org as my pipeline. But if Dagger Cloud considers cache volume names to be global to the org, then that's effectively what I have to do.

Agree that's a problem, though I think the solution to that in particular is either (or both) of the following:

  • There's an API for registering and querying the cache volumes your org uses
  • Cache volumes can be defined in a Zenith env and vendored out to users in your org from there

But I don't think this has to do with namespacing. Even if you could namespace, you'd still have the same problem of needing to know what exists in your namespace.

Maybe we have different definitions of what namespacing means? From the original description of this issue, I thought it was just a way of scoping the same name under different "contexts" (aka namespaces) for the purposes of isolation and multitenancy, which I think only comes into play here if we want tiered caches from org->team that I mentioned in my previous comment.

What is an example of such a low-level volume ID, and what other inputs go into "compiling" it? I still feel like when developing my pipeline logic, I need to know how my cache volumes will be namespaced - "don't worry about it" doesn't help me decide how to name my volume.

Right now (with no namespacing), the name is a just a hash of the cache volume id, which currently only contains the name the user provided. With namespacing, presumably the namespace hierarchy would also be mixed in to the hash.

I agree that none of that helps you decide what to name your volume, the point was that magicache doesn't need to care about this, it just syncs what it's told to sync. It would only need awareness of namespacing if we want tiered caching, otherwise it's just an opaque ID of a cache volume to sync.

@shykes
Copy link
Contributor Author

shykes commented Jul 10, 2023

  • Cache volumes can be defined in a Zenith env and vendored out to users in your org from there

I didn't understand this part, what does "vendored out to users in your org" mean?

@shykes shykes modified the milestones: 0.3.0-alpha.1, Engine v0.3.8 Jul 10, 2023
@sipsma
Copy link
Contributor

sipsma commented Jul 10, 2023

I didn't understand this part, what does "vendored out to users in your org" mean?

I'm imagining a top-hat for an org creating their org's standalone environment and including CacheVolume types as part of that environment's API. The low-level plumbing for this would be easy to support, but this doesn't clearly fit into any of the existing entrypoint types (commands, artifacts, shells, etc.), so I suppose we could consider adding a Caches entrypoint type?

The DX would be that when users in that org need to use a cache volume, they just import their org's environment and then can choose cache volumes from it. So if top-hat wants their support pip_cache and gradle_cache, the user would use them in their pipelines either like:

  1. Type-unsafe version: client.Environment().Load("my-org").Cache("pip_cache")
  2. Codegen version: client.MyOrg().PipCache()

Either one of those above would be returning a CacheVolume type, so it would just plug into the rest of the APIs seamlessly. You could also imagine APIs for querying what caches exist, etc.

The overall idea is that Zenith gives orgs a way of cataloguing cache volumes for end-users, which solves the problem of "when I write code describing my pipeline logic (which includes creating and naming my cache volumes). I shouldn't have to know everything else that runs, or will run in the future, in the same Dagger Cloud org as my pipeline." You know what to use because your org's environment tells you what is available to use.

There's actually a bunch more interesting features we could layer on top of something like this, but I'll try to not derail the conversation too much yet :-)

@shykes
Copy link
Contributor Author

shykes commented Jul 11, 2023

  1. Type-unsafe version: client.Environment().Load("my-org").Cache("pip_cache")
  2. Codegen version: client.MyOrg().PipCache()

Either one of those above would be returning a CacheVolume type, so it would just plug into the rest of the APIs seamlessly. You could also imagine APIs for querying what caches exist, etc.

There's actually a bunch more interesting features we could layer on top of something like this, but I'll try to not derail the conversation too much yet :-)

OK that's pretty cool - perhaps even mindblowing ;) This gives me a glimpse of a deeper integration with Dagger Cloud, where devs can take advantage of its features programmatically. Which I love.

BUT it also scares me a little bit, wouldn't this cause cloud-specific, and org-specific details to leak all over the place into otherwise perfectly portable code?

I feel like we could have the best of both worlds, if we could make this similar to the dynamic secrets API: 1) fully programmatic; but also 2) traces a path to keeping the code portable in the future, with "secret providers".

I feel like your idea for cache volumes isn't quite there, but perhaps it could be?

@mircubed mircubed removed this from the Engine v0.3.8 milestone Jul 18, 2023
@sipsma
Copy link
Contributor

sipsma commented Jul 19, 2023

@shykes Sorry I just remembered this whole thread and realized I never got back

BUT it also scares me a little bit, wouldn't this cause cloud-specific, and org-specific details to leak all over the place into otherwise perfectly portable code?

I'd argue that "cloud-specific, and org-specific details" are not something to be avoided. We absolutely 100% want portable, re-usable environments, but ultimately end-users need to take those environments and use them to their precise needs, and it's great if that's just code too. I guess the analogies would be:

  1. There's lots of re-usable/generic libraries out there for every language, but you still ultimately have to actually import and use those for your specific use case. All of that, both the libraries and the end users code, get committed/published somewhere
  2. There's plenty of shared utils and abstractions in various IaC frameworks, but it's still beneficial to actually publish the code for your specific infra somewhere too that uses those utils.

I feel like we could have the best of both worlds, if we could make this similar to the dynamic secrets API: 1) fully programmatic; but also 2) traces a path to keeping the code portable in the future, with "secret providers".

I can't totally imagine what this would look like in a way that's different than just having the ability to import a cache mount definition that's defined in an external environment, might need clarification on what you are imagining for something like this.

@mircubed
Copy link
Contributor

@marcosnils @aluzzardi can you provide additional context on the current status of this?

@sagikazarmark
Copy link
Contributor

This is what I ran into recently:

I have a project with two Go binary builds and an e2e test run (not to mention a golangci-lint module that uses the same Go module under the hood).
The binary builds run in one GHA workflow, the e2e test in another (although I don't think that really matters).

All three steps use the same Go module that define the following cache volumes:

        WithMountedCache("/root/.cache/go-build", dag.CacheVolume("go-build")).
        WithMountedCache("/go/pkg/mod", dag.CacheVolume("go-mod"))

That results in the behavior outlined above: essentially the last one to complete wins and overwrites the cache of the other two.

In this case each three step would need their own "namespace" to work.

I also wonder if the namespace should include other information, like branch or commit hash (similarly to how caching works on GHA and other SaaS CI providers). Let's say I have two PRs running the same builds in parallel. They would also overwrite the cache.

One potential solution I can imagine is initializing a new module with some context accessible from the module:

// When calling the module
dag.Go(/*add some context here*/).FromVersion("1.21.5)

// In module
dag.Context()

I considered adding a cache key argument to my Go module (I may still do it), but considering the rather complex API (partly due to the method chaining) it wouldn't be trivial. (For reference: here is the module)

@shykes
Copy link
Contributor Author

shykes commented Dec 19, 2023

@sagikazarmark in your case I think you are running into 3 distinct (but related) consequences of lack of namespacing :

  1. Global Go cache. All your invocations of go share the same Go cache because they use the same cache volume name. This is actually OK (I think) because Go supports sharing the same cache across projects.

  2. Concurrent access. In the default sharing configuration, cache volumes allow concurrent writers. So if you invoke go multiple times concurrently, they will hit the same cache volume at the same time. This is also OK (I think) because Go also supports concurrent writers to its global cache.

  3. Last run wins. You are seeing some runs overwrite the result of other runs. I think this is an artifact of Dagger Cloud’s distributed caching implementation. At the moment cache volumes are shared between all engines connected to the same Dagger Cloud organization. The Dagger Cloud organization is a global namespace for cache volumes. Synchronization is done by uploading a snapshot of the entire volume. The last upload wins. There is no merge mechanism at the moment.

In summary: the core namespacing problem is not specific to Dagger Cloud, but it can manifest itself in more visible ways when using Dagger Cloud with cache volumes, and that is what you are encountering.

If I’m right, then grouping all three steps in the same Dagger session (ie. in the same CI job) should solve the “last writer wins” problem.

@sagikazarmark
Copy link
Contributor

@shykes makes sense, I'll try that, thanks.

TBH, I'm not that comfortable with bundling all CI steps into a single run though.

For one, I don't get the same feedback on GH right now with checks if I do that.

Also, in some projects it's just not possible to scale the runner vertically (eg. OSS projects), so splitting CI runs is often for performance reasons (I don't have the numbers to back that up though...yet).

@shykes
Copy link
Contributor Author

shykes commented Dec 19, 2023

@shykes makes sense, I'll try that, thanks.

TBH, I'm not that comfortable with bundling all CI steps into a single run though.

For one, I don't get the same feedback on GH right now with checks if I do that.

Also, in some projects it's just not possible to scale the runner vertically (eg. OSS projects), so splitting CI runs is often for performance reasons (I don't have the numbers to back that up though...yet).

I completely understand. To be clear I only suggested this as a stopgap to alleviate the immediate pain. I absolutely agree that you shouldn’t have to combine runs in order to get the cache volume sharing semantics you expect.

@shykes
Copy link
Contributor Author

shykes commented Apr 26, 2024

I am deprecating this issue in favor of #7211 , which is more up-to-date and more narrowly focused on cache volumes.

@shykes shykes closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024
@shykes shykes added the resolution/duplicate This issue or pull request already exists label Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/engine About dagger core engine resolution/duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

5 participants