Question
I want to start a discussion on an important aspect of Dagger's architecture: whether to make the engine stateful or stateless. This is a complex topic with important ramifications, that affect users in very tangible ways.
This has been a known unresolved question for a while, which we collectively shelved to deal with more urgent things. But I think now is the time to unshelve it, because more organizations are now running, or looking to run, Dagger on their CI infrastructure. As a result they face new questions:
- What is the architecture? What are the components, where do they run, how are they installed?
- What is the lifecycle? What components need to be updated when, and what are the dependencies between them?
- Who's in charge? When self-hosted CI runners are involved, Dagger touches the CI platform (eg. Github Actions), and the underlying compute platform (eg. Kubernetes). These platforms are often managed by different teams, sometimes in completely different orgs, with very different priorities and timelines. Who ultimately is in charge of installing Dagger in CI? Are they the same people who experience the value of Dagger? If not, how to make harmonious collaboration between those two teams possible?
At the moment it is very difficult to answer these questions clearly and definitively, because our architecture is not clear, and not yet defined. For various pragmatic engineering reasons, the engine is split in two parts:
- A stateless frontend that lives in the
dagger CLI. This includes both the user-facing interface and GraphQL API.
- A stateful backend that lives in a container. This includes our custom build of buildkit, containerd and associated tooling
These two pieces must eventually be reunited - we have strong consensus on that. The unresolved question is where?
- Does the entire engine become stateless and live in the
dagger CLI + purely ephemeral companion containers spawned as needed?
- Or does it become stateful and move entirely into a long-running OCI container?
This is a simplified question designed to provoke conversation. I believe both options are possible, and both come with a long list of difficult tradeoffs and nuance. But, let's start simple for now :) What does everyone think?
Answer
After much discussion (see comments below), we have reached the following tentative answer:
Where we want to go
- We should move towards making the engine stateless and ephemeral
- The benefits are 1) it simplifies the versioning matrix, 2) it simplifies the operational model, 3) it makes it easier to adopt Dagger.
How to get there
- We should start moving towards this architecture in incremental steps
- We should take the first step as soon as possible
- The first step is detailed by @sipsma as follows:
I actually think we can+should jump immediately to doing this such that the changes are to our engine image, rather than the k8s daemonset, and thus benefit every user rather than just k8s users.
The outline being (in approximate order of implementation steps, can be done iteratively rather than all at once):
Switch our engine's executor backend from oci worker -> containerd worker
This means we have to include containerd in our engine image, which will increase the size a bit, but highly unlikely to be so much it's a real concern.
Also containerd may be a preferable backend to use anyways since we could benefit from the (small but growing) ecosystem of containerd runtimes.
Change the engine image's entrypoint to no longer be the dagger engine but instead to be containerd
When the image starts, it will start the engine as a container in containerd w/ the containerd socket mounted in.
Note that since containerd is the engine's backend, this change does NOT result in crazy container-in-container-in-container behavior; any containers created by the engine will actually be "sibling" containers to it in containerd. So it's the same level of nestedness as today.
Add support for clients (i.e. the CLI) to optionally start older/newer versions of the engine in containerd rather than just one single version
We'd likely want to do something like include the latest version of the engine in the image and then require that any other engine images be pulled at runtime, which feels like a reasonable tradeoff in terms of functionality and not bloating the image. But there's many other variations possible here too.
This is also the point where we'll need to start considering making the engine containers more short-running, which would make them available for cleanup and thus prevent versions from accumulating indefinitely.
That would technically just be a start to the overall grand plan, but it's extremely appealing in that
It's surprisingly low-hanging fruit; probably somewhere between medium and large in terms of t-shirt sized effort
It would dramatically cut back on how often version mismatches between CLI<->Runner happen, likely to the point of those being very rare events.
Packaging and naming considerations
One important detail to consider: this will determine product packaging, as in: what are the parts of our product and what do our users call them?
I think we reserve “engine” for the upper part, soon to be fully orchestrated by the CLI, soon the be stateless.
We need a new name for the lower part: the substrate that acts as privileged container runner + data plumbing.
Both need to be distributed as OCI images, at least until we can fully split buildkit in higher/upper parts (so: for a good while)
Consequently we may need to distribute 2 distinct OCI images? One called “engine” (upper) and one called maybe “runner” for the lower part / substrate? Would be consistent with the use of “RUNNER” in the env variable.
Question
I want to start a discussion on an important aspect of Dagger's architecture: whether to make the engine stateful or stateless. This is a complex topic with important ramifications, that affect users in very tangible ways.
This has been a known unresolved question for a while, which we collectively shelved to deal with more urgent things. But I think now is the time to unshelve it, because more organizations are now running, or looking to run, Dagger on their CI infrastructure. As a result they face new questions:
At the moment it is very difficult to answer these questions clearly and definitively, because our architecture is not clear, and not yet defined. For various pragmatic engineering reasons, the engine is split in two parts:
daggerCLI. This includes both the user-facing interface and GraphQL API.These two pieces must eventually be reunited - we have strong consensus on that. The unresolved question is where?
daggerCLI + purely ephemeral companion containers spawned as needed?This is a simplified question designed to provoke conversation. I believe both options are possible, and both come with a long list of difficult tradeoffs and nuance. But, let's start simple for now :) What does everyone think?
Answer
After much discussion (see comments below), we have reached the following tentative answer:
Where we want to go
How to get there
Packaging and naming considerations
One important detail to consider: this will determine product packaging, as in: what are the parts of our product and what do our users call them?