Stateless engine or stateful engine?

## Question

I want to start a discussion on an important aspect of Dagger's architecture: whether to make the engine stateful or stateless. This is a complex topic with important ramifications, that affect users in very tangible ways.

This has been a known unresolved question for a while, which we collectively shelved to deal with more urgent things. But I think now is the time to unshelve it, because more organizations are now running, or looking to run, Dagger on their CI infrastructure. As a result they face new questions:

1. What is the architecture? What are the components, where do they run, how are they installed?
2. What is the lifecycle? What components need to be updated when, and what are the dependencies between them?
3. Who's in charge? When self-hosted CI runners are involved, Dagger touches the CI platform (eg. Github Actions), and the underlying compute platform (eg. Kubernetes). These platforms are often managed by different teams, sometimes in completely different orgs, with very different priorities and timelines. Who ultimately is in charge of installing Dagger in CI? Are they the same people who experience the value of Dagger? If not, how to make harmonious collaboration between those two teams possible?

At the moment it is *very difficult* to answer these questions clearly and definitively, because our architecture is not clear, and not yet defined. For various pragmatic engineering reasons, the engine is split in two parts:

1. A *stateless* frontend that lives in the `dagger` CLI. This includes both the user-facing interface and GraphQL API.
2. A *stateful* backend that lives in a container. This includes our custom build of buildkit, containerd and associated tooling

These two pieces must eventually be reunited - we have strong consensus on that. The unresolved question is **where**?

* Does the entire engine become *stateless* and live in the `dagger` CLI + purely ephemeral companion containers spawned as needed?
* Or does it become *stateful* and move entirely into a long-running OCI container?

This is a simplified question designed to provoke conversation. I believe both options are possible, and both come with a long list of difficult tradeoffs and nuance. But, let's start simple for now :) What does everyone think?

## Answer

After much discussion (see comments below), we have reached the following tentative answer:

### Where we want to go

* We should move towards making the engine **stateless** and **ephemeral**
* The benefits are 1) it simplifies the versioning matrix, 2) it simplifies the operational model, 3) it makes it easier to adopt Dagger.


### How to get there

* We should start moving towards this architecture in incremental steps
* We should take the first step as soon as possible
* The first step is detailed by @sipsma as follows:

> I actually think we can+should jump immediately to doing this such that the changes are to our engine image, rather than the k8s daemonset, and thus benefit every user rather than just k8s users.

> The outline being (in approximate order of implementation steps, can be done iteratively rather than all at once):

> Switch our engine's executor backend from oci worker -> containerd worker
> This means we have to include containerd in our engine image, which will increase the size a bit, but highly unlikely to be so much it's a real concern.
> Also containerd may be a preferable backend to use anyways since we could benefit from the (small but growing) ecosystem of containerd runtimes.
> Change the engine image's entrypoint to no longer be the dagger engine but instead to be containerd
> When the image starts, it will start the engine as a container in containerd w/ the containerd socket mounted in.
> Note that since containerd is the engine's backend, this change does NOT result in crazy container-in-container-in-container behavior; any containers created by the engine will actually be "sibling" containers to it in containerd. So it's the same level of nestedness as today.
> Add support for clients (i.e. the CLI) to optionally start older/newer versions of the engine in containerd rather than just one single version
> We'd likely want to do something like include the latest version of the engine in the image and then require that any other engine images be pulled at runtime, which feels like a reasonable tradeoff in terms of functionality and not bloating the image. But there's many other variations possible here too.
> This is also the point where we'll need to start considering making the engine containers more short-running, which would make them available for cleanup and thus prevent versions from accumulating indefinitely.
> That would technically just be a start to the overall grand plan, but it's extremely appealing in that
> 
> It's surprisingly low-hanging fruit; probably somewhere between medium and large in terms of t-shirt sized effort
> It would dramatically cut back on how often version mismatches between CLI<->Runner happen, likely to the point of those being very rare events.

## Packaging and naming considerations

One important detail to consider: this will determine product packaging, as in: what are the parts of our product and what do our users call them?

> I think we reserve “engine” for the upper part, soon to be fully orchestrated by the CLI, soon the be stateless.
> We need a new name for the lower part: the substrate that acts as privileged container runner + data plumbing.
> Both need to be distributed as OCI images, at least until we can fully split buildkit in higher/upper parts (so: for a good while)
> Consequently we may need to distribute 2 distinct OCI images? One called “engine” (upper) and one called maybe “runner” for the lower part / substrate? Would be consistent with the use of “RUNNER” in the env variable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless engine or stateful engine? #5484

Question

Answer

Where we want to go

How to get there

Packaging and naming considerations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stateless engine or stateful engine? #5484

Description

Question

Answer

Where we want to go

How to get there

Packaging and naming considerations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions