Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions docs/architecture/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
title: "System Overview"
description: "Birds-eye view of the ctrlplane orchestration flow"
---

This is the developer-facing entry point to the ctrlplane codebase. It shows
how the apps in this monorepo fit together when a deployment version moves
from creation to execution.

```mermaid
flowchart TD
CLI["CLI / curl"]:::ext
Users["Users<br/>(browser)"]:::ext
Web["apps/web<br/><i>React + tRPC client</i>"]
API["apps/api<br/><i>Express + tRPC + webhooks</i>"]
DB[("Postgres<br/><b>reconcile_work_scope</b>")]
Engine["apps/workspace-engine<br/><i>Go controllers</i>"]
Agents["Job agents<br/>GitHub Actions · ArgoCD ·<br/>Terraform Cloud · custom"]:::ext

Users --> Web
Web -->|tRPC| API
CLI -->|"① register version"| API
API -->|"② enqueue work"| DB
DB <-->|"③ lease / requeue"| Engine
Engine -->|"④ dispatch job"| Agents
Agents -->|"⑤ result"| API
API -->|"⑥ enqueue follow-up"| DB

classDef ext fill:#444,stroke:#888,color:#ddd
```

## The orchestration loop

CLI or `curl` calls register a deployment version against `apps/api` (①). The
api persists the version and writes a work item into the `reconcile_work_scope`
table in Postgres (②) — **this is the only thing the api does to "start"
orchestration; it does not call the engine.**

`apps/workspace-engine` controllers continuously lease items from that queue
(③), and each controller's output enqueues work for the next controller
(planning → policy → dispatch). When dispatch fires, the engine reaches out to
a job agent over HTTPS (④).

Results come back through webhooks to the api (⑤), which writes the job update
plus any follow-up work into the queue (⑥). The engine picks it up again. The
loop ③↔⑥ is the whole orchestration model — every release phase is a trip
through the queue.
143 changes: 143 additions & 0 deletions docs/architecture/workspace-engine.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: "Workspace Engine"
description: "How apps/workspace-engine orchestrates the release lifecycle"
---

The workspace-engine is the Go service that drives every release forward. It
polls a Postgres work queue (`reconcile_work_scope`), leases items by `kind`,
and runs the matching controller. Each controller's output is enqueueing more
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix spelling at Line 8 (enqueueingenqueuing).

This is a user-facing docs typo and should be corrected for consistency.

🧰 Tools
🪛 LanguageTool

[grammar] ~8-~8: Ensure spelling is correct
Context: ...controller. Each controller's output is enqueueing more work, so a single release moves th...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/architecture/workspace-engine.mdx` at line 8, Change the misspelled word
"enqueueing" to "enqueuing" in the sentence "and runs the matching controller.
Each controller's output is enqueueing more" so the line reads "... Each
controller's output is enqueuing more" (search for the phrase containing
"enqueueing" in workspace-engine.mdx to locate the occurrence).

work, so a single release moves through phases by chaining items through the
queue.

## The release-flow chain

When a release-target needs to be evaluated (a new version was created, a
policy changed, a job finished, a resource started matching), a
`desired-release` work item lands in the queue. From there:

```mermaid
sequenceDiagram
autonumber
participant Q as reconcile_work_scope
participant DR as desiredrelease
participant JE as jobeligibility
participant JD as jobdispatch
participant JV as jobverificationmetric
participant Ext as Job agent

Note over Q: kind = desired-release<br/>scope = release-target
Q->>DR: lease
Note over DR: evaluate policies, pick the<br/>deployable version, resolve<br/>variables, persist release
DR-->>Q: enqueue kind=job-eligibility
Q->>JE: lease
Note over JE: can this release run now?<br/>(concurrency, retry rules)
JE-->>Q: enqueue kind=job-dispatch
Q->>JD: lease
Note over JD: create job, route to the<br/>right job agent
JD->>Ext: dispatch
Ext-->>Q: result via api, enqueue kind=job-verification-metric
Q->>JV: lease
Note over JV: poll metrics, on completion<br/>re-enqueue desired-release
JV-->>Q: enqueue kind=desired-release (loop)
```

Four controllers, one queue between them. **No controller calls another
directly** — handoff is always via insert-then-lease. That means each phase is
independently retriable, leasable, and observable, and the engine can run as
multiple instances safely.

## How every controller works

Every controller is a `reconcile.Processor` registered for one `kind`. The
pattern is identical across all of them: lease an event, recompute the desired
state from current Postgres state, persist the result, enqueue follow-up.

```mermaid
flowchart LR
DB[(reconcile_work_scope)]
C[Controller<br/>handles one kind]
DB -->|lease event by kind| C
C -->|persist results<br/>+ enqueue next kind| DB
```

Two things make this a reconciler rather than a job runner. First,
**controllers are stateless** — every invocation re-reads input from Postgres
rather than carrying state forward in memory. If the world changes between
events (a policy is disabled, an approval lands, a new version appears), the
next event picks up the change automatically. Second, **the loop closes back
to the start** — when a job finishes, `jobverificationmetric` enqueues another
`desired-release` event and `desiredrelease` recomputes from scratch.
Idempotent recomputation is the orchestration model.

## Inside `desiredrelease`

`desiredrelease` is the only controller in the chain that does meaningful
internal work — the other three are mostly routing or checking. Here is what
happens on a single lease:

```mermaid
flowchart TD
In([dequeued: desired-release work item])
LP[load scope and policies]
Iter[iterate candidate versions<br/>newest-first]
Eval[evaluate policy rules<br/>inline via policyeval library]
Decide{any version passes?}
NoRel[persist 'no release']
Resolve[resolve variables]
Persist[persist release record]
Out[enqueue job-eligibility]

In --> LP --> Iter --> Eval --> Decide
Decide -->|no| NoRel
Decide -->|yes| Resolve --> Persist --> Out
```

Two things worth knowing:

1. **Policy evaluation is inline, not a separate controller.** A `policyeval`
directory exists at `svc/controllers/policyeval/` but that's a different
controller that writes per-version rule evaluations for the UI. The gating
logic that decides whether a version can deploy lives in the `policyeval`
*library subpackage* at `svc/controllers/desiredrelease/policyeval/` and is
called as a function from inside `desiredrelease`.
2. **Versions are evaluated newest-first as a stream.** The controller doesn't
load all candidate versions then filter — it iterates them and stops at the
first one that passes all policy rules. That's what makes "skip blocked
versions but deploy the newest passing one" cheap.

## Other release-flow controllers

**`jobeligibility`** — given a release record, decides whether a job can run
*right now*. Runs two evaluators: `releasetargetconcurrency` (under the
configured concurrency cap?) and `retry` (under the retry budget?). If both
pass, enqueue `job-dispatch`. If not, requeue with `notBefore`.

**`jobdispatch`** — given a job, picks the right job-agent adapter (GitHub
Actions, ArgoCD, Terraform Cloud, Argo Workflows, or the test runner) and
sends the job over HTTPS. The agent's `externalId` is recorded so results can
be correlated back later.

**`jobverificationmetric`** — given a finished job, polls verification
providers (Datadog, HTTP probes, Terraform Cloud run status, etc.) until they
return pass/fail. On completion, calls `EnqueueDesiredRelease` to close the
loop.

## Controllers outside the release-flow chain

The `svc/controllers/` directory contains several other controllers that exist
for UI surface or precomputed state, not for moving a release through phases:

- `policyeval` (top-level) — computes per-version rule evaluations so the UI
can show "why isn't this version deploying yet."
- `deploymentplan` / `deploymentplanresult` — power plan previews and dry-run
views.
- `deploymentresourceselectoreval` / `environmentresourceselectoreval` —
precompute which resources currently match a deployment or environment
selector.
- `relationshipeval` — evaluates resource relationship rules into the resource
graph.
- `forcedeploy` — handles user-triggered manual deploys (a separate path from
the policy-gated chain).

If you're trying to understand "what happens when I push a version," you can
safely ignore these and focus on the four chain controllers.
13 changes: 13 additions & 0 deletions docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,19 @@
]
}
]
},
{
"tab": "Architecture",
"icon": "diagram-project",
"groups": [
{
"group": "System",
"pages": [
"architecture/overview",
"architecture/workspace-engine"
]
}
]
}
],
"global": {
Expand Down
Loading