Skip to content

Go-SDK: Implement coordinator-mode runtime entry point and task runner#67318

Draft
jason810496 wants to merge 3 commits into
apache:mainfrom
jason810496:refactor/go-sdk/coordinator-runtime-server
Draft

Go-SDK: Implement coordinator-mode runtime entry point and task runner#67318
jason810496 wants to merge 3 commits into
apache:mainfrom
jason810496:refactor/go-sdk/coordinator-runtime-server

Conversation

@jason810496
Copy link
Copy Markdown
Member

@jason810496 jason810496 commented May 22, 2026

Why

Third and final PR in the stack carved out of #67154. With the protocol primitives merged in PR1 and the dispatcher / logger / client merged in PR2, this PR wires the entry point: the same bundle binary that today serves go-plugin can now also be launched directly by the Python supervisor, dial the supervisor's comm and logs sockets, and run a single TaskInstance. Coordinator mode is the path that lets the Python supervisor schedule Go tasks without standing up a separate worker process -- it launches the bundle binary as a child, hands it two socket addresses on the CLI, and talks the msgpack-over-IPC protocol directly -- so a Go-task DagRun looks operationally indistinguishable from a Python-task DagRun on the supervisor side. This is the smallest PR in the stack (~650 LOC) because all the heavy lifting -- frame I/O, dispatcher, slog handler, sdk.Client re-implementation -- already landed in PR1 and PR2. Dag-file parsing over the coordinator protocol is intentionally not part of this stack and will land in a follow-up once that protocol settles.

How

  • pkg/execution/server.go -- execution.Serve(bundle, commAddr, logsAddr) dials both supervisor sockets, defers a Close on each, installs SocketLogHandler as the slog default before any user code runs, constructs a CoordinatorComm over the comm socket, reads the initial StartupDetails, and dispatches to task_runner.Run. If Serve itself errors before the dispatcher spins up, the deferred close still releases the dialed sockets so the supervisor doesn't see a stuck child.
  • pkg/execution/task_runner.go -- runs a single task. Builds a context carrying the CoordinatorClient under sdkcontext.SdkClientContextKey (PR1 added the injection site in bundlev1.taskFunction.Execute), invokes bundle.LookupTask(dag, task).Execute, and sends the resulting TaskStateMsg back through the dispatcher. Terminal-state delivery is ctx.Err()-aware so a cancelled supervisor doesn't leave the runtime blocked on a send.
  • pkg/execution/integration_test.go -- end-to-end test that pipes a fake supervisor against the real Serve over an in-memory socket pair, exercises GetVariable / XCom push / deferral, and asserts the emitted TaskStateMsg.
  • bundle/bundlev1/bundlev1server/server.go -- splits Serve into a decideMode switch over (--bundle-metadata | --comm/--logs | <none>) so the same binary still serves go-plugin when no coordinator flags are present. Partial use of --comm / --logs is a hard error (ErrCoordinatorFlagsIncomplete), returned to main so the caller exits non-zero with usage rather than silently falling back to go-plugin.
  • example/bundle/main.go -- propagates bundlev1server.Serve's error via log.Fatal, and tightens the example connection-log to log only non-sensitive fields, matching the masker TODOs PR1 added on sdk.Client.GetConnection.

What

  • Add go-sdk/pkg/execution/{server,task_runner,integration_test}.go.
  • Extend go-sdk/bundle/bundlev1/bundlev1server/server.go with coordinator-mode dispatch and ErrCoordinatorFlagsIncomplete.
  • Update go-sdk/example/bundle/main.go to propagate Serve's error and redact the connection log.

Next

  • (none -- last PR of the stack)

Was generative AI tooling used to co-author this PR?

First step toward landing the Go SDK coordinator-mode runtime
(ADR 0003, msgpack-over-IPC). Scaffolding only -- no entry point is wired
here, so go-plugin / Edge Worker behaviour is unchanged.

Adds the length-prefixed msgpack frame codec and the typed message
envelopes the runtime will exchange with the supervisor, the
sdkcontext.SdkClientContextKey injection hook on bundlev1.Task so a
follow-up PR can swap in a comm-socket-backed sdk.Client, and the small
sdk surface tweaks (ConnFromAPIResponse export, VariableClient interface
docs, secret-masking TODOs) the comm-socket client will rely on. Pulls
in github.com/vmihailenco/msgpack/v5 -- the encoding the supervisor
speaks.
Build the comm layer on top of the protocol primitives so subsequent
runtime code has a single typed entry point for talking to the supervisor.

CoordinatorComm runs a concurrent-safe dispatcher loop that fans inbound
frames out to per-request reply channels keyed by a monotonic id,
propagates context cancellation, and cleans up pending requests on
SendRequest failure. SocketLogHandler streams slog records as structured
JSON over the dedicated logs socket so the supervisor can demux task
logs without parsing stderr. CoordinatorClient implements the sdk.Client
surface (GetVariable honouring AIRFLOW_VAR_* overrides, GetConnection,
XCom push/pull, deferral) by routing each method through the dispatcher
and translating supervisor not-found responses into the SDK's sentinel
errors.

No server or task-runner loop is wired yet -- that lands in the next PR
in this stack.
Wire the supervisor-launched runtime that speaks ADR 0003's coordinator
protocol. execution.Serve dials the comm and logs sockets the supervisor
passes via the new --comm/--logs flags, installs SocketLogHandler so slog
records reach the supervisor, reads StartupDetails, and drives a single
TaskInstance through task_runner.Run. The runner injects a
CoordinatorClient into the user task function via
sdkcontext.SdkClientContextKey so tasks written against the existing
sdk.Client API run unchanged. bundlev1server.Serve grows a mode selector
so the same binary still serves go-plugin when no coordinator flags are
present, and exits non-zero on partial --comm/--logs misuse.

DAG-file parsing is intentionally not part of this stack -- it will land
in a follow-up once the parsing protocol settles.
@jason810496 jason810496 self-assigned this May 22, 2026
@jason810496 jason810496 changed the title go-sdk: Implement coordinator-mode runtime entry point and task runner Go-SDK: Implement coordinator-mode runtime entry point and task runner May 22, 2026
@jason810496 jason810496 moved this to In progress in AIP-72 (addendum): Go-SDK May 22, 2026
@jason810496 jason810496 moved this from In progress to In review in AIP-72 (addendum): Go-SDK May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

1 participant