fix(provision/docker): probe host apiserver before declaring ready by solsson · Pull Request #12 · Yolean/y-cluster

solsson · 2026-05-01T15:42:23Z

Summary

pkg/provision/docker/docker.go: add waitForHostAPIServer, called between waitForKubeconfig and the "k3s ready" log. It polls kubectl get --raw=/readyz against the merged context (1s interval, 60s deadline) so a not-yet-bound port forward and a still-starting apiserver are both retried before envoygateway.Install (or any other host-side kubectl call) runs.
Updates the waitForKubeconfig doc comment, which had asserted the opposite of reality ("k3s.yaml means apiserver is ready to accept connections").

Why

ystack acceptance on ubuntu-latest has been failing every run on the docker provider with:

INFO docker/docker.go:208  k3s ready
INFO envoygateway/install.go:119  applying envoy-gateway install manifest
error: ... dial tcp 127.0.0.1:6443: connect: connection refused

(Yolean/ystack#76; example run: actions/runs/25220092901). The race is deterministic — every fresh provision shows ~2s between "starting docker" and "k3s ready", with kubectl failing within 1s. The ystack-side 4×retry workaround can't paper over it because each retry tears the cluster down and reproduces the same race.

The qemu provisioner is unaffected: its ssh-driven k3s install script returns only after the apiserver is fully up, so "k3s ready" there is already truthful.

Test plan

go test ./... passes
go test -tags 'e2e,docker' -run TestDocker_ProvisionTeardown ./e2e/ passes locally; "k3s ready" now arrives ~10s after "starting docker" (was ~2s) and the subsequent envoy-gateway install runs without retry
CI e2e (e2e,docker) job green

k3s writes /etc/rancher/k3s/k3s.yaml inside the container as soon as the apiserver socket is bound, but two host-side concerns lag behind: docker's userland port-forward to 127.0.0.1:HostAPIPort takes a moment to start accepting, and the apiserver itself needs a beat to advance from "listening" to "ready". The very next step in Provision -- envoygateway.Install via kubectl apply --server-side against the merged kubeconfig -- raced against both, deterministically failing with: dial tcp 127.0.0.1:6443: connect: connection refused Add waitForHostAPIServer between waitForKubeconfig and the "k3s ready" log. It polls `kubectl get --raw=/readyz` against the merged context (1s interval, 60s deadline). /readyz covers both failure modes -- a transport error from a not-yet-bound port and a 503 from a starting apiserver are both retried. The qemu provisioner is unaffected: its ssh-driven k3s install script returns only after the apiserver is fully up, so "k3s ready" there is already truthful. Verified locally: TestDocker_ProvisionTeardown sees ~10s between "starting docker" and "k3s ready" (was ~2s) and the subsequent envoy-gateway install runs without retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

solsson · 2026-05-01T17:05:02Z

Review from maintainer agent

Verdict

LGTM. Tight, focused change against a real, deterministic race that ystack acceptance has been hitting. Worth merging.

What I liked

Honest doc-comment fix on waitForKubeconfig (line 372–376). The old comment claimed "k3s.yaml means apiserver is ready to accept connections" — that's exactly the false invariant the bug hinges on. Updating it as part of the fix is the right thing.
Symmetric error handling. The new waitForHostAPIServer failure path captures container logs the same way waitForKubeconfig already does (dlogs, _ := dockerexec.Logs(...)). Reading the diff feels like one author wrote both.
/readyz over /healthz. Correct choice — /readyz gates readiness (and includes the apiserver having advanced past "listening" to "serving"), /healthz doesn't.
ctx.Done() honored in the poll loop. A caller cancellation interrupts cleanly.
lastErr instead of returning the most recent transient. The first 30 connection-refused errors aren't useful to surface; the final state when the deadline trips is.

Suggestions (none blocking)

1. The qemu provisioner is "unaffected" by timing, not by guarantee

The PR description: "its ssh-driven k3s install script returns only after the apiserver is fully up, so 'k3s ready' there is already truthful."

That's accurate for "apiserver up inside the VM," but the host-side port forward is still a separate readiness step. qemu happens not to race today because its full-VM boot path runs much longer than the docker container's, by which time SLIRP forwarding has been bound for a while. On a slow host or with very fast k3s startup, qemu could in principle hit the same race.

I wouldn't expand the scope of this PR, but waitForHostAPIServer is generic enough that lifting it to a provision/internal (or provision/k8sready) helper and calling it from pkg/provision/qemu/qemu.go:309 after the kubeconfig merge is a natural follow-up. Filing as a tracking issue would be enough for now.

2. A start-of-wait log line would help future debugging

The function is silent until success or 60s timeout. When someone in the future stares at:

INFO docker/docker.go:208  waiting for host apiserver readiness   <-- new
INFO docker/docker.go:224  k3s ready

it's immediately obvious which step took the time. Today on success there's no log between "merge kubeconfig" and "k3s ready," so a 30s wait looks like a 30s mystery. One c.logger.Info("waiting for host apiserver", zap.String("context", c.cfg.Context)) at the top of waitForHostAPIServer is enough.

3. Why shell out to `kubectl` instead of a direct HTTP probe

You're forking kubectl up to 60 times in the worst case (1s interval × 60s deadline). It's not expensive in absolute terms — this whole loop only runs on a cold provision — but a client-go REST GET against the merged context's server URL would avoid the fork entirely and be one less moving piece on the host (kubectl version drift, PATH order, etc.).

I'd lean toward keeping kubectl since the rest of the provisioner already shells out (and envoygateway.Install does the same), so consistency wins here. Just flagging it because the comment on line 380 implies the choice was deliberate without spelling out why the shell-out was preferred — a one-line "matches envoygateway.Install which also drives the host kubeconfig via kubectl" would close the loop.

4. Polling interval choice

1s is fine. If the deterministic gap is "~2s (current)" vs "~10s (after this fix)" as the PR says, an aggressive interval doesn't buy much. If you ever want to tighten the floor, 200–500ms would catch the median case faster without changing the deadline logic.

5. Unit test coverage

The PR plan calls out TestDocker_ProvisionTeardown (e2e) and CI. That's the right level for a real-network probe loop, but a tiny in-process unit test against a fake kubectl-on-PATH (write a shell script to t.TempDir(), prepend to PATH via t.Setenv, have it print fail-then-success) would lock in:

The 60s deadline is honored (run with the fake always-failing, expect error).
The context cancellation path returns ctx.Err() (not the 60s timeout error).

Optional, but would make a refactor that accidentally drops select { <-ctx.Done(): } fail loudly instead of silently regressing to a non-cancellable wait.

File:line cross-reference

pkg/provision/docker/docker.go:208–222 — Provision flow now has a new readiness gate; reads cleanly.
pkg/provision/docker/docker.go:354–378 — waitForKubeconfig doc updated.
pkg/provision/docker/docker.go:380–410 — waitForHostAPIServer. Single-purpose, well-commented.

Nit

probe.CombinedOutput() discards the separation of stdout/stderr; fine here since we're just looking at exit code + last-error context. No change suggested.

Net: small, surgical, addresses the failing CI mode head-on, and doesn't expand scope. Ship it.

Two non-blocking review nits from #12: - The function was silent until success/timeout; a 30s wait looked like a 30s mystery. Log the start so successive log lines bracket the readiness gate. - Spell out *why* we shell out to kubectl instead of dialing the apiserver directly (consistency with envoygateway.Install, which is the very next caller and uses the same code path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address review nit #5 from #12: lock in the deadline path and the ctx-cancel path with a fake-kubectl-on-PATH harness. To run the loop in milliseconds rather than the production 60s, the body of waitForHostAPIServer is split into pollHostAPIServerReadyz which takes the timeout and interval as arguments. Production keeps the same effective behaviour via const wrappers; the test drives the helper directly with sub-second knobs. Three cases: - success: kubectl exits 0, returns nil immediately. - deadline-honored: always-failing kubectl, expect the wrapped "/readyz never returned 200" error (not ctx.Err()) and the context name in the message. - ctx-cancelled: 50ms ctx vs 10s loop timeout, expect ctx.Err() (not the deadline message) -- guards against a refactor that drops the select { <-ctx.Done() } branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe in the docker provisioner between the in-container kubeconfig appearing and "k3s ready" being declared. The previous race -- "k3s ready" firing while the docker port-forward to 127.0.0.1:6443 wasn't yet accepting, so the very next step (envoy-gateway install via kubectl apply) failed with "dial tcp 127.0.0.1:6443: connect: connection refused" -- is now closed at the source. The 4x retry workaround in the linux-amd64 acceptance script never helped (each retry tore the cluster down and reproduced the same deterministic race) and is dropped. The provision call is back to a single line. The y-kustomize Deployment image is bumped to the matching v0.3.5 tag for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two y-cluster releases unblock the docker provider on ubuntu-latest and let the acceptance script collapse to a single provision call: - v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe between the in-container kubeconfig appearing and "k3s ready" being declared, closing the docker port-forward race that made envoy-gateway install fail with "dial tcp 127.0.0.1:6443: connect: connection refused". The 4x retry/sleep-10s workaround in this script is dead code now -- each retry tore the cluster down and reproduced the deterministic race anyway. - v0.3.6 (Yolean/y-cluster#15) fixed a separate silent-drop in the docker provider's PortBindings: HostIP was left as the zero netip.Addr ("invalid IP"), which moby v1.54+ marshals to the empty JSON string and Docker Engine 28 dropped silently. A second issue with PortBindings still surfaces in some CI contexts -- the y-cluster-managed container's NetworkSettings.Ports comes back empty even with v0.3.6 -- but it's distinct from anything this script can work around; filed upstream against y-cluster. The y-kustomize Deployment image is bumped to the matching v0.3.6 tag for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

solsson and others added 2 commits May 1, 2026 19:06

solsson merged commit d13b217 into main May 1, 2026
11 checks passed

solsson mentioned this pull request May 3, 2026

docker provider: PortBindings silently dropped on ubuntu-latest when binary is invoked from bash #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(provision/docker): probe host apiserver before declaring ready#12

fix(provision/docker): probe host apiserver before declaring ready#12
solsson merged 3 commits intomainfrom
fix-docker-apiserver-host-readiness

solsson commented May 1, 2026

Uh oh!

solsson commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

solsson commented May 1, 2026

Summary

Why

Test plan

Uh oh!

solsson commented May 1, 2026

Verdict

What I liked

Suggestions (none blocking)

1. The qemu provisioner is "unaffected" by timing, not by guarantee

2. A start-of-wait log line would help future debugging

3. Why shell out to kubectl instead of a direct HTTP probe

4. Polling interval choice

5. Unit test coverage

File:line cross-reference

Nit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

3. Why shell out to `kubectl` instead of a direct HTTP probe