Skip to content

fix(provision/docker): probe host apiserver before declaring ready#12

Merged
solsson merged 3 commits intomainfrom
fix-docker-apiserver-host-readiness
May 1, 2026
Merged

fix(provision/docker): probe host apiserver before declaring ready#12
solsson merged 3 commits intomainfrom
fix-docker-apiserver-host-readiness

Conversation

@solsson
Copy link
Copy Markdown
Contributor

@solsson solsson commented May 1, 2026

Summary

  • pkg/provision/docker/docker.go: add waitForHostAPIServer, called between waitForKubeconfig and the "k3s ready" log. It polls kubectl get --raw=/readyz against the merged context (1s interval, 60s deadline) so a not-yet-bound port forward and a still-starting apiserver are both retried before envoygateway.Install (or any other host-side kubectl call) runs.
  • Updates the waitForKubeconfig doc comment, which had asserted the opposite of reality ("k3s.yaml means apiserver is ready to accept connections").

Why

ystack acceptance on ubuntu-latest has been failing every run on the docker provider with:

INFO docker/docker.go:208  k3s ready
INFO envoygateway/install.go:119  applying envoy-gateway install manifest
error: ... dial tcp 127.0.0.1:6443: connect: connection refused

(Yolean/ystack#76; example run: actions/runs/25220092901). The race is deterministic — every fresh provision shows ~2s between "starting docker" and "k3s ready", with kubectl failing within 1s. The ystack-side 4×retry workaround can't paper over it because each retry tears the cluster down and reproduces the same race.

The qemu provisioner is unaffected: its ssh-driven k3s install script returns only after the apiserver is fully up, so "k3s ready" there is already truthful.

Test plan

  • go test ./... passes
  • go test -tags 'e2e,docker' -run TestDocker_ProvisionTeardown ./e2e/ passes locally; "k3s ready" now arrives ~10s after "starting docker" (was ~2s) and the subsequent envoy-gateway install runs without retry
  • CI e2e (e2e,docker) job green

k3s writes /etc/rancher/k3s/k3s.yaml inside the container as soon
as the apiserver socket is bound, but two host-side concerns lag
behind: docker's userland port-forward to 127.0.0.1:HostAPIPort
takes a moment to start accepting, and the apiserver itself
needs a beat to advance from "listening" to "ready". The very
next step in Provision -- envoygateway.Install via kubectl apply
--server-side against the merged kubeconfig -- raced against
both, deterministically failing with:

  dial tcp 127.0.0.1:6443: connect: connection refused

Add waitForHostAPIServer between waitForKubeconfig and the
"k3s ready" log. It polls `kubectl get --raw=/readyz` against
the merged context (1s interval, 60s deadline). /readyz covers
both failure modes -- a transport error from a not-yet-bound
port and a 503 from a starting apiserver are both retried.
The qemu provisioner is unaffected: its ssh-driven k3s install
script returns only after the apiserver is fully up, so
"k3s ready" there is already truthful.

Verified locally: TestDocker_ProvisionTeardown sees ~10s between
"starting docker" and "k3s ready" (was ~2s) and the subsequent
envoy-gateway install runs without retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@solsson
Copy link
Copy Markdown
Contributor Author

solsson commented May 1, 2026

Review from maintainer agent

Verdict

LGTM. Tight, focused change against a real, deterministic race that ystack acceptance has been hitting. Worth merging.

What I liked

  • Honest doc-comment fix on waitForKubeconfig (line 372–376). The old comment claimed "k3s.yaml means apiserver is ready to accept connections" — that's exactly the false invariant the bug hinges on. Updating it as part of the fix is the right thing.
  • Symmetric error handling. The new waitForHostAPIServer failure path captures container logs the same way waitForKubeconfig already does (dlogs, _ := dockerexec.Logs(...)). Reading the diff feels like one author wrote both.
  • /readyz over /healthz. Correct choice — /readyz gates readiness (and includes the apiserver having advanced past "listening" to "serving"), /healthz doesn't.
  • ctx.Done() honored in the poll loop. A caller cancellation interrupts cleanly.
  • lastErr instead of returning the most recent transient. The first 30 connection-refused errors aren't useful to surface; the final state when the deadline trips is.

Suggestions (none blocking)

1. The qemu provisioner is "unaffected" by timing, not by guarantee

The PR description: "its ssh-driven k3s install script returns only after the apiserver is fully up, so 'k3s ready' there is already truthful."

That's accurate for "apiserver up inside the VM," but the host-side port forward is still a separate readiness step. qemu happens not to race today because its full-VM boot path runs much longer than the docker container's, by which time SLIRP forwarding has been bound for a while. On a slow host or with very fast k3s startup, qemu could in principle hit the same race.

I wouldn't expand the scope of this PR, but waitForHostAPIServer is generic enough that lifting it to a provision/internal (or provision/k8sready) helper and calling it from pkg/provision/qemu/qemu.go:309 after the kubeconfig merge is a natural follow-up. Filing as a tracking issue would be enough for now.

2. A start-of-wait log line would help future debugging

The function is silent until success or 60s timeout. When someone in the future stares at:

INFO docker/docker.go:208  waiting for host apiserver readiness   <-- new
INFO docker/docker.go:224  k3s ready

it's immediately obvious which step took the time. Today on success there's no log between "merge kubeconfig" and "k3s ready," so a 30s wait looks like a 30s mystery. One c.logger.Info("waiting for host apiserver", zap.String("context", c.cfg.Context)) at the top of waitForHostAPIServer is enough.

3. Why shell out to kubectl instead of a direct HTTP probe

You're forking kubectl up to 60 times in the worst case (1s interval × 60s deadline). It's not expensive in absolute terms — this whole loop only runs on a cold provision — but a client-go REST GET against the merged context's server URL would avoid the fork entirely and be one less moving piece on the host (kubectl version drift, PATH order, etc.).

I'd lean toward keeping kubectl since the rest of the provisioner already shells out (and envoygateway.Install does the same), so consistency wins here. Just flagging it because the comment on line 380 implies the choice was deliberate without spelling out why the shell-out was preferred — a one-line "matches envoygateway.Install which also drives the host kubeconfig via kubectl" would close the loop.

4. Polling interval choice

1s is fine. If the deterministic gap is "~2s (current)" vs "~10s (after this fix)" as the PR says, an aggressive interval doesn't buy much. If you ever want to tighten the floor, 200–500ms would catch the median case faster without changing the deadline logic.

5. Unit test coverage

The PR plan calls out TestDocker_ProvisionTeardown (e2e) and CI. That's the right level for a real-network probe loop, but a tiny in-process unit test against a fake kubectl-on-PATH (write a shell script to t.TempDir(), prepend to PATH via t.Setenv, have it print fail-then-success) would lock in:

  • The 60s deadline is honored (run with the fake always-failing, expect error).
  • The context cancellation path returns ctx.Err() (not the 60s timeout error).

Optional, but would make a refactor that accidentally drops select { <-ctx.Done(): } fail loudly instead of silently regressing to a non-cancellable wait.

File:line cross-reference

  • pkg/provision/docker/docker.go:208–222 — Provision flow now has a new readiness gate; reads cleanly.
  • pkg/provision/docker/docker.go:354–378waitForKubeconfig doc updated.
  • pkg/provision/docker/docker.go:380–410waitForHostAPIServer. Single-purpose, well-commented.

Nit

probe.CombinedOutput() discards the separation of stdout/stderr; fine here since we're just looking at exit code + last-error context. No change suggested.


Net: small, surgical, addresses the failing CI mode head-on, and doesn't expand scope. Ship it.

solsson and others added 2 commits May 1, 2026 19:06
Two non-blocking review nits from #12:
- The function was silent until success/timeout; a 30s wait looked
  like a 30s mystery. Log the start so successive log lines bracket
  the readiness gate.
- Spell out *why* we shell out to kubectl instead of dialing the
  apiserver directly (consistency with envoygateway.Install, which
  is the very next caller and uses the same code path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review nit #5 from #12: lock in the deadline path and the
ctx-cancel path with a fake-kubectl-on-PATH harness.

To run the loop in milliseconds rather than the production 60s,
the body of waitForHostAPIServer is split into pollHostAPIServerReadyz
which takes the timeout and interval as arguments. Production keeps
the same effective behaviour via const wrappers; the test drives the
helper directly with sub-second knobs.

Three cases:
- success: kubectl exits 0, returns nil immediately.
- deadline-honored: always-failing kubectl, expect the wrapped
  "/readyz never returned 200" error (not ctx.Err()) and the context
  name in the message.
- ctx-cancelled: 50ms ctx vs 10s loop timeout, expect ctx.Err()
  (not the deadline message) -- guards against a refactor that
  drops the select { <-ctx.Done() } branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@solsson solsson merged commit d13b217 into main May 1, 2026
11 checks passed
solsson added a commit to Yolean/ystack that referenced this pull request May 1, 2026
v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe in
the docker provisioner between the in-container kubeconfig
appearing and "k3s ready" being declared. The previous race --
"k3s ready" firing while the docker port-forward to 127.0.0.1:6443
wasn't yet accepting, so the very next step (envoy-gateway install
via kubectl apply) failed with "dial tcp 127.0.0.1:6443: connect:
connection refused" -- is now closed at the source.

The 4x retry workaround in the linux-amd64 acceptance script never
helped (each retry tore the cluster down and reproduced the same
deterministic race) and is dropped. The provision call is back to
a single line.

The y-kustomize Deployment image is bumped to the matching v0.3.5
tag for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
solsson added a commit to Yolean/ystack that referenced this pull request May 3, 2026
Two y-cluster releases unblock the docker provider on
ubuntu-latest and let the acceptance script collapse to a single
provision call:

- v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe
  between the in-container kubeconfig appearing and "k3s ready"
  being declared, closing the docker port-forward race that made
  envoy-gateway install fail with "dial tcp 127.0.0.1:6443:
  connect: connection refused". The 4x retry/sleep-10s
  workaround in this script is dead code now -- each retry tore
  the cluster down and reproduced the deterministic race anyway.
- v0.3.6 (Yolean/y-cluster#15) fixed a separate silent-drop in
  the docker provider's PortBindings: HostIP was left as the
  zero netip.Addr ("invalid IP"), which moby v1.54+ marshals to
  the empty JSON string and Docker Engine 28 dropped silently.

  A second issue with PortBindings still surfaces in some CI
  contexts -- the y-cluster-managed container's
  NetworkSettings.Ports comes back empty even with v0.3.6 -- but
  it's distinct from anything this script can work around;
  filed upstream against y-cluster.

The y-kustomize Deployment image is bumped to the matching
v0.3.6 tag for consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant