fix(provision/docker): probe host apiserver before declaring ready#12
fix(provision/docker): probe host apiserver before declaring ready#12
Conversation
k3s writes /etc/rancher/k3s/k3s.yaml inside the container as soon as the apiserver socket is bound, but two host-side concerns lag behind: docker's userland port-forward to 127.0.0.1:HostAPIPort takes a moment to start accepting, and the apiserver itself needs a beat to advance from "listening" to "ready". The very next step in Provision -- envoygateway.Install via kubectl apply --server-side against the merged kubeconfig -- raced against both, deterministically failing with: dial tcp 127.0.0.1:6443: connect: connection refused Add waitForHostAPIServer between waitForKubeconfig and the "k3s ready" log. It polls `kubectl get --raw=/readyz` against the merged context (1s interval, 60s deadline). /readyz covers both failure modes -- a transport error from a not-yet-bound port and a 503 from a starting apiserver are both retried. The qemu provisioner is unaffected: its ssh-driven k3s install script returns only after the apiserver is fully up, so "k3s ready" there is already truthful. Verified locally: TestDocker_ProvisionTeardown sees ~10s between "starting docker" and "k3s ready" (was ~2s) and the subsequent envoy-gateway install runs without retry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Review from maintainer agent VerdictLGTM. Tight, focused change against a real, deterministic race that ystack acceptance has been hitting. Worth merging. What I liked
Suggestions (none blocking)1. The qemu provisioner is "unaffected" by timing, not by guaranteeThe PR description: "its ssh-driven k3s install script returns only after the apiserver is fully up, so 'k3s ready' there is already truthful." That's accurate for "apiserver up inside the VM," but the host-side port forward is still a separate readiness step. qemu happens not to race today because its full-VM boot path runs much longer than the docker container's, by which time SLIRP forwarding has been bound for a while. On a slow host or with very fast k3s startup, qemu could in principle hit the same race. I wouldn't expand the scope of this PR, but 2. A start-of-wait log line would help future debuggingThe function is silent until success or 60s timeout. When someone in the future stares at: it's immediately obvious which step took the time. Today on success there's no log between "merge kubeconfig" and "k3s ready," so a 30s wait looks like a 30s mystery. One 3. Why shell out to
|
Two non-blocking review nits from #12: - The function was silent until success/timeout; a 30s wait looked like a 30s mystery. Log the start so successive log lines bracket the readiness gate. - Spell out *why* we shell out to kubectl instead of dialing the apiserver directly (consistency with envoygateway.Install, which is the very next caller and uses the same code path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review nit #5 from #12: lock in the deadline path and the ctx-cancel path with a fake-kubectl-on-PATH harness. To run the loop in milliseconds rather than the production 60s, the body of waitForHostAPIServer is split into pollHostAPIServerReadyz which takes the timeout and interval as arguments. Production keeps the same effective behaviour via const wrappers; the test drives the helper directly with sub-second knobs. Three cases: - success: kubectl exits 0, returns nil immediately. - deadline-honored: always-failing kubectl, expect the wrapped "/readyz never returned 200" error (not ctx.Err()) and the context name in the message. - ctx-cancelled: 50ms ctx vs 10s loop timeout, expect ctx.Err() (not the deadline message) -- guards against a refactor that drops the select { <-ctx.Done() } branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe in the docker provisioner between the in-container kubeconfig appearing and "k3s ready" being declared. The previous race -- "k3s ready" firing while the docker port-forward to 127.0.0.1:6443 wasn't yet accepting, so the very next step (envoy-gateway install via kubectl apply) failed with "dial tcp 127.0.0.1:6443: connect: connection refused" -- is now closed at the source. The 4x retry workaround in the linux-amd64 acceptance script never helped (each retry tore the cluster down and reproduced the same deterministic race) and is dropped. The provision call is back to a single line. The y-kustomize Deployment image is bumped to the matching v0.3.5 tag for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two y-cluster releases unblock the docker provider on ubuntu-latest and let the acceptance script collapse to a single provision call: - v0.3.5 (Yolean/y-cluster#12) added a host-side /readyz probe between the in-container kubeconfig appearing and "k3s ready" being declared, closing the docker port-forward race that made envoy-gateway install fail with "dial tcp 127.0.0.1:6443: connect: connection refused". The 4x retry/sleep-10s workaround in this script is dead code now -- each retry tore the cluster down and reproduced the deterministic race anyway. - v0.3.6 (Yolean/y-cluster#15) fixed a separate silent-drop in the docker provider's PortBindings: HostIP was left as the zero netip.Addr ("invalid IP"), which moby v1.54+ marshals to the empty JSON string and Docker Engine 28 dropped silently. A second issue with PortBindings still surfaces in some CI contexts -- the y-cluster-managed container's NetworkSettings.Ports comes back empty even with v0.3.6 -- but it's distinct from anything this script can work around; filed upstream against y-cluster. The y-kustomize Deployment image is bumped to the matching v0.3.6 tag for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
pkg/provision/docker/docker.go: addwaitForHostAPIServer, called betweenwaitForKubeconfigand the"k3s ready"log. It pollskubectl get --raw=/readyzagainst the merged context (1s interval, 60s deadline) so a not-yet-bound port forward and a still-starting apiserver are both retried beforeenvoygateway.Install(or any other host-side kubectl call) runs.waitForKubeconfigdoc comment, which had asserted the opposite of reality ("k3s.yaml means apiserver is ready to accept connections").Why
ystack acceptance on
ubuntu-latesthas been failing every run on the docker provider with:(Yolean/ystack#76; example run: actions/runs/25220092901). The race is deterministic — every fresh provision shows ~2s between
"starting docker"and"k3s ready", withkubectlfailing within 1s. The ystack-side 4×retry workaround can't paper over it because each retry tears the cluster down and reproduces the same race.The qemu provisioner is unaffected: its ssh-driven k3s install script returns only after the apiserver is fully up, so
"k3s ready"there is already truthful.Test plan
go test ./...passesgo test -tags 'e2e,docker' -run TestDocker_ProvisionTeardown ./e2e/passes locally;"k3s ready"now arrives ~10s after"starting docker"(was ~2s) and the subsequent envoy-gateway install runs without retrye2e (e2e,docker)job green