You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a consumer container has any WithServiceBinding(alias, svc) and is then evaluated with WithExec(...).Stdout(ctx) (or .Sync(ctx)), the exec fails immediately at host-alias setup time with:
lookup <hash> for hosts file: lookup <hash> on 10.87.0.1:53: no such host
lookup <hash>.<session>.dagger.local on 10.87.0.1:53: no such host
This happens before the consumer's command runs at all — the consumer's command can be a harmless echo, never referencing the bound alias, and it still fails. Plain WithExec (no service binding) works. Starting the service in isolation (Service.Start(ctx) + Service.Hostname(ctx) + Service.Endpoint(ctx)) also works.
Direct inspection of the engine's dnsmasq hosts file while the service is running shows the bug's mechanism:
Every entry has an IP but an empty hostname. The expected line for the service container (e.g. 10.87.0.18 <hash>) is never written. So when engine/buildkit/executor_spec.go does net.LookupIP(<hash>) to populate the consumer's /etc/hosts (around line 283), dnsmasq has no record and returns NXDOMAIN.
Looks related to #6951 (closed; same error string) and #13060 (open; different trigger). This reproduction does not use PRIVATE cache mounts or modules, so it's not #13060 specifically.
Start(ctx) anchoring: pre-anchoring the service identity via svc.Start(ctx) before binding to a consumer makes no difference.
Custom hostname: Service.WithHostname("named") makes no difference — same error with the chosen name in place of the random hash.
Workaround that works
dag.Host().Tunnel(svc).Start(ctx) + dial the tunnel endpoint from the host process succeeds in ~1s on v0.20.8. (Tunnel was reportedly broken on v0.18.19 in similar environments.) This sidesteps the WithServiceBinding consumer setup path entirely.
Where it appears to break (engine side, code references)
Lookup site: engine/buildkit/executor_spec.go:283 — net.LookupIP(qualified) returns NXDOMAIN because dnsmasq doesn't know the service hostname.
Registration site: cmd/dnsname/files.go:14-35 writes IP\tpodname\t… to addnhosts. With podname == "", the empty-hostname entries we observe are produced.
CNI args wiring: internal/buildkit/util/network/cniprovider/cni.go:269-277 — K8S_POD_NAME is only passed when the namespace is created with hostname != "". Pool slots (created via pool.provider.newNS(ctx, "")) intentionally have empty hostnames. What's unclear is why the service container — which should be created via c.newNS(ctx, hostname) — also lacks a hostname entry in addnhosts. Three plausible failure modes:
The service container is being routed through the pool path instead of the named-hostname path.
K8S_POD_NAMEis passed but lost between Dagger and the dnsname plugin on this host (libcni / kernel version interaction).
The plugin runs but silently fails to write the hostname for hostnamed containers.
I don't have an engine dev-loop set up to instrument further, but happy to test patches.
Log output
=== RUN TestServiceBindingDNS
✘ withExec sh -c 'echo …' ERROR
! lookup <hash> for hosts file: lookup <hash> on 10.87.0.1:53: no such host
lookup <hash>.<session>.dagger.local on 10.87.0.1:53: no such host
--- FAIL: TestServiceBindingDNS (0.9s)
What is the issue?
When a consumer container has any
WithServiceBinding(alias, svc)and is then evaluated withWithExec(...).Stdout(ctx)(or.Sync(ctx)), the exec fails immediately at host-alias setup time with:This happens before the consumer's command runs at all — the consumer's command can be a harmless
echo, never referencing the bound alias, and it still fails. PlainWithExec(no service binding) works. Starting the service in isolation (Service.Start(ctx)+Service.Hostname(ctx)+Service.Endpoint(ctx)) also works.Direct inspection of the engine's dnsmasq hosts file while the service is running shows the bug's mechanism:
$ docker exec dagger-engine-v0.20.8 cat /var/run/containers/cni/dnsname/dagger/addnhosts 10.87.0.2 10.87.0.3 ... 10.87.0.17Every entry has an IP but an empty hostname. The expected line for the service container (e.g.
10.87.0.18 <hash>) is never written. So whenengine/buildkit/executor_spec.godoesnet.LookupIP(<hash>)to populate the consumer's/etc/hosts(around line 283), dnsmasq has no record and returns NXDOMAIN.Looks related to #6951 (closed; same error string) and #13060 (open; different trigger). This reproduction does not use
PRIVATEcache mounts or modules, so it's not #13060 specifically.Dagger version
dagger v0.20.8 (image://registry.dagger.io/engine:v0.20.8) linux/amd64Also reproduces on
v0.18.19, where the failure was silent / indefinite hang instead of fast error — the upgrade made it diagnosable.Steps to reproduce
Minimal Go test (
go 1.26,dagger.io/dagger v0.20.8):Run with
go test -tags daggerE2ETest -count=1 -v -run TestServiceBindingDNS ..What we ruled out
docker rm -fengine container.nc -s 0.0.0.0(IPv4 wildcard) makes no difference.PRIVATEcache mounts.Start(ctx)anchoring: pre-anchoring the service identity viasvc.Start(ctx)before binding to a consumer makes no difference.Service.WithHostname("named")makes no difference — same error with the chosen name in place of the random hash.Workaround that works
dag.Host().Tunnel(svc).Start(ctx)+ dial the tunnel endpoint from the host process succeeds in ~1s on v0.20.8. (Tunnel was reportedly broken on v0.18.19 in similar environments.) This sidesteps theWithServiceBindingconsumer setup path entirely.Where it appears to break (engine side, code references)
engine/buildkit/executor_spec.go:283—net.LookupIP(qualified)returns NXDOMAIN because dnsmasq doesn't know the service hostname.cmd/dnsname/files.go:14-35writesIP\tpodname\t…to addnhosts. Withpodname == "", the empty-hostname entries we observe are produced.internal/buildkit/util/network/cniprovider/cni.go:269-277—K8S_POD_NAMEis only passed when the namespace is created withhostname != "". Pool slots (created viapool.provider.newNS(ctx, "")) intentionally have empty hostnames. What's unclear is why the service container — which should be created viac.newNS(ctx, hostname)— also lacks a hostname entry in addnhosts. Three plausible failure modes:K8S_POD_NAMEis passed but lost between Dagger and the dnsname plugin on this host (libcni / kernel version interaction).I don't have an engine dev-loop set up to instrument further, but happy to test patches.
Log output