Skip to content

🐞 WithServiceBinding + WithExec fails: dnsmasq addnhosts has IP entries with empty hostnames #13169

@gtzo-anchorage

Description

@gtzo-anchorage

What is the issue?

When a consumer container has any WithServiceBinding(alias, svc) and is then evaluated with WithExec(...).Stdout(ctx) (or .Sync(ctx)), the exec fails immediately at host-alias setup time with:

lookup <hash> for hosts file: lookup <hash> on 10.87.0.1:53: no such host
lookup <hash>.<session>.dagger.local on 10.87.0.1:53: no such host

This happens before the consumer's command runs at all — the consumer's command can be a harmless echo, never referencing the bound alias, and it still fails. Plain WithExec (no service binding) works. Starting the service in isolation (Service.Start(ctx) + Service.Hostname(ctx) + Service.Endpoint(ctx)) also works.

Direct inspection of the engine's dnsmasq hosts file while the service is running shows the bug's mechanism:

$ docker exec dagger-engine-v0.20.8 cat /var/run/containers/cni/dnsname/dagger/addnhosts
10.87.0.2	
10.87.0.3	
...
10.87.0.17	

Every entry has an IP but an empty hostname. The expected line for the service container (e.g. 10.87.0.18 <hash>) is never written. So when engine/buildkit/executor_spec.go does net.LookupIP(<hash>) to populate the consumer's /etc/hosts (around line 283), dnsmasq has no record and returns NXDOMAIN.

Looks related to #6951 (closed; same error string) and #13060 (open; different trigger). This reproduction does not use PRIVATE cache mounts or modules, so it's not #13060 specifically.

Dagger version

dagger v0.20.8 (image://registry.dagger.io/engine:v0.20.8) linux/amd64

Also reproduces on v0.18.19, where the failure was silent / indefinite hang instead of fast error — the upgrade made it diagnosable.

Steps to reproduce

Minimal Go test (go 1.26, dagger.io/dagger v0.20.8):

//go:build daggerE2ETest

package minrepro

import (
	"context"
	"testing"
	"time"

	"dagger.io/dagger"
	"github.com/stretchr/testify/require"
)

func TestServiceBindingDNS(t *testing.T) {
	ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
	defer cancel()
	dag, err := dagger.Connect(ctx, dagger.WithLogOutput(testWriter{t}))
	require.NoError(t, err)
	t.Cleanup(func() { _ = dag.Close() })

	svc := dag.Container().
		From("alpine:3.20").
		WithExposedPort(8888).
		AsService(dagger.ContainerAsServiceOpts{
			Args: []string{"sh", "-c", "echo serving; nc -lk -p 8888 -s 0.0.0.0"},
		})

	// Consumer binds the alias but its exec never references it.
	out, err := dag.Container().
		From("alpine:3.20").
		WithServiceBinding("served", svc).
		WithExec([]string{"sh", "-c", "echo 'consumer running'; cat /etc/hosts"}).
		Stdout(ctx)
	require.NoError(t, err, "bound-but-unused consumer must succeed")
	t.Logf("stdout:\n%s", out)
}

type testWriter struct{ t *testing.T }

func (w testWriter) Write(p []byte) (int, error) {
	w.t.Logf("[dagger] %s", string(p))
	return len(p), nil
}

Run with go test -tags daggerE2ETest -count=1 -v -run TestServiceBindingDNS ..

What we ruled out

  • Test design / runtime DNS use: consumer's command never resolves the alias; the failure is at WithExec network setup time, not inside the container.
  • Stale state: failure is identical on a fresh docker rm -f engine container.
  • Contention: failure is identical on a near-idle host (loadavg ~5) and a loaded one.
  • IPv4-only listener / readiness probe mismatch: explicit nc -s 0.0.0.0 (IPv4 wildcard) makes no difference.
  • Cache key drift (regression: private cache mounts can split service endpoint and binding identity #13060 trigger): this repro uses no PRIVATE cache mounts.
  • Start(ctx) anchoring: pre-anchoring the service identity via svc.Start(ctx) before binding to a consumer makes no difference.
  • Custom hostname: Service.WithHostname("named") makes no difference — same error with the chosen name in place of the random hash.

Workaround that works

dag.Host().Tunnel(svc).Start(ctx) + dial the tunnel endpoint from the host process succeeds in ~1s on v0.20.8. (Tunnel was reportedly broken on v0.18.19 in similar environments.) This sidesteps the WithServiceBinding consumer setup path entirely.

Where it appears to break (engine side, code references)

  • Lookup site: engine/buildkit/executor_spec.go:283net.LookupIP(qualified) returns NXDOMAIN because dnsmasq doesn't know the service hostname.
  • Registration site: cmd/dnsname/files.go:14-35 writes IP\tpodname\t… to addnhosts. With podname == "", the empty-hostname entries we observe are produced.
  • CNI args wiring: internal/buildkit/util/network/cniprovider/cni.go:269-277K8S_POD_NAME is only passed when the namespace is created with hostname != "". Pool slots (created via pool.provider.newNS(ctx, "")) intentionally have empty hostnames. What's unclear is why the service container — which should be created via c.newNS(ctx, hostname) — also lacks a hostname entry in addnhosts. Three plausible failure modes:
    1. The service container is being routed through the pool path instead of the named-hostname path.
    2. K8S_POD_NAME is passed but lost between Dagger and the dnsname plugin on this host (libcni / kernel version interaction).
    3. The plugin runs but silently fails to write the hostname for hostnamed containers.

I don't have an engine dev-loop set up to instrument further, but happy to test patches.

Log output

=== RUN   TestServiceBindingDNS
    ✘ withExec sh -c 'echo …' ERROR
    ! lookup <hash> for hosts file: lookup <hash> on 10.87.0.1:53: no such host
      lookup <hash>.<session>.dagger.local on 10.87.0.1:53: no such host
--- FAIL: TestServiceBindingDNS (0.9s)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions