Intermittent `failed to get state for index` errors #6111

sipsma · 2023-11-15T00:12:02Z

These sorts of errors have been happening occasionally (~1 / day) in our CI approximately since we switched over everything to our shared runners:

Error: input:1: pipeline.pipeline.container.export failed to solve for container publish: failed to compute cache key: failed to get state for index 0 on copy /runc /usr/local/bin/runc

(run)

The exact message varies but seems to always happen on FileOps and include failed to compute cache key: failed to get state for index 0.

There is an upstream issue for this here: moby/buildkit#3635

If we can repro the error and get useful data will move to the upstream issue (or just send PR w/ fix). Creating this to track in case anyone hits it and searches for the error message.

The fact that it seems to have only started once we switched to the shared runner is notable. My immediate gut reaction is that it probably has something to do with concurrent solves that overlap in vertexes since that's proven to be a fairly tricky + error prone codepath in buildkit and is something that would have not happened previously before switching to the shared runners.

The text was updated successfully, but these errors were encountered:

matiasinsaurralde · 2023-11-21T00:32:46Z

Have been observing it in the past days too: https://github.com/dagger/dagger/actions/runs/6937495840/job/18871658791?pr=6136

gerhard · 2024-01-31T17:28:00Z

I know that we've been seeing these for a while, want to resume tracking them so that we have a better idea of how often they happen:

https://github.com/dagger/dagger/actions/runs/7729432480/job/21072590483#step:6:720

https://github.com/dagger/dagger/actions/runs/7728437741/job/21069222134#step:6:175

gerhard · 2024-02-01T10:47:57Z

https://github.com/dagger/dagger/actions/runs/7734943604/job/21103510867#step:6:882

gerhard · 2024-03-12T18:08:47Z

Just hit this again while releasing 0.10.2: https://github.com/dagger/dagger/actions/runs/8253625814/job/22575935633?pr=6871#step:6:772

mckinnsb · 2024-04-12T16:39:16Z

I also get "failed to compute cache key" errors in CI on K8 with the runner, but I get a different "source":

cause: ClientError: resolve: container: withRegistryAuth: build: publish: failed to solve for container publish: failed to compute cache key: failed to unmount /tmp/containerd-mount922630576: failed to unmount target /tmp/containerd-mount922630576: device or resource busy: {"response":{"errors":[{"message":"resolve: container: withRegistryAuth: build: publish: failed to solve for container publish: failed to compute cache key: failed to unmount /tmp/containerd-mount922630576: failed to unmount target /tmp/containerd-mount922630576: device or resource busy"}]

helderco · 2024-04-19T00:44:25Z

Whelp! This error is happening a lot in Python tests:

failed to compute cache key: failed to get state for index 0 on copy /runtime/template/runtime.py /runtime

That copy comes from here:

dagger/sdk/python/runtime/main.go

Lines 202 to 206 in d3a11cf

    
           m.Container = m.Container.WithFile( 
        
           	RuntimeExecutablePath, 
        
           	template.File("runtime.py"), 
        
           	ContainerWithFileOpts{Permissions: 0o755}, 
        
           )

Seems to be more frequent with the tests around the lock file. See anything wrong here?

dagger/core/integration/module_python_test.go

Lines 638 to 711 in d3a11cf

    
           func TestModulePythonLockHashes(t *testing.T) { 
        
           	t.Parallel() 
        
           	c, ctx := connect(t) 
        
           	base := daggerCliBase(t, c).With(daggerInitPython()) 
        
           	out, err := base.File("requirements.lock").Contents(ctx) 
        
           	require.NoError(t, err) 
        
           	// Replace hashes for platformdirs with an invalid one. 
        
           	// The lock file has the following format: 
        
           	// 
        
           	// httpx==0.27.0 \ 
        
           	//     --hash=sha256:71d5465162c13681bff01ad59b2cc68dd838ea1f10e51574bac27103f00c91a5 \ 
        
           	//     --hash=sha256:a0cb88a46f32dc874e04ee956e4c2764aba2aa228f650b06788ba6bda2962ab5 
        
           	//     # via gql 
        
           	// platformdirs==4.2.0 \ 
        
           	//     --hash=sha256:0614df2a2f37e1a662acbd8e2b25b92ccf8632929bc6d43467e17fe89c75e068 \ 
        
           	//     --hash=sha256:ef0cc731df711022c174543cb70a9b5bd22e5a9337c8624ef2c2ceb8ddad8768 
        
           	// pygments==2.17.2 \ 
        
           	//     --hash=sha256:b27c2826c47d0f3219f29554824c30c5e8945175d888647acd804ddd04af846c \ 
        
           	//     --hash=sha256:da46cec9fd2de5be3a8a784f434e4c4ab670b4ff54d605c4c2717e9d49c4c367 
        
           	//     # via rich 
        
           	var lock strings.Builder 
        
           	replaceHashes := false 
        
           	for _, line := range strings.Split(out, "\n") { 
        
           		if strings.HasPrefix(line, "platformdirs==") { 
        
           			replaceHashes = true 
        
           			lock.WriteString( 
        
           				fmt.Sprintf("%s\n    --hash=sha256:%s\n", line, strings.Repeat("1", 64)), 
        
           			) 
        
           			continue 
        
           		} 
        
           		if replaceHashes { 
        
           			if strings.HasPrefix(strings.TrimSpace(line), "--hash") { 
        
           				continue 
        
           			} else { 
        
           				replaceHashes = false 
        
           			} 
        
           		} 
        
           		lock.WriteString(line) 
        
           		lock.WriteString("\n") 
        
           	} 
        
           	requirements := lock.String() 
        
           	t.Logf("requirements.lock:\n%s", requirements) 
        
           	t.Run("uv", func(t *testing.T) { 
        
           		_, err := base. 
        
           			With(fileContents("requirements.lock", requirements)). 
        
           			With(daggerExec("develop")). 
        
           			Sync(ctx) 
        
           		// TODO: uv doesn't support hash verification yet. 
        
           		// require.ErrorContains(t, err, "hash mismatch") 
        
           		require.NoError(t, err) 
        
           	}) 
        
           	t.Run("pip", func(t *testing.T) { 
        
           		_, err := base. 
        
           			With(fileContents("requirements.lock", requirements)). 
        
           			With(pyprojectExtra(` 
        
                           [tool.dagger] 
        
                           use-uv = false 
        
                       `)). 
        
           			With(daggerExec("develop")). 
        
           			Sync(ctx) 
        
           		require.ErrorContains(t, err, "DO NOT MATCH THE HASHES") 
        
           		require.ErrorContains(t, err, "Expected sha256 1111111") 
        
           	}) 
        
           }

Maybe this is a simpler case to debug and figure this out.

sipsma · 2024-04-19T18:48:58Z

Just FYI I'm attempting to debug this and the other (possibly related) inconsistent graph state errors here: #7128

sipsma · 2024-04-26T20:57:27Z

More details in #7128, but ended up being able to very consistently repro this locally and figured out the bug upstream, fix there moby/buildkit#4887

Will leave this open until we've picked up the upstream fix, done a release with it, upgraded CI runners and confirmed that the errors are gone now

gerhard · 2024-04-30T18:20:18Z

Yes!

FTR, the new CI runners for Engine & CLI are 16 CPUs & 128GB RAM, so this is less likely to occur for those CI jobs.

All other CI runners are still bunching around a single Engine, so we can expect this to continue being an issue.

💪

jedevc · 2024-05-07T11:04:40Z

The worst of this seems to be resolved by #7295, so going to take this out of the milestone, but worth leaving open until the upstream fix is resolved.

sipsma · 2024-06-05T03:19:48Z

We've been running with this fix in place for a while and haven't seen it since (used to happen 1 or a few times a day). So I'll close this out until proven otherwise.

jedevc mentioned this issue Dec 7, 2023

Intermittent failed to get edge: inconsistent graph state errors #6234

Closed

jedevc mentioned this issue Mar 4, 2024

error: failed to solve: failed to compute cache key: failed to get state for index 0 on docker/buildx#1308

Closed

jedevc mentioned this issue Apr 4, 2024

Enable scheduler debug for tests #7020

Closed

sipsma mentioned this issue Apr 18, 2024

Disable buildkit scheduler debug logs once no longer needed #7129

Closed

sipsma mentioned this issue Apr 19, 2024

ci: enable buildkit scheduler debug logs #7128

Closed

helderco mentioned this issue Apr 19, 2024

chore: remove eslint cmd from template ts module #6958

Merged

sipsma mentioned this issue Apr 19, 2024

core: support automatic installation of custom CA certs. #7067

Merged

8 tasks

sipsma mentioned this issue Apr 30, 2024

engine: nested exec simplifications and service fixes #7213

Merged

sipsma added this to the v0.11.3 milestone Apr 30, 2024

gerhard mentioned this issue May 2, 2024

ci: use nested dagger for testdev #7223

Closed

jedevc removed this from the v0.11.3 milestone May 7, 2024

sipsma closed this as completed Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent `failed to get state for index` errors #6111

Intermittent `failed to get state for index` errors #6111

sipsma commented Nov 15, 2023

matiasinsaurralde commented Nov 21, 2023

gerhard commented Jan 31, 2024 •

edited

Loading

gerhard commented Feb 1, 2024

gerhard commented Mar 12, 2024

mckinnsb commented Apr 12, 2024 •

edited

Loading

helderco commented Apr 19, 2024

sipsma commented Apr 19, 2024

sipsma commented Apr 26, 2024

gerhard commented Apr 30, 2024

jedevc commented May 7, 2024

sipsma commented Jun 5, 2024

Intermittent failed to get state for index errors #6111

Intermittent failed to get state for index errors #6111

Comments

sipsma commented Nov 15, 2023

matiasinsaurralde commented Nov 21, 2023

gerhard commented Jan 31, 2024 • edited Loading

gerhard commented Feb 1, 2024

gerhard commented Mar 12, 2024

mckinnsb commented Apr 12, 2024 • edited Loading

helderco commented Apr 19, 2024

sipsma commented Apr 19, 2024

sipsma commented Apr 26, 2024

gerhard commented Apr 30, 2024

jedevc commented May 7, 2024

sipsma commented Jun 5, 2024

Intermittent `failed to get state for index` errors #6111

Intermittent `failed to get state for index` errors #6111

gerhard commented Jan 31, 2024 •

edited

Loading

mckinnsb commented Apr 12, 2024 •

edited

Loading