Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failed to get state for index errors #6111

Closed
sipsma opened this issue Nov 15, 2023 · 11 comments
Closed

Intermittent failed to get state for index errors #6111

sipsma opened this issue Nov 15, 2023 · 11 comments

Comments

@sipsma
Copy link
Contributor

sipsma commented Nov 15, 2023

These sorts of errors have been happening occasionally (~1 / day) in our CI approximately since we switched over everything to our shared runners:

Error: input:1: pipeline.pipeline.container.export failed to solve for container publish: failed to compute cache key: failed to get state for index 0 on copy /runc /usr/local/bin/runc

(run)

The exact message varies but seems to always happen on FileOps and include failed to compute cache key: failed to get state for index 0.

There is an upstream issue for this here: moby/buildkit#3635

If we can repro the error and get useful data will move to the upstream issue (or just send PR w/ fix). Creating this to track in case anyone hits it and searches for the error message.


The fact that it seems to have only started once we switched to the shared runner is notable. My immediate gut reaction is that it probably has something to do with concurrent solves that overlap in vertexes since that's proven to be a fairly tricky + error prone codepath in buildkit and is something that would have not happened previously before switching to the shared runners.

@matiasinsaurralde
Copy link
Contributor

Have been observing it in the past days too: https://github.com/dagger/dagger/actions/runs/6937495840/job/18871658791?pr=6136

@gerhard
Copy link
Member

gerhard commented Jan 31, 2024

I know that we've been seeing these for a while, want to resume tracking them so that we have a better idea of how often they happen:

  1. https://github.com/dagger/dagger/actions/runs/7729432480/job/21072590483#step:6:720
image
  1. https://github.com/dagger/dagger/actions/runs/7728437741/job/21069222134#step:6:175
image

@gerhard
Copy link
Member

gerhard commented Feb 1, 2024

@gerhard
Copy link
Member

gerhard commented Mar 12, 2024

@mckinnsb
Copy link

mckinnsb commented Apr 12, 2024

I also get "failed to compute cache key" errors in CI on K8 with the runner, but I get a different "source":

cause: ClientError: resolve: container: withRegistryAuth: build: publish: failed to solve for container publish: failed to compute cache key: failed to unmount /tmp/containerd-mount922630576: failed to unmount target /tmp/containerd-mount922630576: device or resource busy: {"response":{"errors":[{"message":"resolve: container: withRegistryAuth: build: publish: failed to solve for container publish: failed to compute cache key: failed to unmount /tmp/containerd-mount922630576: failed to unmount target /tmp/containerd-mount922630576: device or resource busy"}]

@helderco
Copy link
Contributor

Whelp! This error is happening a lot in Python tests:

failed to compute cache key: failed to get state for index 0 on copy /runtime/template/runtime.py /runtime

That copy comes from here:

m.Container = m.Container.WithFile(
RuntimeExecutablePath,
template.File("runtime.py"),
ContainerWithFileOpts{Permissions: 0o755},
)

Seems to be more frequent with the tests around the lock file. See anything wrong here?

func TestModulePythonLockHashes(t *testing.T) {
t.Parallel()
c, ctx := connect(t)
base := daggerCliBase(t, c).With(daggerInitPython())
out, err := base.File("requirements.lock").Contents(ctx)
require.NoError(t, err)
// Replace hashes for platformdirs with an invalid one.
// The lock file has the following format:
//
// httpx==0.27.0 \
// --hash=sha256:71d5465162c13681bff01ad59b2cc68dd838ea1f10e51574bac27103f00c91a5 \
// --hash=sha256:a0cb88a46f32dc874e04ee956e4c2764aba2aa228f650b06788ba6bda2962ab5
// # via gql
// platformdirs==4.2.0 \
// --hash=sha256:0614df2a2f37e1a662acbd8e2b25b92ccf8632929bc6d43467e17fe89c75e068 \
// --hash=sha256:ef0cc731df711022c174543cb70a9b5bd22e5a9337c8624ef2c2ceb8ddad8768
// pygments==2.17.2 \
// --hash=sha256:b27c2826c47d0f3219f29554824c30c5e8945175d888647acd804ddd04af846c \
// --hash=sha256:da46cec9fd2de5be3a8a784f434e4c4ab670b4ff54d605c4c2717e9d49c4c367
// # via rich
var lock strings.Builder
replaceHashes := false
for _, line := range strings.Split(out, "\n") {
if strings.HasPrefix(line, "platformdirs==") {
replaceHashes = true
lock.WriteString(
fmt.Sprintf("%s\n --hash=sha256:%s\n", line, strings.Repeat("1", 64)),
)
continue
}
if replaceHashes {
if strings.HasPrefix(strings.TrimSpace(line), "--hash") {
continue
} else {
replaceHashes = false
}
}
lock.WriteString(line)
lock.WriteString("\n")
}
requirements := lock.String()
t.Logf("requirements.lock:\n%s", requirements)
t.Run("uv", func(t *testing.T) {
_, err := base.
With(fileContents("requirements.lock", requirements)).
With(daggerExec("develop")).
Sync(ctx)
// TODO: uv doesn't support hash verification yet.
// require.ErrorContains(t, err, "hash mismatch")
require.NoError(t, err)
})
t.Run("pip", func(t *testing.T) {
_, err := base.
With(fileContents("requirements.lock", requirements)).
With(pyprojectExtra(`
[tool.dagger]
use-uv = false
`)).
With(daggerExec("develop")).
Sync(ctx)
require.ErrorContains(t, err, "DO NOT MATCH THE HASHES")
require.ErrorContains(t, err, "Expected sha256 1111111")
})
}

Maybe this is a simpler case to debug and figure this out.

@sipsma
Copy link
Contributor Author

sipsma commented Apr 19, 2024

Just FYI I'm attempting to debug this and the other (possibly related) inconsistent graph state errors here: #7128

@sipsma
Copy link
Contributor Author

sipsma commented Apr 26, 2024

More details in #7128, but ended up being able to very consistently repro this locally and figured out the bug upstream, fix there moby/buildkit#4887

Will leave this open until we've picked up the upstream fix, done a release with it, upgraded CI runners and confirmed that the errors are gone now

@gerhard
Copy link
Member

gerhard commented Apr 30, 2024

Yes!

FTR, the new CI runners for Engine & CLI are 16 CPUs & 128GB RAM, so this is less likely to occur for those CI jobs.

All other CI runners are still bunching around a single Engine, so we can expect this to continue being an issue.

💪

@jedevc
Copy link
Member

jedevc commented May 7, 2024

The worst of this seems to be resolved by #7295, so going to take this out of the milestone, but worth leaving open until the upstream fix is resolved.

@jedevc jedevc removed this from the v0.11.3 milestone May 7, 2024
@sipsma
Copy link
Contributor Author

sipsma commented Jun 5, 2024

We've been running with this fix in place for a while and haven't seen it since (used to happen 1 or a few times a day). So I'll close this out until proven otherwise.

@sipsma sipsma closed this as completed Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants