-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992
Comments
Actually, I have a reproducer:
Will result after some time with:
|
Maybe the culprit here is the fact that the container almost immediately exits, but has a restart policy? |
@AkihiroSuda this one is really bad I'll look around and see if I can send a patch. |
I am new here, but this is what I gathered from reading the source: looks like the question is managed in namestore.go with a filesystem based store.
I am leaning towards number 2 ^ but any opinion / more informed insight on this would be welcome. Thanks! |
Confirming there is a serious issue with the name store:
|
Release is being called somewhere. |
@AkihiroSuda I need help. The problem lies in: https://github.com/containerd/nerdctl/blob/main/pkg/ocihook/ocihook.go#L486 During the whole run/restart cycle, The reproducer is actually much simpler - just start a failing container with Problem now is that I am out of my depth here. Do you have any insight for me here on what to do to fix this? I am now thinking the entire namestore is a bad idea and we should rather query containerd instead of trying to maintain our own "db". Thanks in advance. |
This may increase the lookup cost from O(1) to O(n)? |
Probably (unless containerd lookup methods are smarter than just enumerating all containers). It seems to me the OCI lifecycle events are just not enough for us to maintain a copy of the state on our side. Here is my current train of thoughts: Solution A: keep namestore, and patch ocihookPretty much, in onStartContainer, call namestore.Acquire again (I have not verified that containerd will actually trigger these in that case - hopefully it does). It will not fix the problem - but it will alleviate it by making the racy window much shorter - between postStop and onStart for the first (restart always, failing) container. Solution B: remove namestore entirely, and query containerd
Upside is that we get rid of trying to keep the state on our side (pretty sure there are other ways to fuck it up that this specific issue). Thoughts? |
I guess we can try A first and see if it practically works. We may try B later if A is still too racy. |
Ok. |
Signed-off-by: apostasie <spam_blackhole@farcloser.world>
Re-acquire name in onStartContainer (see #2992)
Unfortunately, I am seeing this somewhat often while parallelizing tests. |
Description
Title says all
Steps to reproduce the issue
Unfortunately, I do not have a reproducer and this seems to happen under certain circumstances only.
This is definitely confirmed though:
Note error message above ^
Describe the results you received and expected
This should never happen.
What version of nerdctl are you using?
1.7.6
Are you using a variant of nerdctl? (e.g., Rancher Desktop)
None
Host information
No response
The text was updated successfully, but these errors were encountered: