nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992

apostasie · 2024-05-11T23:58:32Z

Description

Title says all

Steps to reproduce the issue

Unfortunately, I do not have a reproducer and this seems to happen under certain circumstances only.

This is definitely confirmed though:

apo@oliphant:~ $ sudo ./nerdctl ps
CONTAINER ID    IMAGE                                                         COMMAND                   CREATED              STATUS    PORTS                     NAMES
5e7d6291a423    docker.io/library/debian:latest                               "sleep 3600"              4 minutes ago        Up                                  restarttest
b579c38e3c55    docker.io/library/debian:latest                               "sleep 123"               6 seconds ago        Up                                  restarttest

apo@oliphant:~ $ sudo ./nerdctl rm restarttest
FATA[0000] 1 errors:
multiple IDs found with provided prefix: restarttest

Note error message above ^

Describe the results you received and expected

This should never happen.

What version of nerdctl are you using?

1.7.6

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

No response

The text was updated successfully, but these errors were encountered:

apostasie · 2024-05-12T00:03:37Z

Actually, I have a reproducer:

while true; do sudo ./nerdctl run -d --restart always --name whatevername debian bash;  done

Will result after some time with:

apo@oliphant:~ $ sudo ./nerdctl ps -a | grep whatevername
016ee92893a2    docker.io/library/debian:latest                               "exit 1"                  About a minute ago    Created                                                   whatevername
2a9140621055    docker.io/library/debian:latest                               "exit 1"                  About a minute ago    Created                                                   whatevername
3b294997c56c    docker.io/library/debian:latest                               "bash"                    51 seconds ago        Restarting (0) 8 seconds ago                              whatevername
5fb34e4cd5a8    docker.io/library/debian:latest                               "bash"                    About a minute ago    Restarting (0) 8 seconds ago                              whatevername
86606352a9af    docker.io/library/debian:latest                               "exit 1"                  About a minute ago    Created                                                   whatevername
b3eebcb75094    docker.io/library/debian:latest                               "exit 1"                  About a minute ago    Created                                                   whatevername

apostasie · 2024-05-12T00:04:30Z

Maybe the culprit here is the fact that the container almost immediately exits, but has a restart policy?

apostasie · 2024-05-12T00:05:57Z

@AkihiroSuda this one is really bad

I'll look around and see if I can send a patch.

apostasie · 2024-05-12T00:41:40Z

I am new here, but this is what I gathered from reading the source: looks like the question is managed in namestore.go with a filesystem based store.
Obviously, this is racy, so, I assume either:

there must be a lock mechanism in there that is not doing what it should
or the name is temporarily released for some other reason during the whole exit / restart cycle

I am leaning towards number 2 ^ but any opinion / more informed insight on this would be welcome.

Thanks!

apostasie · 2024-05-12T21:00:56Z

Confirming there is a serious issue with the name store:

apo@oliphant:~ $ sudo ./nerdctl ps -a | grep whatever
ab08a16d1d74    docker.io/library/debian:latest                               "bash"                    7 minutes ago    Restarting (0) 6 seconds ago                              whatevername
b249f8f569bb    docker.io/library/debian:latest                               "bash"                    7 minutes ago    Restarting (0) 6 seconds ago

apo@oliphant:~ $ sudo ls -lA /var/lib/nerdctl/1935db59/names/default/ | grep whatever <- returns nothing - the file in the name store is gone.

apostasie · 2024-05-12T21:02:12Z

~~Surprisingly, namestore.Release does not seem to ever be called - is there another routine altering the name store?~~

Release is being called somewhere.
This seems to be why this is racy.
Something Releases the name while it should not.

apostasie · 2024-05-12T22:25:45Z

@AkihiroSuda I need help.

The problem lies in: https://github.com/containerd/nerdctl/blob/main/pkg/ocihook/ocihook.go#L486

During the whole run/restart cycle, onPostStop is being called, which does release the name.
The race condition happens if another Create is being called after that ~~before the first container is restarted and gets to lock the name again~~.

The reproducer is actually much simpler - just start a failing container with --restart always, and after some time the name file will just disappear from /var/lib/nerdctl/XXX/names/default/.

Problem now is that I am out of my depth here.

Do you have any insight for me here on what to do to fix this?

I am now thinking the entire namestore is a bad idea and we should rather query containerd instead of trying to maintain our own "db".

Thanks in advance.

AkihiroSuda · 2024-05-13T00:57:24Z

query containerd

This may increase the lookup cost from O(1) to O(n)?

apostasie · 2024-05-13T06:48:52Z

query containerd

This may increase the lookup cost from O(1) to O(n)?

Probably (unless containerd lookup methods are smarter than just enumerating all containers).

It seems to me the OCI lifecycle events are just not enough for us to maintain a copy of the state on our side.
postStop just tells us that a container is done - and not whether it is going to be restarted or not.

Here is my current train of thoughts:

Solution A: keep namestore, and patch ocihook

Pretty much, in onStartContainer, call namestore.Acquire again (I have not verified that containerd will actually trigger these in that case - hopefully it does).

It will not fix the problem - but it will alleviate it by making the racy window much shorter - between postStop and onStart for the first (restart always, failing) container.

Solution B: remove namestore entirely, and query containerd

containerNameStore.Acquire(name id) becomes client.Containers(ctx, fmt.Sprintf("labels.%q==%s", "nerdctl/name", name))

Upside is that we get rid of trying to keep the state on our side (pretty sure there are other ways to fuck it up that this specific issue).
Downside is performance.
Also, I am not 100% positive this won't still be racy (really depends how containerd implements that).

Thoughts?

AkihiroSuda · 2024-05-13T06:51:48Z

I guess we can try A first and see if it practically works. We may try B later if A is still too racy.

apostasie · 2024-05-13T07:02:03Z

Ok.
I'll send a PR.

Signed-off-by: apostasie <spam_blackhole@farcloser.world>

Re-acquire name in onStartContainer (see #2992)

apostasie · 2024-06-16T01:00:36Z

@AkihiroSuda

Unfortunately, I am seeing this somewhat often while parallelizing tests.
We might have to revisit this and consider plan B.

apostasie added the kind/unconfirmed-bug-claim Unconfirmed bug claim label May 11, 2024

apostasie mentioned this issue May 12, 2024

nerdctl is leaking host files when a container fails to start #2993

Open

apostasie added a commit to apostasie/nerdctl that referenced this issue May 13, 2024

Re-acquire name in onStartContainer (see containerd#2992)

6b10870

Signed-off-by: apostasie <spam_blackhole@farcloser.world>

apostasie mentioned this issue May 13, 2024

Re-acquire name in onStartContainer (see #2992) #2995

Merged

AkihiroSuda closed this as completed in #2995 May 14, 2024

AkihiroSuda added a commit that referenced this issue May 14, 2024

Merge pull request #2995 from apostasie/dev-3

1bac3e7

Re-acquire name in onStartContainer (see #2992)

AkihiroSuda reopened this Jun 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992

nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992

apostasie commented May 11, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024 •

edited

Loading

apostasie commented May 12, 2024 •

edited

Loading

AkihiroSuda commented May 13, 2024 •

edited

Loading

apostasie commented May 13, 2024

AkihiroSuda commented May 13, 2024

apostasie commented May 13, 2024

apostasie commented Jun 16, 2024

nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992

nerdctl sometimes fail to check if a container name already exists and ends-up creating duplicates #2992

Comments

apostasie commented May 11, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of nerdctl are you using?

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

Host information

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024

apostasie commented May 12, 2024 • edited Loading

apostasie commented May 12, 2024 • edited Loading

AkihiroSuda commented May 13, 2024 • edited Loading

apostasie commented May 13, 2024

Solution A: keep namestore, and patch ocihook

Solution B: remove namestore entirely, and query containerd

AkihiroSuda commented May 13, 2024

apostasie commented May 13, 2024

apostasie commented Jun 16, 2024

apostasie commented May 12, 2024 •

edited

Loading

apostasie commented May 12, 2024 •

edited

Loading

AkihiroSuda commented May 13, 2024 •

edited

Loading