Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] killing soci-snapshotter-grpc while a container is running requires manual cleanup #275

Closed
sparr opened this issue Jan 3, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@sparr
Copy link
Contributor

sparr commented Jan 3, 2023

Describe the bug
When the soci process is killed while a container is running, some of the mounts and metadata are left in a problematic state.

Steps To Reproduce

  1. soci-snapshotter-grpc
  2. ctr run --snapshotter soci [...]
  3. killall soci-snapshotter-grpc

Expected behavior
All state (running processes, files, container and image metadata in the data store and registry, etc) is left in a state allowing the snapshotter and container to be started again without manual intervention.

Additional context
Problem statement is non-specific because I did not take sufficient notes when encountering this problem while troubleshooting a more pressing issue. Investigating will be a necessary step of resolving this issue.

@sparr sparr added the bug Something isn't working label Jan 3, 2023
@rdpsin
Copy link
Contributor

rdpsin commented Jan 3, 2023

The problem is that FUSE mounts' lifetimes are tied to the snapshotter. If the snapshotter crashes, the FUSE mounts will die too and we have no way to re-mount them. A couple of options are:

  1. Separate out the FUSE implementation from the snapshotter, so that the mounts still exist even if the snapshotter crashes.

  2. Persist some kind of state on disk that will allow the snapshotter to reconstruct the FUSE mount whenever it comes back online.

@Kern--
Copy link
Contributor

Kern-- commented Jan 3, 2023

Is this the same problem?
#93

@Kern--
Copy link
Contributor

Kern-- commented Jul 20, 2023

I think the "solution" to this is the config to ignore the broken data: https://github.com/awslabs/soci-snapshotter/blob/main/config/service.go#L79

Maybe we could also have the SOCI snapshotter call to containerd to remove the broken snapshots? That seems a bit weird, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

4 participants