Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep directories when SIGINT sent to daemon #881

Merged
merged 1 commit into from
Oct 30, 2023

Conversation

sondavidb
Copy link
Contributor

Issue #, if available:
#275, #832

Description of changes:
On SIGINT, the daemon will perform some cleanup functions, including removing the directories for each of the snapshot layers. However, the metadata does not reflect this change. This meant that, upon restarting the snapshot, the snapshot would not start unless allow_invalid_mounts_on_restart=true was set in the config. Even with the variable set, one had to manually remove the snapshot to get the daemon in a working state.

This was pretty frustrating behavior. What should be a graceful kill ends up being more work than using a non-graceful kill (i.e. SIGKILL/SIGTERM, which allow the snapshots to instantly start working again, no questions asked). Additionally, it was unintuitive (to me) that killing the daemon also wipes all the snapshotter data, forcing one to re-pull an image every time the daemon is killed.

This change removes the cleanup step of wiping the snapshot directories. It will still unmount the FUSE mounts, but the snapshotter already tries to restore snapshots on startup, which just unmounts any mounts leftover and remounts existing mounts in the metadata, so it ties in very well to snapshot startup

Testing performed:
I performed some basic testing, largely including pulling an image + running a container with it after a SIGINT and confirmed normal behavior.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sondavidb sondavidb requested a review from a team as a code owner October 20, 2023 23:11
Copy link
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some of these manual tests as automated tests with this change?

snapshot/snapshot.go Outdated Show resolved Hide resolved
@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch 2 times, most recently from 2898ab4 to 7182dc3 Compare October 24, 2023 00:29
@sondavidb
Copy link
Contributor Author

Should we have some of these manual tests as automated tests with this change?

I tried, but I don't actually know how to kill the snapshotter with a SIGINT. I see that we have KillMatchingProcess, which is called by rebootContainerd, but this sends a SIGKILL, not a SIGINT. Even if I could pass it in, rebootContainerd also wipes the content stores so I don't believe calling that is a viable option regardless.

I tried to manually kill but killall isn't a binary in the container created. I can technically fudge this by banking on the grpc being a specific pid every time the container starts, but from past experiences relying on an arbitrary value far out of the scope of what I'm trying to achieve is...not ideal, to say the least.

@austinvazquez
Copy link
Contributor

Should we have some of these manual tests as automated tests with this change?

I tried, but I don't actually know how to kill the snapshotter with a SIGINT. I see that we have KillMatchingProcess, which is called by rebootContainerd, but this sends a SIGKILL, not a SIGINT. Even if I could pass it in, rebootContainerd also wipes the content stores so I don't believe calling that is a viable option regardless.

I tried to manually kill but killall isn't a binary in the container created. I can technically fudge this by banking on the grpc being a specific pid every time the container starts, but from past experiences relying on an arbitrary value far out of the scope of what I'm trying to achieve is...not ideal, to say the least.

Give https://man7.org/linux/man-pages/man1/kill.1.html a try and if still stuck lmk and we can look together.

@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch from 7182dc3 to 5f36c19 Compare October 24, 2023 18:26
@sondavidb
Copy link
Contributor Author

sondavidb commented Oct 24, 2023

I ended up using pkill instead of kill since for some reason kill doesn't work with a name even though the man page seems to imply that.

I added the test and ensured that it crashed with the previous version and passed with this one.

EDIT: I was wrong, but now I should be right (I hope!) I tried to do a goroutine at first but they weren't guaranteed to terminate with its encapsulating testing function. Hopefully timeout works fine for this.

Just also realized I will need to update the documentation to specify this new behavior. Should that be a separate PR? I think the docs (aka just debug.md) would only reflect slightly extraneous steps after this PR, maybe it's ok to keep as is?

@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch 3 times, most recently from b8f3729 to 7529c33 Compare October 25, 2023 00:53
integration/run_test.go Outdated Show resolved Hide resolved
@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch from 7529c33 to 9fe318b Compare October 25, 2023 16:56
Kern--
Kern-- previously approved these changes Oct 25, 2023
@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch 2 times, most recently from 13f39aa to a81d56f Compare October 26, 2023 16:33
turan18
turan18 previously approved these changes Oct 27, 2023
Kern--
Kern-- previously approved these changes Oct 27, 2023
snapshot/snapshot.go Show resolved Hide resolved
snapshot/snapshot.go Outdated Show resolved Hide resolved
Signed-off-by: David Son <davbson@amazon.com>
@sondavidb sondavidb dismissed stale reviews from Kern-- and turan18 via 1e7b33f October 30, 2023 17:31
@sondavidb sondavidb force-pushed the keep-directories-on-daemon-kill branch from a81d56f to 1e7b33f Compare October 30, 2023 17:31
@sondavidb sondavidb merged commit 5af486c into awslabs:main Oct 30, 2023
4 checks passed
@sondavidb sondavidb deleted the keep-directories-on-daemon-kill branch October 30, 2023 19:37
pendo324 added a commit to runfinch/finch that referenced this pull request Jan 29, 2024
…#767)

Issue #, if available: re-verified #412
- Through extensive e2e test debugging, I noticed that soci and stargz
snapshotters weren't persisting data as expected. After debugging, I
found some context in these two PRs:
  - awslabs/soci-snapshotter#881
  - containerd/stargz-snapshotter#1526
Unfortunately, neither of them are deployed yet, so I've implemented a
hacky workaround for now. After this change, an image/container can be
pull/run, the VM can be restarted, and then the container can be
re-started again.

*Description of changes:*
- Redo how BuildKit/Stargz/SOCI are related to containerd using
[systemd's `PartOf`

](https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#PartOf=)
- this ensures that all of these services are restarted when containerd
is restarted, which the lack of has caused errors in the past
- Create some missing directories that might throw errors in cloud-init
- Ensure that `SIGTERM` is used to kill the snapshotter services for now

*Testing done:*
- manual testing



- [x] I've reviewed the guidance in CONTRIBUTING.md


#### License Acceptance

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Signed-off-by: Justin Alvarez <alvajus@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants