-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4641 +/- ##
=======================================
Coverage 82.08% 82.08%
=======================================
Files 255 255
Lines 31257 31257
=======================================
Hits 25656 25656
Misses 5601 5601
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Make `Microvm.restore_from_snapshot` return a `Snapshot` object. The return value describes the snapshot files inside the microvm's jail, which are potentially a copy of the given `Snapshot`. This will be used so that tests that restore snapshots in a loop will be able to clean them up again easily. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
0f2a289
to
60d1f6c
Compare
Reuse the existing function Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
The `test_restore_latency` test resumes 30 microVMs in a loop. Each iteration copies the snapshot that is to be restored into the respective microvm's jail, where they are not cleaned up until the end of the test, e.g. after all 30 iterations have completed. This can cause a problem if any of these 30 iterations fail. In this case, we copy all files in the chroot of each of the 30 microvms into `test_results` (even the ones that completed successfully). This can result in us running out of disk space if it happens with large snapshots. To avoid this, delete the copies of the snapshots at the end of their respective iteration. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
60d1f6c
to
a03811d
Compare
pb8o
reviewed
Jun 11, 2024
pb8o
previously approved these changes
Jun 11, 2024
cd436be
to
0f9ad85
Compare
The test creates, and resumes from 5 consecutively taken snapshots. This means it involves 6 microVMs: The seed microVM that is originally booted, and then 5 restored microVMs. The SIGSTOP/SIGCONT logic was supposed to ensure that the in-guest state of the vsock `socat` process would be consistent across snapshot-restore boundaries: We "stop" the process before taking a snapshot, and "continue" it after restoring. However, there are two problems with our implementation: - No `SIGSTOP` was sent before taking the first snapshot, and - The SIGSTOP/SIGCONT signals related to later snapshots were sent to the wrong microVM (they were sent inside the original, booted microVM, but never inside the restored microVMs, since the `vm` variable from outside the `for` loop was reused, instead of the `microvm` variable from inside the loop). Together, these mean that we never actually stopped/continued any `socat` servers. Thus drop the whole logic related to it, since it is nonsense. Instead, replace with with a `time.sleep(2)`, as seemingly the ssh connection to the wrong microVM was inducing exactly the right delay (measured at slightly above 1s, which I rounded up to 2) to reduce connection related intermittent failures (which is why we thought the SIGSTOP/SIGCONT pair was fixing the intermittent failures). Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Kill microvms after we are doing with them. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Similar to test_snapshot_ab.py, delete copies of snapshots created in the loop. Unlike test_snapshot_ab we here do keep one copy of each snapshot around, since the snapshots are consecutive (e.g. snapshot 3 is taken from a microVM restored from snapshot 2). Having the full chain available might be useful for debugging. We can safely delete any intermediary copies of the snapshot files, however. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
The test operates similarly to test_5_snapshots, so delete the superfluous copies. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
0f9ad85
to
ee1ad8e
Compare
pb8o
approved these changes
Jun 13, 2024
kalyazin
approved these changes
Jun 13, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
Delete intermediary snapshot while when tests create snapshots in loops.
Reason
Tests such as
test_snapshot_ab.py
ortest_5_snapshots
create snapshot files in a loop. This interacts badly with some debugging infrastructure added in #4590: If a test fails, then all files from all microVM chroots from that test are copied totest_results
, and then uploaded as run artifacts. This means that if one of the 30 microVM restorations intest_snapshot_ab.py
fails, we copy all, potentially 30, snapshot files totest_results
. If these are large (which they can be: the test goes up to 12GB microVMs), this can cause the buildkite agent to run out of disk space.License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md
.PR Checklist
PR.
CHANGELOG.md
.TODO
s link to an issue.contribution quality standards.
rust-vmm
.