Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

Merged
merged 7 commits into from
Jun 14, 2024

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Jun 11, 2024

Changes

Delete intermediary snapshot while when tests create snapshots in loops.

Reason

Tests such as test_snapshot_ab.py or test_5_snapshots create snapshot files in a loop. This interacts badly with some debugging infrastructure added in #4590: If a test fails, then all files from all microVM chroots from that test are copied to test_results, and then uploaded as run artifacts. This means that if one of the 30 microVM restorations in test_snapshot_ab.py fails, we copy all, potentially 30, snapshot files to test_results. If these are large (which they can be: the test goes up to 12GB microVMs), this can cause the buildkite agent to run out of disk space.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Jun 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.08%. Comparing base (e641bfb) to head (ee1ad8e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4641   +/-   ##
=======================================
  Coverage   82.08%   82.08%           
=======================================
  Files         255      255           
  Lines       31257    31257           
=======================================
  Hits        25656    25656           
  Misses       5601     5601           
Flag Coverage Δ
4.14-c5n.metal 79.57% <ø> (ø)
4.14-c7g.metal ?
4.14-m5n.metal 79.56% <ø> (ø)
4.14-m6a.metal 78.78% <ø> (ø)
4.14-m6g.metal 76.60% <ø> (ø)
4.14-m6i.metal 79.55% <ø> (+<0.01%) ⬆️
4.14-m7g.metal 76.60% <ø> (ø)
5.10-c5n.metal 82.09% <ø> (ø)
5.10-c7g.metal ?
5.10-m5n.metal 82.07% <ø> (ø)
5.10-m6a.metal 81.38% <ø> (+<0.01%) ⬆️
5.10-m6g.metal 79.37% <ø> (ø)
5.10-m6i.metal 82.07% <ø> (ø)
5.10-m7g.metal 79.37% <ø> (ø)
6.1-c5n.metal 82.09% <ø> (+<0.01%) ⬆️
6.1-c7g.metal ?
6.1-m5n.metal 82.07% <ø> (-0.01%) ⬇️
6.1-m6a.metal 81.38% <ø> (+<0.01%) ⬆️
6.1-m6g.metal 79.37% <ø> (ø)
6.1-m6i.metal 82.06% <ø> (ø)
6.1-m7g.metal 79.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Make `Microvm.restore_from_snapshot` return a `Snapshot` object. The
return value describes the snapshot files inside the microvm's jail,
which are potentially a copy of the given `Snapshot`.

This will be used so that tests that restore snapshots in a loop will
be able to clean them up again easily.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Reuse the existing function

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
The `test_restore_latency` test resumes 30 microVMs in a loop. Each
iteration copies the snapshot that is to be restored into the respective
microvm's jail, where they are not cleaned up until the end of the test,
e.g. after all 30 iterations have completed. This can cause a problem if
any of these 30 iterations fail. In this case, we copy all files in the
chroot of each of the 30 microvms into `test_results` (even the ones
that completed successfully). This can result in us running out of disk
space if it happens with large snapshots.

To avoid this, delete the copies of the snapshots at the end of their
respective iteration.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
pb8o
pb8o previously approved these changes Jun 11, 2024
@roypat roypat marked this pull request as draft June 11, 2024 13:44
@roypat roypat force-pushed the delete-snapshots-ab branch 3 times, most recently from cd436be to 0f9ad85 Compare June 11, 2024 16:11
The test creates, and resumes from 5 consecutively taken snapshots. This
means it involves 6 microVMs: The seed microVM that is originally
booted, and then 5 restored microVMs. The SIGSTOP/SIGCONT logic was
supposed to ensure that the in-guest state of the vsock `socat` process
would be consistent across snapshot-restore boundaries: We "stop" the
process before taking a snapshot, and "continue" it after restoring.
However, there are two problems with our implementation:
- No `SIGSTOP` was sent before taking the first snapshot, and
- The SIGSTOP/SIGCONT signals related to later snapshots were sent to
  the wrong microVM (they were sent inside the original, booted microVM,
  but never inside the restored microVMs, since the `vm` variable from
  outside the `for` loop was reused, instead of the `microvm` variable
  from inside the loop).

Together, these mean that we never actually stopped/continued any
`socat` servers. Thus drop the whole logic related to it, since it is
nonsense. Instead, replace with with a `time.sleep(2)`, as seemingly the
ssh connection to the wrong microVM was inducing exactly the right delay
(measured at slightly above 1s, which I rounded up to 2) to reduce
connection related intermittent failures (which is why we thought the
SIGSTOP/SIGCONT pair was fixing the intermittent failures).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Kill microvms after we are doing with them.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
Similar to test_snapshot_ab.py, delete copies of snapshots created in
the loop. Unlike test_snapshot_ab we here do keep one copy of each
snapshot around, since the snapshots are consecutive (e.g. snapshot 3 is
taken from a microVM restored from snapshot 2). Having the full chain
available might be useful for debugging. We can safely delete any
intermediary copies of the snapshot files, however.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
The test operates similarly to test_5_snapshots, so delete the
superfluous copies.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
@roypat roypat marked this pull request as ready for review June 12, 2024 08:00
@roypat roypat changed the title Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generatoin Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation Jun 12, 2024
@roypat roypat requested a review from pb8o June 12, 2024 08:25
@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jun 12, 2024
@roypat roypat requested review from bchalios June 12, 2024 09:07
@roypat roypat requested a review from zulinx86 June 13, 2024 08:54
@roypat roypat merged commit 8903d3e into firecracker-microvm:main Jun 14, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants