Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

roypat · 2024-06-11T12:57:42Z

Changes

Delete intermediary snapshot while when tests create snapshots in loops.

Reason

Tests such as test_snapshot_ab.py or test_5_snapshots create snapshot files in a loop. This interacts badly with some debugging infrastructure added in #4590: If a test fails, then all files from all microVM chroots from that test are copied to test_results, and then uploaded as run artifacts. This means that if one of the 30 microVM restorations in test_snapshot_ab.py fails, we copy all, potentially 30, snapshot files to test_results. If these are large (which they can be: the test goes up to 12GB microVMs), this can cause the buildkite agent to run out of disk space.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this
PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet
contribution quality standards.

This functionality cannot be added in rust-vmm.

codecov · 2024-06-11T13:03:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.08%. Comparing base (e641bfb) to head (ee1ad8e).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4641   +/-   ##
=======================================
  Coverage   82.08%   82.08%           
=======================================
  Files         255      255           
  Lines       31257    31257           
=======================================
  Hits        25656    25656           
  Misses       5601     5601

Flag	Coverage Δ
4.14-c5n.metal	`79.57% <ø> (ø)`
4.14-c7g.metal	`?`
4.14-m5n.metal	`79.56% <ø> (ø)`
4.14-m6a.metal	`78.78% <ø> (ø)`
4.14-m6g.metal	`76.60% <ø> (ø)`
4.14-m6i.metal	`79.55% <ø> (+<0.01%)`	⬆️
4.14-m7g.metal	`76.60% <ø> (ø)`
5.10-c5n.metal	`82.09% <ø> (ø)`
5.10-c7g.metal	`?`
5.10-m5n.metal	`82.07% <ø> (ø)`
5.10-m6a.metal	`81.38% <ø> (+<0.01%)`	⬆️
5.10-m6g.metal	`79.37% <ø> (ø)`
5.10-m6i.metal	`82.07% <ø> (ø)`
5.10-m7g.metal	`79.37% <ø> (ø)`
6.1-c5n.metal	`82.09% <ø> (+<0.01%)`	⬆️
6.1-c7g.metal	`?`
6.1-m5n.metal	`82.07% <ø> (-0.01%)`	⬇️
6.1-m6a.metal	`81.38% <ø> (+<0.01%)`	⬆️
6.1-m6g.metal	`79.37% <ø> (ø)`
6.1-m6i.metal	`82.06% <ø> (ø)`
6.1-m7g.metal	`79.37% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Make `Microvm.restore_from_snapshot` return a `Snapshot` object. The return value describes the snapshot files inside the microvm's jail, which are potentially a copy of the given `Snapshot`. This will be used so that tests that restore snapshots in a loop will be able to clean them up again easily. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

Reuse the existing function Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

The `test_restore_latency` test resumes 30 microVMs in a loop. Each iteration copies the snapshot that is to be restored into the respective microvm's jail, where they are not cleaned up until the end of the test, e.g. after all 30 iterations have completed. This can cause a problem if any of these 30 iterations fail. In this case, we copy all files in the chroot of each of the 30 microvms into `test_results` (even the ones that completed successfully). This can result in us running out of disk space if it happens with large snapshots. To avoid this, delete the copies of the snapshots at the end of their respective iteration. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

tests/integration_tests/performance/test_snapshot_ab.py

The test creates, and resumes from 5 consecutively taken snapshots. This means it involves 6 microVMs: The seed microVM that is originally booted, and then 5 restored microVMs. The SIGSTOP/SIGCONT logic was supposed to ensure that the in-guest state of the vsock `socat` process would be consistent across snapshot-restore boundaries: We "stop" the process before taking a snapshot, and "continue" it after restoring. However, there are two problems with our implementation: - No `SIGSTOP` was sent before taking the first snapshot, and - The SIGSTOP/SIGCONT signals related to later snapshots were sent to the wrong microVM (they were sent inside the original, booted microVM, but never inside the restored microVMs, since the `vm` variable from outside the `for` loop was reused, instead of the `microvm` variable from inside the loop). Together, these mean that we never actually stopped/continued any `socat` servers. Thus drop the whole logic related to it, since it is nonsense. Instead, replace with with a `time.sleep(2)`, as seemingly the ssh connection to the wrong microVM was inducing exactly the right delay (measured at slightly above 1s, which I rounded up to 2) to reduce connection related intermittent failures (which is why we thought the SIGSTOP/SIGCONT pair was fixing the intermittent failures). Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

Kill microvms after we are doing with them. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

Similar to test_snapshot_ab.py, delete copies of snapshots created in the loop. Unlike test_snapshot_ab we here do keep one copy of each snapshot around, since the snapshots are consecutive (e.g. snapshot 3 is taken from a microVM restored from snapshot 2). Having the full chain available might be useful for debugging. We can safely delete any intermediary copies of the snapshot files, however. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

The test operates similarly to test_5_snapshots, so delete the superfluous copies. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

roypat force-pushed the delete-snapshots-ab branch from 0f2a289 to 60d1f6c Compare June 11, 2024 13:15

roypat added 2 commits June 11, 2024 14:33

refactor(test): Do not open-code Microvm.chroot in microvm_factory

4cbe6fd

Reuse the existing function Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

roypat force-pushed the delete-snapshots-ab branch from 60d1f6c to a03811d Compare June 11, 2024 13:37

pb8o reviewed Jun 11, 2024

View reviewed changes

tests/integration_tests/performance/test_snapshot_ab.py Show resolved Hide resolved

pb8o previously approved these changes Jun 11, 2024

View reviewed changes

pb8o assigned roypat Jun 11, 2024

roypat dismissed pb8o’s stale review via f907726 June 11, 2024 13:44

roypat marked this pull request as draft June 11, 2024 13:44

roypat force-pushed the delete-snapshots-ab branch 3 times, most recently from cd436be to 0f9ad85 Compare June 11, 2024 16:11

roypat added 4 commits June 12, 2024 08:49

test_5_snapshots: Kill old microvms

107a1cc

Kill microvms after we are doing with them. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

test_vmgenid: delete intermediary snapshot copies

ee1ad8e

The test operates similarly to test_5_snapshots, so delete the superfluous copies. Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

roypat force-pushed the delete-snapshots-ab branch from 0f9ad85 to ee1ad8e Compare June 12, 2024 07:51

roypat marked this pull request as ready for review June 12, 2024 08:00

roypat changed the title ~~Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generatoin~~ Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation Jun 12, 2024

roypat requested a review from pb8o June 12, 2024 08:25

roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jun 12, 2024

roypat requested review from bchalios June 12, 2024 09:07

pb8o approved these changes Jun 13, 2024

View reviewed changes

roypat requested a review from zulinx86 June 13, 2024 08:54

kalyazin approved these changes Jun 13, 2024

View reviewed changes

roypat merged commit 8903d3e into firecracker-microvm:main Jun 14, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

roypat commented Jun 11, 2024

codecov bot commented Jun 11, 2024 •

edited

Loading

Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

Delete intermediary snapshots created by tests, to avoid out-of-disk-space errors during artifact generation #4641

Conversation

roypat commented Jun 11, 2024

Changes

Reason

License Acceptance

PR Checklist

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 11, 2024 •

edited

Loading