Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

roypat · 2023-04-13T08:25:33Z

The test calls the serve_vmm_action_request function and tries to check whether it correctly updates specific metrices related to measuring how long a request took. However, since the VMM side is mocked out in this test (the receiver end of the channel through which the API server communicates is simply pre-filled with VmmData::Empty, and not passed to any implementation that handles the requests), the serve_vmm_action_request function only goes through the following steps:

Put something into a channel
Read a pre-submitted value from another channel
Update the metrics

This can happen very fast (below 1us, which is the resolution of our metrics), in which cases the unittests fails at the assertion that tries to check whether the metric of "how long did handling this request take" was correctly set (as this assertion only checks that the metric is not 0).

We fix this by subtracting 1 from the start time inside the test, so make sure that the difference "now - start time" which serve_vmm_action_request uses to compute the metric value is always at least 1.

Prior to this fix, empirical testing has shown a failure rate of around 1 in 3 on m6i.metal instances, with this patch that has gone done to 0 in 100.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

If a specific issue led to this PR, this PR closes the issue.
The description of changes is clear and encompassing.
Any required documentation changes (code and docs) are included in this PR.
API changes follow the Runbook for Firecracker API changes.
User-facing changes are mentioned in CHANGELOG.md.
All added/changed functionality is tested.
New TODOs link to an issue.
Commits meet contribution quality standards.

This functionality cannot be added in rust-vmm.

src/api_server/src/lib.rs

andreitraistaru · 2023-04-13T09:19:36Z

Nice investigation! Looks good to me. Just for my understanding: do we have a reason for why this error seems to be more often reproducible in the last period of time? Have you discovered anything that could increase the reproducing frequency.

roypat · 2023-04-13T10:08:39Z

Nice investigation! Looks good to me. Just for my understanding: do we have a reason for why this error seems to be more often reproducible in the last period of time? Have you discovered anything that could increase the reproducing frequency.

Ah, yes I do! I thought I added it to the commit description, apparently I didnt, thanks for pointing that out!

I think its due to the rust 1.67.0 update, which changes the implementation for mpsc channels to be based on crossbeam, which I suspect is more performant than the old one.

The test calls the `serve_vmm_action_request` function and tries to check whether it correctly updates specific metrices related to measuring how long a request took. However, since the VMM side is mocked out in this test (the receiver end of the channel through which the API server communicates is simply pre-filled with VmmData::Empty, and not passed to any implementation that handles the requests), the `serve_vmm_action_request` function only goes through the following steps: 1. Put something into a channel 2. Read a pre-submitted value from another channel 3. Update the metrics This can happen very fast (below 1us, which is the resolution of our metrics), in which cases the unittests fails at the assertion that tries to check whether the metric of "how long did handling this request take" was correctly set (as this assertion only checks that the metric is not 0). I suspect that the reason we have been encountering this more often recently is due to the Rust 1.67.0 update (firecracker-microvm#3576), which replaced the old stdlib implementation of mpsc channels with that from the crossbeam crate. Since channel operations are most of what the test does, this could have a performance impact if the new rust version has a more performant implementation (which the changelog implies [1]). We fix this by subtracting 1 from the start time inside the test, so make sure that the difference "now - start time" which serve_vmm_action_request uses to compute the metric value is always at least 1. [1]: https://blog.rust-lang.org/2023/01/26/Rust-1.67.0.html Signed-off-by: Patrick Roy <roypat@amazon.co.uk>

roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Apr 13, 2023

zulinx86 reviewed Apr 13, 2023

View reviewed changes

src/api_server/src/lib.rs Outdated Show resolved Hide resolved

roypat force-pushed the fix- branch from 520ccc8 to c5d96e7 Compare April 13, 2023 10:12

zulinx86 approved these changes Apr 13, 2023

View reviewed changes

andreitraistaru approved these changes Apr 13, 2023

View reviewed changes

zulinx86 merged commit 5adec36 into firecracker-microvm:main Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

roypat commented Apr 13, 2023

andreitraistaru commented Apr 13, 2023

roypat commented Apr 13, 2023 •

edited

Loading

Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

Conversation

roypat commented Apr 13, 2023

License Acceptance

PR Checklist

andreitraistaru commented Apr 13, 2023

roypat commented Apr 13, 2023 • edited Loading

roypat commented Apr 13, 2023 •

edited

Loading