Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spurious failure in api_server::tests::test_serve_vmm_action_request #3607

Merged
merged 1 commit into from
Apr 13, 2023

Conversation

roypat
Copy link
Contributor

@roypat roypat commented Apr 13, 2023

The test calls the serve_vmm_action_request function and tries to check whether it correctly updates specific metrices related to measuring how long a request took. However, since the VMM side is mocked out in this test (the receiver end of the channel through which the API server communicates is simply pre-filled with VmmData::Empty, and not passed to any implementation that handles the requests), the serve_vmm_action_request function only goes through the following steps:

  1. Put something into a channel
  2. Read a pre-submitted value from another channel
  3. Update the metrics

This can happen very fast (below 1us, which is the resolution of our metrics), in which cases the unittests fails at the assertion that tries to check whether the metric of "how long did handling this request take" was correctly set (as this assertion only checks that the metric is not 0).

We fix this by subtracting 1 from the start time inside the test, so make sure that the difference "now - start time" which serve_vmm_action_request uses to compute the metric value is always at least 1.

Prior to this fix, empirical testing has shown a failure rate of around 1 in 3 on m6i.metal instances, with this patch that has gone done to 0 in 100.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following
Developer Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@roypat roypat added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Apr 13, 2023
src/api_server/src/lib.rs Outdated Show resolved Hide resolved
@andreitraistaru
Copy link
Contributor

Nice investigation! Looks good to me. Just for my understanding: do we have a reason for why this error seems to be more often reproducible in the last period of time? Have you discovered anything that could increase the reproducing frequency.

@roypat
Copy link
Contributor Author

roypat commented Apr 13, 2023

Nice investigation! Looks good to me. Just for my understanding: do we have a reason for why this error seems to be more often reproducible in the last period of time? Have you discovered anything that could increase the reproducing frequency.

Ah, yes I do! I thought I added it to the commit description, apparently I didnt, thanks for pointing that out!

I think its due to the rust 1.67.0 update, which changes the implementation for mpsc channels to be based on crossbeam, which I suspect is more performant than the old one.

The test calls the `serve_vmm_action_request` function and tries to
check whether it correctly updates specific metrices related to
measuring how long a request took. However, since the VMM side is mocked
out in this test (the receiver end of the channel through which the
API server communicates is simply pre-filled with VmmData::Empty, and
not passed to any implementation that handles the requests), the
`serve_vmm_action_request` function only goes through the following
steps:

1. Put something into a channel
2. Read a pre-submitted value from another channel
3. Update the metrics

This can happen very fast (below 1us, which is the resolution of our
metrics), in which cases the unittests fails at the assertion that tries
to check whether the metric of "how long did handling this request take"
was correctly set (as this assertion only checks that the metric is not
0).

I suspect that the reason we have been encountering this more often
recently is due to the Rust 1.67.0 update (firecracker-microvm#3576), which replaced the
old stdlib implementation of mpsc channels with that from the crossbeam
crate. Since channel operations are most of what the test does, this
could have a performance impact if the new rust version has a more
performant implementation (which the changelog implies [1]).

We fix this by subtracting 1 from the start time inside the test, so
make sure that the difference "now - start time" which
serve_vmm_action_request uses to compute the metric value is always at
least 1.

[1]: https://blog.rust-lang.org/2023/01/26/Rust-1.67.0.html

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
@zulinx86 zulinx86 merged commit 5adec36 into firecracker-microvm:main Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants