Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero #4630

Merged

Conversation

kalyazin
Copy link
Contributor

Changes

Backport from #4618 .

This change introduces a workaround. If when taking a snapshot, we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with the MSR_IA32_TSC value from the same vCPU to make sure the vCPU will continue to receive TSC interrupts.

Reason

On x86_64, we observed that when restoring from a snapshot, one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never received TSC interrupts until the MSR is updated externally (eg by setting the system time).

We believe this happens because the TSC interrupt is lost during snapshot taking process: the MSR is cleared, but the interrupt is not delivered to the guest, so the guest does not rearm the timer.

A visible effect of that is failure to connect to a restored VM via SSH, similar to https://buildkite.com/firecracker/firecracker-pr-nightly/builds/1403#018f83db-5395-4656-8d9c-83b6fcfcfd54/50-1994 .

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

On x86_64, we observed that when restoring from a snapshot,
one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never
received TSC interrupts until the MSR is updated externally
(eg by setting the system time).

We believe this happens because the TSC interrupt is lost
during snapshot taking process: the MSR is cleared, but the
interrupt is not delivered to the guest, so the guest
does not rearm the timer.

A visible effect of that is failure to connect to a restored VM
via SSH.

This commit introduces a workaround. If when taking a snapshot,
we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with
the MSR_IA32_TSC value from the same vCPU to make sure that
the vCPU will continue to receive TSC interrupts.

(cherry picked from commit 94b37cb)
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated
by the guest kernel based on the current TSC value.

(cherry picked from commit 4402c82)
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated
by the guest kernel based on the current TSC value.

(cherry picked from commit cee34ab)
Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Copy link

codecov bot commented May 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.16%. Comparing base (5c17bc6) to head (506f1b8).

Additional details and impacted files
@@                 Coverage Diff                  @@
##           firecracker-v1.8    #4630      +/-   ##
====================================================
+ Coverage             82.14%   82.16%   +0.01%     
====================================================
  Files                   255      255              
  Lines                 31285    31307      +22     
====================================================
+ Hits                  25700    25722      +22     
  Misses                 5585     5585              
Flag Coverage Δ
4.14-c5n.metal 79.66% <100.00%> (+0.01%) ⬆️
4.14-c7g.metal ?
4.14-m5n.metal 79.64% <100.00%> (+0.01%) ⬆️
4.14-m6a.metal 78.88% <100.00%> (+0.01%) ⬆️
4.14-m6g.metal 76.70% <ø> (ø)
4.14-m6i.metal 79.64% <100.00%> (+0.01%) ⬆️
4.14-m7g.metal 76.70% <ø> (ø)
5.10-c5n.metal 82.17% <100.00%> (+0.01%) ⬆️
5.10-c7g.metal ?
5.10-m5n.metal 82.15% <100.00%> (+0.01%) ⬆️
5.10-m6a.metal 81.47% <100.00%> (+0.01%) ⬆️
5.10-m6g.metal 79.47% <ø> (ø)
5.10-m6i.metal 82.16% <100.00%> (+0.01%) ⬆️
5.10-m7g.metal 79.47% <ø> (ø)
6.1-c5n.metal 82.17% <100.00%> (+0.01%) ⬆️
6.1-c7g.metal ?
6.1-m5n.metal 82.16% <100.00%> (+0.02%) ⬆️
6.1-m6a.metal 81.47% <100.00%> (+<0.01%) ⬆️
6.1-m6g.metal 79.46% <ø> (-0.01%) ⬇️
6.1-m6i.metal 82.16% <100.00%> (+0.02%) ⬆️
6.1-m7g.metal 79.47% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kalyazin kalyazin changed the title [WIP] [1.8] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero [1.8] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero May 31, 2024
@kalyazin kalyazin self-assigned this May 31, 2024
@kalyazin kalyazin marked this pull request as ready for review May 31, 2024 15:15
@kalyazin kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label May 31, 2024
@kalyazin kalyazin changed the title [1.8] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero [1.8, do not merge yet] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero Jun 3, 2024
@roypat roypat changed the title [1.8, do not merge yet] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero Jun 11, 2024
@roypat
Copy link
Contributor

roypat commented Jun 11, 2024

We have observed no more failures of this type on main since merging #4618, so are confident to proceed with this PR, too.

Copy link
Contributor

@xmarcalx xmarcalx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the change.

@roypat roypat merged commit 54c2210 into firecracker-microvm:firecracker-v1.8 Jun 12, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants