Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent hanging vbd.Unmount from blocking task progress #6114

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

bduffany
Copy link
Member

Related issues: N/A

@bduffany bduffany marked this pull request as ready for review March 11, 2024 18:07
@maggie-lou
Copy link
Contributor

Could this cause issues in the healthy case? If we try to reuse a firecracker container where the previous VBD hasn't fully unmounted yet (lets say there's some resource contention on the executor so the unmount is slow, but it isn't hanging), and we try to mount a new VBD drive?

@bduffany
Copy link
Member Author

bduffany commented Mar 11, 2024

@maggie-lou not sure if I fully understand the scenario you're thinking of - but by the time we call vbd.Unmount() the backing COWStores should have been closed by Pause(). i.e. by the time we call vbd.Unmount() the VM should no longer exist, and the remaining work that needs to be done should just be (1) unmount VBDs (just kills the FUSE server which should no longer be able to access the closed COWStore, and cleans up the FUSE mount) then (2) remove the workspace dir. This is how it's supposed to work anyway... going to continue looking at this code to make sure we're cleaning up everything properly.

@maggie-lou
Copy link
Contributor

@maggie-lou not sure if I fully understand the scenario you're thinking of - but by the time we call vbd.Unmount() the backing COWStores should have been closed by Pause(). i.e. by the time we call vbd.Unmount() the VM should no longer exist, and the remaining work that needs to be done should just be (1) unmount VBDs (just kills the FUSE server which should no longer be able to access the closed COWStore, and cleans up the FUSE mount) then (2) remove the workspace dir. This is how it's supposed to work anyway... going to continue looking at this code to make sure we're cleaning up everything properly.

I just mean that it seems potentially error-prone that we're no longer unmounting VBD synchronously.

The hotSwapWorkspace case is perhaps a better example, because we unmount one VBD and immediately afterwards mount a new VBD. It seems like this could create a race condition if we haven't fully killed the old FUSE server when we're trying to start a new one (or some weird interaction between the FUSE mounts)

@maggie-lou
Copy link
Contributor

We've seen similar problems to this before (https://github.com/buildbuddy-io/buildbuddy-internal/issues/3006). Instead should we change Run to not wait on successfully cleaning up the container? That feels like a more wholistic solution to this whole class of bugs.

@bduffany
Copy link
Member Author

Instead should we change Run to not wait on successfully cleaning up the container? That feels like a more wholistic solution to this whole class of bugs.

Yes, but it's a much more involved fix because we want to make sure it works across all container implementations, and changing lifecycle-type stuff tends to be risky. Since this PR should fix the specific issue we're seeing in Firecracker, I would prefer to get this merged for now then consider a more comprehensive solution later.

@@ -2381,7 +2381,7 @@ func (c *FirecrackerContainer) remove(ctx context.Context) error {
c.scratchStore = nil
}
if c.rootVBD != nil {
if err := c.rootVBD.Unmount(); err != nil {
if err := c.rootVBD.Unmount(ctx); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, but I think there's a chance there could be errors if we fail to unmount, and then we try to close the store while the VBD drive is still reading from it. That might be better than hanging indefinitely, but perhaps worth looking into

Copy link
Member Author

@bduffany bduffany Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more, I think we have bigger problems if firecracker block device IO is allowed to happen after the VM is paused. Specifically, if there are COW writes in-flight at the time we call CacheSnapshot, it could cause filecache corruption, since we could compute the chunk digest before the in-progress write completes, then afterwards a write comes in and causes the chunk file contents to change, no longer matching the digest. (Note that only writes are problematic - reads don't cause corruption, and also if the COWStore is closed then the read will just return an error which should be fine.)

I think we should investigate this possibility separately, and get this PR merged to prevent user actions getting stuck. I will send a separate follow-up PR to make sure we close the COW for writes before calling CacheSnapshot.

@bduffany bduffany merged commit 4c6ff68 into master Mar 12, 2024
16 of 17 checks passed
@bduffany bduffany deleted the vbd-unmount-issue branch March 12, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants