Prevent hanging vbd.Unmount from blocking task progress #6114

bduffany · 2024-03-11T17:48:53Z

Related issues: N/A

maggie-lou · 2024-03-11T19:14:45Z

Could this cause issues in the healthy case? If we try to reuse a firecracker container where the previous VBD hasn't fully unmounted yet (lets say there's some resource contention on the executor so the unmount is slow, but it isn't hanging), and we try to mount a new VBD drive?

bduffany · 2024-03-11T19:33:51Z

@maggie-lou not sure if I fully understand the scenario you're thinking of - but by the time we call vbd.Unmount() the backing COWStores should have been closed by Pause(). i.e. by the time we call vbd.Unmount() the VM should no longer exist, and the remaining work that needs to be done should just be (1) unmount VBDs (just kills the FUSE server which should no longer be able to access the closed COWStore, and cleans up the FUSE mount) then (2) remove the workspace dir. This is how it's supposed to work anyway... going to continue looking at this code to make sure we're cleaning up everything properly.

maggie-lou · 2024-03-11T19:55:21Z

@maggie-lou not sure if I fully understand the scenario you're thinking of - but by the time we call vbd.Unmount() the backing COWStores should have been closed by Pause(). i.e. by the time we call vbd.Unmount() the VM should no longer exist, and the remaining work that needs to be done should just be (1) unmount VBDs (just kills the FUSE server which should no longer be able to access the closed COWStore, and cleans up the FUSE mount) then (2) remove the workspace dir. This is how it's supposed to work anyway... going to continue looking at this code to make sure we're cleaning up everything properly.

I just mean that it seems potentially error-prone that we're no longer unmounting VBD synchronously.

The hotSwapWorkspace case is perhaps a better example, because we unmount one VBD and immediately afterwards mount a new VBD. It seems like this could create a race condition if we haven't fully killed the old FUSE server when we're trying to start a new one (or some weird interaction between the FUSE mounts)

maggie-lou · 2024-03-11T20:00:19Z

We've seen similar problems to this before (https://github.com/buildbuddy-io/buildbuddy-internal/issues/3006). Instead should we change Run to not wait on successfully cleaning up the container? That feels like a more wholistic solution to this whole class of bugs.

bduffany · 2024-03-11T20:07:39Z

Instead should we change Run to not wait on successfully cleaning up the container? That feels like a more wholistic solution to this whole class of bugs.

Yes, but it's a much more involved fix because we want to make sure it works across all container implementations, and changing lifecycle-type stuff tends to be risky. Since this PR should fix the specific issue we're seeing in Firecracker, I would prefer to get this merged for now then consider a more comprehensive solution later.

maggie-lou · 2024-03-11T21:54:44Z

enterprise/server/remote_execution/containers/firecracker/firecracker.go

@@ -2381,7 +2381,7 @@ func (c *FirecrackerContainer) remove(ctx context.Context) error {
 		c.scratchStore = nil
 	}
 	if c.rootVBD != nil {
-		if err := c.rootVBD.Unmount(); err != nil {
+		if err := c.rootVBD.Unmount(ctx); err != nil {


Discussed offline, but I think there's a chance there could be errors if we fail to unmount, and then we try to close the store while the VBD drive is still reading from it. That might be better than hanging indefinitely, but perhaps worth looking into

Thinking about it a bit more, I think we have bigger problems if firecracker block device IO is allowed to happen after the VM is paused. Specifically, if there are COW writes in-flight at the time we call CacheSnapshot, it could cause filecache corruption, since we could compute the chunk digest before the in-progress write completes, then afterwards a write comes in and causes the chunk file contents to change, no longer matching the digest. (Note that only writes are problematic - reads don't cause corruption, and also if the COWStore is closed then the read will just return an error which should be fine.)

I think we should investigate this possibility separately, and get this PR merged to prevent user actions getting stuck. I will send a separate follow-up PR to make sure we close the COW for writes before calling CacheSnapshot.

bduffany requested review from tylerwilliams and siggisim March 11, 2024 17:48

Prevent vbd unmount failures from blocking task progress

d7679a3

bduffany force-pushed the vbd-unmount-issue branch from 37dd439 to d7679a3 Compare March 11, 2024 18:05

bduffany marked this pull request as ready for review March 11, 2024 18:07

bduffany requested a review from maggie-lou March 11, 2024 18:30

tylerwilliams approved these changes Mar 11, 2024

View reviewed changes

maggie-lou reviewed Mar 11, 2024

View reviewed changes

maggie-lou approved these changes Mar 12, 2024

View reviewed changes

bduffany merged commit 4c6ff68 into master Mar 12, 2024
16 of 17 checks passed

bduffany deleted the vbd-unmount-issue branch March 12, 2024 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent hanging vbd.Unmount from blocking task progress #6114

Prevent hanging vbd.Unmount from blocking task progress #6114

bduffany commented Mar 11, 2024

maggie-lou commented Mar 11, 2024

bduffany commented Mar 11, 2024 •

edited

maggie-lou commented Mar 11, 2024

maggie-lou commented Mar 11, 2024

bduffany commented Mar 11, 2024

maggie-lou Mar 11, 2024

bduffany Mar 12, 2024 •

edited

Prevent hanging vbd.Unmount from blocking task progress #6114

Prevent hanging vbd.Unmount from blocking task progress #6114

Conversation

bduffany commented Mar 11, 2024

maggie-lou commented Mar 11, 2024

bduffany commented Mar 11, 2024 • edited

maggie-lou commented Mar 11, 2024

maggie-lou commented Mar 11, 2024

bduffany commented Mar 11, 2024

maggie-lou Mar 11, 2024

Choose a reason for hiding this comment

bduffany Mar 12, 2024 • edited

Choose a reason for hiding this comment

bduffany commented Mar 11, 2024 •

edited

bduffany Mar 12, 2024 •

edited