-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent hanging vbd.Unmount from blocking task progress #6114
Conversation
37dd439
to
d7679a3
Compare
Could this cause issues in the healthy case? If we try to reuse a firecracker container where the previous VBD hasn't fully unmounted yet (lets say there's some resource contention on the executor so the unmount is slow, but it isn't hanging), and we try to mount a new VBD drive? |
@maggie-lou not sure if I fully understand the scenario you're thinking of - but by the time we call |
I just mean that it seems potentially error-prone that we're no longer unmounting VBD synchronously. The |
We've seen similar problems to this before (https://github.com/buildbuddy-io/buildbuddy-internal/issues/3006). Instead should we change |
Yes, but it's a much more involved fix because we want to make sure it works across all container implementations, and changing lifecycle-type stuff tends to be risky. Since this PR should fix the specific issue we're seeing in Firecracker, I would prefer to get this merged for now then consider a more comprehensive solution later. |
@@ -2381,7 +2381,7 @@ func (c *FirecrackerContainer) remove(ctx context.Context) error { | |||
c.scratchStore = nil | |||
} | |||
if c.rootVBD != nil { | |||
if err := c.rootVBD.Unmount(); err != nil { | |||
if err := c.rootVBD.Unmount(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, but I think there's a chance there could be errors if we fail to unmount, and then we try to close the store while the VBD drive is still reading from it. That might be better than hanging indefinitely, but perhaps worth looking into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more, I think we have bigger problems if firecracker block device IO is allowed to happen after the VM is paused. Specifically, if there are COW writes in-flight at the time we call CacheSnapshot
, it could cause filecache corruption, since we could compute the chunk digest before the in-progress write completes, then afterwards a write comes in and causes the chunk file contents to change, no longer matching the digest. (Note that only writes are problematic - reads don't cause corruption, and also if the COWStore is closed then the read will just return an error which should be fine.)
I think we should investigate this possibility separately, and get this PR merged to prevent user actions getting stuck. I will send a separate follow-up PR to make sure we close the COW for writes before calling CacheSnapshot
.
Related issues: N/A