[Bug] Ongoing `read` Syscalls on Vsocks Don't Get Interrupted after the Second Snapshot Resume/Restore

# Describe the bug

Recently, we [reported a bug](https://github.com/firecracker-microvm/firecracker/issues/4736) that lead to ongoing `read` syscalls not being interrupted after a snapshot resume. @ShadowCurse Then published a [fix](
https://github.com/firecracker-microvm/firecracker/pull/4796/files) for this. We then verified that their fix worked by using our reproducer repo:

https://github.com/firecracker-microvm/firecracker/issues/4736#issuecomment-2347815017

In the reproducer repo, we're essentially doing this:

1. Start Firecracker
2. Create a VM with a VSock
3. Start a VSock-over-UDS listener on the host with `socat`
4. In the guest VM, connect to the listener on the host through VSock with `socat`
5. Pause the VM and create a snapshot
6. Stop the listener on the host
7. Resume the VM

And this now all works fine now! At step 7, any in-progress reads from the VSock in the guest VM now return RST errors (as they should).

But if we add the following steps:

8. Start a VSock-over-UDS listener on the host with `socat` again
9.  In the guest VM, connect to the listener on the host through VSock with `socat`
10. Pause the VM and create a snapshot
11. Stop the listener on the host
12. Resume the VM

The issue appears again! Essentially, if we create & resume two or more subsequent snapshots (already verified this with upstream Firecracker using our existing reproducer repo), reads from the VSocks don't get any RST errors anymore but instead hang forever.

## To Reproduce

See the [original reproduction steps](https://github.com/firecracker-microvm/firecracker/issues/4736#issue-2474280231), but add the additional steps described above (snapshot & restore more than one time)

## Expected behaviour

See the [original expected behavior](https://github.com/firecracker-microvm/firecracker/issues/4736#issue-2474280231) - from our reading, this behavior should continue to work even if it's the 2nd, 3rd etc. snapshot/restore cycle.

## Potential Fix

We noticed that in `src/vmm/src/device_manager/persist.rs`, the VSock state is serialized _before_ the reset event is sent. If this is swapped, the issue goes away, and the 2nd, 3rd etc. resume still correctly resets any connected VSocks. We're not sure if this is a proper solution though, if there are any better workarounds or if there is a better way to fix the issue. To test it, apply the following patch:

```patch
diff --git a/src/vmm/src/device_manager/persist.rs b/src/vmm/src/device_manager/persist.rs
index 7a51bf790..154427a84 100644
--- a/src/vmm/src/device_manager/persist.rs
+++ b/src/vmm/src/device_manager/persist.rs
@@ -365,11 +365,6 @@ impl<'a> Persist<'a> for MMIODeviceManager {
                         .downcast_mut::<Vsock<VsockUnixBackend>>()
                         .unwrap();
 
-                    let vsock_state = VsockState {
-                        backend: vsock.backend().save(),
-                        frontend: vsock.save(),
-                    };
-
                     // Send Transport event to reset connections if device
                     // is activated.
                     if vsock.is_activated() {
@@ -378,6 +373,11 @@ impl<'a> Persist<'a> for MMIODeviceManager {
                         });
                     }
 
+                    let vsock_state = VsockState {
+                        backend: vsock.backend().save(),
+                        frontend: vsock.save(),
+                    };
+
                     states.vsock_device = Some(ConnectedVsockState {
                         device_id: devid.clone(),
                         device_state: vsock_state,
```

Then run the reproduction steps using the reproducer repo. The issue should no longer appear afterwards.

## Environment

- Firecracker version: Firecracker v1.10.0-dev at commit 8eea9df5fef635c92741216cb215d8461e1e4362
- Host and guest kernel versions: Host `6.10.6-200.fc40.x86_64`, guest `6.1.89`
- Rootfs used: Buildroot 2024.02.5. See the [loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer](https://github.com/loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer) for more details and the specific rootfs being used - this is also reproducible with Ubuntu and Alpine rootfses.
- Architecture: `x86_64`
- Any other relevant software versions: The host is Fedora 40 on a Intel i7-1280P.

## Additional context

See the [original additional context](https://github.com/firecracker-microvm/firecracker/issues/4736#issue-2474280231).

## Checks

- [x] Have you searched the Firecracker Issues database for similar problems?
- [x] Have you read the existing relevant Firecracker documentation?
- [x] Are you certain the bug being reported is a Firecracker issue?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Ongoing `read` Syscalls on Vsocks Don't Get Interrupted after the Second Snapshot Resume/Restore #4811

Describe the bug

To Reproduce

Expected behaviour

Potential Fix

Environment

Additional context

Checks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Ongoing read Syscalls on Vsocks Don't Get Interrupted after the Second Snapshot Resume/Restore #4811

Description

Describe the bug

To Reproduce

Expected behaviour

Potential Fix

Environment

Additional context

Checks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Ongoing `read` Syscalls on Vsocks Don't Get Interrupted after the Second Snapshot Resume/Restore #4811