NV2 cache coherency fix: DSB barriers and test infrastructure#183
NV2 cache coherency fix: DSB barriers and test infrastructure#183
Conversation
|
⛔ Lint FailedThe Lint check has failed for this PR. Please fix the linting issues before I can proceed with the review. You can check the lint failure details here: https://github.com/ejc3/fcvm/actions/runs/21322777548 Review by Claude | Run |
🔧 CI Auto-FixCreated fix PR: #184 The CI was failing due to rustfmt formatting violations in 5 files. I've applied |
|
🔧 CI Auto-FixCreated fix PR: #185 The CI failure was caused by formatting issues. I've automatically applied |
|
|
CI Failure AnalysisI analyzed CI #21325400113 but couldn't create an automatic fix. DiagnosisThe CI failures are NOT due to code bugs:
Why Not FixableThis appears to be related to the known performance/timeout issues mentioned in PR #183:
The failures are due to environmental performance constraints, not code defects. The kernel building infrastructure is working correctly - it attempts download, gets 404, then successfully builds locally as designed. |
…corruption
FUSE-over-vsock corrupts at ~1MB cumulative transfer under ARM64 NV2 nested
virtualization. Error manifests as "DESERIALIZE FAILED - tag for enum is not
valid" with bincode failing to parse received data.
- Added CRC32 checksum to wire protocol format: [4-byte CRC][4-byte length][payload]
- WIRE CRC MISMATCH proves data is corrupted IN TRANSIT (not serialization bug)
- Corruption always happens at message count=12, around 1.3MB total bytes read
- This is consistently a FUSE WRITE request (~256KB or ~1MB payload)
- 512K, 768K, 1M: Always PASS
- 1280K: ~40-60% success rate
- 1536K: ~20% success rate
- 2M: ~20% success rate
Under NV2 (FEAT_NV2), L1 guest's writes to vsock SKB buffers may not be visible
to L0 host due to cache coherency issues in double Stage 2 translation path.
The data flow:
1. L1 app writes to FUSE
2. L1 fc-agent serializes to vsock SKB
3. L1 kernel adds SKB to virtqueue
4. L1 kicks virtio (MMIO trap to L0)
5. L0 Firecracker reads from virtqueue mmap
6. L0 may see STALE data if L1's writes aren't flushed
- Small messages use LINEAR SKBs (skb->data points to contiguous buffer)
- Large messages (>PAGE_SIZE) use NONLINEAR SKBs with page fragments
- Original DC CIVAC only flushed linear data, missing page fragments
1. nv2-vsock-dcache-flush.patch
- Adds DC CIVAC flush in virtio_transport_send_skb() for TX path
- Handles BOTH linear and nonlinear (paged) SKBs
- Uses page_address() to get proper VA for page fragments
- Adds DSB SY + ISB barriers around flush
2. nv2-virtio-kick-barrier.patch
- Adds DSB SY + ISB in virtqueue_notify() before MMIO kick
- Ensures all prior writes are visible before trap to hypervisor
3. nv2-vsock-rx-barrier.patch (existing)
- Adds DSB SY in virtio_transport_rx_work() before reading RX queue
- Ensures L0's writes are visible to L1 when receiving responses
4. nv2-vsock-cache-sync.patch (existing)
- Adds DSB SY in kvm_nested_sync_hwstate()
- Barrier at nested guest exit
5. nv2-mmio-barrier.patch
- Adds DSB SY in io_mem_abort() before kvm_io_bus_write()
- Ensures L1's writes visible before signaling eventfd
- Only activates on ARM64_HAS_NESTED_VIRT capability
```
[4 bytes: CRC32 of (length + body)]
[4 bytes: length (big-endian u32)]
[N bytes: serialized WireRequest]
```
- Server reads CRC header first
- Computes CRC of received (length + body)
- Logs WIRE CRC MISMATCH if expected != received
- Helps pinpoint WHERE corruption occurs (before or during transit)
With all patches applied:
- ~60% success rate at 1280K (up from ~40%)
- ~20% success rate at 2M
- Still intermittent - likely missing vring descriptor flush
1. Vring descriptor array may need flushing (not just SKB data)
2. Available ring updates may be cached
3. May need flush at different point in virtqueue_add_sgs() path
4. Consider flushing entire virtqueue memory region
```bash
for SIZE in 512K 768K 1M 1280K 1536K 2M; do
sudo fcvm podman run --kernel-profile nested --network bridged \
--map /tmp/test:/mnt alpine:latest \
sh -c "dd if=/dev/urandom of=/mnt/test.bin bs=$SIZE count=1 conv=fsync"
done
```
New layout:
kernel/
├── 0001-fuse-add-remap_file_range-support.patch # Universal (symlinked down)
├── host/
│ ├── arm64/
│ │ ├── 0001-fuse-*.patch -> ../../ (symlink)
│ │ └── nv2-mmio-barrier.patch (host KVM MMIO DSB)
│ └── x86/
│ └── 0001-fuse-*.patch -> ../../ (symlink)
└── nested/
├── arm64/
│ ├── 0001-fuse-*.patch -> ../../ (symlink)
│ ├── nv2-vsock-*.patch (guest vsock cache flush)
│ ├── nv2-virtio-kick-barrier.patch
│ ├── mmfr4-override.vm.patch
│ └── psci-debug-*.patch
└── x86/
└── 0001-fuse-*.patch -> ../../ (symlink)
Principle: Put patches at highest level where they apply, symlink down.
- FUSE remap: ALL kernels → kernel/
- MMIO barrier: Host ARM64 only → kernel/host/arm64/
- vsock flush: Nested ARM64 only → kernel/nested/arm64/
Updated rootfs-config.toml to use new paths:
- nested.arm64.patches_dir = "kernel/nested/arm64"
- nested.arm64.host_kernel.patches_dir = "kernel/host/arm64"
Host kernel patch (nv2-mmio-barrier.patch): - Use vcpu_has_nv(vcpu) instead of cpus_have_final_cap() to only apply DSB barrier for nested guests, not all VMs on NV2 hardware - Remove debug printk that was causing massive performance degradation Nested kernel patch (nv2-virtio-kick-barrier.patch): - Add DC CIVAC cache flush for vring structures (desc, avail, used) - Previous DSB+ISB alone doesn't flush dirty cache lines under NV2 Test script (scripts/nv2-corruption-test.sh): - First verifies simple VM works before running corruption tests - Reports pass/fail counts for each test iteration
- Set up ~/linux with fcvm-host and fcvm-nested branches - Patches now managed via stgit for automatic line number updates - Updated all patches to target v6.18 with correct offsets - Added stgit workflow documentation to CLAUDE.md - Fixed kernel patch layout documentation (added psci-debug patches) Workflow: edit in ~/linux, `stg refresh`, `stg export` to fcvm
Progress: - Set up stgit for kernel patch management (~/linux) - Rebuilt host kernel (85bc71093b8c) and nested kernel (73b4418e28a9) - Updated corruption test script to auto-setup Current issue: - L1 VMs with --kernel-profile nested (HAS_EL2 enabled) fail with I/O error on FUSE writes > ~1.3MB - L1 VMs WITHOUT nested profile work fine at 50MB+ - Issue is NV2-specific: when vCPU has HAS_EL2, cache coherency breaks Analysis: - Host patch (nv2-mmio-barrier.patch) only applies DSB when vcpu_has_nv(vcpu) - vcpu_has_nv() checks if guest is running a nested guest (L2) - But issue occurs at L1 level when L1 has HAS_EL2 feature enabled - Need to add barrier for any vCPU with HAS_EL2, not just nested guests Next: Update host patch to check for HAS_EL2 feature instead of nested state
- Make checksums protective: server drops requests with wire CRC mismatch (continue → skip), sends EIO for field-level checksum failures. Client returns EIO for corrupted responses instead of delivering corrupt data to FUSE layer. - Add 6 unit tests for checksum lifecycle: roundtrip encode/decode with validation, corruption detection for tampered request/response, backwards compatibility (no checksum = passes), determinism. - Update wire protocol docs to show actual format: requests use [CRC][length][payload], responses use [length][payload] with checksum embedded in WireResponse struct. - Remove psci-debug-handle-exit.patch and psci-debug-psci.patch (debug-only, never referenced in build_inputs). - Fix: vcpu_has_nv() already checks HAS_EL2 feature — the WIP commit's analysis was wrong. The barrier fires correctly but is insufficient alone for NV2 cache coherency. Tested: cargo test -p fuse-pipe --lib protocol::wire::tests (11/11 pass)
🔧 CI Auto-FixCreated fix PR: #248 Issue: 4 fuse-pipe tests were failing/timing out because they weren't updated to handle the CRC header added to the client->server wire format in the recent CRC32 checksum implementation. Fix: Updated all affected tests to read/write the CRC header (4 bytes) when simulating wire protocol communication. |
Summary
This PR addresses ARM64 NV2 (FEAT_NV2) cache coherency issues that cause data corruption in nested virtualization.
The Problem: Under NV2, FWB (Stage2 Forwarding Write Buffer) does not properly ensure cache coherency across the double stage-2 translation (L2 GPA → L1 S2 → L1 HPA → L0 S2 → physical). When a guest writes to virtqueue buffers and kicks via MMIO, the host's userspace may read stale/zero data instead of the actual content.
The Solution: Add DSB (Data Synchronization Barrier) patches at key points:
kernel/host/arm64/nv2-mmio-barrier.patch): Conditional DSB before ioeventfd signaling for NV2 guestskernel/nested/arm64/nv2-mmio-barrier.patch): Page table walking + dirty page flush on all MMIO writes (unconditional since nested is always inside broken FWB)Changes
Kernel Patches
vcpu_has_nv()) beforekvm_io_bus_write()dcache_clean_inval_poc()Test Infrastructure
/mnt/fcvm-btrfs/nested-debug/FCVM_DATA_DIR=/root/fcvm-datafor Unix sockets (FUSE doesn't support them)/nested_l2/to/nested/)make setup-nestedtargetOutput Listener Timeout Fix
Other
Current Status
test_nested_run_fcvm_inside_vm) times out at 30minPerformance Notes
Under NV2 nested virtualization:
The nested test currently times out before L2 can complete, but the infrastructure is working correctly. The performance limitation is inherent to FUSE-over-vsock under double Stage 2 translation.
Test Plan
make test-root) - pendingRelated
arch/arm64/kvm/nested.c