Fix timeout during prefault by amphi · Pull Request #111 · cyberus-technology/cloud-hypervisor

amphi · 2026-03-17T16:25:22Z

This PR fixes timeout errors if the receiver side of a migration takes a long time applying the VM config (e.g. because prefaulting the memory takes a long time). It does that by making the receiver send keep alive messages during times where the receiver does not listen for new requests.

Unfortunately, this is not easy to test, because we'd need some mechanism that makes the receiver wait for some time at a certain point in the migration. But we tested this at SAP and it worked.

phip1611

The whole keep alive logic is pretty complex. I think for upstreaming, we should perhaps evaluate if we can come up with a somewhat cleaner, simpler, and less invasive architecture.

Thanks for the analysis and fixing this!

arctic-alpaca

The first commit (vm-migration: speed up volatile read and write) doesn't explain how it speeds up the read/write and for me, it's not obvious from the code change.

vm-migration/src/keep_alive_stream.rs

We will use the KeepAliveStream also if the migration uses a single TCP connection from now on. We have seen timeouts if the VM has huge amounts of memory and prefaults the memory during migration. Thus we need the keep alive messages also in this case. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

We will use the KeepAliveStream also on the reciever side of the live migration, thus it has to implement AsFd. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

This is important when we wrap the receiver socket into the KeepAliveStream, because we want readers to wait longer than senders, and we want readers to wait long enough to see keep alive messages. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

Otherwise we have to scatter the keep alive handling over the whole code base, which we don't want. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

The sender of the live migration usually waits for a response when it isn't sending requests or doing any work. Thus the receiver should send keep alive responses to not break the protocol. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

The sender and receiver side have to behave a bit different to not break the protocol. Thus we add a bit special handling for both sides. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

tpressure · 2026-03-18T10:01:56Z

This change will break migration from old chv versions to this one. We cannot merge this in this state, otherwise, our customer can no longer migrate during rollouts.

tpressure · 2026-03-18T10:02:30Z

@phip1611 do we have a migration protocol version that is negotiated between the two cloud hypervisor instances? If not, we should probably introduce it before we go GA.

amphi requested review from Coffeeri, phip1611 and scholzp March 17, 2026 16:25

amphi self-assigned this Mar 17, 2026

amphi force-pushed the fix-timeout-during-prefault branch from ecc4580 to 3f1acdc Compare March 17, 2026 16:26

amphi requested review from arctic-alpaca and olivereanderson March 17, 2026 16:30

phip1611 approved these changes Mar 18, 2026

View reviewed changes

arctic-alpaca reviewed Mar 18, 2026

View reviewed changes

vm-migration/src/keep_alive_stream.rs Outdated Show resolved Hide resolved

vm-migration/src/keep_alive_stream.rs Show resolved Hide resolved

arctic-alpaca approved these changes Mar 18, 2026

View reviewed changes

amphi force-pushed the fix-timeout-during-prefault branch from 3f1acdc to f0e96f2 Compare March 18, 2026 09:12

amphi added 8 commits March 18, 2026 10:14

vm-migration: Add AsFd for KeepAliveStream

8306f31

We will use the KeepAliveStream also on the reciever side of the live migration, thus it has to implement AsFd. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

vm-migration: move keep alive handling into the protocol

43cd377

Otherwise we have to scatter the keep alive handling over the whole code base, which we don't want. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

vmm: always use KeepAliveStream for main connection

8c23a50

On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

vmm: Use KeepAliveStream also for receiver

82d6812

On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>

amphi force-pushed the fix-timeout-during-prefault branch from 89f2ebe to 82d6812 Compare March 18, 2026 09:14

tpressure marked this pull request as draft March 18, 2026 10:08

tpressure marked this pull request as ready for review March 18, 2026 10:28

tpressure merged commit c9b2935 into cyberus-technology:gardenlinux Mar 18, 2026
11 checks passed

arctic-alpaca mentioned this pull request Mar 18, 2026

Add option to skip zero pages during VM migration #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timeout during prefault#111

Fix timeout during prefault#111
tpressure merged 8 commits intocyberus-technology:gardenlinuxfrom
amphi:fix-timeout-during-prefault

amphi commented Mar 17, 2026

Uh oh!

phip1611 left a comment

Uh oh!

arctic-alpaca left a comment

Uh oh!

Uh oh!

Uh oh!

tpressure commented Mar 18, 2026

Uh oh!

tpressure commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amphi commented Mar 17, 2026

Uh oh!

phip1611 left a comment

Choose a reason for hiding this comment

Uh oh!

arctic-alpaca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tpressure commented Mar 18, 2026

Uh oh!

tpressure commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tpressure commented Mar 18, 2026 •

edited

Loading