Skip to content

Fix timeout during prefault#111

Merged
tpressure merged 8 commits intocyberus-technology:gardenlinuxfrom
amphi:fix-timeout-during-prefault
Mar 18, 2026
Merged

Fix timeout during prefault#111
tpressure merged 8 commits intocyberus-technology:gardenlinuxfrom
amphi:fix-timeout-during-prefault

Conversation

@amphi
Copy link
Copy Markdown

@amphi amphi commented Mar 17, 2026

This PR fixes timeout errors if the receiver side of a migration takes a long time applying the VM config (e.g. because prefaulting the memory takes a long time). It does that by making the receiver send keep alive messages during times where the receiver does not listen for new requests.

Unfortunately, this is not easy to test, because we'd need some mechanism that makes the receiver wait for some time at a certain point in the migration. But we tested this at SAP and it worked.

@amphi amphi requested review from Coffeeri, phip1611 and scholzp March 17, 2026 16:25
@amphi amphi self-assigned this Mar 17, 2026
@amphi amphi force-pushed the fix-timeout-during-prefault branch from ecc4580 to 3f1acdc Compare March 17, 2026 16:26
Copy link
Copy Markdown
Member

@phip1611 phip1611 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole keep alive logic is pretty complex. I think for upstreaming, we should perhaps evaluate if we can come up with a somewhat cleaner, simpler, and less invasive architecture.

Thanks for the analysis and fixing this!

Copy link
Copy Markdown

@arctic-alpaca arctic-alpaca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first commit (vm-migration: speed up volatile read and write) doesn't explain how it speeds up the read/write and for me, it's not obvious from the code change.

@amphi amphi force-pushed the fix-timeout-during-prefault branch from 3f1acdc to f0e96f2 Compare March 18, 2026 09:12
amphi added 8 commits March 18, 2026 10:14
We will use the KeepAliveStream also if the migration uses a single TCP
connection from now on. We have seen timeouts if the VM has huge
amounts of memory and prefaults the memory during migration. Thus we
need the keep alive messages also in this case.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
We will use the KeepAliveStream also on the reciever side of the live
migration, thus it has to implement AsFd.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This is important when we wrap the receiver socket into the
KeepAliveStream, because we want readers to wait longer than senders,
and we want readers to wait long enough to see keep alive messages.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
Otherwise we have to scatter the keep alive handling over the whole code
base, which we don't want.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
The sender of the live migration usually waits for a response when it
isn't sending requests or doing any work. Thus the receiver should send
keep alive responses to not break the protocol.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
The sender and receiver side have to behave a bit different to not break
the protocol. Thus we add a bit special handling for both sides.

On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
On-behalf-of: SAP sebastian.eydam@sap.com
Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
@amphi amphi force-pushed the fix-timeout-during-prefault branch from 89f2ebe to 82d6812 Compare March 18, 2026 09:14
@tpressure
Copy link
Copy Markdown

This change will break migration from old chv versions to this one. We cannot merge this in this state, otherwise, our customer can no longer migrate during rollouts.

@tpressure
Copy link
Copy Markdown

tpressure commented Mar 18, 2026

@phip1611 do we have a migration protocol version that is negotiated between the two cloud hypervisor instances? If not, we should probably introduce it before we go GA.

@tpressure tpressure marked this pull request as draft March 18, 2026 10:08
@tpressure tpressure marked this pull request as ready for review March 18, 2026 10:28
@tpressure tpressure merged commit c9b2935 into cyberus-technology:gardenlinux Mar 18, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants