Fix timeout during prefault#111
Conversation
ecc4580 to
3f1acdc
Compare
phip1611
left a comment
There was a problem hiding this comment.
The whole keep alive logic is pretty complex. I think for upstreaming, we should perhaps evaluate if we can come up with a somewhat cleaner, simpler, and less invasive architecture.
Thanks for the analysis and fixing this!
arctic-alpaca
left a comment
There was a problem hiding this comment.
The first commit (vm-migration: speed up volatile read and write) doesn't explain how it speeds up the read/write and for me, it's not obvious from the code change.
3f1acdc to
f0e96f2
Compare
We will use the KeepAliveStream also if the migration uses a single TCP connection from now on. We have seen timeouts if the VM has huge amounts of memory and prefaults the memory during migration. Thus we need the keep alive messages also in this case. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
We will use the KeepAliveStream also on the reciever side of the live migration, thus it has to implement AsFd. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This is important when we wrap the receiver socket into the KeepAliveStream, because we want readers to wait longer than senders, and we want readers to wait long enough to see keep alive messages. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
Otherwise we have to scatter the keep alive handling over the whole code base, which we don't want. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
The sender of the live migration usually waits for a response when it isn't sending requests or doing any work. Thus the receiver should send keep alive responses to not break the protocol. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
The sender and receiver side have to behave a bit different to not break the protocol. Thus we add a bit special handling for both sides. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
89f2ebe to
82d6812
Compare
|
This change will break migration from old chv versions to this one. We cannot merge this in this state, otherwise, our customer can no longer migrate during rollouts. |
|
@phip1611 do we have a migration protocol version that is negotiated between the two cloud hypervisor instances? If not, we should probably introduce it before we go GA. |
This PR fixes timeout errors if the receiver side of a migration takes a long time applying the VM config (e.g. because prefaulting the memory takes a long time). It does that by making the receiver send keep alive messages during times where the receiver does not listen for new requests.
Unfortunately, this is not easy to test, because we'd need some mechanism that makes the receiver wait for some time at a certain point in the migration. But we tested this at SAP and it worked.