hypervisor: add hot plugging support for Qemu Q35 #456
Conversation
@mcastelino @amshinde can you please review it ? |
pcieBridge = "pcie" | ||
) | ||
|
||
const pciBridgeMaxCapacity = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Where does this limit come from?
- Is there any way to avoid hard-coding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this limit come from?
Qemu's documentation is not very good, I found this [1], but I'm not able to hot plug 32 devices,
I guess because of SHPC is enabled.
Is there any way to avoid hard-coding this?
I don't think so
[1] https://github.com/qemu/qemu/blob/master/docs/pcie.txt#L76
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a peek at: https://en.wikipedia.org/wiki/PCI_configuration_space
looks like the per-bus device space is defined by a 5-bit field.
Maybe 0 and 0x1F are reserved - leaving 30 ? Just a guess. We could add a note saying this is part of the PCI spec I guess etc.
bridge.go
Outdated
// hotplug drive on this bridge | ||
drive.Bus = b.ID | ||
|
||
// hotplug drive on this bridge's addres |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "address".
bridge.go
Outdated
return bridges | ||
} | ||
|
||
func (b *Bridge) hotplugAddPCIDevice(h hypervisor, drive Drive, devType deviceType) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something? This function and hotplugRemovePCIDevice()
don't seem to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch!, sorry I forgot to push some patches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc this function says hotplugAddPCIDevice. But this seems to be a specific implementation for a disk.
@@ -0,0 +1,122 @@ | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix errors in your commit message:
hotplug: Implement Bridge structure
Bridge represents a PCI or PCIE bridge and can be used
to hot plug PCI devices for pc and q35 machine types.
Signed-off-by: Julio Montes <julio.montes@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title of the commit message is still wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh ok s/Implements/Implement/
done, thanks
bridge.go
Outdated
// hotplug drive on this bridge's addres | ||
drive.Addr = fmt.Sprintf("0x%x", addr) | ||
|
||
err := h.hotplugAddDevice(drive, devType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is the right way to implement it inside virtcontainers. This should be called from the hypervisor implementation since different hypervisors might handle this in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc you have not taken this comment into account.
bridge.go
Outdated
return fmt.Errorf("Unable to hot unplug device %s: not present on bridge", drive.ID) | ||
} | ||
|
||
err := h.hotplugRemoveDevice(drive, devType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Ditto [1]?
[1] https://bulbapedia.bulbagarden.net/wiki/Ditto_(Pok%C3%A9mon) 😄
pod.go
Outdated
|
||
// Bridges is the list of pci bridges attached to the pod | ||
// pci bridges can be used to hot plug devices | ||
Bridges []*Bridge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an hypervisor resource IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a bridge is also a resource, same as memory and CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf @jodh-intel changes applied |
device.go
Outdated
if err = h.hotplugAddDevice(drive, blockDev); err != nil { | ||
var err error | ||
for _, b := range c.pod.config.VMConfig.Bridges { | ||
err = b.hotplugAddPCIDevice(h, drive, blockDev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I am pointing in a previous commit as not the right way to proceed. hotplugAddPCIDevice() should be called from the hypervisor implementation (no need for h
as parameter in that case), and we should not modify the call here. Hypervisor implementation should be able to detect if it needs to hotplug a PCI device or not I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to save PCI bridges in the pod state, because of before trying to hot plug any device, we must know what slots are available, is the hypervisor structure saved in the pod state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc currently, we don't save any data related to hypervisor in a state file (relates to run-time which is different from config files storing the Hypervisor config). But I think that if you need to store some informations so that you can reuse them across different containers, you should maybe create a new state file for the hypervisor.
Before you do that, I'd like to make sure I understand properly what you need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf do we store the POD state somewhere? also where do we store details about the devices attached to the PODs and containers.
bridge.go
Outdated
} | ||
|
||
// hotplug drive on this bridge | ||
drive.Bus = b.ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are updating drive here. Wouldn't you require to pass pointer to the drive for the update?
bridge.go
Outdated
} | ||
|
||
// free address to re-use the same slot with other devices | ||
delete(b.Address, devAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are reusing the pci slot here, have you checked if the predicted drive name you get is correct after doing this.
You may have to change the logic to predict names based on the pci slot address, rather than index at which the drive is attached on the main bus that is being done currently. I had avoided it earlier, due to the differences in addresses that are available on pci and pcie bus. But now that we are using bridges with explicit pci addresses, we may need to revisit predicting the drive names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amshinde that is a good idea. This will ensure we are fully OS independent.
bridge.go
Outdated
return bridges | ||
} | ||
|
||
func (b *Bridge) hotplugAddPCIDevice(h hypervisor, drive Drive, devType deviceType) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc this function says hotplugAddPCIDevice. But this seems to be a specific implementation for a disk.
pod.go
Outdated
|
||
// Bridges is the list of pci bridges attached to the pod | ||
// pci bridges can be used to hot plug devices | ||
Bridges []*Bridge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bus: bus, | ||
ID: b.ID, | ||
// Each bridge is required to be assigned a unique chassis id > 0 | ||
Chassis: (idx + 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc does it make sense to store the chassis id in the bridge structure for tracking. If not it becomes implicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not needed, we already have an ID to identify the bridge
qemu.go
Outdated
|
||
// Addr is PCI specific, use ExecutePCIDeviceAdd if it's NOT empty | ||
if drive.Addr != "" { | ||
if err := q.qmpMonitorCh.qmp.ExecutePCIDeviceAdd(q.qmpMonitorCh.ctx, drive.ID, devID, driver, drive.Addr, drive.Bus); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to make this function specific PCIBlockDeviceAdd or make it truly generic. Right now the function name does not match the implementation
Type bridgeType | ||
|
||
//ID is used to identify the bridge in the hypervisor | ||
ID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add the chassis information here.
bridge.go
Outdated
} | ||
|
||
// free address to re-use the same slot with other devices | ||
delete(b.Address, devAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amshinde that is a good idea. This will ensure we are fully OS independent.
device.go
Outdated
if err = h.hotplugAddDevice(drive, blockDev); err != nil { | ||
var err error | ||
for _, b := range c.pod.config.VMConfig.Bridges { | ||
err = b.hotplugAddPCIDevice(h, drive, blockDev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sboeuf do we store the POD state somewhere? also where do we store details about the devices attached to the PODs and containers.
fe2d2b7
to
eef1642
Compare
@mcastelino @amshinde @sboeuf I did some changes in order to handle Sebastien's comments,
could you please re-review it? thanks |
@devimc thanks for the update, I am gonna take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some changes, and I am sorry but I need to think about the best way to get this information stored for the bridges.
BTW, could you rebase your PR on latest master, and merge some of your commits. It would help to understand better. Having too many commits makes it complicated IMO. One commit per feature is usually a good thing.
// Bridge is a bridge where devices can be hot plugged | ||
type Bridge struct { | ||
// Address contains information about devices plugged and its address in the bridge | ||
Address map[uint32]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong but no need for a very specific uint32
type here since that's only an index, right ?
uint
should be okay. But on the other hand, if you really think about saving space, then you might want to go with uint8
which is gonna be enough for 30 possible entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pcie has more capacity, for this reason I'd like to keep uint32
return nil | ||
} | ||
|
||
for i := uint32(0); i < n; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, if we really want to limit the memory, then let's use uint8
.
// addDevice on success adds the device ID to the bridge and return the address | ||
// where the device was added, otherwise an error is returned | ||
func (b *Bridge) addDevice(ID string) (uint32, error) { | ||
var addr uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more a key than an address.
// otherwise an error is returned | ||
func (b *Bridge) removeDevice(ID string) error { | ||
// check if the device was hot plugged in the bridge | ||
for addr, devID := range b.Address { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more a key than an address. Let's call it key
.
hypervisor.go
Outdated
@@ -161,6 +161,10 @@ type HypervisorConfig struct { | |||
// Pod configuration VMConfig.Memory overwrites this. | |||
DefaultMemSz uint32 | |||
|
|||
// DefaultBridges specifies default number of Bridges for the VM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Bridges/bridges/
hypervisor.go
Outdated
@@ -161,6 +161,10 @@ type HypervisorConfig struct { | |||
// Pod configuration VMConfig.Memory overwrites this. | |||
DefaultMemSz uint32 | |||
|
|||
// DefaultBridges specifies default number of Bridges for the VM. | |||
// bridges can be used to hot plug devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/bridges/Bridges/
device.go
Outdated
@@ -232,7 +232,7 @@ func (device *BlockDevice) attach(h hypervisor, c *Container) (err error) { | |||
|
|||
device.DeviceInfo.Hotplugged = false | |||
} else { | |||
if err = h.hotplugAddDevice(drive, blockDev); err != nil { | |||
if err := c.pod.hypervisor.hotplugAddDevice(drive, blockDev); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this commit. There is no point in using c.pod.hypervisor
instead of the provided hypervisor
.
device.go
Outdated
if device.DeviceInfo.Hotplugged { | ||
deviceLogger().WithField("device", device.DeviceInfo.HostPath).Info("Unplugging block device") | ||
|
||
drive := Drive{ | ||
ID: makeBlockDevIDForHypervisor(device.DeviceInfo.ID), | ||
} | ||
|
||
if err := h.hotplugRemoveDevice(drive, blockDev); err != nil { | ||
deviceLogger().WithError(err).Error("Failed to unplug block device") | ||
if err := c.pod.hypervisor.hotplugRemoveDevice(drive, blockDev); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't see the point with this, hypervisor
object is provided, it is better to use this than c.pod.hypervisor
.
hypervisor.go
Outdated
type HypervisorState struct { | ||
// Bridges is the list of pci bridges attached to the pod | ||
// pci bridges can be used to hot plug devices | ||
Bridges []*Bridge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot do that unfortunately... The thing here is that you want to store this information through a file on the host filesystem. If you store a list of pointers, then when another instance of cc-runtime will read this file, those pointers will certainly not contain the data you expect.
You need a list of real objects
type HypervisorState struct {
Bridges []Bridge
}
pod.go
Outdated
@@ -72,6 +72,9 @@ type State struct { | |||
|
|||
// Bool to indicate if the drive for a container was hotplugged. | |||
HotpluggedDrive bool `json:"hotpluggedDrive"` | |||
|
|||
// HypervisorState keeps the state of the hypervisor | |||
HypervisorState HypervisorState `json:"hypervisorState"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the way we have implemented the state files on virtcontainers. Let me think about it, I'll try to come up with a clean solution for this storage purpose (now that I am clear about what we expect).
@amshinde please take a look and give some inputs about this too.
@devimc I have discussed with @amshinde about the options we have to implement the storage of this list of bridges. I know what we want to store is at a Pod level (because it depends on the VM), but I don't want to store this through the pod state since this is very specific to the hypervisor implementation. Indeed, let's take the exemple of Xen hypervisor, and let's say we want to hotplug a new device. In case Xen does not need a bridge because the hotplug works as expected, we don't want to update the Pod state for nothing. What we need is to let every hypervisor implementation decide if they want to store something into In our case this means several changes, let me try to summarize this: 1. Update Qemu structure type qemu struct {
pod *Pod
...
bridges []Bridges
} 2. Update hypervisor interface type hypervisor interface {
init(config HypervisorConfig) error
...
} to that: type hypervisor interface {
init(pod *Pod) error
...
} The explanation is that we need the pod to be stored inside qemu structure so that we can access storage anytime we want from the qemu implementation. Xen implementation would do nothing with this pod actually since it would not need to store anything. Of course, you're gonna have to update the mock implementation, but that's nothing. 3. Modify doFetchPod() sequence func doFetchPod(podConfig PodConfig) (*Pod, error) {
...
if err := hypervisor.init(podConfig.HypervisorConfig); err != nil {
return nil, err
}
...
p := &Pod{
id: podConfig.ID,
hypervisor: hypervisor,
...
}
... to that: func doFetchPod(podConfig PodConfig) (*Pod, error) {
...
p := &Pod{
id: podConfig.ID,
hypervisor: hypervisor,
...
}
if err := hypervisor.init(pod); err != nil {
return nil, err
}
... We need this, otherwise we cannot provide the pod to the 4. Create a new hypervisor.json file 5. Fetch hypervisor info func (q *qemu) init(pod *Pod) error {
q.pod = pod
...
// Fetch the hypervisor.json if it exists and assign q.bridges with what we read.
} Every time the hypervisor is gonna be created from 6. Store hypervisor info @devimc I am trying to think of everything here, but I might have forgotten some details. @amshinde, @sameo please let me know what you think about this proposal and let me know if something is missing. |
@sboeuf your proposal is good and feasible |
@sboeuf you have captured almost everything that we discussed. Another point to note, we should consider storing the hypervisor.json at the end of a container lifecycle event vs having it done immediately after a device is added. |
@amshinde oh yes you're right, and this should come as a second step (as an improvement). |
@devimc I am glad that's clear :) |
@sboeuf just one question, if hypervisor.json is hypervisor specific then filesystem fetch method should return |
@devimc oh yeah let me clarify this. So basically I think you'll have the fetch method returning an |
e48dcbb
to
6c938cf
Compare
@sboeuf changes applied |
@devimc thanks, I am going to review this ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc few comments but this is getting better :)
filesystem.go
Outdated
@@ -32,6 +32,8 @@ import ( | |||
// pods and containers. | |||
type podResource int | |||
|
|||
type HypervisorState []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. There is no point in creating this type since we know that what comes out of marshalling is always an array of bytes.
qemu.go
Outdated
@@ -39,6 +39,10 @@ type qmpChannel struct { | |||
qmp *ciaoQemu.QMP | |||
} | |||
|
|||
type qemuState struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be an exported structure, otherwise the JSON marshalling won't work.
filesystem.go
Outdated
@@ -675,10 +728,31 @@ func (fs *filesystem) fetchPodNetwork(podID string) (NetworkNamespace, error) { | |||
return NetworkNamespace{}, fmt.Errorf("Unknown network type") | |||
} | |||
|
|||
func (fs *filesystem) fetchHypervisorState(podID string, state interface{}) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this cast here, we can simplify this function like this:
func (fs *filesystem) fetchHypervisorState(podID string, state interface{}) error {
data, err := fs.fetchResource(true, podID, "", hypervisorFileType)
if err != nil {
return err
}
if err = json.Unmarshal(data, state); err != nil {
return err
}
return nil
}
data
being a simple []byte
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll cast data to []byte, if not, I get this error
# github.com/containers/virtcontainers
./filesystem.go:735:25: cannot use data (type interface {}) as type []byte in argument to json.Unmarshal: need type assertion
Makefile:27: recipe for target 'build' failed
make: *** [build] Error 2
func (fs *filesystem) fetchHypervisorState(podID string, state interface{}) error {
data, err := fs.fetchResource(true, podID, "", hypervisorFileType)
if err != nil {
return err
}
if err = json.Unmarshal(data.([]byte), state); err != nil {
return err
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devimc you're right but there is an underlying problem here and it's not related to your PR, but to the way the storage (and the fetch specifically) is implemented in virtcontainers.
I have submitted #472 in order to make this more flexible and it should solve your issues. I am waiting for @sameo to review this PR and hopefully this will get merged soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return err | ||
} | ||
|
||
if err := q.pod.storage.storeHypervisorState(q.pod.id, q.state); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be improved later (moving this to container.go where we actually add the devices), but that's fine for now.
return err | ||
} | ||
|
||
if err := q.pod.storage.storeHypervisorState(q.pod.id, q.state); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
1a1208f
to
2826db8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment, but everything else looks fine
@@ -598,10 +610,23 @@ func (fs *filesystem) fetchPodNetwork(podID string) (NetworkNamespace, error) { | |||
return networkNS, nil | |||
} | |||
|
|||
func (fs *filesystem) fetchHypervisorState(podID string, state interface{}) error { | |||
return fs.fetchResource(true, podID, "", hypervisorFileType, state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perfect, but look at commonResourceChecks()
, you have to update the list of known file type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change applied, thanks
LGTM |
8fe21f4
to
d0114d8
Compare
Qemu Q35 machine type does not support to hot plug devices directly on pcie.0, hence we have to find a way to allow users hot plug N devices. The only way to hot plug devices in Qemu Q35 is through PCI brdiges. Each PCI bridge is able to support until 32 devices, therefore we have a limitation in the amount of devices we can hot plug. This patch adds support for hot plugging devices on PCI bridges in pc and q35 machine types. Signed-off-by: Julio Montes <julio.montes@intel.com>
Qemu Q35 machine type does not support to hot plug devices
directly on pcie.0, hence we have to find a way to allow users
hot plug N devices.
The only way to hot plug devices in Qemu Q35 is through PCI brdiges.
Each PCI bridge is able to support until 32 devices, therefore we have a
limitation in the amount of devices we can hot plug.
This patch adds support for hot plugging devices on PCI bridges in
pc and q35 machine types.
Signed-off-by: Julio Montes julio.montes@intel.com