Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NVME disk support #12500

Merged
merged 6 commits into from
Nov 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/api-extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2318,3 +2318,6 @@ that the endpoint can be used to wait for an operation to complete rather than w
This extension adds support for copying and moving custom storage volumes within a cluster with a single API call.
Calling `POST /1.0/storage-pools/<pool>/custom?target=<target>` will copy the custom volume specified in the `source` part of the request.
Calling `POST /1.0/storage-pools/<pool>/custom/<volume>?target=<target>` will move the custom volume from the source, specified in the `source` part of the request, to the target.

## `disk_io_bus`
This introduces a new `io.bus` property to disk devices which can be used to override the bus the disk is attached to.
43 changes: 22 additions & 21 deletions doc/reference/devices_disk.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,24 +92,25 @@ Note that you cannot use initial volume configurations with custom volume option

`disk` devices have the following device options:

Key | Type | Default | Required | Description
:-- | :-- | :-- | :-- | :--
`boot.priority` | integer | - | no | Boot priority for VMs (higher value boots first)
`ceph.cluster_name` | string | `ceph` | no | The cluster name of the Ceph cluster (required for Ceph or CephFS sources)
`ceph.user_name` | string | `admin` | no | The user name of the Ceph cluster (required for Ceph or CephFS sources)
`initial.*` | n/a | - | no | {ref}`devices-disk-initial-config` that allows setting unique configurations independent of default storage pool settings
`io.cache` | string | `none` | no | Only for VMs: Override the caching mode for the device (`none`, `writeback` or `unsafe`)
`limits.max` | string | - | no | I/O limit in byte/s or IOPS for both read and write (same as setting both `limits.read` and `limits.write`)
`limits.read` | string | - | no | I/O limit in byte/s (various suffixes supported, see {ref}`instances-limit-units`) or in IOPS (must be suffixed with `iops`) - see also {ref}`storage-configure-IO`
`limits.write` | string | - | no | I/O limit in byte/s (various suffixes supported, see {ref}`instances-limit-units`) or in IOPS (must be suffixed with `iops`) - see also {ref}`storage-configure-IO`
`path` | string | - | yes | Path inside the instance where the disk will be mounted (only for containers)
`pool` | string | - | no | The storage pool to which the disk device belongs (only applicable for storage volumes managed by LXD)
`propagation` | string | - | no | Controls how a bind-mount is shared between the instance and the host (can be one of `private`, the default, or `shared`, `slave`, `unbindable`, `rshared`, `rslave`, `runbindable`, `rprivate`; see the Linux Kernel [shared subtree](https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) documentation for a full explanation) <!-- wokeignore:rule=slave -->
`raw.mount.options` | string | - | no | File system specific mount options
`readonly` | bool | `false` | no | Controls whether to make the mount read-only
`recursive` | bool | `false` | no | Controls whether to recursively mount the source path
`required` | bool | `true` | no | Controls whether to fail if the source doesn't exist
`shift` | bool | `false` | no | Sets up a shifting overlay to translate the source UID/GID to match the instance (only for containers)
`size` | string | - | no | Disk size in bytes (various suffixes supported, see {ref}`instances-limit-units`) - only supported for the `rootfs` (`/`)
`size.state` | string | - | no | Same as `size`, but applies to the file-system volume used for saving runtime state in VMs
`source` | string | - | yes | Source of a file system or block device (see {ref}`devices-disk-types` for details)
Key | Type | Default | Required | Description
:-- | :-- | :-- | :-- | :--
`boot.priority` | integer | - | no | Boot priority for VMs (higher value boots first)
`ceph.cluster_name` | string | `ceph` | no | The cluster name of the Ceph cluster (required for Ceph or CephFS sources)
`ceph.user_name` | string | `admin` | no | The user name of the Ceph cluster (required for Ceph or CephFS sources)
`initial.*` | n/a | - | no | {ref}`devices-disk-initial-config` that allows setting unique configurations independent of default storage pool settings
`io.bus` | string | `virtio-scsi` | no | Only for VMs: Override the bus for the device (`virtio-scsi` or `nvme`)
`io.cache` | string | `none` | no | Only for VMs: Override the caching mode for the device (`none`, `writeback` or `unsafe`)
`limits.max` | string | - | no | I/O limit in byte/s or IOPS for both read and write (same as setting both `limits.read` and `limits.write`)
`limits.read` | string | - | no | I/O limit in byte/s (various suffixes supported, see {ref}`instances-limit-units`) or in IOPS (must be suffixed with `iops`) - see also {ref}`storage-configure-IO`
`limits.write` | string | - | no | I/O limit in byte/s (various suffixes supported, see {ref}`instances-limit-units`) or in IOPS (must be suffixed with `iops`) - see also {ref}`storage-configure-IO`
`path` | string | - | yes | Path inside the instance where the disk will be mounted (only for containers)
`pool` | string | - | no | The storage pool to which the disk device belongs (only applicable for storage volumes managed by LXD)
`propagation` | string | - | no | Controls how a bind-mount is shared between the instance and the host (can be one of `private`, the default, or `shared`, `slave`, `unbindable`, `rshared`, `rslave`, `runbindable`, `rprivate`; see the Linux Kernel [shared subtree](https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt) documentation for a full explanation) <!-- wokeignore:rule=slave -->
`raw.mount.options` | string | - | no | File system specific mount options
`readonly` | bool | `false` | no | Controls whether to make the mount read-only
`recursive` | bool | `false` | no | Controls whether to recursively mount the source path
`required` | bool | `true` | no | Controls whether to fail if the source doesn't exist
`shift` | bool | `false` | no | Sets up a shifting overlay to translate the source UID/GID to match the instance (only for containers)
`size` | string | - | no | Disk size in bytes (various suffixes supported, see {ref}`instances-limit-units`) - only supported for the `rootfs` (`/`)
`size.state` | string | - | no | Same as `size`, but applies to the file-system volume used for saving runtime state in VMs
`source` | string | - | yes | Source of a file system or block device (see {ref}`devices-disk-types` for details)
33 changes: 26 additions & 7 deletions lxd/device/disk.go
Original file line number Diff line number Diff line change
Expand Up @@ -195,13 +195,18 @@ func (d *disk) validateConfig(instConf instance.ConfigReader) error {
"boot.priority": validate.Optional(validate.IsUint32),
"path": validate.IsAny,
"io.cache": validate.Optional(validate.IsOneOf("none", "writeback", "unsafe")),
"io.bus": validate.Optional(validate.IsOneOf("virtio-scsi", "nvme")),
}

err := d.config.Validate(rules)
if err != nil {
return err
}

if instConf.Type() == instancetype.Container && d.config["io.bus"] != "" {
return fmt.Errorf("IO bus configuration cannot be applied to containers")
}

if instConf.Type() == instancetype.Container && d.config["io.cache"] != "" {
return fmt.Errorf("IO cache configuration cannot be applied to containers")
}
Expand Down Expand Up @@ -736,11 +741,6 @@ func (d *disk) detectVMPoolMountOpts() []string {
opts = append(opts, DiskIOUring)
}

// Allow the user to override the caching mode.
if d.config["io.cache"] != "" {
opts = append(opts, fmt.Sprintf("cache=%s", d.config["io.cache"]))
}

return opts
}

Expand All @@ -751,18 +751,33 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {
revert := revert.New()
defer revert.Fail()

// Handle user overrides.
opts := []string{}

// Allow the user to override the bus.
if d.config["io.bus"] != "" {
opts = append(opts, fmt.Sprintf("bus=%s", d.config["io.bus"]))
}

// Allow the user to override the caching mode.
if d.config["io.cache"] != "" {
opts = append(opts, fmt.Sprintf("cache=%s", d.config["io.cache"]))
}

if shared.IsRootDiskDevice(d.config) {
// Handle previous requests for setting new quotas.
err := d.applyDeferredQuota()
if err != nil {
return nil, err
}

opts = append(opts, d.detectVMPoolMountOpts()...)

runConf.Mounts = []deviceConfig.MountEntryItem{
{
TargetPath: d.config["path"], // Indicator used that this is the root device.
DevName: d.name,
Opts: d.detectVMPoolMountOpts(),
Opts: opts,
},
}

Expand Down Expand Up @@ -790,6 +805,7 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {
DevPath: fmt.Sprintf("%s:%d:%s", DiskFileDescriptorMountPrefix, f.Fd(), isoPath),
DevName: d.name,
FSType: "iso9660",
Opts: opts,
},
}

Expand All @@ -805,6 +821,7 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {
{
DevPath: DiskGetRBDFormat(clusterName, userName, fields[0], fields[1]),
DevName: d.name,
Opts: opts,
},
}
} else {
Expand All @@ -814,6 +831,7 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {
mount := deviceConfig.MountEntryItem{
DevPath: shared.HostPath(d.config["source"]),
DevName: d.name,
Opts: opts,
}

// Mount the pool volume and update srcPath to mount path so it can be recognised as dir
Expand Down Expand Up @@ -866,6 +884,7 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {
mount := deviceConfig.MountEntryItem{
DevPath: DiskGetRBDFormat(clusterName, userName, poolName, d.config["source"]),
DevName: d.name,
Opts: opts,
}

if contentType == db.StoragePoolVolumeContentTypeISO {
Expand All @@ -884,7 +903,7 @@ func (d *disk) startVM() (*deviceConfig.RunConfig, error) {

revert.Add(revertFunc)

mount.Opts = d.detectVMPoolMountOpts()
mount.Opts = append(mount.Opts, d.detectVMPoolMountOpts()...)
}

if shared.IsTrue(d.config["readonly"]) {
Expand Down
105 changes: 85 additions & 20 deletions lxd/instance/drivers/driver_qemu.go
Original file line number Diff line number Diff line change
Expand Up @@ -2129,7 +2129,7 @@ func (d *qemu) deviceAttachBlockDevice(deviceName string, configCopy map[string]
return fmt.Errorf("Failed to connect to QMP monitor: %w", err)
}

monHook, err := d.addDriveConfig(nil, mount)
monHook, err := d.addDriveConfig(nil, nil, mount)
if err != nil {
return fmt.Errorf("Failed to add drive config: %w", err)
}
Expand Down Expand Up @@ -3119,12 +3119,38 @@ func (d *qemu) generateQemuConfigFile(cpuInfo *cpuTopology, mountInfo *storagePo
for _, drive := range runConf.Mounts {
var monHook monitorHook

// Check if the user has overridden the bus.
busName := "virtio-scsi"
for _, opt := range drive.Opts {
if !strings.HasPrefix(opt, "bus=") {
continue
}

busName = strings.TrimPrefix(opt, "bus=")
break
}

qemuDev := make(map[string]string)
if busName == "nvme" {
// Allocate a PCI(e) port and write it to the config file so QMP can "hotplug" the
// NVME drive into it later.
devBus, devAddr, multi := bus.allocate(busFunctionGroupNone)

// Populate the qemu device with port info.
qemuDev["bus"] = devBus
qemuDev["addr"] = devAddr

if multi {
qemuDev["multifunction"] = "on"
}
}

if drive.TargetPath == "/" {
monHook, err = d.addRootDriveConfig(mountInfo, bootIndexes, drive)
monHook, err = d.addRootDriveConfig(qemuDev, mountInfo, bootIndexes, drive)
} else if drive.FSType == "9p" {
err = d.addDriveDirConfig(&cfg, bus, fdFiles, &agentMounts, drive)
} else {
monHook, err = d.addDriveConfig(bootIndexes, drive)
monHook, err = d.addDriveConfig(qemuDev, bootIndexes, drive)
}

if err != nil {
Expand Down Expand Up @@ -3356,7 +3382,7 @@ func (d *qemu) addFileDescriptor(fdFiles *[]*os.File, file *os.File) int {
}

// addRootDriveConfig adds the qemu config required for adding the root drive.
func (d *qemu) addRootDriveConfig(mountInfo *storagePools.MountInfo, bootIndexes map[string]int, rootDriveConf deviceConfig.MountEntryItem) (monitorHook, error) {
func (d *qemu) addRootDriveConfig(qemuDev map[string]string, mountInfo *storagePools.MountInfo, bootIndexes map[string]int, rootDriveConf deviceConfig.MountEntryItem) (monitorHook, error) {
if rootDriveConf.TargetPath != "/" {
return nil, fmt.Errorf("Non-root drive config supplied")
}
Expand Down Expand Up @@ -3391,7 +3417,7 @@ func (d *qemu) addRootDriveConfig(mountInfo *storagePools.MountInfo, bootIndexes
driveConf.DevPath = device.DiskGetRBDFormat(clusterName, userName, config["ceph.osd.pool_name"], vol.Name())
}

return d.addDriveConfig(bootIndexes, driveConf)
return d.addDriveConfig(qemuDev, bootIndexes, driveConf)
}

// addDriveDirConfig adds the qemu config required for adding a supplementary drive directory share.
Expand Down Expand Up @@ -3483,7 +3509,7 @@ func (d *qemu) addDriveDirConfig(cfg *[]cfgSection, bus *qemuBus, fdFiles *[]*os
}

// addDriveConfig adds the qemu config required for adding a supplementary drive.
func (d *qemu) addDriveConfig(bootIndexes map[string]int, driveConf deviceConfig.MountEntryItem) (monitorHook, error) {
func (d *qemu) addDriveConfig(qemuDev map[string]string, bootIndexes map[string]int, driveConf deviceConfig.MountEntryItem) (monitorHook, error) {
aioMode := "native" // Use native kernel async IO and O_DIRECT by default.
cacheMode := "none" // Bypass host cache, use O_DIRECT semantics by default.
media := "disk"
Expand Down Expand Up @@ -3580,6 +3606,17 @@ func (d *qemu) addDriveConfig(bootIndexes map[string]int, driveConf deviceConfig
}
}

// Check if the user has overridden the bus.
bus := "virtio-scsi"
for _, opt := range driveConf.Opts {
if !strings.HasPrefix(opt, "bus=") {
continue
}

bus = strings.TrimPrefix(opt, "bus=")
break
}

// Check if the user has overridden the cache mode.
for _, opt := range driveConf.Opts {
if !strings.HasPrefix(opt, "cache=") {
Expand Down Expand Up @@ -3709,23 +3746,51 @@ func (d *qemu) addDriveConfig(bootIndexes map[string]int, driveConf deviceConfig
blockDev["locking"] = "off"
}

device := map[string]string{
"id": fmt.Sprintf("%s%s", qemuDeviceIDPrefix, escapedDeviceName),
"drive": blockDev["node-name"].(string),
"bus": "qemu_scsi.0",
"channel": "0",
"lun": "1",
"serial": fmt.Sprintf("%s%s", qemuBlockDevIDPrefix, escapedDeviceName),
if qemuDev == nil {
qemuDev = map[string]string{}
}

if bootIndexes != nil {
device["bootindex"] = strconv.Itoa(bootIndexes[driveConf.DevName])
qemuDev["id"] = fmt.Sprintf("%s%s", qemuDeviceIDPrefix, escapedDeviceName)
qemuDev["drive"] = blockDev["node-name"].(string)
qemuDev["serial"] = fmt.Sprintf("%s%s", qemuBlockDevIDPrefix, escapedDeviceName)

if bus == "virtio-scsi" {
qemuDev["channel"] = "0"
qemuDev["lun"] = "1"
qemuDev["bus"] = "qemu_scsi.0"

if media == "disk" {
qemuDev["driver"] = "scsi-hd"
} else if media == "cdrom" {
qemuDev["driver"] = "scsi-cd"
}
} else if bus == "nvme" {
if qemuDev["bus"] == "" {
// Figure out a hotplug slot.
pciDevID := qemuPCIDeviceIDStart

// Iterate through all the instance devices in the same sorted order as is used when allocating the
// boot time devices in order to find the PCI bus slot device we would have used at boot time.
// Then attempt to use that same device, assuming it is available.
for _, dev := range d.expandedDevices.Sorted() {
if dev.Name == driveConf.DevName {
break // Found our device.
}

pciDevID++
}

pciDeviceName := fmt.Sprintf("%s%d", busDevicePortPrefix, pciDevID)
d.logger.Debug("Using PCI bus device to hotplug NVME into", logger.Ctx{"device": driveConf.DevName, "port": pciDeviceName})
qemuDev["bus"] = pciDeviceName
qemuDev["addr"] = "00.0"
}

qemuDev["driver"] = "nvme"
}

if media == "disk" {
device["driver"] = "scsi-hd"
} else if media == "cdrom" {
device["driver"] = "scsi-cd"
if bootIndexes != nil {
qemuDev["bootindex"] = strconv.Itoa(bootIndexes[driveConf.DevName])
}

monHook := func(m *qmp.Monitor) error {
Expand Down Expand Up @@ -3769,7 +3834,7 @@ func (d *qemu) addDriveConfig(bootIndexes map[string]int, driveConf deviceConfig
blockDev["filename"] = fmt.Sprintf("/dev/fdset/%d", info.ID)
}

err := m.AddBlockDevice(blockDev, device)
err := m.AddBlockDevice(blockDev, qemuDev)
if err != nil {
return fmt.Errorf("Failed adding block device for disk device %q: %w", driveConf.DevName, err)
}
Expand Down
1 change: 1 addition & 0 deletions shared/version/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -388,6 +388,7 @@ var APIExtensions = []string{
"disk_initial_volume_configuration",
"operation_wait",
"cluster_internal_custom_volume_copy",
"disk_io_bus",
}

// APIExtensionsCount returns the number of available API extensions.
Expand Down