Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

OvmfPkg: use regular PCI bus enumeration for bhyve #9

Open
wants to merge 1 commit into
base: bhyve/edk2-stable201903
Choose a base branch
from

Conversation

d-scott-phillips
Copy link

With the initial import of bhyve functionality, I set
PcdPciDisableBusEnumeration = TRUE to follow the UDK2014.SP1 bhyve
firmware's behavior of using the no-enumeration PCI Dxes from
DuetPkg.

The enumeration disabled path is for Xen, which assumes that PCI
devices are configured in the way that Xen configures them, and
has various asserts to ensure this. Bhyve doesn't meet all the
expectations of this code path, specifically that all 64-bit
capable memory spaces are programmed at an address at or above 4G.

This prevents successfully booting with bhyve's nvme emulation, so
here we switch to using the PCI enumeration path like QEMU.

With the initial import of bhyve functionality, I set
PcdPciDisableBusEnumeration = TRUE to follow the UDK2014.SP1 bhyve
firmware's behavior of using the no-enumeration PCI Dxes from
DuetPkg.

The enumeration disabled path is for Xen, which assumes that PCI
devices are configured in the way that Xen configures them, and
has various asserts to ensure this. Bhyve doesn't meet all the
expectations of this code path, specifically that all 64-bit
capable memory spaces are programmed at an address at or above 4G.

This prevents successfully booting with bhyve's nvme emulation, so
here we switch to using the PCI enumeration path like QEMU.
@lupinglade
Copy link

lupinglade commented Jan 11, 2020

This seems to break pci_fbuf because UEFI is changing the frame buffer memory address behind pci_fbuf's back. It will crash at the assert(bardix == 0) in pci_fbuf because writing to the frame buffer is no longer accessing the video memory address.

@sjorge
Copy link

sjorge commented Jan 17, 2020

We're also hitting this on illumos using a firmware with this change.

@citrus-it
Copy link

@d-scott-phillips do you have any thoughts on this? In illumos we currently prefer fbuf over nvme but would obviously like both! How hard is adjusting things so that we can enable the bus enumeration without breaking fbuf? Thanks!

@d-scott-phillips
Copy link
Author

The fix here would be to add the ability to relocate fbuf's frame buffer BAR as in: https://reviews.freebsd.org/D24066

@citrus-it
Copy link

Fantastic, thank you for that patch. I'll pull it into OmniOS and test it there too.

citrus-it added a commit to citrus-it/illumos-omnios that referenced this pull request Mar 14, 2020
From https://reviews.freebsd.org/D24066

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. One issue
with this is that fbuf uses a special memory segment for the frame
buffer memory. The pci_emul code can handle address assignments to
PCI BARs, but doesn't know anything about the frame buffer memory
segment. Here we add a callback to emulated PCI devices to inform
them of a BAR configuration change. pci_fbuf then watches for
these BAR changes and relocates the frame buffer memory segment.
We also add a new VM_MUNMAP_MEMSEG ioctl to allow pci_fbuf to
actually change the frame buffer mapping.

[1]: freebsd/uefi-edk2#9
citrus-it added a commit to citrus-it/illumos-omnios that referenced this pull request Mar 14, 2020
From https://reviews.freebsd.org/D24066

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. One issue
with this is that fbuf uses a special memory segment for the frame
buffer memory. The pci_emul code can handle address assignments to
PCI BARs, but doesn't know anything about the frame buffer memory
segment. Here we add a callback to emulated PCI devices to inform
them of a BAR configuration change. pci_fbuf then watches for
these BAR changes and relocates the frame buffer memory segment.
We also add a new VM_MUNMAP_MEMSEG ioctl to allow pci_fbuf to
actually change the frame buffer mapping.

[1]: freebsd/uefi-edk2#9
citrus-it added a commit to citrus-it/illumos-omnios that referenced this pull request Mar 14, 2020
From https://reviews.freebsd.org/D24066

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. One issue
with this is that fbuf uses a special memory segment for the frame
buffer memory. The pci_emul code can handle address assignments to
PCI BARs, but doesn't know anything about the frame buffer memory
segment. Here we add a callback to emulated PCI devices to inform
them of a BAR configuration change. pci_fbuf then watches for
these BAR changes and relocates the frame buffer memory segment.
We also add a new VM_MUNMAP_MEMSEG ioctl to allow pci_fbuf to
actually change the frame buffer mapping.

[1]: freebsd/uefi-edk2#9
freebsd-git pushed a commit to freebsd/freebsd-src that referenced this pull request Mar 19, 2021
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
@khng300
Copy link
Member

khng300 commented Mar 24, 2021

Since D24066 has been committed to src, we probably would like to make this changes on upstream as well.

freebsd-git pushed a commit to freebsd/freebsd-src that referenced this pull request Mar 26, 2021
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

(cherry picked from commit f8a6ec2)
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Jul 5, 2021
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Jul 6, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Jul 14, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Aug 27, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Aug 27, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Aug 27, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Aug 27, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Aug 27, 2021
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307
bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this pull request Nov 5, 2021
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
ckoehne pushed a commit to Beckhoff/freebsd-src that referenced this pull request Nov 17, 2021
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
Yamagi pushed a commit to Yamagi/freebsd that referenced this pull request Feb 3, 2022
libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b509

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b509 Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b509 introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b509
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		freebsd/freebsd-src#463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b509
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada29
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307
Yamagi added a commit to Yamagi/freebsd that referenced this pull request Feb 6, 2022
Includes all commits up to 2022/02/06 or d21e71efce39.

----

libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b5090487d Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b5090487de9fed1b503769702a9a2a27cc7bb introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		https://github.com/freebsd/freebsd-src/pull/463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b5090487d
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e3ca895a14, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e3ca895a14

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada297e897
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307

----

bhyve: Nuke double-semicolons

A distinct number of double-semicolons ended up in bhyve. Take a pass at
getting rid of many of these harmless typos.

MFC after:	3 days

----

bhyve: Fix pci device node key in bhyve_config.5

PCI device node key in the manual page is wrong. It should be
pci.bus.slot.function.

MFC after:	3 days

----

bhyve: Support setting the disk serial number for VirtIO block devices.

Reviewed by:	allanjude
Obtained from:	illumos
Differential Revision:	https://reviews.freebsd.org/D31983

----

bhyve: Update the -G description in the SYNPOSIS.

It was missing both the 'w' flag and 'bind_address'.

----

bhyve_config.5: Document gdb.address.

----

bhyve: Add an empty case for event types in mevent_kq_fflags().

This fixes a -Wswitch error raised by GCC 9.

Differential Revision:	https://reviews.freebsd.org/D31938

----

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by:	bz, grehan, jhb
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32359

----

bhyve.8: Fix markup of the -G flag

----

bhyve: Update usage and synopsis for the -k flag

Let's make it clear to users that -k is for configuration files.
Also, point to bhyve_config(5) in the paragraph describing the flag.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32467

----

bhyve: ignore low bits of CFGADR

Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.

According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.

E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
	outl 0x80000002 at CFGADR
	inw  at CFGDAT + 2
bhyve:
	cfgoff = 0x80000002 & 0xFF = 0x02
	coff   = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.

Reviewed By: #bhyve, grehan
Differential Revision: https://reviews.freebsd.org/D31819
Sponsored by:	       Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by:	Greg V <greg@unrelenting.technology>
Reviewed by:	grehan, bz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32523

----

bhyve: Bump the SMBIOS firmware version to 14.0 for 14-CURRENT

Bump the firmware version to 14.0 and set the firmware release date
to today.

Reviewed by: jhb, bz, imp
Differential Revision: https://reviews.freebsd.org/D32534

----

bhyve: use physical lobits for BARs of passthru devices

Tell the guest whether a BAR uses prefetched memory or not for
passthru devices by using the same lobits as the physical device.

Reviewed by:	 grehan
Sponsored by:	 Beckhoff Autmation GmbH & Co. KG
Differential Revision:	  https://reviews.freebsd.org/D32685

----

bhyve: do not explicitly map fbuf framebuffer

Allocating a BAR will call baraddr which maps the framebuffer. No need
to allocate it explicitly on init.

Reviewed by:     grehan
Sponsored by:    Beckhoff Autmation GmbH & Co. KG
Differential Revision:    https://reviews.freebsd.org/D32596

----

bhyve: move 64 bit BAR location to match OVMF assumptions

OVMF will fail, if large 64 bit BARs are used. GCD-Map doesn't cover
64 bit addresses of BARs.
OVMF assumes that 64 bit addresses of BARS are located on next 32 GB
boundary behind Top of High RAM.

This patch moves 64 bit BARs on next 32 GB boundary behind Top of High
RAM to match OVMF assumptions.

Differential Revision:	https://reviews.freebsd.org/D27970
Sponsored by: Beckhoff Automation GmbH & Co. KG

----

bhyve: use a fixed 32 bit BAR base address

OVMF always uses 0xC0000000 as base address for 32 bit PCI MMIO space.
For that reason, we should use that address too.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31051
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: keep physical and virtual COMMAND reg in sync

On startup all virtual BARs are registered.
Additionally, the encoding bit in the virtual cmd register is set.
After that, the passthru emulation overwrites the virtual cmd register with
the physical one.
This could lead to a mismatch between registered BARs and the encoding
bits in the cmd register.
Instead of writing the physical to the virtual cmd register,
write the virtual to the physical cmd register to solve this issue.

Reviewed by:	  markj
Differential Revision:	https://reviews.freebsd.org/D32687
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: emulate reads of MSI-X capabilities for passthru devices

Reads of the MSI-X capabilites aren't emulated by passthru devices
yet. The guest will read the host MSI-X capabilites which could
cause issues.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D32686
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix compile

We need err.h

Fixes:	5cf21e48ccf11 ("bhyve: use a fixed 32 bit BAR base address")
Sponsored by:	Bechoff Automation GmbH & Co. KG

----

bhyve blockif: fix blockif_candelete with Capsicum

NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.

The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.

Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.

PR:		260081
Reviewed by:	allanjude, markj, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33203

----

bhyve: fix -Wunused-but-set-variable warning

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D33306

----

bhyve: Support a _VARS.fd file for bootrom

OVMF creates two separate .fd files, a _CODE.fd file containing
the UEFI code, and a _VARS.fd file containing a template of an
empty UEFI variable store.

OVMF decides to write variables to the memory range just below the
boot rom code if it detects a CFI flash device. So here we add
just the barest facsimile of CFI command handling to bootrom.c
that is needed to placate OVMF.

Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19976
MFC After: 1 week

----

bhyve: set EV_CLEAR for EVFILT_VNODE mevents

When an EVFILT_VNODE filter event is triggered, reset it.

This fixes the issue where a virtio-blk resize event would cause the
mevent thread to consume 100% of the cpu.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33326

----

bhyve nvme: Add AEN support to NVMe emulation

Add Asynchronous Event Notification infrastructure to the NVMe
emulation.

Reviewed by:	imp, grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32952

----

bhyve nvme: Inform guests of namespace resize

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32953

----

bhyve: Only snapshot initialized VirtIO queues

If the virtio device is not fully initialized, then suspend fails with:

  vi_pci_snapshot_queues: invalid address: vq->vq_desc
  Failed to snapshot virtio-rnd; ret=14

MFC after:	1 week
Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D26268

----

bhyve: passthru: enable BARs before possibly mmap(2)ing them

The first time we start bhyve with a passthru device everything is fine
as on boot we do enable BARs.  If a driver (unload) inside bhyve disables
the BAR(s) as some Linux drivers do, we need to make sure we re-enable
them on next bhyve start.

If we are trying to mmap a disabled BAR for MSI-X (PCIOCBARMMAP)
the kernel will give us an EBUSY.
While we were re-enabling the BAR(s) in the current code loop
cfginit() was writing the changes out too late to the real hardware.

Move the call to init_msix_table() after the register on the real
hardware was updated.  That way the kernel will be happy and the
mmap will succeed and bhyve will start.
Also simplify the code given the last argument to init_msix_table()
is unused we do not need to do checks for each bar. [1]

MFC after:	3 days
PR:		260148
Pointed out by:	markj [1]
Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision: https://reviews.freebsd.org/D33628

----

bhyve: clean up trailing whitespaces

Clean up trailing whitespaces. No functional changes.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33681

----

bhyve smbios type 3 structure is incorrect

If you look at the SMBIOS specification, we'll find something is
missing. In particular at offset 0Dh is supposed to be the OEM-defined
field. This should go between security and height. It is not legal to
actually skip this and will lead to other folks not properly
interpreting later parts of the table.

https://www.illumos.org/issues/14312

Reviewed by:	jhb
Submitted by:	Robert Mustacchi <rm@fingolfin.org>
Obtained from:	ilumos
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D33682

----

bhyve: only init MSI-X table if passthru device supports it

Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.

MFC after:	3 days
X-MFC with:	f1442847c9404d4bc5f5524a0c3362dd39cb14f9
PR:		260148
Reviewed by:	manu, bz
Differential Revision:	https://reviews.freebsd.org/D33728

----

bhyve: enumerate BARs by size

E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D28278

----

bhyve: allow reading of fwctl signature multiple times

At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D31981

----

bhyve: add more slop to 64 bit BARs

Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.

Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33118

----

bhyve: dynamically register FwCtl ports

Qemu's FwCfg uses the same ports as Bhyve's FwCtl. Static allocated
ports wouldn't allow to switch between Qemu's FwCfg and Bhyve's
FwCtl.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33496

----

bhyve: Map the right BAR in init_msix_table()

The PBA and MSI-X table can reside in different BARs.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: Correct unmapping of the MSI-X table BAR

The starting address passed to mprotect was wrong, so in the case where
the last page containing the table is not the last page of the BAR, the
wrong region would be unmapped.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: add nvlist functions for setting unset nodes

If an emulation uses those functions instead of set_config_value_node
or set_config_value, it allows the config values to get
overwritten. Introducing new functions is much more readable than
if else statements in the emulation code.

Reviewed by:	khng
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33770

----

bhyve: get mediasize for character devices when resizing virtio-blk

Reviewed by:	imp, allanjude, jhb
Differential Revision:	https://reviews.freebsd.org/D33403

----

bhyve/snapshot: fix pthread_create() error check

pthread_create() returns 0 on success or an error number on failure.

Reviewed by:	khng, markj
Differential Revision:	https://reviews.freebsd.org/D33930

----

Append Keyboard Layout specified option for using VNC.
Part two: Append bhyve -K option for specified keyboard layout
with layout setting files every languages.
Since the cmd option '-k' was used in the meantime
it was changed to '-K'

PR:		246121
Submitted by:	koinec@yahoo.co.jp
Reviewed by:	grehan@
Differential Revision:	https://reviews.freebsd.org/D29473

MFC after:	4 weeks

----

bhyve: ahci: Fix regression with no ports

An AHCI controller may be specified with no connected ports.  Avoid
dumping core in this case for compatibility with existing VM configs.

Reviewed by:	khng, jhb
Fixes:		621b5090487de Refactor configuration management in bhyve.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D33969

----

bhyve/block_if: allow DIOCGMEDIASIZE ioctl

This is needed to get mediasize of the device after a resize event.

I missed this earlier as I was building WITH_BHYVE_SNAPSHOT, which
disables capsicum.

Reviewed by:	khng, markj
Fixes: ae9ea22e14bf ("bhyve: get mediasize for character devices when ...")
Differential Revision:	https://reviews.freebsd.org/D34013

----

pkgbase: bhyve: Tag the kbdlayout file to be in the bhyve package

----

bhyve nvme: Advertise v1.4 support

Bump advertised NVMe support from v1.3 to v1.4

Reviewed by:	allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33564

----

bhyve nvme: Fix NVM Format completion status

The NVM Format command is unique among the Admin commands in that it
needs to finish asynchronously. For this reason, the emulation code
invented a synthetic completion status (NVME_NO_STATUS) to indicate that
the command was still in progress and the command processing loop should
not generate a completion message. The implementation used the value
0xffff for the synthetic value as this set both the Status Code and
Status Code Type fields to reserved values.

Format initialized the completion status to this value and expected
error cases to override it with a status code/type appropriate to the
situation. The macros used to set the NVMe status are careful not to
modify bit 0 (i.e. the phase bit), which with the synthetic completion
status, causes the phase bit to get out of sync. When running tests in a
guest with illegal NVM Format commands, Admin commands would eventually
hang because it appeared there were no completions due to the incorrect
phase bit value.

Fix is to only set NVME_NO_STATUS if the blockif delete command
succeeds. While in the neighborhood, add a missing break statement when
NVM Format is not supported.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33565

----

bhyve nvme: Fix Namespace Specific Set Features

Return an error if the feature specified in Set Features is Namespace
specific but the Namespace ID uses the Global Namespace tag.

Fixes UNH Test 1.2.7

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33566

----

bhyve nvme: Implement Log Page Offset

Modify the Get Log Page command to parse the Log Page Offset fields to
support more recent versions of the NVMe specification.

Fixes various tests for UNH Test 1.3.*

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33568

----

bhyve nvme: Add missing Admin opcodes

Don't treat unsupported Admin commands as Invalid Opcode. Instead return
the proper Invalid Field in Command.

Fixes UNH IOL test 1.17.2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33569

----

bhyve nvme: Remove redundant AER Limit checks

The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33570

----

bhyve nvme: Fix Set Features

Be more conservative and only support the Features mandatory for an I/O
Controller.

Avoids a "hang" in UNH test 1.2.10 associated with Predictable Latency
Mode Configuration and Host Behavior Support features.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33571

----

bhyve nvme: Add Temperature Threshold support

This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.

While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.

Fixes UNH IOL 16.0 Test 1.7, cases 1, 2, and 4.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33572

----

bhyve nvme: Update v1.4 Identify Controller data

Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.

Fixes UNH ILO Test 1.1, Case 2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33573

----

bhyve nvme: Add Select support to Get Features

Implement basic support for the SEL field of Get Features. This returns
information about Namespace Specific features.

Fixes UNH ILO 16.0 Test 1.2, Case 13

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33574

----

bhyve nvme: Fix LBA out-of-range calculation

The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.

Fix is to rename the variable to avoid future confusion.

While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.

Reviewed by:	imp, allanjude, jhb
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33575

----

bhyve nvme: Fix reported VWC value

v1.4 and later NVMe Controllers report "Flush all Namespaces" support
differently.

Fixes UNH IOL 16.0 Test 2.6, Case 3

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33576

----

bhyve nvme: Fix Set Features, AEN

NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.

Fixes UNH IOL Test 3.12, Case 8

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33577

----

bhyve nvme: Fix Identify Namespace, NSID=ffffffff

If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.

Fixes UNH IOL 16.0 Test 9.1, Case 6

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33578

----

bhyve/virtio: use correct device id for virtio-scsi

Section 4.1.2.1 of the virtio spec states that the transitional PCI
device id for a scsi device is 0x1004.

Fix suggested by reporter.

PR:             259961
Reported by:    me@nanaya.pro
Reviewed by:	imp, jhb
Fixes:  f9c005a17f4e ("Add bhyve virtio-scsi storage backend support.")
Differential Revision:	https://reviews.freebsd.org/D34103

----

Create VM_MEMATTR_DEVICE on all architectures

This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692

----

Use if ... else when printing memory attributes

In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.

Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D29703

----

Remove an always-true check.

This fixes a -Wtype-limits error from GCC 9.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D31936

----

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0.  On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead.  This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by:	grehan, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32559

----

vmm: vlapic resume can eat 100% CPU by vlapic_callout_handler

Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.

Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).

During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33448

----

bhyve: add support for MTRR

Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33333
Yamagi added a commit to Yamagi/freebsd that referenced this pull request Feb 6, 2022
Includes all commits up to 2022/02/06 or d21e71efce39.

----

libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b5090487d Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b5090487de9fed1b503769702a9a2a27cc7bb introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		https://github.com/freebsd/freebsd-src/pull/463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b5090487d
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e3ca895a14, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e3ca895a14

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada297e897
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307

----

bhyve: Nuke double-semicolons

A distinct number of double-semicolons ended up in bhyve. Take a pass at
getting rid of many of these harmless typos.

MFC after:	3 days

----

bhyve: Fix pci device node key in bhyve_config.5

PCI device node key in the manual page is wrong. It should be
pci.bus.slot.function.

MFC after:	3 days

----

bhyve: Support setting the disk serial number for VirtIO block devices.

Reviewed by:	allanjude
Obtained from:	illumos
Differential Revision:	https://reviews.freebsd.org/D31983

----

bhyve: Update the -G description in the SYNPOSIS.

It was missing both the 'w' flag and 'bind_address'.

----

bhyve_config.5: Document gdb.address.

----

bhyve: Add an empty case for event types in mevent_kq_fflags().

This fixes a -Wswitch error raised by GCC 9.

Differential Revision:	https://reviews.freebsd.org/D31938

----

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by:	bz, grehan, jhb
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32359

----

bhyve.8: Fix markup of the -G flag

----

bhyve: Update usage and synopsis for the -k flag

Let's make it clear to users that -k is for configuration files.
Also, point to bhyve_config(5) in the paragraph describing the flag.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32467

----

bhyve: ignore low bits of CFGADR

Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.

According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.

E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
	outl 0x80000002 at CFGADR
	inw  at CFGDAT + 2
bhyve:
	cfgoff = 0x80000002 & 0xFF = 0x02
	coff   = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.

Reviewed By: #bhyve, grehan
Differential Revision: https://reviews.freebsd.org/D31819
Sponsored by:	       Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by:	Greg V <greg@unrelenting.technology>
Reviewed by:	grehan, bz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32523

----

bhyve: Bump the SMBIOS firmware version to 14.0 for 14-CURRENT

Bump the firmware version to 14.0 and set the firmware release date
to today.

Reviewed by: jhb, bz, imp
Differential Revision: https://reviews.freebsd.org/D32534

----

bhyve: use physical lobits for BARs of passthru devices

Tell the guest whether a BAR uses prefetched memory or not for
passthru devices by using the same lobits as the physical device.

Reviewed by:	 grehan
Sponsored by:	 Beckhoff Autmation GmbH & Co. KG
Differential Revision:	  https://reviews.freebsd.org/D32685

----

bhyve: do not explicitly map fbuf framebuffer

Allocating a BAR will call baraddr which maps the framebuffer. No need
to allocate it explicitly on init.

Reviewed by:     grehan
Sponsored by:    Beckhoff Autmation GmbH & Co. KG
Differential Revision:    https://reviews.freebsd.org/D32596

----

bhyve: move 64 bit BAR location to match OVMF assumptions

OVMF will fail, if large 64 bit BARs are used. GCD-Map doesn't cover
64 bit addresses of BARs.
OVMF assumes that 64 bit addresses of BARS are located on next 32 GB
boundary behind Top of High RAM.

This patch moves 64 bit BARs on next 32 GB boundary behind Top of High
RAM to match OVMF assumptions.

Differential Revision:	https://reviews.freebsd.org/D27970
Sponsored by: Beckhoff Automation GmbH & Co. KG

----

bhyve: use a fixed 32 bit BAR base address

OVMF always uses 0xC0000000 as base address for 32 bit PCI MMIO space.
For that reason, we should use that address too.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31051
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: keep physical and virtual COMMAND reg in sync

On startup all virtual BARs are registered.
Additionally, the encoding bit in the virtual cmd register is set.
After that, the passthru emulation overwrites the virtual cmd register with
the physical one.
This could lead to a mismatch between registered BARs and the encoding
bits in the cmd register.
Instead of writing the physical to the virtual cmd register,
write the virtual to the physical cmd register to solve this issue.

Reviewed by:	  markj
Differential Revision:	https://reviews.freebsd.org/D32687
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: emulate reads of MSI-X capabilities for passthru devices

Reads of the MSI-X capabilites aren't emulated by passthru devices
yet. The guest will read the host MSI-X capabilites which could
cause issues.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D32686
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix compile

We need err.h

Fixes:	5cf21e48ccf11 ("bhyve: use a fixed 32 bit BAR base address")
Sponsored by:	Bechoff Automation GmbH & Co. KG

----

bhyve blockif: fix blockif_candelete with Capsicum

NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.

The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.

Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.

PR:		260081
Reviewed by:	allanjude, markj, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33203

----

bhyve: fix -Wunused-but-set-variable warning

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D33306

----

bhyve: Support a _VARS.fd file for bootrom

OVMF creates two separate .fd files, a _CODE.fd file containing
the UEFI code, and a _VARS.fd file containing a template of an
empty UEFI variable store.

OVMF decides to write variables to the memory range just below the
boot rom code if it detects a CFI flash device. So here we add
just the barest facsimile of CFI command handling to bootrom.c
that is needed to placate OVMF.

Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19976
MFC After: 1 week

----

bhyve: set EV_CLEAR for EVFILT_VNODE mevents

When an EVFILT_VNODE filter event is triggered, reset it.

This fixes the issue where a virtio-blk resize event would cause the
mevent thread to consume 100% of the cpu.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33326

----

bhyve nvme: Add AEN support to NVMe emulation

Add Asynchronous Event Notification infrastructure to the NVMe
emulation.

Reviewed by:	imp, grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32952

----

bhyve nvme: Inform guests of namespace resize

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32953

----

bhyve: Only snapshot initialized VirtIO queues

If the virtio device is not fully initialized, then suspend fails with:

  vi_pci_snapshot_queues: invalid address: vq->vq_desc
  Failed to snapshot virtio-rnd; ret=14

MFC after:	1 week
Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D26268

----

bhyve: passthru: enable BARs before possibly mmap(2)ing them

The first time we start bhyve with a passthru device everything is fine
as on boot we do enable BARs.  If a driver (unload) inside bhyve disables
the BAR(s) as some Linux drivers do, we need to make sure we re-enable
them on next bhyve start.

If we are trying to mmap a disabled BAR for MSI-X (PCIOCBARMMAP)
the kernel will give us an EBUSY.
While we were re-enabling the BAR(s) in the current code loop
cfginit() was writing the changes out too late to the real hardware.

Move the call to init_msix_table() after the register on the real
hardware was updated.  That way the kernel will be happy and the
mmap will succeed and bhyve will start.
Also simplify the code given the last argument to init_msix_table()
is unused we do not need to do checks for each bar. [1]

MFC after:	3 days
PR:		260148
Pointed out by:	markj [1]
Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision: https://reviews.freebsd.org/D33628

----

bhyve: clean up trailing whitespaces

Clean up trailing whitespaces. No functional changes.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33681

----

bhyve smbios type 3 structure is incorrect

If you look at the SMBIOS specification, we'll find something is
missing. In particular at offset 0Dh is supposed to be the OEM-defined
field. This should go between security and height. It is not legal to
actually skip this and will lead to other folks not properly
interpreting later parts of the table.

https://www.illumos.org/issues/14312

Reviewed by:	jhb
Submitted by:	Robert Mustacchi <rm@fingolfin.org>
Obtained from:	ilumos
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D33682

----

bhyve: only init MSI-X table if passthru device supports it

Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.

MFC after:	3 days
X-MFC with:	f1442847c9404d4bc5f5524a0c3362dd39cb14f9
PR:		260148
Reviewed by:	manu, bz
Differential Revision:	https://reviews.freebsd.org/D33728

----

bhyve: enumerate BARs by size

E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D28278

----

bhyve: allow reading of fwctl signature multiple times

At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D31981

----

bhyve: add more slop to 64 bit BARs

Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.

Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33118

----

bhyve: dynamically register FwCtl ports

Qemu's FwCfg uses the same ports as Bhyve's FwCtl. Static allocated
ports wouldn't allow to switch between Qemu's FwCfg and Bhyve's
FwCtl.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33496

----

bhyve: Map the right BAR in init_msix_table()

The PBA and MSI-X table can reside in different BARs.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: Correct unmapping of the MSI-X table BAR

The starting address passed to mprotect was wrong, so in the case where
the last page containing the table is not the last page of the BAR, the
wrong region would be unmapped.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: add nvlist functions for setting unset nodes

If an emulation uses those functions instead of set_config_value_node
or set_config_value, it allows the config values to get
overwritten. Introducing new functions is much more readable than
if else statements in the emulation code.

Reviewed by:	khng
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33770

----

bhyve: get mediasize for character devices when resizing virtio-blk

Reviewed by:	imp, allanjude, jhb
Differential Revision:	https://reviews.freebsd.org/D33403

----

bhyve/snapshot: fix pthread_create() error check

pthread_create() returns 0 on success or an error number on failure.

Reviewed by:	khng, markj
Differential Revision:	https://reviews.freebsd.org/D33930

----

Append Keyboard Layout specified option for using VNC.
Part two: Append bhyve -K option for specified keyboard layout
with layout setting files every languages.
Since the cmd option '-k' was used in the meantime
it was changed to '-K'

PR:		246121
Submitted by:	koinec@yahoo.co.jp
Reviewed by:	grehan@
Differential Revision:	https://reviews.freebsd.org/D29473

MFC after:	4 weeks

----

bhyve: ahci: Fix regression with no ports

An AHCI controller may be specified with no connected ports.  Avoid
dumping core in this case for compatibility with existing VM configs.

Reviewed by:	khng, jhb
Fixes:		621b5090487de Refactor configuration management in bhyve.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D33969

----

bhyve/block_if: allow DIOCGMEDIASIZE ioctl

This is needed to get mediasize of the device after a resize event.

I missed this earlier as I was building WITH_BHYVE_SNAPSHOT, which
disables capsicum.

Reviewed by:	khng, markj
Fixes: ae9ea22e14bf ("bhyve: get mediasize for character devices when ...")
Differential Revision:	https://reviews.freebsd.org/D34013

----

pkgbase: bhyve: Tag the kbdlayout file to be in the bhyve package

----

bhyve nvme: Advertise v1.4 support

Bump advertised NVMe support from v1.3 to v1.4

Reviewed by:	allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33564

----

bhyve nvme: Fix NVM Format completion status

The NVM Format command is unique among the Admin commands in that it
needs to finish asynchronously. For this reason, the emulation code
invented a synthetic completion status (NVME_NO_STATUS) to indicate that
the command was still in progress and the command processing loop should
not generate a completion message. The implementation used the value
0xffff for the synthetic value as this set both the Status Code and
Status Code Type fields to reserved values.

Format initialized the completion status to this value and expected
error cases to override it with a status code/type appropriate to the
situation. The macros used to set the NVMe status are careful not to
modify bit 0 (i.e. the phase bit), which with the synthetic completion
status, causes the phase bit to get out of sync. When running tests in a
guest with illegal NVM Format commands, Admin commands would eventually
hang because it appeared there were no completions due to the incorrect
phase bit value.

Fix is to only set NVME_NO_STATUS if the blockif delete command
succeeds. While in the neighborhood, add a missing break statement when
NVM Format is not supported.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33565

----

bhyve nvme: Fix Namespace Specific Set Features

Return an error if the feature specified in Set Features is Namespace
specific but the Namespace ID uses the Global Namespace tag.

Fixes UNH Test 1.2.7

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33566

----

bhyve nvme: Implement Log Page Offset

Modify the Get Log Page command to parse the Log Page Offset fields to
support more recent versions of the NVMe specification.

Fixes various tests for UNH Test 1.3.*

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33568

----

bhyve nvme: Add missing Admin opcodes

Don't treat unsupported Admin commands as Invalid Opcode. Instead return
the proper Invalid Field in Command.

Fixes UNH IOL test 1.17.2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33569

----

bhyve nvme: Remove redundant AER Limit checks

The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33570

----

bhyve nvme: Fix Set Features

Be more conservative and only support the Features mandatory for an I/O
Controller.

Avoids a "hang" in UNH test 1.2.10 associated with Predictable Latency
Mode Configuration and Host Behavior Support features.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33571

----

bhyve nvme: Add Temperature Threshold support

This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.

While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.

Fixes UNH IOL 16.0 Test 1.7, cases 1, 2, and 4.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33572

----

bhyve nvme: Update v1.4 Identify Controller data

Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.

Fixes UNH ILO Test 1.1, Case 2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33573

----

bhyve nvme: Add Select support to Get Features

Implement basic support for the SEL field of Get Features. This returns
information about Namespace Specific features.

Fixes UNH ILO 16.0 Test 1.2, Case 13

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33574

----

bhyve nvme: Fix LBA out-of-range calculation

The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.

Fix is to rename the variable to avoid future confusion.

While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.

Reviewed by:	imp, allanjude, jhb
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33575

----

bhyve nvme: Fix reported VWC value

v1.4 and later NVMe Controllers report "Flush all Namespaces" support
differently.

Fixes UNH IOL 16.0 Test 2.6, Case 3

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33576

----

bhyve nvme: Fix Set Features, AEN

NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.

Fixes UNH IOL Test 3.12, Case 8

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33577

----

bhyve nvme: Fix Identify Namespace, NSID=ffffffff

If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.

Fixes UNH IOL 16.0 Test 9.1, Case 6

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33578

----

bhyve/virtio: use correct device id for virtio-scsi

Section 4.1.2.1 of the virtio spec states that the transitional PCI
device id for a scsi device is 0x1004.

Fix suggested by reporter.

PR:             259961
Reported by:    me@nanaya.pro
Reviewed by:	imp, jhb
Fixes:  f9c005a17f4e ("Add bhyve virtio-scsi storage backend support.")
Differential Revision:	https://reviews.freebsd.org/D34103

----

Create VM_MEMATTR_DEVICE on all architectures

This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692

----

Use if ... else when printing memory attributes

In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.

Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D29703

----

Remove an always-true check.

This fixes a -Wtype-limits error from GCC 9.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D31936

----

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0.  On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead.  This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by:	grehan, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32559

----

vmm: vlapic resume can eat 100% CPU by vlapic_callout_handler

Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.

Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).

During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33448

----

bhyve: add support for MTRR

Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33333
ckoehne pushed a commit to Beckhoff/freebsd-src that referenced this pull request Mar 17, 2022
We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: freebsd/uefi-edk2#9

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066
Yamagi added a commit to Yamagi/freebsd that referenced this pull request Mar 25, 2022
Includes all commits up to 2022/02/06 or d21e71efce39.

----

libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b5090487d Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b5090487de9fed1b503769702a9a2a27cc7bb introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		https://github.com/freebsd/freebsd-src/pull/463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b5090487d
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e3ca895a14, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e3ca895a14

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada297e897
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307

----

bhyve: Nuke double-semicolons

A distinct number of double-semicolons ended up in bhyve. Take a pass at
getting rid of many of these harmless typos.

MFC after:	3 days

----

bhyve: Fix pci device node key in bhyve_config.5

PCI device node key in the manual page is wrong. It should be
pci.bus.slot.function.

MFC after:	3 days

----

bhyve: Support setting the disk serial number for VirtIO block devices.

Reviewed by:	allanjude
Obtained from:	illumos
Differential Revision:	https://reviews.freebsd.org/D31983

----

bhyve: Update the -G description in the SYNPOSIS.

It was missing both the 'w' flag and 'bind_address'.

----

bhyve_config.5: Document gdb.address.

----

bhyve: Add an empty case for event types in mevent_kq_fflags().

This fixes a -Wswitch error raised by GCC 9.

Differential Revision:	https://reviews.freebsd.org/D31938

----

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by:	bz, grehan, jhb
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32359

----

bhyve.8: Fix markup of the -G flag

----

bhyve: Update usage and synopsis for the -k flag

Let's make it clear to users that -k is for configuration files.
Also, point to bhyve_config(5) in the paragraph describing the flag.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32467

----

bhyve: ignore low bits of CFGADR

Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.

According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.

E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
	outl 0x80000002 at CFGADR
	inw  at CFGDAT + 2
bhyve:
	cfgoff = 0x80000002 & 0xFF = 0x02
	coff   = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.

Reviewed By: #bhyve, grehan
Differential Revision: https://reviews.freebsd.org/D31819
Sponsored by:	       Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by:	Greg V <greg@unrelenting.technology>
Reviewed by:	grehan, bz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32523

----

bhyve: Bump the SMBIOS firmware version to 14.0 for 14-CURRENT

Bump the firmware version to 14.0 and set the firmware release date
to today.

Reviewed by: jhb, bz, imp
Differential Revision: https://reviews.freebsd.org/D32534

----

bhyve: use physical lobits for BARs of passthru devices

Tell the guest whether a BAR uses prefetched memory or not for
passthru devices by using the same lobits as the physical device.

Reviewed by:	 grehan
Sponsored by:	 Beckhoff Autmation GmbH & Co. KG
Differential Revision:	  https://reviews.freebsd.org/D32685

----

bhyve: do not explicitly map fbuf framebuffer

Allocating a BAR will call baraddr which maps the framebuffer. No need
to allocate it explicitly on init.

Reviewed by:     grehan
Sponsored by:    Beckhoff Autmation GmbH & Co. KG
Differential Revision:    https://reviews.freebsd.org/D32596

----

bhyve: move 64 bit BAR location to match OVMF assumptions

OVMF will fail, if large 64 bit BARs are used. GCD-Map doesn't cover
64 bit addresses of BARs.
OVMF assumes that 64 bit addresses of BARS are located on next 32 GB
boundary behind Top of High RAM.

This patch moves 64 bit BARs on next 32 GB boundary behind Top of High
RAM to match OVMF assumptions.

Differential Revision:	https://reviews.freebsd.org/D27970
Sponsored by: Beckhoff Automation GmbH & Co. KG

----

bhyve: use a fixed 32 bit BAR base address

OVMF always uses 0xC0000000 as base address for 32 bit PCI MMIO space.
For that reason, we should use that address too.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31051
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: keep physical and virtual COMMAND reg in sync

On startup all virtual BARs are registered.
Additionally, the encoding bit in the virtual cmd register is set.
After that, the passthru emulation overwrites the virtual cmd register with
the physical one.
This could lead to a mismatch between registered BARs and the encoding
bits in the cmd register.
Instead of writing the physical to the virtual cmd register,
write the virtual to the physical cmd register to solve this issue.

Reviewed by:	  markj
Differential Revision:	https://reviews.freebsd.org/D32687
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: emulate reads of MSI-X capabilities for passthru devices

Reads of the MSI-X capabilites aren't emulated by passthru devices
yet. The guest will read the host MSI-X capabilites which could
cause issues.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D32686
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix compile

We need err.h

Fixes:	5cf21e48ccf11 ("bhyve: use a fixed 32 bit BAR base address")
Sponsored by:	Bechoff Automation GmbH & Co. KG

----

bhyve blockif: fix blockif_candelete with Capsicum

NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.

The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.

Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.

PR:		260081
Reviewed by:	allanjude, markj, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33203

----

bhyve: fix -Wunused-but-set-variable warning

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D33306

----

bhyve: Support a _VARS.fd file for bootrom

OVMF creates two separate .fd files, a _CODE.fd file containing
the UEFI code, and a _VARS.fd file containing a template of an
empty UEFI variable store.

OVMF decides to write variables to the memory range just below the
boot rom code if it detects a CFI flash device. So here we add
just the barest facsimile of CFI command handling to bootrom.c
that is needed to placate OVMF.

Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19976
MFC After: 1 week

----

bhyve: set EV_CLEAR for EVFILT_VNODE mevents

When an EVFILT_VNODE filter event is triggered, reset it.

This fixes the issue where a virtio-blk resize event would cause the
mevent thread to consume 100% of the cpu.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33326

----

bhyve nvme: Add AEN support to NVMe emulation

Add Asynchronous Event Notification infrastructure to the NVMe
emulation.

Reviewed by:	imp, grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32952

----

bhyve nvme: Inform guests of namespace resize

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32953

----

bhyve: Only snapshot initialized VirtIO queues

If the virtio device is not fully initialized, then suspend fails with:

  vi_pci_snapshot_queues: invalid address: vq->vq_desc
  Failed to snapshot virtio-rnd; ret=14

MFC after:	1 week
Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D26268

----

bhyve: passthru: enable BARs before possibly mmap(2)ing them

The first time we start bhyve with a passthru device everything is fine
as on boot we do enable BARs.  If a driver (unload) inside bhyve disables
the BAR(s) as some Linux drivers do, we need to make sure we re-enable
them on next bhyve start.

If we are trying to mmap a disabled BAR for MSI-X (PCIOCBARMMAP)
the kernel will give us an EBUSY.
While we were re-enabling the BAR(s) in the current code loop
cfginit() was writing the changes out too late to the real hardware.

Move the call to init_msix_table() after the register on the real
hardware was updated.  That way the kernel will be happy and the
mmap will succeed and bhyve will start.
Also simplify the code given the last argument to init_msix_table()
is unused we do not need to do checks for each bar. [1]

MFC after:	3 days
PR:		260148
Pointed out by:	markj [1]
Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision: https://reviews.freebsd.org/D33628

----

bhyve: clean up trailing whitespaces

Clean up trailing whitespaces. No functional changes.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33681

----

bhyve smbios type 3 structure is incorrect

If you look at the SMBIOS specification, we'll find something is
missing. In particular at offset 0Dh is supposed to be the OEM-defined
field. This should go between security and height. It is not legal to
actually skip this and will lead to other folks not properly
interpreting later parts of the table.

https://www.illumos.org/issues/14312

Reviewed by:	jhb
Submitted by:	Robert Mustacchi <rm@fingolfin.org>
Obtained from:	ilumos
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D33682

----

bhyve: only init MSI-X table if passthru device supports it

Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.

MFC after:	3 days
X-MFC with:	f1442847c9404d4bc5f5524a0c3362dd39cb14f9
PR:		260148
Reviewed by:	manu, bz
Differential Revision:	https://reviews.freebsd.org/D33728

----

bhyve: enumerate BARs by size

E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D28278

----

bhyve: allow reading of fwctl signature multiple times

At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D31981

----

bhyve: add more slop to 64 bit BARs

Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.

Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33118

----

bhyve: dynamically register FwCtl ports

Qemu's FwCfg uses the same ports as Bhyve's FwCtl. Static allocated
ports wouldn't allow to switch between Qemu's FwCfg and Bhyve's
FwCtl.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33496

----

bhyve: Map the right BAR in init_msix_table()

The PBA and MSI-X table can reside in different BARs.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: Correct unmapping of the MSI-X table BAR

The starting address passed to mprotect was wrong, so in the case where
the last page containing the table is not the last page of the BAR, the
wrong region would be unmapped.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: add nvlist functions for setting unset nodes

If an emulation uses those functions instead of set_config_value_node
or set_config_value, it allows the config values to get
overwritten. Introducing new functions is much more readable than
if else statements in the emulation code.

Reviewed by:	khng
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33770

----

bhyve: get mediasize for character devices when resizing virtio-blk

Reviewed by:	imp, allanjude, jhb
Differential Revision:	https://reviews.freebsd.org/D33403

----

bhyve/snapshot: fix pthread_create() error check

pthread_create() returns 0 on success or an error number on failure.

Reviewed by:	khng, markj
Differential Revision:	https://reviews.freebsd.org/D33930

----

Append Keyboard Layout specified option for using VNC.
Part two: Append bhyve -K option for specified keyboard layout
with layout setting files every languages.
Since the cmd option '-k' was used in the meantime
it was changed to '-K'

PR:		246121
Submitted by:	koinec@yahoo.co.jp
Reviewed by:	grehan@
Differential Revision:	https://reviews.freebsd.org/D29473

MFC after:	4 weeks

----

bhyve: ahci: Fix regression with no ports

An AHCI controller may be specified with no connected ports.  Avoid
dumping core in this case for compatibility with existing VM configs.

Reviewed by:	khng, jhb
Fixes:		621b5090487de Refactor configuration management in bhyve.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D33969

----

bhyve/block_if: allow DIOCGMEDIASIZE ioctl

This is needed to get mediasize of the device after a resize event.

I missed this earlier as I was building WITH_BHYVE_SNAPSHOT, which
disables capsicum.

Reviewed by:	khng, markj
Fixes: ae9ea22e14bf ("bhyve: get mediasize for character devices when ...")
Differential Revision:	https://reviews.freebsd.org/D34013

----

pkgbase: bhyve: Tag the kbdlayout file to be in the bhyve package

----

bhyve nvme: Advertise v1.4 support

Bump advertised NVMe support from v1.3 to v1.4

Reviewed by:	allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33564

----

bhyve nvme: Fix NVM Format completion status

The NVM Format command is unique among the Admin commands in that it
needs to finish asynchronously. For this reason, the emulation code
invented a synthetic completion status (NVME_NO_STATUS) to indicate that
the command was still in progress and the command processing loop should
not generate a completion message. The implementation used the value
0xffff for the synthetic value as this set both the Status Code and
Status Code Type fields to reserved values.

Format initialized the completion status to this value and expected
error cases to override it with a status code/type appropriate to the
situation. The macros used to set the NVMe status are careful not to
modify bit 0 (i.e. the phase bit), which with the synthetic completion
status, causes the phase bit to get out of sync. When running tests in a
guest with illegal NVM Format commands, Admin commands would eventually
hang because it appeared there were no completions due to the incorrect
phase bit value.

Fix is to only set NVME_NO_STATUS if the blockif delete command
succeeds. While in the neighborhood, add a missing break statement when
NVM Format is not supported.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33565

----

bhyve nvme: Fix Namespace Specific Set Features

Return an error if the feature specified in Set Features is Namespace
specific but the Namespace ID uses the Global Namespace tag.

Fixes UNH Test 1.2.7

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33566

----

bhyve nvme: Implement Log Page Offset

Modify the Get Log Page command to parse the Log Page Offset fields to
support more recent versions of the NVMe specification.

Fixes various tests for UNH Test 1.3.*

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33568

----

bhyve nvme: Add missing Admin opcodes

Don't treat unsupported Admin commands as Invalid Opcode. Instead return
the proper Invalid Field in Command.

Fixes UNH IOL test 1.17.2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33569

----

bhyve nvme: Remove redundant AER Limit checks

The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33570

----

bhyve nvme: Fix Set Features

Be more conservative and only support the Features mandatory for an I/O
Controller.

Avoids a "hang" in UNH test 1.2.10 associated with Predictable Latency
Mode Configuration and Host Behavior Support features.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33571

----

bhyve nvme: Add Temperature Threshold support

This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.

While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.

Fixes UNH IOL 16.0 Test 1.7, cases 1, 2, and 4.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33572

----

bhyve nvme: Update v1.4 Identify Controller data

Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.

Fixes UNH ILO Test 1.1, Case 2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33573

----

bhyve nvme: Add Select support to Get Features

Implement basic support for the SEL field of Get Features. This returns
information about Namespace Specific features.

Fixes UNH ILO 16.0 Test 1.2, Case 13

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33574

----

bhyve nvme: Fix LBA out-of-range calculation

The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.

Fix is to rename the variable to avoid future confusion.

While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.

Reviewed by:	imp, allanjude, jhb
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33575

----

bhyve nvme: Fix reported VWC value

v1.4 and later NVMe Controllers report "Flush all Namespaces" support
differently.

Fixes UNH IOL 16.0 Test 2.6, Case 3

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33576

----

bhyve nvme: Fix Set Features, AEN

NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.

Fixes UNH IOL Test 3.12, Case 8

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33577

----

bhyve nvme: Fix Identify Namespace, NSID=ffffffff

If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.

Fixes UNH IOL 16.0 Test 9.1, Case 6

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33578

----

bhyve/virtio: use correct device id for virtio-scsi

Section 4.1.2.1 of the virtio spec states that the transitional PCI
device id for a scsi device is 0x1004.

Fix suggested by reporter.

PR:             259961
Reported by:    me@nanaya.pro
Reviewed by:	imp, jhb
Fixes:  f9c005a17f4e ("Add bhyve virtio-scsi storage backend support.")
Differential Revision:	https://reviews.freebsd.org/D34103

----

Create VM_MEMATTR_DEVICE on all architectures

This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692

----

Use if ... else when printing memory attributes

In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.

Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D29703

----

Remove an always-true check.

This fixes a -Wtype-limits error from GCC 9.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D31936

----

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0.  On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead.  This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by:	grehan, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32559

----

vmm: vlapic resume can eat 100% CPU by vlapic_callout_handler

Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.

Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).

During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33448

----

bhyve: add support for MTRR

Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33333
Yamagi added a commit to Yamagi/freebsd that referenced this pull request Apr 19, 2022
Includes all commits up to 2022/02/06 or d21e71efce39.

----

libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b5090487d Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b5090487de9fed1b503769702a9a2a27cc7bb introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		https://github.com/freebsd/freebsd-src/pull/463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b5090487d
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e3ca895a14, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e3ca895a14

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada297e897
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307

----

bhyve: Nuke double-semicolons

A distinct number of double-semicolons ended up in bhyve. Take a pass at
getting rid of many of these harmless typos.

MFC after:	3 days

----

bhyve: Fix pci device node key in bhyve_config.5

PCI device node key in the manual page is wrong. It should be
pci.bus.slot.function.

MFC after:	3 days

----

bhyve: Support setting the disk serial number for VirtIO block devices.

Reviewed by:	allanjude
Obtained from:	illumos
Differential Revision:	https://reviews.freebsd.org/D31983

----

bhyve: Update the -G description in the SYNPOSIS.

It was missing both the 'w' flag and 'bind_address'.

----

bhyve_config.5: Document gdb.address.

----

bhyve: Add an empty case for event types in mevent_kq_fflags().

This fixes a -Wswitch error raised by GCC 9.

Differential Revision:	https://reviews.freebsd.org/D31938

----

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by:	bz, grehan, jhb
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32359

----

bhyve.8: Fix markup of the -G flag

----

bhyve: Update usage and synopsis for the -k flag

Let's make it clear to users that -k is for configuration files.
Also, point to bhyve_config(5) in the paragraph describing the flag.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32467

----

bhyve: ignore low bits of CFGADR

Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.

According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.

E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
	outl 0x80000002 at CFGADR
	inw  at CFGDAT + 2
bhyve:
	cfgoff = 0x80000002 & 0xFF = 0x02
	coff   = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.

Reviewed By: #bhyve, grehan
Differential Revision: https://reviews.freebsd.org/D31819
Sponsored by:	       Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by:	Greg V <greg@unrelenting.technology>
Reviewed by:	grehan, bz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32523

----

bhyve: Bump the SMBIOS firmware version to 14.0 for 14-CURRENT

Bump the firmware version to 14.0 and set the firmware release date
to today.

Reviewed by: jhb, bz, imp
Differential Revision: https://reviews.freebsd.org/D32534

----

bhyve: use physical lobits for BARs of passthru devices

Tell the guest whether a BAR uses prefetched memory or not for
passthru devices by using the same lobits as the physical device.

Reviewed by:	 grehan
Sponsored by:	 Beckhoff Autmation GmbH & Co. KG
Differential Revision:	  https://reviews.freebsd.org/D32685

----

bhyve: do not explicitly map fbuf framebuffer

Allocating a BAR will call baraddr which maps the framebuffer. No need
to allocate it explicitly on init.

Reviewed by:     grehan
Sponsored by:    Beckhoff Autmation GmbH & Co. KG
Differential Revision:    https://reviews.freebsd.org/D32596

----

bhyve: move 64 bit BAR location to match OVMF assumptions

OVMF will fail, if large 64 bit BARs are used. GCD-Map doesn't cover
64 bit addresses of BARs.
OVMF assumes that 64 bit addresses of BARS are located on next 32 GB
boundary behind Top of High RAM.

This patch moves 64 bit BARs on next 32 GB boundary behind Top of High
RAM to match OVMF assumptions.

Differential Revision:	https://reviews.freebsd.org/D27970
Sponsored by: Beckhoff Automation GmbH & Co. KG

----

bhyve: use a fixed 32 bit BAR base address

OVMF always uses 0xC0000000 as base address for 32 bit PCI MMIO space.
For that reason, we should use that address too.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31051
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: keep physical and virtual COMMAND reg in sync

On startup all virtual BARs are registered.
Additionally, the encoding bit in the virtual cmd register is set.
After that, the passthru emulation overwrites the virtual cmd register with
the physical one.
This could lead to a mismatch between registered BARs and the encoding
bits in the cmd register.
Instead of writing the physical to the virtual cmd register,
write the virtual to the physical cmd register to solve this issue.

Reviewed by:	  markj
Differential Revision:	https://reviews.freebsd.org/D32687
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: emulate reads of MSI-X capabilities for passthru devices

Reads of the MSI-X capabilites aren't emulated by passthru devices
yet. The guest will read the host MSI-X capabilites which could
cause issues.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D32686
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix compile

We need err.h

Fixes:	5cf21e48ccf11 ("bhyve: use a fixed 32 bit BAR base address")
Sponsored by:	Bechoff Automation GmbH & Co. KG

----

bhyve blockif: fix blockif_candelete with Capsicum

NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.

The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.

Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.

PR:		260081
Reviewed by:	allanjude, markj, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33203

----

bhyve: fix -Wunused-but-set-variable warning

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D33306

----

bhyve: Support a _VARS.fd file for bootrom

OVMF creates two separate .fd files, a _CODE.fd file containing
the UEFI code, and a _VARS.fd file containing a template of an
empty UEFI variable store.

OVMF decides to write variables to the memory range just below the
boot rom code if it detects a CFI flash device. So here we add
just the barest facsimile of CFI command handling to bootrom.c
that is needed to placate OVMF.

Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19976
MFC After: 1 week

----

bhyve: set EV_CLEAR for EVFILT_VNODE mevents

When an EVFILT_VNODE filter event is triggered, reset it.

This fixes the issue where a virtio-blk resize event would cause the
mevent thread to consume 100% of the cpu.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33326

----

bhyve nvme: Add AEN support to NVMe emulation

Add Asynchronous Event Notification infrastructure to the NVMe
emulation.

Reviewed by:	imp, grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32952

----

bhyve nvme: Inform guests of namespace resize

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32953

----

bhyve: Only snapshot initialized VirtIO queues

If the virtio device is not fully initialized, then suspend fails with:

  vi_pci_snapshot_queues: invalid address: vq->vq_desc
  Failed to snapshot virtio-rnd; ret=14

MFC after:	1 week
Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D26268

----

bhyve: passthru: enable BARs before possibly mmap(2)ing them

The first time we start bhyve with a passthru device everything is fine
as on boot we do enable BARs.  If a driver (unload) inside bhyve disables
the BAR(s) as some Linux drivers do, we need to make sure we re-enable
them on next bhyve start.

If we are trying to mmap a disabled BAR for MSI-X (PCIOCBARMMAP)
the kernel will give us an EBUSY.
While we were re-enabling the BAR(s) in the current code loop
cfginit() was writing the changes out too late to the real hardware.

Move the call to init_msix_table() after the register on the real
hardware was updated.  That way the kernel will be happy and the
mmap will succeed and bhyve will start.
Also simplify the code given the last argument to init_msix_table()
is unused we do not need to do checks for each bar. [1]

MFC after:	3 days
PR:		260148
Pointed out by:	markj [1]
Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision: https://reviews.freebsd.org/D33628

----

bhyve: clean up trailing whitespaces

Clean up trailing whitespaces. No functional changes.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33681

----

bhyve smbios type 3 structure is incorrect

If you look at the SMBIOS specification, we'll find something is
missing. In particular at offset 0Dh is supposed to be the OEM-defined
field. This should go between security and height. It is not legal to
actually skip this and will lead to other folks not properly
interpreting later parts of the table.

https://www.illumos.org/issues/14312

Reviewed by:	jhb
Submitted by:	Robert Mustacchi <rm@fingolfin.org>
Obtained from:	ilumos
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D33682

----

bhyve: only init MSI-X table if passthru device supports it

Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.

MFC after:	3 days
X-MFC with:	f1442847c9404d4bc5f5524a0c3362dd39cb14f9
PR:		260148
Reviewed by:	manu, bz
Differential Revision:	https://reviews.freebsd.org/D33728

----

bhyve: enumerate BARs by size

E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D28278

----

bhyve: allow reading of fwctl signature multiple times

At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D31981

----

bhyve: add more slop to 64 bit BARs

Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.

Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33118

----

bhyve: dynamically register FwCtl ports

Qemu's FwCfg uses the same ports as Bhyve's FwCtl. Static allocated
ports wouldn't allow to switch between Qemu's FwCfg and Bhyve's
FwCtl.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33496

----

bhyve: Map the right BAR in init_msix_table()

The PBA and MSI-X table can reside in different BARs.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: Correct unmapping of the MSI-X table BAR

The starting address passed to mprotect was wrong, so in the case where
the last page containing the table is not the last page of the BAR, the
wrong region would be unmapped.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: add nvlist functions for setting unset nodes

If an emulation uses those functions instead of set_config_value_node
or set_config_value, it allows the config values to get
overwritten. Introducing new functions is much more readable than
if else statements in the emulation code.

Reviewed by:	khng
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33770

----

bhyve: get mediasize for character devices when resizing virtio-blk

Reviewed by:	imp, allanjude, jhb
Differential Revision:	https://reviews.freebsd.org/D33403

----

bhyve/snapshot: fix pthread_create() error check

pthread_create() returns 0 on success or an error number on failure.

Reviewed by:	khng, markj
Differential Revision:	https://reviews.freebsd.org/D33930

----

Append Keyboard Layout specified option for using VNC.
Part two: Append bhyve -K option for specified keyboard layout
with layout setting files every languages.
Since the cmd option '-k' was used in the meantime
it was changed to '-K'

PR:		246121
Submitted by:	koinec@yahoo.co.jp
Reviewed by:	grehan@
Differential Revision:	https://reviews.freebsd.org/D29473

MFC after:	4 weeks

----

bhyve: ahci: Fix regression with no ports

An AHCI controller may be specified with no connected ports.  Avoid
dumping core in this case for compatibility with existing VM configs.

Reviewed by:	khng, jhb
Fixes:		621b5090487de Refactor configuration management in bhyve.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D33969

----

bhyve/block_if: allow DIOCGMEDIASIZE ioctl

This is needed to get mediasize of the device after a resize event.

I missed this earlier as I was building WITH_BHYVE_SNAPSHOT, which
disables capsicum.

Reviewed by:	khng, markj
Fixes: ae9ea22e14bf ("bhyve: get mediasize for character devices when ...")
Differential Revision:	https://reviews.freebsd.org/D34013

----

pkgbase: bhyve: Tag the kbdlayout file to be in the bhyve package

----

bhyve nvme: Advertise v1.4 support

Bump advertised NVMe support from v1.3 to v1.4

Reviewed by:	allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33564

----

bhyve nvme: Fix NVM Format completion status

The NVM Format command is unique among the Admin commands in that it
needs to finish asynchronously. For this reason, the emulation code
invented a synthetic completion status (NVME_NO_STATUS) to indicate that
the command was still in progress and the command processing loop should
not generate a completion message. The implementation used the value
0xffff for the synthetic value as this set both the Status Code and
Status Code Type fields to reserved values.

Format initialized the completion status to this value and expected
error cases to override it with a status code/type appropriate to the
situation. The macros used to set the NVMe status are careful not to
modify bit 0 (i.e. the phase bit), which with the synthetic completion
status, causes the phase bit to get out of sync. When running tests in a
guest with illegal NVM Format commands, Admin commands would eventually
hang because it appeared there were no completions due to the incorrect
phase bit value.

Fix is to only set NVME_NO_STATUS if the blockif delete command
succeeds. While in the neighborhood, add a missing break statement when
NVM Format is not supported.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33565

----

bhyve nvme: Fix Namespace Specific Set Features

Return an error if the feature specified in Set Features is Namespace
specific but the Namespace ID uses the Global Namespace tag.

Fixes UNH Test 1.2.7

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33566

----

bhyve nvme: Implement Log Page Offset

Modify the Get Log Page command to parse the Log Page Offset fields to
support more recent versions of the NVMe specification.

Fixes various tests for UNH Test 1.3.*

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33568

----

bhyve nvme: Add missing Admin opcodes

Don't treat unsupported Admin commands as Invalid Opcode. Instead return
the proper Invalid Field in Command.

Fixes UNH IOL test 1.17.2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33569

----

bhyve nvme: Remove redundant AER Limit checks

The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33570

----

bhyve nvme: Fix Set Features

Be more conservative and only support the Features mandatory for an I/O
Controller.

Avoids a "hang" in UNH test 1.2.10 associated with Predictable Latency
Mode Configuration and Host Behavior Support features.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33571

----

bhyve nvme: Add Temperature Threshold support

This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.

While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.

Fixes UNH IOL 16.0 Test 1.7, cases 1, 2, and 4.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33572

----

bhyve nvme: Update v1.4 Identify Controller data

Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.

Fixes UNH ILO Test 1.1, Case 2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33573

----

bhyve nvme: Add Select support to Get Features

Implement basic support for the SEL field of Get Features. This returns
information about Namespace Specific features.

Fixes UNH ILO 16.0 Test 1.2, Case 13

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33574

----

bhyve nvme: Fix LBA out-of-range calculation

The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.

Fix is to rename the variable to avoid future confusion.

While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.

Reviewed by:	imp, allanjude, jhb
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33575

----

bhyve nvme: Fix reported VWC value

v1.4 and later NVMe Controllers report "Flush all Namespaces" support
differently.

Fixes UNH IOL 16.0 Test 2.6, Case 3

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33576

----

bhyve nvme: Fix Set Features, AEN

NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.

Fixes UNH IOL Test 3.12, Case 8

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33577

----

bhyve nvme: Fix Identify Namespace, NSID=ffffffff

If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.

Fixes UNH IOL 16.0 Test 9.1, Case 6

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33578

----

bhyve/virtio: use correct device id for virtio-scsi

Section 4.1.2.1 of the virtio spec states that the transitional PCI
device id for a scsi device is 0x1004.

Fix suggested by reporter.

PR:             259961
Reported by:    me@nanaya.pro
Reviewed by:	imp, jhb
Fixes:  f9c005a17f4e ("Add bhyve virtio-scsi storage backend support.")
Differential Revision:	https://reviews.freebsd.org/D34103

----

Create VM_MEMATTR_DEVICE on all architectures

This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692

----

Use if ... else when printing memory attributes

In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.

Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D29703

----

Remove an always-true check.

This fixes a -Wtype-limits error from GCC 9.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D31936

----

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0.  On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead.  This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by:	grehan, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32559

----

vmm: vlapic resume can eat 100% CPU by vlapic_callout_handler

Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.

Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).

During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33448

----

bhyve: add support for MTRR

Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33333
Yamagi added a commit to Yamagi/freebsd that referenced this pull request Sep 18, 2022
Includes all commits up to 2022/02/06 or d21e71efce39.

----

libvmm: clean up vmmapi.h

struct checkpoint_op, enum checkpoint_opcodes, and
MAX_SNAPSHOT_VMNAME are not vmm specific, move them out of the vmmapi
header.

They are used for the save/restore functionality that bhyve(8)
provides and are better suited in usr.sbin/bhyve/snapshot.h

Since bhyvectl(8) requires these, the Makefile for bhyvectl has been
modified to include usr.sbin/bhyve/snapshot.h

Reviewed by:    kevans, grehan
Differential Revision:  https://reviews.freebsd.org/D28410

----

bhyve/snapshot: drop mkdir when creating the unix domain socket

Add /var/run/bhyve/ to BSD.var.dist so we don't have to call mkdir when
creating the unix domain socket for a given bhyve vm.

The path to the unix domain socket for a bhyve vm will now be
/var/run/bhyve/vmname instead of /var/run/bhyve/checkpoint/vmname

Move BHYVE_RUN_DIR from snapshot.c to snapshot.h so it can be shared
to bhyvectl(8).

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28783

----

bhyve/snapshot: rename checkpoint_opcodes to be more generic

Generalize the naming here since the domain socket that uses these codes
might be used for purposes other than the save/restore feature.

- rename checkpoint_opcodes to ipc_opcode

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28877

----

bhyvectl: reduce code duplication

Combine send_start_checkpoint() and send_start_suspend() into a
single function named snapshot_request().

snapshot_request() is equivalent to send_start_checkpoint() and
send_start_suspend() except that it takes an additional argument. The
additional argument, enum ipc_opcode, is used to determine the type of
snapshot request being performed. Also, switch to using strlcpy instead
of strncpy.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28878

----

bhyve/snapshot: rename and bump size of MAX_SNAPSHOT_VMNAME

MAX_SNAPSHOT_VMNAME is a macro used to set the size of a character
buffer that stores a filename or the path to a file - this file is used
by the save/restore feature.

Since the file doesn't have anything to do with a vm name, rename
MAX_SNAPSHOT_VMNAME to MAX_SNAPSHOT_FILENAME. Bump the size to PATH_MAX
while here.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28879

----

bhyvectl: print a better error message when vm_open() fails

Use errno to print a more descriptive error message when vm_open() fails

libvmm: preserve errno when vm_device_open() fails

vm_destroy() squashes errno by making a dive into sysctlbyname() - we
can safely skip vm_destroy() here since it's not doing any critical
clean up at this point. Replace vm_destroy() with a free() call.

PR:             250671
MFC after:      3 days
Submitted by:   marko@apache.org
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D29109

----

bhyve/snapshot: use SOCK_DGRAM instead of SOCK_STREAM

The save/restore feature uses a unix domain socket to send messages
from bhyvectl(8) to a bhyve(8) process. A datagram socket will suffice
for this.

An added benefit of using a datagram socket is simplified code. For
bhyve, the listen/accept calls are dropped; and for bhyvectl, the
connect() call is dropped.

EPRINTLN handles raw mode for bhyve(8), use it to print error messages.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D28983

----

bhyve: virtio shares definitions between sys/dev/virtio

Definitions inside usr.sbin/bhyve/virtio.h are thrown away.
Definitions in sys/dev/virtio are used instead.

This reduces code duplication.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29084

----

Refactor configuration management in bhyve.

Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs.  The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key.  Values are always stored as strings.  The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.

The configuration values are stored in a tree using nvlists.  Leaf
nodes hold string values.  Configuration values are permitted to
reference other configuration values using '%(name)'.  This permits
constructing template configurations.

All existing command line arguments now set configuration values.  For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.

A new '-o' command line option permits setting an individual
configuration variable.  The key name is always given as a full path
of dot-separated components.

A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable.  Lines
starting with a '#' are comments.

In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values.  Once this is
complete, bhyve then begins initializing its state based on the
configuration values.  This means that subsequent configuration
options or files may override or supplement previously given settings.

A special 'config.dump' configuration value can be set to true to help
debug configuration issues.  When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.

Most command line argments map to a single configuration variable,
e.g.  '-w' sets the 'x86.strictmsr' value to false.  A few command
line arguments have less obvious effects:

- Multiple '-p' options append their values (as a comma-seperated
  list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).

- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
  The first argument to '-s' (the device type) is used as the value of
  a "device" variable.  Additional comma-separated arguments are then
  parsed into 'key=value' pairs and used to set additional variables
  under the device node.  A PCI device emulation driver can provide
  its own hook to override the parsing of the additonal '-s' arguments
  after the device type.

  After the configuration phase as completed, the init_pci hook
  then walks the "pci.<bus>.<slot>.<func>" nodes.  It uses the
  "device" value to find the device model to use.  The device
  model's init routine is passed a reference to its nvlist node
  in the configuration tree which it can query for specific
  variables.

  The result is that a lot of the string parsing is removed from
  the device models and centralized.  In addition, adding a new
  variable just requires teaching the model to look for the new
  variable.

- For '-l' options, a similar model is used where the string is
  parsed into values that are later read during initialization.
  One key note here is that the serial ports use the commonly
  used lowercase names from existing documentation and examples
  (e.g. "lpc.com1") instead of the uppercase names previously
  used internally in bhyve.

Reviewed by:	grehan
MFC after:	3 months
Differential Revision:	https://reviews.freebsd.org/D26035

----

bhyve: support relocating fbuf and passthru data BARs

We want to allow the UEFI firmware to enumerate and assign
addresses to PCI devices so we can boot from NVMe[1]. Address
assignment of PCI BARs is properly handled by the PCI emulation
code in general, but a few specific cases need additional support.
fbuf and passthru map additional objects into the guest physical
address space and so need to handle address updates. Here we add a
callback to emulated PCI devices to inform them of a BAR
configuration change. fbuf and passthru then watch for these BAR
changes and relocate the frame buffer memory segment and passthru
device mmio area respectively.

We also add new VM_MUNMAP_MEMSEG and VM_UNMAP_PPTDEV_MMIO ioctls
to vmm(4) to facilitate the unmapping needed for addres updates.

[1]: https://github.com/freebsd/uefi-edk2/pull/9/

Originally by:	scottph
MFC After:	1 week
Sponsored by:	Intel Corporation
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D24066

----

bhyve amd: Small cleanups in amdvi_dump_cmds

Bump offset with MOD_INC instead in amdvi_dump_cmds.

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28862

----

bhyve hostbridge: Rename "device" property to "devid".

"device" is already used as the generic PCI-level name of the device
model to use (e.g. "hostbridge").  The result was that parsing
"hostbridge" as an integer failed and the host bridge used a device ID
of 0.  The EFI ROM asserts that the device ID of the hostbridge is not
0, so booting with the current EFI ROM was failing during the ROM
boot.

Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb

----

bhyve: Enable virtio-scsi legacy config parsing.

The previous commit added the handler to parse the command line
options for virtio-scsi devices but forgot to set the correct function
pointer to point to the handler.

Reported by:	vangyzen
Reviewed by:	vangyzen
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29438

----

bhyve: change vq_getchain to return iovecs in both directions

The old prototype requires callers to inspect flags of each descriptors
to get the starting position of host-writable iovecs.

vq_getchain() is changed to return a virtio request with the number of
host-readable iovecs and host-writable iovecs instead. Callers can avoid
boilerplate code of getting the start offset of host-writable iovecs.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Reviewed by:	afedorov
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29433

----

Fix typo in xhci nvlist node name, and also increment device counter.

This allows the xhci tablet device to be recognized and a PCI device
instantiated.

Reviewed by:	jhb
Fixes:		621b5090487d Refactor configuration management in bhyve.
MFC after:	3 months.

----

bhyve: fix regression in legacy virtio-9p config parsing

Commit 621b5090487de9fed1b503769702a9a2a27cc7bb introduced a regression
in legacy virtio-9p config parsing by not initializing *sharename to
NULL. As a result, "sharename != NULL" check in the first iteration fails
and bhyve exits with "virtio-9p: more than one share name given".

Fix by adding NULL back.

Approved by:	grehan

----

bhyve: add SMBIOS Baseboard Information

Add the System Management BIOS Baseboard (or Module) Information
a.k.a. Type 2 structure to the SMBIOS emulation.

Reviewed by:	rgrimes, bcran, grehan
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29657

----

bhyve: Move the gdb_active check to gdb_cpu_suspend().

The check needs to be in the public routine (gdb_cpu_suspend()), not
in the internal routine called from various places
(_gdb_cpu_suspend()).  All the other callers of _gdb_cpu_suspend()
already check gdb_active, and this breaks the use of snapshots when
the debug server is not enabled since gdb_cpu_suspend() tries to lock
an uninitialized mutex.

Reported by:	Darius Mihai, Elena Mihailescu
Reviewed by:	elenamihailescu22_gmail.com
Fixes:		621b5090487de9fed1b503769702a9a2a27cc7bb
Differential Revision:	https://reviews.freebsd.org/D29538

----

bhyve: implement rdmsr for MSR_IA32_FEATURE_CONTROL

Without the -w option, Windows guests crash on boot. This is caused by a rdmsr
of MSR_IA32_FEATURE_CONTROL. Windows checks this MSR to determine enabled VMX
features. This MSR isn't emulated in bhyve, so a #GP exception is injected
which causes Windows to crash.

Fix by returning a rdmsr of MSR_IA32_FEATURE_CONTROL with Lock Bit set and
VMX disabled to informWindows that VMX isn't available.

Reviewed by:	jhb, grehan (bhyve)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29665

----

bhyve.8: Make synopsis more readable

There is no need to squeeze all the possible options into one synopsis
entry. Let "-l help" and "-s help" be listed separately.

While here, keep -s and its arguments on the same line.

MFC after:	2 weeks

----

bhyve: Fix synopsis in the usage message

In particular:
- Sort short options to align with style(9)
- Add two missing flags: -G and -r
- Drop unnecessary angle brackets for consistency
- Rename the "vm" argument to vmname for consistency with the manual
  page

MFC after:	2 weeks

----

bhyve: Improve the option description in the usage message

- Sort options as suggested by style(9)
- Capitalize some words like CPU and HLT
- Add a missing description for the -G flag

MFC after:	2 weeks

----

bhyve.8: Sort the options in the OPTIONS section

No content change intended. Just moving the option descriptions around
to follow the order suggested by style(9).

MFC after:	2 weeks

----

bhyve.8: Improve the description and synopsis of -l

- Describe "-l help" separately for readability.
- List all the supported comX devices explicitly
- Use Cm instead of Ar for command modifiers (i.e., literal values a
  user can specify as an argument to the command).
- Explain where to get more information about the possible values of the
  conf argument.

MFC after:	2 weeks

----

bhyve.8: Improve the description of the -m flag

- Stylize the synopsis with proper mdoc macros
- Do some wordsmithing on the description for consistency.

MFC after:	2 weeks

----

bhyve.8: Fix the synopsis of -p

Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up description of -r

There is no need to wrap those flags in Op macros.

MFC after:	2 weeks

----

bhyve.8: Fix indention in the signals table

MFC after:	2 weeks

----

bhyve.8: Clean-up synopsis of -s

- Document "-s help" separately for readability.
- Use appropriate mdoc macros.

MFC after:	2 weeks

----

bhyve.8: Clean up the slot description of -s

Also, remove the macros of the nested list which contained slot,
emulation and conf. This decreases the indention of the -s description.
It was necessary to clean up the slot description.

MFC after:	2 weeks

----

bhyve.8: Improve emulation description of the -s flag

- Set width of the list to the longest key word for readability.
- Separate descriptions of amd_hostbridge and hostbridge emulations.
  Also, wordsmith their descriptions for consistency with other entries.
- Use Cm instead of Li for command modifiers.
- Do not stylize AMD with Li, there's no need to do it.
- Mention COM3 and COM4 in the definition of lpc.
- Fix a typo in the definition of ahci-hd ("hard drive" instead of
  "hard-drive").

MFC after:	2 weeks

----

bhyve.8: Clean up network backends section

- Reformat the format lists, use appropriate mdoc macros for
  readability.
- Add a missing Oxford comma.

MFC after:	2 weeks

----

bhyve.8: Clean up block storage device backends description

MFC after:	2 weeks

----

bhyve.8: Clean up SCSI device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up 9P device backends section

MFC after:	2 weeks

----

bhyve.8: Clean up TTY, boot ROM, and pass-through descriptions

MFC after:	2 weeks

----

bhyve.8: Clean up virtio console device backends description

MFC after:	2 weeks

----

bhyve.8: Improve framebuffer backends description

- Use appropriate mdoc macros
- Document that tcp= is a synonym to rfb= (tcp is used in the examples,
  but never mentioned)
- Clarify the IP address specification

MFC after:	2 weeks

----

bhyve.8: Improve documentation of NVME backend

- Document the configuration format.
- Document two additional configuration options: eui64 and dsm.

MFC after:	2 weeks

----

bhyve.8: Improve AHCI backends documentation

- Document the backend format.

MFC after:	2 weeks

----

bhyve: Document the format for HD audio backends

- This change is done for consistency with other backend definitions.

MFC after:	2 weeks

----

bhyve.8: Fix mandoc -Tlint issues

While here, keep network backends section consistent with other
sections.

MFC after:	2 weeks

----

bhyve: Be explicit that setting config.dump will not start a VM.

Suggested by:	rpokala
Reviewed by:	bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D29738

----

bhyve: Gracefully handle virtio-scsi with no conf

Fixes segfault with the command `bhyve -s 0,virtio-scsi`, which is used
by some third party software to probe bhyve for virtio-scsi support.

Reviewed by:	jhb
MFC after:	1 day
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D29926

----

bhyve: Set SO_REUSEADDR on the gdb stub socket

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30037

----

bhyve/snapshot: provide a way to send other messages/data to bhyve

This is a step towards sending messages (other than suspend/checkpoint)
from bhyvectl to bhyve.

Introduce a new struct, ipc_message - this struct stores the type of
message and a union containing message specific structures for the type
of message being sent.

Reviewed by:    grehan
Differential Revision: https://reviews.freebsd.org/D30221

----

bhyve/snapshot: split up mutex/cond initialization from socket creation

Move initialization of the mutex/condition variables required by the
save/restore feature to their own function.

The unix domain socket that facilitates communication between bhyvectl
and bhyve doesn't rely on these variables in order to be functional.

Reviewed by:    markj
Differential Revision:  https://reviews.freebsd.org/D30281

----

Add a virtio-input device emulation.

This will be used to inject keyboard/mouse input events into a guest.
The command line syntax is:
   -s <slot>,virtio-input,/dev/input/eventX

Reviewed by:	jhb (bhyve), grehan
Obtained from:	Corvin Köhne <C.Koehne@beckhoff.com>
MFC after:	3 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30020

----

bhyve: Register new kevents synchronously.

Change mevent_add*() to synchronously add the new kevent.  This
permits reporting event registration failures to the caller and avoids
failing the registration of other, unrelated events queued up in the
same batch.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30502

----

bhyve: Add support for EVFILT_VNODE mevents.

This allows registering an event to watch for changes to a file's
attributes.  This is a bit imperfect as it would be nice to have a way
to determine if an fd can use EVFILT_VNODE successfully.  mevent's
current structure does not permit that and a failure to register a
single kevent impacts several other kevents.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30503

----

bhyve: Add support for handling disk resize events to block_if.

Allow clients of blockif to register a resize callback handler.  When
a callback is registered, register an EVFILT_VNODE kevent watching the
backing store for a change in the file's attributes.  If the size has
changed when the kevent fires, invoke the clients' callback.

Currently resize detection is limited to backing stores that support
EVFILT_VNODE kevents such as regular files.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30504

----

bhyve: Split out a lower-level helper for VirtIO interrupts.

This allows device models to assert VirtIO interrupts for reasons
other than publishing changes to a VirtIO ring such as configuration
changes.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30505

----

bhyve vtblk: Inform guests of disk resize events.

Register a resize callback with the blockif interface.  When the
callback fires, update the size of the disk and notify the guest via a
configuration change interrupt.

Reviewed by:	grehan, markj
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D30506

----

bhyve: enhance debug info for memory range clash

Explain what the two clashing regions are.

Reivewed by:		grehan, jhb
Differential Revision:	https://reviews.freebsd.org/D29696
Pull Request:		https://github.com/freebsd/freebsd-src/pull/463

----

Add more GIC and GICv3 registers

These aren't used by either driver, however they will be needed by
bhyve on arm64 to emulate a GICv3 interrupt controller.

Sponsored by:	Innovate UK

----

bhyve: Fix cli regression with NVMe ram

The configuration management refactoring inadvertently removed support
for a RAM-backed NVMe Namespace (i.e. -s X,nvme,ram=16384). This adds it
back.

Reported by:	andy@omniosce.org
Reviewed by:	jhb, andy@omniosce.org
Fixes:		621b5090487d
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D30717

----

bhyve: fix NVMe MDTS comment

Removes an obsolete comment and adds parenthesis around the macro while
in the area. No functional change.

----

bhyve: Fix NVMe iovec construction for large IOs

The UEFI driver included with Rocky Linux 8.4 uncovered an existing bug
in the NVMe emulation's construction of iovec's.

By default, NVMe data transfer operations use a scatter-gather list in
which all entries point to a fixed size memory region. For example, if
the Memory Page Size is 4KiB, a 2MiB IO requires 512 entries. Lists
themselves are also fixed size (default is 512 entries).

Because the list size is fixed, the last entry is special. If the IO
requires more than 512 entries, the last entry in the list contains the
address of the next list of entries. But if the IO requires exactly 512
entries, the last entry points to data.

The NVMe emulation missed this logic and unconditionally treated the
last entry as a pointer to the next list. Fix is to check if the
remaining data is greater than the page size before using the last entry
as a pointer to the next list.

PR:		256422
Reported by:	dave@syix.com
Tested by:	jason@tubnor.net
MFC after:	5 days
Relnotes:	yes
Reviewed by:	imp, grehan
Differential Revision:	https://reviews.freebsd.org/D30897

----

Append Keyboard Layout specified option for using VNC.
Part one: supporting QEMU Extended Keyboard Event Message

PR:             246121
Submitted by:   koinec@yahoo.co.jp
Differential Revision: https://reviews.freebsd.org/D29430

----

libvmm: explicitly save and restore errno in vm_open()

In commit 6bb140e3ca895a14, vm_destroy() was replaced with free() to
preserve errno. However, it's possible that free() may change the errno
as well. Keep the free() call, but explicitly save and restore errno.

Noted by: jhb
Fixes: 6bb140e3ca895a14

----

vmm: Let guests enable SMEP/SMAP if the host supports it

Reviewed by:	kib, grehan, jhb
Tested by:	grehan (AMD)
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30462

----

vmm: Fix ivrs_drv device_printf usage

The original %b description string is wrong.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D30805

----

bhyve/vioapic: remove an extra pin masked check

vioapic_send_intr does already check whether the pin is masked before
injecting the interrupt, there's no need to do it in vioapic_write
also.

No functional change intended.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28236

----

bhyve/ioapic: only account for asserted line in level mode

After modifying a redirection entry only try to inject an interrupt if
the pin is in level mode, pins in edge mode shouldn't take into
account the line assert status as they are triggered by edge changes,
not the line status itself.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28237

----

bhyve/ioapic: improve the tracking of IRR bit

One common method of EOI'ing an interrupt at the IO-APIC level is to
switch the pin to edge triggering mode and then back into level mode.
That would cause the IRR bit to be cleared and thus further interrupts
to be injected. FreeBSD does indeed use that method if the IO-APIC EOI
register is not supported.

The bhyve IO-APIC emulation code didn't clear the IRR bit when doing
that switch, and was also missing acknowledging the IRR state when
trying to inject an interrupt in vioapic_send_intr.

Reviewed by:		grehan
Differential revision:	https://reviews.freebsd.org/D28238

----

ivrs_drv: Fix IVHDs with duplicated BaseAddress

Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28945

----

AMD-vi: Fix IOMMU device interrupts being overridden

Currently, AMD-vi PCI-e passthrough will lead to the following lines in
dmesg:
"kernel: CPU0: local APIC error 0x40
ivhd0: Error: completion failed tail:0x720, head:0x0."

After some tracing, the problem is due to the interaction with
amdvi_alloc_intr_resources() and pci_driver_added(). In ivrs_drv, the
identification of AMD-vi IVHD is done by walking over the ACPI IVRS
table and ivhdX device_ts are added under the acpi bus, while there are
no driver handling the corresponding IOMMU PCI function. In
amdvi_alloc_intr_resources(), the MSI intr are allocated with the ivhdX
device_t instead of the IOMMU PCI function device_t. bus_setup_intr() is
called on ivhdX. the IOMMU pci function device_t is only used for
pci_enable_msi(). Since bus_setup_intr() is not called on IOMMU pci
function, the IOMMU PCI function device_t's dinfo->cfg.msi is never
updated to reflect the supposed msi_data and msi_addr. So the msi_data
and msi_addr stay in the value 0. When pci_driver_added() tried to loop
over the children of a pci bus, and do pci_cfg_restore() on each of
them, msi_addr and msi_data with value 0 will be written to the MSI
capability of the IOMMU pci function, thus explaining the errors in
dmesg.

This change includes an amdiommu driver which currently does attaching,
detaching and providing DEVMETHODs for setting up and tearing down
interrupt. The purpose of the driver is to prevent pci_driver_added()
from calling pci_cfg_restore() on the IOMMU PCI function device_t.
The introduction of the amdiommu driver handles allocation of an IRQ
resource within the IOMMU PCI function, so that the dinfo->cfg.msi is
populated.

This has been tested on EPYC Rome 7282 with Radeon 5700XT GPU.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
Approved by:	philip (mentor)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28984

----

Correct "Fondation" typo (missing "u")

----

AMD-vi: Mixed format IVHD block should replace fixed format IVHD block

This fixes double IVHD_SETUP_INTR calls on the same IOMMU device.

Sponsored by:	The FreeBSD Foundation
MFC with:	74ada297e897
Reported by:	Oleg Ginzburg <olevole@olevole.ru>
Reviewed by:	grehan
Approved by:	philip (mentor)
Differential Revision:	https://reviews.freebsd.org/D29521

----

vmm: Fix AMD-vi using wrong rid range

The ACPI parsing code around rid range was wrong on assuming there is
only one pair of start/end device id range. Besides, ivhd_dev_parse()
never work as supposed. The start/end rid info was always zero.

Restructure the code to build dynamic-sized tables for each IOMMU softc
holding device entries. The device entries are enumerated to find a
suitable IOMMU unit. Operations on devices not governed (e.g. the IOMMU
unit itself) are no-op from now on. There are also a minor fix on wrong
%b formatting string usage.

Tested on my EPYC 7282.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D30827

----

vmm: Bump vmname buffer in struct vm to VM_MAX_NAMELEN + 1

In hw.vmm.create sysctl handler the maximum length of vm name is
VM_MAX_NAMELEN. However in vm_create() the maximum length allowed is
only VM_MAX_NAMELEN - 1 chars. Bump the length of the internal buffer to
allow the length of VM_MAX_NAMELEN for vm name.

MFC after:	3 days
Reviewed by:	grehan
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31372

----

amd64: Fix output operand specs for the stmxcsr and vmread intrinsics

This does not appear to affect code generation, at least with the
default toolchain.

Noticed because incorrect output specifications lead to false positives
from KMSAN, as the instrumentation uses them to update shadow state for
output operands.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31466

----

vmm: Make iommu ops tables const

While here, use designated initializers and rename some AMD iommu method
implementations to match the corresponding op names.  No functional
change intended.

Reviewed by:	grehan
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31462

----

vmm: Fix wrong assert in ivhd_dev_add_entry

The correct condition is to check the number of ivhd entries fit into
the array.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31514

----

vmm: Add credential to cdev object

Add a credential to the cdev object in sysctl_vmm_create(), then check
that we have the correct credentials in sysctl_vmm_destroy(). This
prevents a process in one jail from opening or destroying the /dev/vmm
file corresponding to a VM in a sibling jail.

Add regression tests.

Reviewed by:	jhb, markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31156

----

bhyve: net_backends, automatically IFF_UP tap devices

If you want communications with the outside world and tell bhyve to
create an interfaces then it should be usable as well.
Rather than relying on the sysctl net.link.tap.up_on_open automatically
try to IFF_UP the opened tap device.

MFC after:	10 days
Reviewed by:	markj, grehan
Differential Revision: https://reviews.freebsd.org/D31342

----

bhyve: Use fspacectl(2) for BOP_DELETE on regular file images

bhyve can also make use of fspacectl(2) to implement BOP_DELETE with
hole-punching. Since it is not desirable to do zero-filling for large
DEALLOCATE/UNMAP range, candelete is not set if pathconf(2) indicates
that the underlying file system does not support native
VOP_DEALLOCATE(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	grehan
Differential Revision:	https://reviews.freebsd.org/D28880

----

bhyve: Use pci(4) to access I/O port BARs

This removes the dependency on /dev/io.

PR:		251046
Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31308

----

byhve: add option to specify IP address for gdb

Allow user to specify the IP address available for gdb debugger.

Reviewed by:	jhb, grehan, rgrimes, bcr (man pages)
Differential Revision:	https://reviews.freebsd.org/D29607

----

bhyve: change a default address from ANY to localhost

Discussed with:     grehan, jhb

----

bhyve: Fix vq_getchain() error handling bugs in various device models

Reviewed by:	grehan, khng
Approved by:	so
Security:	CVE-2021-29631
Security:	FreeBSD-SA-21:13.bhyve

----

pci: Add an ioctl to perform I/O to BARs

This is useful for bhyve, which otherwise has to use /dev/io to handle
accesses to I/O port BARs when PCI passthrough is in use.

Reviewed by:	imp, kib
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31307

----

bhyve: Nuke double-semicolons

A distinct number of double-semicolons ended up in bhyve. Take a pass at
getting rid of many of these harmless typos.

MFC after:	3 days

----

bhyve: Fix pci device node key in bhyve_config.5

PCI device node key in the manual page is wrong. It should be
pci.bus.slot.function.

MFC after:	3 days

----

bhyve: Support setting the disk serial number for VirtIO block devices.

Reviewed by:	allanjude
Obtained from:	illumos
Differential Revision:	https://reviews.freebsd.org/D31983

----

bhyve: Update the -G description in the SYNPOSIS.

It was missing both the 'w' flag and 'bind_address'.

----

bhyve_config.5: Document gdb.address.

----

bhyve: Add an empty case for event types in mevent_kq_fflags().

This fixes a -Wswitch error raised by GCC 9.

Differential Revision:	https://reviews.freebsd.org/D31938

----

bhyve: Map the MSI-X table unconditionally for passthrough

It is possible for the PBA to reside in the same page as the MSI-X
table.  And, while devices are not supposed to do this, at least some
Intel wifi devices place registers in a page shared with the MSI-X
table.  To handle the first case we currently map the PBA page using
/dev/mem, and the second case is not handled.

Kill two birds with one stone: map the MSI-X table BAR using the
PCIOCBARMMAP ioctl instead of /dev/mem, and map the entire table so that
accesses beyond the bounds of the table can be emulated.  Regions of the
BAR not containing the table are left unmapped.

Reviewed by:	bz, grehan, jhb
MFC after:	3 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32359

----

bhyve.8: Fix markup of the -G flag

----

bhyve: Update usage and synopsis for the -k flag

Let's make it clear to users that -k is for configuration files.
Also, point to bhyve_config(5) in the paragraph describing the flag.

Reviewed by:	jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32467

----

bhyve: ignore low bits of CFGADR

Bhyve could emulate wrong PCI registers.
In the best case, the guest reads wrong registers and the device driver would
report some errors.
In the worst case, the guest writes to wrong PCI registers and could brick
hardware when using PCI passthrough.

According to Intels specification, low bits of CFGADR should be
ignored. Some OS like linux may rely on it. Otherwise, bhyve could
emulate a wrong PCI register.

E.g.
If linux would like to read 2 bytes from offset 0x02, following would
happen.
linux:
	outl 0x80000002 at CFGADR
	inw  at CFGDAT + 2
bhyve:
	cfgoff = 0x80000002 & 0xFF = 0x02
	coff   = cfgoff + (port - CFGDAT) = 0x02 + 0x02 = 0x04
Bhyve would emulate the register at offset 0x04 not 0x02.

Reviewed By: #bhyve, grehan
Differential Revision: https://reviews.freebsd.org/D31819
Sponsored by:	       Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix the WITH_BHYVE_SNAPSHOT build

Note, this breaks compatibility with snapshots generated by older builds
of bhyve(8).

Fixes: 7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
Reported by:	Greg V <greg@unrelenting.technology>
Reviewed by:	grehan, bz
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32523

----

bhyve: Bump the SMBIOS firmware version to 14.0 for 14-CURRENT

Bump the firmware version to 14.0 and set the firmware release date
to today.

Reviewed by: jhb, bz, imp
Differential Revision: https://reviews.freebsd.org/D32534

----

bhyve: use physical lobits for BARs of passthru devices

Tell the guest whether a BAR uses prefetched memory or not for
passthru devices by using the same lobits as the physical device.

Reviewed by:	 grehan
Sponsored by:	 Beckhoff Autmation GmbH & Co. KG
Differential Revision:	  https://reviews.freebsd.org/D32685

----

bhyve: do not explicitly map fbuf framebuffer

Allocating a BAR will call baraddr which maps the framebuffer. No need
to allocate it explicitly on init.

Reviewed by:     grehan
Sponsored by:    Beckhoff Autmation GmbH & Co. KG
Differential Revision:    https://reviews.freebsd.org/D32596

----

bhyve: move 64 bit BAR location to match OVMF assumptions

OVMF will fail, if large 64 bit BARs are used. GCD-Map doesn't cover
64 bit addresses of BARs.
OVMF assumes that 64 bit addresses of BARS are located on next 32 GB
boundary behind Top of High RAM.

This patch moves 64 bit BARs on next 32 GB boundary behind Top of High
RAM to match OVMF assumptions.

Differential Revision:	https://reviews.freebsd.org/D27970
Sponsored by: Beckhoff Automation GmbH & Co. KG

----

bhyve: use a fixed 32 bit BAR base address

OVMF always uses 0xC0000000 as base address for 32 bit PCI MMIO space.
For that reason, we should use that address too.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31051
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: keep physical and virtual COMMAND reg in sync

On startup all virtual BARs are registered.
Additionally, the encoding bit in the virtual cmd register is set.
After that, the passthru emulation overwrites the virtual cmd register with
the physical one.
This could lead to a mismatch between registered BARs and the encoding
bits in the cmd register.
Instead of writing the physical to the virtual cmd register,
write the virtual to the physical cmd register to solve this issue.

Reviewed by:	  markj
Differential Revision:	https://reviews.freebsd.org/D32687
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: emulate reads of MSI-X capabilities for passthru devices

Reads of the MSI-X capabilites aren't emulated by passthru devices
yet. The guest will read the host MSI-X capabilites which could
cause issues.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D32686
Sponsored by:	Beckhoff Automation GmbH & Co. KG

----

bhyve: Fix compile

We need err.h

Fixes:	5cf21e48ccf11 ("bhyve: use a fixed 32 bit BAR base address")
Sponsored by:	Bechoff Automation GmbH & Co. KG

----

bhyve blockif: fix blockif_candelete with Capsicum

NVMe conformance tests for the Format command failed if the
backing-storage for the bhyve device was a file instead of a Zvol. The
tests (and the specification) expect a Format to destroy all previously
written data. The bhyve NVMe emulation implements this by trimming /
deallocating all data from the backing-storage.

The blockif_candelete() function indicated the file did not support
deallocation (i.e. fpathconf(..., _PC_DEALLOC_PRESENT) returned FALSE)
even though the kernel supported file hole punching. This occurs on
builds with Capsicum enabled because blockif did not allow the
fpathconf(2) right.

Fix is to add CAP_FPATHCONF to the cap_rights_init(3) call.

PR:		260081
Reviewed by:	allanjude, markj, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33203

----

bhyve: fix -Wunused-but-set-variable warning

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D33306

----

bhyve: Support a _VARS.fd file for bootrom

OVMF creates two separate .fd files, a _CODE.fd file containing
the UEFI code, and a _VARS.fd file containing a template of an
empty UEFI variable store.

OVMF decides to write variables to the memory range just below the
boot rom code if it detects a CFI flash device. So here we add
just the barest facsimile of CFI command handling to bootrom.c
that is needed to placate OVMF.

Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19976
MFC After: 1 week

----

bhyve: set EV_CLEAR for EVFILT_VNODE mevents

When an EVFILT_VNODE filter event is triggered, reset it.

This fixes the issue where a virtio-blk resize event would cause the
mevent thread to consume 100% of the cpu.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33326

----

bhyve nvme: Add AEN support to NVMe emulation

Add Asynchronous Event Notification infrastructure to the NVMe
emulation.

Reviewed by:	imp, grehan
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32952

----

bhyve nvme: Inform guests of namespace resize

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32953

----

bhyve: Only snapshot initialized VirtIO queues

If the virtio device is not fully initialized, then suspend fails with:

  vi_pci_snapshot_queues: invalid address: vq->vq_desc
  Failed to snapshot virtio-rnd; ret=14

MFC after:	1 week
Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D26268

----

bhyve: passthru: enable BARs before possibly mmap(2)ing them

The first time we start bhyve with a passthru device everything is fine
as on boot we do enable BARs.  If a driver (unload) inside bhyve disables
the BAR(s) as some Linux drivers do, we need to make sure we re-enable
them on next bhyve start.

If we are trying to mmap a disabled BAR for MSI-X (PCIOCBARMMAP)
the kernel will give us an EBUSY.
While we were re-enabling the BAR(s) in the current code loop
cfginit() was writing the changes out too late to the real hardware.

Move the call to init_msix_table() after the register on the real
hardware was updated.  That way the kernel will be happy and the
mmap will succeed and bhyve will start.
Also simplify the code given the last argument to init_msix_table()
is unused we do not need to do checks for each bar. [1]

MFC after:	3 days
PR:		260148
Pointed out by:	markj [1]
Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision: https://reviews.freebsd.org/D33628

----

bhyve: clean up trailing whitespaces

Clean up trailing whitespaces. No functional changes.

Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33681

----

bhyve smbios type 3 structure is incorrect

If you look at the SMBIOS specification, we'll find something is
missing. In particular at offset 0Dh is supposed to be the OEM-defined
field. This should go between security and height. It is not legal to
actually skip this and will lead to other folks not properly
interpreting later parts of the table.

https://www.illumos.org/issues/14312

Reviewed by:	jhb
Submitted by:	Robert Mustacchi <rm@fingolfin.org>
Obtained from:	ilumos
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D33682

----

bhyve: only init MSI-X table if passthru device supports it

Some passthru devices only support MSI instead of MSI-X. For those
devices the initialization of MSI-X table will fail. Re-add the
check erroneously removed in f1442847c9404d4bc5f5524a0c3362dd39cb14f9.

MFC after:	3 days
X-MFC with:	f1442847c9404d4bc5f5524a0c3362dd39cb14f9
PR:		260148
Reviewed by:	manu, bz
Differential Revision:	https://reviews.freebsd.org/D33728

----

bhyve: enumerate BARs by size

E.g. Framebuffers can require large space and BARs need to be aligned
by their size. If BARs aren't allocated by size, it'll cause much
fragmentation of the MMIO space. Reduce fragmentation by ordering
the BAR allocation on their size to reduce the risk of
OUT_OF_MMIO_SPACE issues.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D28278

----

bhyve: allow reading of fwctl signature multiple times

At the moment, you only have one single chance to read the fwctl
signature. At boot bhyve is in the state IDENT_WAIT. It's then
possible to switch to IDENT_SEND. After bhyve sends the signature,
it switches to REQ. From now on it's impossible to switch back to
IDENT_SEND to read the signature. For that reason, only a single
driver can read the signature. A guest can't use two drivers to
identify that fwctl is present. It gets even worse when using
OVMF. OVMF uses a library to access fwctl. Therefore, every single
OVMF driver would try to read the signature. Currently, only a
single OVMF driver accesses the fwctl. So, there's no issue with
it yet. However, no OS driver would have a chance to detect fwctl when
using OVMF because it's signature was already consumed by OVMF.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D31981

----

bhyve: add more slop to 64 bit BARs

Bhyve allocates small 64 bit BARs below 4 GB and generates ACPI tables
based on this allocation. If the guest decides to relocate those BARs
above 4 GB, it could lead to mismatching ACPI tables. Especially
when using OVMF with enabled bus enumeration it could cause
issues. OVMF relocates all 64 bit BARs above 4 GB. The guest OS
may be unable to recover from this situation and disables some PCI
devices because their BARs are located outside of the MMIO space
reported by ACPI. Avoid this situation by giving the guest more
space for relocating BARs.

Let's be paranoid. The available space for BARs below 4 GB is 512 MB
large. Use a slop of 512 MB. It'll allow the guest to relocate all
BARs below 4 GB to an address above 4 GB. We could run into issues
when we exceeding the memlimit above 4 GB. However, this space has
a size of 32 GB. Even when using many PCI device with large BARs
like framebuffer or when using multiple PCI busses, it's very
unlikely that we run out of space due to the large slop.
Additionally, this situation will occur on startup and not at runtime
which is much better.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33118

----

bhyve: dynamically register FwCtl ports

Qemu's FwCfg uses the same ports as Bhyve's FwCtl. Static allocated
ports wouldn't allow to switch between Qemu's FwCfg and Bhyve's
FwCtl.

Reviewed by:    markj
MFC after:      2 weeks
Sponsored by:   Beckhoff Automation GmbH & Co. KG
Differential Revision:  https://reviews.freebsd.org/D33496

----

bhyve: Map the right BAR in init_msix_table()

The PBA and MSI-X table can reside in different BARs.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: Correct unmapping of the MSI-X table BAR

The starting address passed to mprotect was wrong, so in the case where
the last page containing the table is not the last page of the BAR, the
wrong region would be unmapped.

Reported by:	Andy Fiddaman <andy@omniosce.org>
Reviewed by:	jhb
Fixes:		7fa233534736 ("bhyve: Map the MSI-X table unconditionally for passthrough")
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33739

----

bhyve: add nvlist functions for setting unset nodes

If an emulation uses those functions instead of set_config_value_node
or set_config_value, it allows the config values to get
overwritten. Introducing new functions is much more readable than
if else statements in the emulation code.

Reviewed by:	khng
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33770

----

bhyve: get mediasize for character devices when resizing virtio-blk

Reviewed by:	imp, allanjude, jhb
Differential Revision:	https://reviews.freebsd.org/D33403

----

bhyve/snapshot: fix pthread_create() error check

pthread_create() returns 0 on success or an error number on failure.

Reviewed by:	khng, markj
Differential Revision:	https://reviews.freebsd.org/D33930

----

Append Keyboard Layout specified option for using VNC.
Part two: Append bhyve -K option for specified keyboard layout
with layout setting files every languages.
Since the cmd option '-k' was used in the meantime
it was changed to '-K'

PR:		246121
Submitted by:	koinec@yahoo.co.jp
Reviewed by:	grehan@
Differential Revision:	https://reviews.freebsd.org/D29473

MFC after:	4 weeks

----

bhyve: ahci: Fix regression with no ports

An AHCI controller may be specified with no connected ports.  Avoid
dumping core in this case for compatibility with existing VM configs.

Reviewed by:	khng, jhb
Fixes:		621b5090487de Refactor configuration management in bhyve.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D33969

----

bhyve/block_if: allow DIOCGMEDIASIZE ioctl

This is needed to get mediasize of the device after a resize event.

I missed this earlier as I was building WITH_BHYVE_SNAPSHOT, which
disables capsicum.

Reviewed by:	khng, markj
Fixes: ae9ea22e14bf ("bhyve: get mediasize for character devices when ...")
Differential Revision:	https://reviews.freebsd.org/D34013

----

pkgbase: bhyve: Tag the kbdlayout file to be in the bhyve package

----

bhyve nvme: Advertise v1.4 support

Bump advertised NVMe support from v1.3 to v1.4

Reviewed by:	allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33564

----

bhyve nvme: Fix NVM Format completion status

The NVM Format command is unique among the Admin commands in that it
needs to finish asynchronously. For this reason, the emulation code
invented a synthetic completion status (NVME_NO_STATUS) to indicate that
the command was still in progress and the command processing loop should
not generate a completion message. The implementation used the value
0xffff for the synthetic value as this set both the Status Code and
Status Code Type fields to reserved values.

Format initialized the completion status to this value and expected
error cases to override it with a status code/type appropriate to the
situation. The macros used to set the NVMe status are careful not to
modify bit 0 (i.e. the phase bit), which with the synthetic completion
status, causes the phase bit to get out of sync. When running tests in a
guest with illegal NVM Format commands, Admin commands would eventually
hang because it appeared there were no completions due to the incorrect
phase bit value.

Fix is to only set NVME_NO_STATUS if the blockif delete command
succeeds. While in the neighborhood, add a missing break statement when
NVM Format is not supported.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33565

----

bhyve nvme: Fix Namespace Specific Set Features

Return an error if the feature specified in Set Features is Namespace
specific but the Namespace ID uses the Global Namespace tag.

Fixes UNH Test 1.2.7

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33566

----

bhyve nvme: Implement Log Page Offset

Modify the Get Log Page command to parse the Log Page Offset fields to
support more recent versions of the NVMe specification.

Fixes various tests for UNH Test 1.3.*

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33568

----

bhyve nvme: Add missing Admin opcodes

Don't treat unsupported Admin commands as Invalid Opcode. Instead return
the proper Invalid Field in Command.

Fixes UNH IOL test 1.17.2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33569

----

bhyve nvme: Remove redundant AER Limit checks

The NVMe emulation checked if the Asynchronous Event Request Limit
(a.k.a AERL) would be exceeded in pci_nvme_aer_add(), but this function
is only called from nvme_opc_async_event_req() which also checks for
exceeding the AERL.

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33570

----

bhyve nvme: Fix Set Features

Be more conservative and only support the Features mandatory for an I/O
Controller.

Avoids a "hang" in UNH test 1.2.10 associated with Predictable Latency
Mode Configuration and Host Behavior Support features.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33571

----

bhyve nvme: Add Temperature Threshold support

This adds the ability for a guest OS to send Set / Get Feature,
Temperature Threshold commands. The implementation assumes a constant
temperature and will generate an Asynchronous Event Notification if the
specified threshold is above/below this value. Although the
specification allows 9 temperature values, this implementation only
implements the Composite Temperature.

While in the neighborhood, move the clear of the CSTS register in the
reset function after all other cleanup. This avoids a race with the
guest thinking the reset is complete (i.e. CSTS.RDY = 0) before the NVMe
emulation is actually complete with the reset.

Fixes UNH IOL 16.0 Test 1.7, cases 1, 2, and 4.

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33572

----

bhyve nvme: Update v1.4 Identify Controller data

Compliant v1.4 Controllers must report a Controller Type (CNTRLTYPE).
Also, do not advertise secure erase functionality in the Format NVM
Attributes field of the Identify Controller data structure as the
Controller does not implement secure erase.

Fixes UNH ILO Test 1.1, Case 2

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33573

----

bhyve nvme: Add Select support to Get Features

Implement basic support for the SEL field of Get Features. This returns
information about Namespace Specific features.

Fixes UNH ILO 16.0 Test 1.2, Case 13

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33574

----

bhyve nvme: Fix LBA out-of-range calculation

The function which checks for a valid LBA range mistakenly named an
input value as NLB ("Number of Logical Blocks") instead of "number of
blocks". The NVMe specification defines NLB as a zero-based value (i.e.
NLB=0x0 represents 1 block, 0x1 is 2 blocks, etc.), but the passed
parameter is a 1's-based value.

Fix is to rename the variable to avoid future confusion.

While in the neighborhood, also check that the starting LBA is less than
the size of the backing storage to avoid an integer overflow.

Reviewed by:	imp, allanjude, jhb
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33575

----

bhyve nvme: Fix reported VWC value

v1.4 and later NVMe Controllers report "Flush all Namespaces" support
differently.

Fixes UNH IOL 16.0 Test 2.6, Case 3

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33576

----

bhyve nvme: Fix Set Features, AEN

NVMe Controllers which do not support Endurance Groups must return an
error when the Endurance Group Event Aggregate Log Change Notices bit is
set in Set Features, Asynchronous Event Configuration.

Fixes UNH IOL Test 3.12, Case 8

Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33577

----

bhyve nvme: Fix Identify Namespace, NSID=ffffffff

If the NVMe Controller doesn't support Namespace Management, it should
return "Invalid Namespace or Format" when the Host request Identify
Namespace with the global NSID value.

Fixes UNH IOL 16.0 Test 9.1, Case 6

Reviewed by:	imp, allanjude
Tested by:      jason@tubnor.net
MFC after:      1 month
Differential Revision:	https://reviews.freebsd.org/D33578

----

bhyve/virtio: use correct device id for virtio-scsi

Section 4.1.2.1 of the virtio spec states that the transitional PCI
device id for a scsi device is 0x1004.

Fix suggested by reporter.

PR:             259961
Reported by:    me@nanaya.pro
Reviewed by:	imp, jhb
Fixes:  f9c005a17f4e ("Add bhyve virtio-scsi storage backend support.")
Differential Revision:	https://reviews.freebsd.org/D34103

----

Create VM_MEMATTR_DEVICE on all architectures

This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692

----

Use if ... else when printing memory attributes

In vmstat there is a switch statement that converts these attributes to
a string. As some values can be duplicate we have to hide these from
userspace.

Replace this switch statement with an if ... else macro that lets us
repeat values without a compiler error.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D29703

----

Remove an always-true check.

This fixes a -Wtype-limits error from GCC 9.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D31936

----

vlapic: Schedule callouts on the local CPU

The virtual LAPIC driver uses callouts to implement the LAPIC timer.
Callouts are armed using callout_reset_sbt(), which currently puts
everything on CPU 0.  On systems running many bhyve VMs this results in
a large amount of contention for CPU 0's callout lock.

Modify vlapic to schedule callouts on the local CPU instead.  This
allows timer interrupts to be scheduled more evenly among CPUs where
bhyve is running.

Reviewed by:	grehan, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32559

----

vmm: vlapic resume can eat 100% CPU by vlapic_callout_handler

Suspend/Resume of Win10 leads that CPU0 is busy on handling interrupts.

Win10 does not use LAPIC timer to often and in most cases, and I see it
is disabled by writing 0 to Initial Count Register (for Timer).

During resume, restart timer only for enabled LAPIC and enabled timer
for that LAPIC.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33448

----

bhyve: add support for MTRR

Some guests or driver might depend on MTRR to work properly. E.g. the
nvidia gpu driver won't work without MTRR.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Beckhoff Automation GmbH & Co. KG
Differential Revision:	https://reviews.freebsd.org/D33333
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants