Commits
ibm-io-acceler…
Name already in use
Commits on Sep 17, 2013
-
Guest-side Paravirtual posted interrupts support
Posted interrupts allows the KVM running one one core (root mode) to inject a virtual interrupt to a guest running on another core (guest mode) without forcing a guest exit. Posted interrupts is not yet available in current hardware but this patch implements this future hardware virtualization feature using software (para-virtual guest). As described in a previous patch, we extended ELI mechanism to deliver exitless virtual interrupts. This patch implements the guest-side logic. When a pre-specified IPI (fixed vector number) is received by the guest, the guest kernel checks in a memory descriptor which interrupt handler (virtual vector number) needs to be called. KVM pre-specifies the virtual interrupt vector number in the shared descriptor before sending the posted interrupt IPI. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
With ELI, the guest is run using a "shadow IDT" instead of the guest's requested IDT. This shadow IDT is built by KVM in a way that causes exits for some interrupts while running the guest's normal handlers on others. The processor requires that the IDT location be given using a virtual address, thus, the shadow IDT must always be mapped in the guest address space. To solve this issue, this patch introduces a new kernel module. When this module is loaded, we allocate a page (in the guest) and pass the corresponding address (GVA) to KVM. ELI can then use this address to prepare the shadow IDT. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Make vhost.ko a standalone module
Previously, vhost.o was combined with net.o to form vhost_net.ko and with blk.o to form vhost_blk.ko. With this patch vhost.ko is a standalone module on which vhost_net.ko and vhost_blk.ko depend. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Virtio in-kernel accelerator for block devices
Vhost blk implementation written by Asias He. Code taken from https://github.com/asias/linux/tree/blk.vhost-blk Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add statistics to monitor vhost performance
This patch introduces a set of statistics to monitor different performance metrics of vhost and our polling and I/O scheduling mechanisms. The statistics are exposed using debugfs and can be easily displayed with a Python script (vhost_stat) Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Do not add a work item if a queue is being polled
vhost stops processing a virqueue once it consumed all the descriptors (processed all the pending data). To avoid starvation, vhost also limits the amount of data that can be processed for a virqueue. When the limit is reached, vhost stops processing the current virtqueue and switches to other virtqueue. vhost may not receive further notifications for the queue that was limited. Thus, to ensure that pending data on this queue will be processed later, vhost adds a new item to the work queue. If queue is being polled by our mechanism, we don't need to add a new work item when we limit the queue. We just switch queues and continue polling as usual. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add herustics to improve I/O scheduling
This patch enhaces the round-robin mechanism with a set of heuristics to decide when to leave a virtqueue and proceed to the next. The patch also introduces a set of kernel module parameters to configure these heuristics. The vhost generic layer exposes a new function that implements the heuristics. The concrete code (e.g. vhost-net or vhost-blk) is required to call this function to check when it should stop processing data. The heuristics work as follows: (1) We always leave a queue after doing a certain maximum amount of work on it, even if it is not yet empty. This limitation is required to avoids starvation. (2) We may leave a queue earlier if we recognize that another non-empty queue or the work queue is stuck and therefore likely to be latency-sensitive. We call a queue stuck if a certain time has passed since it last received new work. A latency-sensitive workload which waits for replies before sending further requests will get "stuck" this sense, while a high-throughput workload which continuously creates new requests will not be found stuck. (3) If we leave a queue erarlier without processing half of the maximum data, we move the queue to the head of the round-robin list. We use this condition to avoid degrading queues that were limited because other queues were stuck. (4) Usually a bursty queues will add lot of items per burst. The time between bursts may be bigger than the time we specified to detect stuck queues. If a queue has more items than the specified in a module param, we never consider the queue as stuck. We use this condition to avoid detecting a bursty queue as a stuck queue. (5) We do not leave a queue before we did some minimum amount of work on it. This technique improves cache efficiency and limits the number of queue switches (important to improve throughput). Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add virtqueue polling mode to vhost
When vhost is waiting for buffers from the guest driver (e.g., more packets to send in vhost-net's transmit queue), it normally goes to sleep and waits for the guest to "kick" it. This kick involves a PIO in the guest, an therefore an exit (and possibly userspace involvement in translating this PIO exit into a file descriptor event), all of which hurts performance. If we can dedicate a core to vhost, we can have it continuously poll the virtqueues for new buffers, and avoid asking the guest to kick us. Polling on all virtqueues happens on the same worker thread, in round-robin fashion. Thanks to the previous patch, the virtqueues of multiple VMs may be polled on the same worker thread, which allows dedicating only one core to servicing the I/O from multiple vcpus. When polling is active for one of the virtqueues, the guest is asked to disable notification (kicks), and the worker thread continuously checks for new buffers. When it does discover new buffers, it simulates a "kick" by invking the underlying backend driver (such as vhost-net), which thinks it got a real kick from the guest, and acts accordingly. If the underlying driver asks not to be kicked, we disable polling on this virtqueue. In this version, we start polling on a virtqueue when we notice it has work to do. Polling on this virtqueue is later disabled after 3 seconds of polling turned out no new work, as in this case we are better off returning to the exit-based notification mechanism. The default timeout of 3 seconds can be changed with the "poll_stop_idle" kernel module parameter. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Share vhost thread between mutiple devices
Vanilla vhost creates a worker thread and worker queue per virtio device. This patch creates a worker thread and worker queue shared accross multiple virtio devices (VMs). The number of maximum virtio devices per worker can be specified using a kernel module parameter. Every time a new virtio devices is created we search for a running worker thread that is serving less than the maximum number of devices. If we find such worker thread, we bind the new virtio device to this thread. Otherwise, we create a new worker thread to serve the new virtio devices and more devices that may be created in the future. Currently, once a device is binded to a specicific worker thread it can not be migrated to other worker thread. In the future, we should improve the mechanism to balance the devices accross threads (migrate a device to a different worker thread) and to create/destroy threads dynamically during runtime based on the I/O activity. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Handle EOIs for injected virtual interrupts
KVM traps and emulates EOIs for every virtual interrupt it injects. When we inject interrupts using the paravirtual posted interrupts mechanism, KVM still needs to emulate EOIs unless we enable exitless EOIs (requires x2apic). Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Enable paravirtual posted interrupts support
Add a hypercall enabling/disabling virtual interrupt injection via paravirtual posted interrupts. When the enable hypercall is called, we check that the paravirtual posted interrupts mechanism was properly initialized and in this case we enable the mechanism for every VCPU. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
VMX implementation of PV posted interrupts
This patch implements the two kvm_x86_ops introduces above: send_posted_interrupt and has_posted_interrupts Every time KVM wishes to inject a virtual interrupt to a given VCPU, we check if the VCPU is currently running in some core. In this case, to avoid unnecessary exits, we try to inject the virtual interrupts using the paravirtual posted interrupts mechanism. If the VCPU is not running, we proceed as usual (queue the interrupt). Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Initialize paravirtual posted interrupts support
The patch adds a hypercall for the guest to pass a pointer to a notification vector variable and an injection page, and initialize paravirtual posted interrupts support for all vcpus of the this guest. It does not yet enable exitless injection with PV posted interrupts (this will be done in the next patch). The notification vector variable is used by the host to tell the guest which vector number will be used to deliver posted interrupts (set once during the initialization phase). The injection page is used by the host to tell the guest which virtual interrupt KVM wishes to inject each time a posted interrupt is sent. (set every time a virtual interrupt is delivered using the paravirtual posted interrupts mechanism). Once the guests receives an interrupt that corresponds to the number stored in the notification vector variable, the guests reads the injections page to identify the virtual interrupt the host is asking to inject and call the corresponding handler. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Re-queue posted interrupts received in root mode
Paravirtual posted interrupts allows injecting virtual interrupts into a running guest (without causing it to exiting to root mode first). However, in rare cases, at the time we send a posted interrupt from core A the guest running in core B may exit for an unrelated reason, and the IPI we sent (from core A to core B) to signal the posted interrupt arrives in the host, instead of the guest. In a previous patch we already reserved this vector in the host. This patch introduces the handler, which re-queues the injection using the traditional KVM injection mechanism (because the guest is not running anymore). Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
VMX fields for PV posted interrupts
This patch defines the per vcpu (vmx) fields required for virtual interrupt injection via Exitless IPIs Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Injection using PV posted interrupts
This patch uses paravirtual posted interrupts for injecting virtual interrupts into guests running on another core, instead of using the traditional exit-causing IPI (kvm_vcpu_kick). The patch introduces new kvm_x86_ops methods which will be implemented in later patches. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Reserve vector for paravirtual posted interrupts
This patch is the first in a series of patches enabling exit-less virtual interrupt injection, by implementing paravirtual Posted-Interrupts using ELI. This feature allows the KVM running one one core, to inject an interrupt to a guest running on another core, without the guest exiting. This is done by choosing one interrupt vector (the "posted interrupts" PI vector) and a memory location. The host writes a vector to be injected to this memory location, and sends the PI vector with an IPI to the core running the guest. This guest is run with ELI to have this specific PI vector delivered directly at the guest, without causing an exit. However, now we need some cooperation at the guest (hence we call this feature paravirtual posted interrupts): The interrupt the guest received is the fixed PI vector, but upon getting it, it needs to read the agreed memory location, and run the handler for the vector written there. This first patch of the series reservers a vector to send Exitless IPIs. It also adds a new counter to KVM_STATS which shows how many virtual interrupt injections were done using Exitless IPIs. If we send an Exitless IPI to a running vcpu, in rare cases, the IPI might arrive in root mode (host) because an exit ocurred while the IPI was being sent. Thus, we also add a handler in the host for the IPI and a counter in /proc/interrupts Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add support for disabling halt exits
If we dedicate a physical core per VCPU (to improve performance), there is no reason to force an exit when the guest executes HLT because the host will not schedule other VCPU in the same core. HLT exits increase the time the CPU spends in root mode (host context) and thus increase the chances that an assigned physical interrupt will arrive in root mode and not in guest mode. To achieve maximum performance, we strive to maximize the number of physical assigned interrupts delivered in guest mode so better to avoid HLT exits if we dedicate a core per VCPU. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
With ELI, if a guest disables interrupts for a long time, it can prevent the host also from receiving the timer interrupt - and thus prevent the host from ever giving time slices to other guests. To avoid this security problem, we use the VMX Preemption Timer feature, to force an exit after some time elapsed, regardless of what the guest does (including disabling interrupts). To avoid uncessary exits, the value of the preempt_timer kernel module parameter should be slightly higher than the host timer interrupt rate (in cycles) - exits due to preemtion timer should not occur during normal execution, only in misbehaving guests (in the future, such guests should have ELI disabled for them). If preemption timer exits occur frequently (log warnings), then the specified preemption value might be too short or something might be wrong with the guest. Note the current default value corresponds to 200 timer interrupts per second on a 2.93GHz processor. In the future we should calculate this automatically. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
ELI needs to trap guest changes to the IDTR, to prevent the guest replacing the Shadow IDT enforced by the host. To do this, it must enable the "Descriptor-table exiting" bit of the VM-execution controls. Unfortunately, this bit enables exiting not only on IDTR changes, but also on changes of the unrelated LDTR, GDTR and TR registers, and when any of these exits come, we need to handle them appropriately. The appropriate way to handle IDTR changes is to rebuild the shadow IDT based on the instruction's operand, and the appropriate way to handle LDTR, GDTR and TR exits is to turn off descriptor-table exiting (i.e., go to injection mode), and run the guest for a while until it exits - letting it handle these instructions normally (which is fine - KVM doesn't normally need to trap and emulate these instructions). In this version, however, we didn't do all of this yet. LDTR and TR changes are handled correctly (via injection mode), but for GDTR and IDTR (which share the same exit reason, so separating them requires further decoding), we just disable ELI which is a valid, although drastic, solution. We note that Linux guests rarely change any of these registers after boot, so provided that ELI is only enabled after boot (by a command doing the hypercall enabling ELI), this case will rarely happen anyway, and if it does, a warning message can be noticed. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Hypercalls for enabling/disabling ELI
This patch introduces hypercalls to start and stop ExitLess Interrupt (ELI) delivery. In a previous patch we already introduced hypercalls for starting and stopping ExitLess Interrupt completion. When ELI is enabled, we prepare the shadow IDT and set all the entries corresponding to non-assigned interrupts as not present. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
After previous patches introduced exit-less interrupt delivery (the guest receives assigned interrupts without exit), this patch introduces exit-less interrupt completion - i.e., allowing a guest to EOI an assigned interrupt without an exit. Our implementation relies on x2APIC, in which the APIC EOI register is a separate MSR, so we can use the MSR-Bitmap feature to avoid exits on EOI, while still having exits on a guest attempt to use other APIC registers. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
As described in a previous patch, ELI switches to injection mode every time KVM needs to inject a virtual interrupt. We need to switch back to the regular ELI mode once we finish the injection. Thus, right after the guest acknowledges a virtual interrupt that was delivered in injection mode, we switch back to ELI mode. Outside injection mode, i.e., on acknowledgment of assigned interrupt, we have nothing special to do (we EOI on any exit anyway), and we just need to circumvent the regular EOI handling of KVM. Moreover we'll show later a mechanism (x2APIC) for avoiding exits on EOI in this case. IMPORTANT NOTE: PV-EOI must be disabled (-cpu -kvm_pv_eoi) otherwise ELI will remain in injection mode for longer periods of time causing external interrupt exits for any physical interrupt. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
With ELI enabled, the VMCS contains not the guest's chosen IDTR, but rather a shadow IDTR that points to our shadow IDT. In this patch we override the IDTR get and set functions to get/save the guest IDTR and not the shadow IDTR. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
In the previous patch, we saw that in ELI, outside injection mode, non-assigned interrupts cause an exit with an NP exception (while assigned interrupts do not cause an exit at all). In this patch, we handle these NP exceptions: We verify that it was really caused by an external interrupt, and if it was we re-generate the same interrupt, causing the host (Linux) to handle this interrupt as usual. To regenerate the interrupt that caused the NP exception exit, we use INT X instruction (software interrut) Alternatively, it is also possible to send a self-IPI to regenerate the interrupt. ELI changes the way KVM handles interrupt completion (EOIs). With ELI enabled, assigned interrupts could be delivered during guest mode execution and might remain in service (pending EOI) after an exit. Thus, we always acknowledge (EOI/ack_APIC_irq) on exit to complete pending interrupts and let the hardware continue raising new interrupts. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
When not in injection mode, trap NP exceptions
When in ELI but not in injection mode (defined in the previous patch), all interrupts arrive at the guest, but the ones not assigned to the guest cause an NP exception (because a handler for it is not present in the shadow IDT). In this patch, when needed we have these NP exceptions cause exits, so the host can handle the non-assigned interrupts. Note that NP exceptions completely unrelated to ELI are also possible, and they will cause unnecessary exits, but the number of these is usually negligable. In the next patch, we will handle these NP exceptions. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
ELI runs the guest using the shadow IDT and configures the processor to deliver physical interrupts directly to the guest (no exit). The shadow IDT causes NP exceptions (exits) for non-assigned interrupts. With ELI enabled, KVM can not inject virtual interrupts from emulated devices (e.g. keyboard) because the injection will cause an NP exception. Thus, we introduce in this patch a special mode of operation we call "injection mode". During this mode, ELI configures the processor to exit on external interrupts and uses the the original guest IDT for guest mode execution. ELI temporary switches to injection mode each time KVM needs to inject a virtual interrupt. In a later patch we'll see that we switch off injection mode on the next exit, typically an EOI. IMPORTANT NOTE: PV-EOI must be disabled (-cpu -kvm_pv_eoi) otherwise ELI will remain in injection mode for longer periods of time causing external interrupt exits for any physical interrupt. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
In device assignment, use eli_remap_vector
This patch modifies device assignment to call kvm_arch_eli_remap_vector() telling it which host IRQ should be mapped to each interrupt vector that the guest chose for the device. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Remap host-vector to guest-vector
In device assignment, the guest believes it configures the device on which interrupt vectors to generate. However, what really happens is that the host chooses different vectors, ones that are available in the host, and when the host recieves the *host* vector, it injects (forwards) the different *guest* vector into the guest. With ELI, the interrupt will arrive directly at the guest, but the interrupt vector received will be the *host* vector. So we need the shadow IDT to contain, in position host-vector, the handler that the guest IDT set for guest-vector. This patch provides a function, kvm_arch_eli_remap_vector(), which KVM's device assignment code will call (see the next patch) to remap a given host-chosen IRQ to the guest-chosen vector. The host IRQ is specified, not the vector, because device assignment only choses the IRQ and the actual vector will only be chosen by the kernel later. If ELI is not yet enabled, kvm_arch_eli_remap_vector() only remembers the mapping but doesn't actually create the shadow IDT. This will be done when the eli_remap() function, also provided in this patch, is called after enabling ELI. Note ELI assumes kvm device-assignment implementation. VFIO is not supported by this patch. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Expose functions for accessing guest memory
The functions to read and write to guest virtual addresses are currently private to x86.c, but in the next patch we will also need to use them in vmx.c, where ELI will need to read and write the shadow IDT. So this patch exports the necessary functions. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
With ELI, the guest is run using a "shadow IDT" instead of the guest's requested IDT. This shadow IDT is built by KVM in a way that causes exits for some interrupts while running the guest's normal handlers on others. Unfortunately, the processor requires that the IDT location be given using a virtual address, so we need to cause the guest to map a spare page which can be used to hold the shadow IDT. This patch allows a guest to specify, via a new hypercall, a location for the shadow IDT in its own virtual memory space. When a guest calls this hypercall, ELI is initialized for this guest, and the shadow IDT is located in the given guest virtual address (GVA). Note that this only initializes ELI, but does *not* enable it. To enable ELI, it must be specified which physical interrupts should be assigned to the guest (arrive directly in the guest, without exit), and of course the shadow IDT needs to be built. This will be done in more patches below. IMPORTANT NOTE: In this implementation we offer no protection against a guest modifying the shadow IDT directly: The guest knows the GVA and GPA of the shadow IDT and can modify it maliciously in one of two ways - either by modifying its contents (writing to the page itself), or by changing its page tables (assuming EPT is being used) to make the same GVA point to a different GPA with different content. The ELI paper explains how both holes can be avoided - the first by trapping writes to the shadow IDT, and the second by using IDTR limit or by periodically verifying that the guest hasn't change the mapping - but none of these solutions is built into this patch set. Instead, it is recommended to either limit ELI's use to non-malicious guests, or make this a non-issue by pinning critical host interrupts to a core not running guests so that even malicous guests cannot cause these critical interrupts to be lost. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add missing bit and exit reasons to vmx.h
This patch is the first in a series of patches enabling the "Exit-Less Interrupts" (ELI) feature for KVM on VMX. While ordinarily VMX causes an exit for every physical interrupt, ELI allows KVM to determine that certain interrupts should be handled directly in the guest, without exit. This can significantly improve device-assignment I/O performance, as the only remaining causes of exits - the interrupts and their completion - are elliminated. The ELI technique is explained and evaluated in more detail in the ASPLOS 2012 paper "ELI: Bare-Metal Performance for I/O Virtualization". ELI needs to enable the "Descriptor-table exiting" bit of the VM-execution controls, which will causes exits when the guest attempts to change the IDT pointer through LIDT (or other changes to other descriptor tables). In this patch, we give a name in vmx.h to this bit, and to the two exit reasons which it leads to. Signed-off-by: Abel Gordon <abelg@il.ibm.com>
-
Add IBM copyright notice to all the files we modified to implement our Virtual I/O acceleration technologies Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Commits on Apr 27, 2013
-
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/ker…
…nel/git/arm/arm-soc Pull ARM SoC fix from Olof Johansson: "A late-arriving fix for musb on OMAP4, resolving an issue where the musb IP won't be clocked and thus not functional. Small in scope, most of the lines changed is a longish comment." * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: OMAP4: hwmod data: make 'ocp2scp_usb_phy_phy_48m" as the main clock