

# The Path to DPDK Speeds for AF\_XDP

Magnus Karlsson, Björn Töpel magnus.karlsson@intel.com, bjorn.topel@intel.com

Linux Plumbers Conference, Vancouver, 2018

### **Legal Disclaimer**

- Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.
- No computer system can be absolutely secure.
- Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will
  affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete
  information about performance and benchmark results, visit www.intel.com/benchmarks.
- Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and
  configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
- All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
- No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
- Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web
  site and confirm whether referenced data are accurate.
- Intel, the Intel logo, and other Intel product and solution names in this presentation are trademarks of Intel.
- Other names and brands may be claimed as the property of others.
- ©2018 Intel Corporation.

### **XDP 101**



### AF\_XDP 101

- Ingress
  - userspace XDP packet sink
  - XDP\_REDIRECT to socket via XSKMAP
- Egress
  - no XDP program
- Register userspace packet buffer memory to kernel (UMEM)
- Pass packet buffer ownership via descriptor rings

#### AF\_XDP 101



- Fill ring (to kernel) / Rx ring (from kernel)
- Tx ring (to kernel) / Completion ring (from kernel)
- copy mode (DMA to/from kernel allocated frames, copy data to user)
- zero-copy mode (DMA to/from user allocated frames)

## Baseline and optimization strategy

- Baseline
  - Linux 4.20
  - 64B @ ~15-22 Mpps
- Strategy
  - do less (instructions)
  - talk less (coherency traffic)
  - do more at the same time (batching, i\$)
  - Land of Spectres: fewer retpolines, fewer retpolines, fewer repolines

## **Experimental Setup**

- Broadwell E5-2660 @ 2.7GHz
- 2 cores used for run-to-completion benchmarks
- 1 core used for busy-poll benchmarks
- 2 i40e 40GBit/s netdevs, 2 AF\_XDP sockets
- Ixia load generator blasting at full 40 Gbit/s per NIC

### **Ingress**

- XDP\_ATTACH and bpf\_xsk\_redirect, attach at-most one socket per netdev queue, load built-in XDP program, 2-level hierarchy
- remove indirect call, bpf\_prog\_run\_xdp
- remove indirect call, XDP actions switch-statement ( $>= 5 \implies \text{jump table}$ )
- driver optimizations (batching, code restructure)
- bpf\_prog\_run\_xdp, xdp\_do\_redirect and xdp\_do\_flush\_map: per-CPU struct bpf\_redirect\_info + struct xdp\_buff + struct xdp\_rxq\_info vs explicit, stack-based context

## Ingress, results<sup>1</sup>, data not touched



<sup>\*</sup>Beautist have bee nentimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance tests, una systems and workloads. As we measured using specific computer systems, components, ordings, or any configuration of the performance tests may have been optimized for performance tests and section of the performance tests and section of the performance tests are set to a sestit you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/dazacetest.

## **Egress**

- Tx performance capped per HW queue
   ⇒ multiple Tx sockets per UMEM
- Larger/more batching, larger descriptor rings
- Dedicated AF\_XDP HW Tx queues
- In-order complettion, setsockopt
   XDP\_INORDER\_COMPLETION



# Egress, results<sup>1</sup>, data not touched



<sup>\*</sup>Beautist have bee nentimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance tests, una systems and workloads. As we measured using specific computer systems, components, ordings, or any configuration of the performance tests may have been optimized for performance tests and section of the performance tests and section of the performance tests are set to a sestit you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/dazacetest.

# Busy poll() vs run-to-completion



# Busy poll() vs run-to-completion, results<sup>1</sup>



Results two bee nestinated based on internal Intial analysis and are provided for informational purposes only. Any difference in system hardware or configuration may affect actual performance. Software and wholesade used in performance tests may have been optimized for performance tests, may have been optimized for performance tests, may have been optimized for performance tests to account the indirect processor. Any changes to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/distancerur.

## **Comparison with DPDK**

- Userspace, vectorized drivers
- "Learning from the DPDK" http://vger.kernel.org/netconf2018\_files/ StephenHemminger\_netconf2018.pdf

## Comparison with DPDK, results<sup>1</sup>



<sup>\*</sup>Beautist have bee nentimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Software and workloads used in performance tests may have been optimized for performance tests, una systems and workloads. As we measured using specific computer systems, components, ordings, or any configuration of the performance tests may have been optimized for performance tests and section of the performance tests and section of the performance tests are set to a sestit you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance/dazacetest.

### **Next steps**

### Upstream!

- XDP: switch-statement
- Rx/Tx: drivers
- Rx: XDP\_ATTTACH and bpf\_xsk\_redirect
- libbpf AF\_XDP support
- Tx: multiple Tx sockets per UMEM
- selftest, samples

### **Future work**

- hugepage support, less fill ring traffic (get\_user\_pages)
- fd.io/VPP work vectors (i\$, explicit batching in function calls)
- "XDP first" drivers
- collaborate/share code with RDMA (e.g. get\_user\_pages)
- Type-writer model (currently not planned)

## TL;DR

- Rx 15.1 to 39.3 Mpps (260%)
- Tx 25.3 to 68.0 Mpps (269%)
- Busy poll() promising
- DPDK still faster for "notouch", but AF\_XDP on par when data is touched
- drivers need to change when skb is not the only consumer

### Thanks!

- Ilias Apalodimas
- Daniel Borkmann
- Jesper Dangaard Brouer
- Willem De Bruijn
- Eric Dumazet
- Alexander Duyck
- Mykyta Iziumtsev
- Jakub Kicinski
- Song Liu

- David S. Miller
- Sridhar Samudrala
- Yonghong Song
- Alexei Starovoitov
- William Tu
- Anil Vasudevan
- Jingjing Wu
- Qi Zhang

# **Questions?**

