Skip to content

Commit 78b6f6e

Browse files
committed
Merge branch 'for-6.15/io_uring-rx-zc' into for-6.15/io_uring-reg-vec
* for-6.15/io_uring-rx-zc: (80 commits) io_uring/zcrx: add selftest case for recvzc with read limit io_uring/zcrx: add a read limit to recvzc requests io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap io_uring: Rename KConfig to Kconfig io_uring/zcrx: fix leaks on failed registration io_uring/zcrx: recheck ifq on shutdown io_uring/zcrx: add selftest net: add documentation for io_uring zcrx io_uring/zcrx: add copy fallback io_uring/zcrx: throttle receive requests io_uring/zcrx: set pp memory provider for an rx queue io_uring/zcrx: add io_recvzc request io_uring/zcrx: dma-map area for the device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: grab a net device io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add interface queue and refill queue net: add helpers for setting a memory provider on an rx queue net: page_pool: add memory provider helpers net: prepare for non devmem TCP memory providers ...
2 parents 94765d7 + 89baa22 commit 78b6f6e

File tree

107 files changed

+3554
-3357
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+3554
-3357
lines changed

Documentation/arch/s390/driver-model.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ information about the interrupt from the irb parameter.
244244
--------------------
245245

246246
The ccwgroup mechanism is designed to handle devices consisting of multiple ccw
247-
devices, like lcs or ctc.
247+
devices, like qeth or ctc.
248248

249249
The ccw driver provides a 'group' attribute. Piping bus ids of ccw devices to
250250
this attributes creates a ccwgroup device consisting of these ccw devices (if

Documentation/devicetree/bindings/net/faraday,ftgmac100.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@ properties:
4444
phy-mode:
4545
enum:
4646
- rgmii
47+
- rgmii-id
48+
- rgmii-rxid
49+
- rgmii-txid
4750
- rmii
4851

4952
phy-handle: true

Documentation/netlink/genetlink-c.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ $defs:
1414
pattern: ^[0-9A-Za-z_-]+( - 1)?$
1515
minimum: 0
1616
len-or-limit:
17-
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
17+
# literal int, const name, or limit based on fixed-width type
18+
# e.g. u8-min, u16-max, etc.
1819
type: [ string, integer ]
19-
pattern: ^[su](8|16|32|64)-(min|max)$
20+
pattern: ^[0-9A-Za-z_-]+$
2021
minimum: 0
2122

2223
# Schema for specs

Documentation/netlink/genetlink-legacy.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ $defs:
1414
pattern: ^[0-9A-Za-z_-]+( - 1)?$
1515
minimum: 0
1616
len-or-limit:
17-
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
17+
# literal int, const name, or limit based on fixed-width type
18+
# e.g. u8-min, u16-max, etc.
1819
type: [ string, integer ]
19-
pattern: ^[su](8|16|32|64)-(min|max)$
20+
pattern: ^[0-9A-Za-z_-]+$
2021
minimum: 0
2122

2223
# Schema for specs

Documentation/netlink/genetlink.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ $defs:
1414
pattern: ^[0-9A-Za-z_-]+( - 1)?$
1515
minimum: 0
1616
len-or-limit:
17-
# literal int or limit based on fixed-width type e.g. u8-min, u16-max, etc.
17+
# literal int, const name, or limit based on fixed-width type
18+
# e.g. u8-min, u16-max, etc.
1819
type: [ string, integer ]
19-
pattern: ^[su](8|16|32|64)-(min|max)$
20+
pattern: ^[0-9A-Za-z_-]+$
2021
minimum: 0
2122

2223
# Schema for specs

Documentation/netlink/specs/netdev.yaml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,9 @@ attribute-sets:
114114
doc: Bitmask of enabled AF_XDP features.
115115
type: u64
116116
enum: xsk-flags
117+
-
118+
name: io-uring-provider-info
119+
attributes: []
117120
-
118121
name: page-pool
119122
attributes:
@@ -171,6 +174,11 @@ attribute-sets:
171174
name: dmabuf
172175
doc: ID of the dmabuf this page-pool is attached to.
173176
type: u32
177+
-
178+
name: io-uring
179+
doc: io-uring memory provider information.
180+
type: nest
181+
nested-attributes: io-uring-provider-info
174182
-
175183
name: page-pool-info
176184
subset-of: page-pool
@@ -296,6 +304,11 @@ attribute-sets:
296304
name: dmabuf
297305
doc: ID of the dmabuf attached to this queue, if any.
298306
type: u32
307+
-
308+
name: io-uring
309+
doc: io_uring memory provider information.
310+
type: nest
311+
nested-attributes: io-uring-provider-info
299312

300313
-
301314
name: qstats
@@ -572,6 +585,7 @@ operations:
572585
- inflight-mem
573586
- detach-time
574587
- dmabuf
588+
- io-uring
575589
dump:
576590
reply: *pp-reply
577591
config-cond: page-pool
@@ -637,6 +651,7 @@ operations:
637651
- napi-id
638652
- ifindex
639653
- dmabuf
654+
- io-uring
640655
dump:
641656
request:
642657
attributes:

Documentation/networking/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Contents:
6363
gtp
6464
ila
6565
ioam6-sysctl
66+
iou-zcrx
6667
ip_dynaddr
6768
ipsec
6869
ip-sysctl
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=====================
4+
io_uring zero copy Rx
5+
=====================
6+
7+
Introduction
8+
============
9+
10+
io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
11+
the network receive path, allowing packet data to be received directly into
12+
userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
13+
there are no strict alignment requirements and no need to mmap()/munmap().
14+
Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
15+
processed by the kernel TCP stack as normal.
16+
17+
NIC HW Requirements
18+
===================
19+
20+
Several NIC HW features are required for io_uring ZC Rx to work. For now the
21+
kernel API does not configure the NIC and it must be done by the user.
22+
23+
Header/data split
24+
-----------------
25+
26+
Required to split packets at the L4 boundary into a header and a payload.
27+
Headers are received into kernel memory as normal and processed by the TCP
28+
stack as normal. Payloads are received into userspace memory directly.
29+
30+
Flow steering
31+
-------------
32+
33+
Specific HW Rx queues are configured for this feature, but modern NICs
34+
typically distribute flows across all HW Rx queues. Flow steering is required
35+
to ensure that only desired flows are directed towards HW queues that are
36+
configured for io_uring ZC Rx.
37+
38+
RSS
39+
---
40+
41+
In addition to flow steering above, RSS is required to steer all other non-zero
42+
copy flows away from queues that are configured for io_uring ZC Rx.
43+
44+
Usage
45+
=====
46+
47+
Setup NIC
48+
---------
49+
50+
Must be done out of band for now.
51+
52+
Ensure there are at least two queues::
53+
54+
ethtool -L eth0 combined 2
55+
56+
Enable header/data split::
57+
58+
ethtool -G eth0 tcp-data-split on
59+
60+
Carve out half of the HW Rx queues for zero copy using RSS::
61+
62+
ethtool -X eth0 equal 1
63+
64+
Set up flow steering, bearing in mind that queues are 0-indexed::
65+
66+
ethtool -N eth0 flow-type tcp6 ... action 1
67+
68+
Setup io_uring
69+
--------------
70+
71+
This section describes the low level io_uring kernel API. Please refer to
72+
liburing documentation for how to use the higher level API.
73+
74+
Create an io_uring instance with the following required setup flags::
75+
76+
IORING_SETUP_SINGLE_ISSUER
77+
IORING_SETUP_DEFER_TASKRUN
78+
IORING_SETUP_CQE32
79+
80+
Create memory area
81+
------------------
82+
83+
Allocate userspace memory area for receiving zero copy data::
84+
85+
void *area_ptr = mmap(NULL, area_size,
86+
PROT_READ | PROT_WRITE,
87+
MAP_ANONYMOUS | MAP_PRIVATE,
88+
0, 0);
89+
90+
Create refill ring
91+
------------------
92+
93+
Allocate memory for a shared ringbuf used for returning consumed buffers::
94+
95+
void *ring_ptr = mmap(NULL, ring_size,
96+
PROT_READ | PROT_WRITE,
97+
MAP_ANONYMOUS | MAP_PRIVATE,
98+
0, 0);
99+
100+
This refill ring consists of some space for the header, followed by an array of
101+
``struct io_uring_zcrx_rqe``::
102+
103+
size_t rq_entries = 4096;
104+
size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
105+
/* align to page size */
106+
ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
107+
108+
Register ZC Rx
109+
--------------
110+
111+
Fill in registration structs::
112+
113+
struct io_uring_zcrx_area_reg area_reg = {
114+
.addr = (__u64)(unsigned long)area_ptr,
115+
.len = area_size,
116+
.flags = 0,
117+
};
118+
119+
struct io_uring_region_desc region_reg = {
120+
.user_addr = (__u64)(unsigned long)ring_ptr,
121+
.size = ring_size,
122+
.flags = IORING_MEM_REGION_TYPE_USER,
123+
};
124+
125+
struct io_uring_zcrx_ifq_reg reg = {
126+
.if_idx = if_nametoindex("eth0"),
127+
/* this is the HW queue with desired flow steered into it */
128+
.if_rxq = 1,
129+
.rq_entries = rq_entries,
130+
.area_ptr = (__u64)(unsigned long)&area_reg,
131+
.region_ptr = (__u64)(unsigned long)&region_reg,
132+
};
133+
134+
Register with kernel::
135+
136+
io_uring_register_ifq(ring, &reg);
137+
138+
Map refill ring
139+
---------------
140+
141+
The kernel fills in fields for the refill ring in the registration ``struct
142+
io_uring_zcrx_ifq_reg``. Map it into userspace::
143+
144+
struct io_uring_zcrx_rq refill_ring;
145+
146+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
147+
refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
148+
refill_ring.rqes =
149+
(struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
150+
refill_ring.rq_tail = 0;
151+
refill_ring.ring_ptr = ring_ptr;
152+
153+
Receiving data
154+
--------------
155+
156+
Prepare a zero copy recv request::
157+
158+
struct io_uring_sqe *sqe;
159+
160+
sqe = io_uring_get_sqe(ring);
161+
io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
162+
sqe->ioprio |= IORING_RECV_MULTISHOT;
163+
164+
Now, submit and wait::
165+
166+
io_uring_submit_and_wait(ring, 1);
167+
168+
Finally, process completions::
169+
170+
struct io_uring_cqe *cqe;
171+
unsigned int count = 0;
172+
unsigned int head;
173+
174+
io_uring_for_each_cqe(ring, head, cqe) {
175+
struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
176+
177+
unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
178+
unsigned char *data = area_ptr + (rcqe->off & mask);
179+
/* do something with the data */
180+
181+
count++;
182+
}
183+
io_uring_cq_advance(ring, count);
184+
185+
Recycling buffers
186+
-----------------
187+
188+
Return buffers back to the kernel to be used again::
189+
190+
struct io_uring_zcrx_rqe *rqe;
191+
unsigned mask = refill_ring.ring_entries - 1;
192+
rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
193+
194+
unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
195+
rqe->off = area_offset | area_reg.rq_area_token;
196+
rqe->len = cqe->res;
197+
IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
198+
199+
Testing
200+
=======
201+
202+
See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``

Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,5 @@ source "lib/Kconfig"
3030
source "lib/Kconfig.debug"
3131

3232
source "Documentation/Kconfig"
33+
34+
source "io_uring/Kconfig"

arch/s390/include/asm/irq.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@ enum interruption_class {
5454
IRQIO_C70,
5555
IRQIO_TAP,
5656
IRQIO_VMR,
57-
IRQIO_LCS,
5857
IRQIO_CTC,
5958
IRQIO_ADM,
6059
IRQIO_CSC,

0 commit comments

Comments
 (0)