Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curl ngtcp2 performance regression on fast links #15267

Closed
vranem1 opened this issue Oct 11, 2024 · 23 comments
Closed

curl ngtcp2 performance regression on fast links #15267

vranem1 opened this issue Oct 11, 2024 · 23 comments
Assignees
Labels
HTTP/3 h3 or quic related performance

Comments

@vranem1
Copy link

vranem1 commented Oct 11, 2024

I did this

Starting from this change: 97c0f89bd I observe slight performance improvement on slower links, but also a significant performance regression on fast links when downloading an http/3 resource.

Steps to reproduce:

  1. Build curl with http/3 support from these component version:
    OpenSSL: OpenSSL_1_1_1w-37-g612d8e44d6
    nghttp3: v1.6.0-1-g7f40779
    ngtcp2: v1.8.0-1-g8d5ab787
    nghttp2: v1.63.0-15-g4202608e
    curl: curl-8_10_1-146-g97c0f89bd

  2. Build a second binary from the same components, but with previous curl git commit

  3. Setup a http3 server and a bandwith limit using netem with the following params:
    rate=1Gbit ,rtt=2ms, loss=0 , jitter=0

  4. download 100MB file and measure the time

I expected the following

The download duration with older binary is approximatly 3s, but 100s with the newer one.

The faster the emulated link is the bigger regression we get. for example on 1.5GBit line the regression is 1000% !

curl/libcurl version

curl 8.10 and newer (tested with the main branch today)

operating system

Ubuntu 22.04

@icing
Copy link
Contributor

icing commented Oct 11, 2024

This sounds like recv/sendmmsg does not seem to work well on your system.

So, just to be sure that we are not dealing with another effec, assuming you build with automake, could you add the following patch, do you see the good performance returned?

diff --git a/configure.ac b/configure.ac
index 4a9a7fc5f8..73d086decd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -575,11 +575,11 @@ sac

 # In order to detect support of sendmmsg(), we need to escape the POSIX
 # jail by defining _GNU_SOURCE or <sys/socket.h> will not expose it.
-case $host_os in
-  linux*)
-    CPPFLAGS="$CPPFLAGS -D_GNU_SOURCE"
-    ;;
-esac
+#case $host_os in
+#  linux*)
+#    CPPFLAGS="$CPPFLAGS -D_GNU_SOURCE"
+#    ;;
+#esac

 dnl Build unit tests when option --enable-debug is given.
 if test "x$want_debug" = "xyes" &&

@icing icing self-assigned this Oct 11, 2024
@bagder bagder added performance HTTP/3 h3 or quic related labels Oct 12, 2024
@vranem1
Copy link
Author

vranem1 commented Oct 14, 2024

The source code with the patch is failing to compile with the following error:

vquic/vquic.c: In function ‘recvmmsg_packets’:
vquic/vquic.c:362:18: error: array type has incomplete element type ‘struct mmsghdr’
  362 |   struct mmsghdr mmsg[MMSG_NUM];
      |                  ^~~~
vquic/vquic.c:391:21: error: implicit declaration of function ‘recvmmsg’; did you mean ‘recvmsg’? [-Werror=implicit-function-declaration]
  391 |     while((mcount = recvmmsg(qctx->sockfd, mmsg, n, 0, NULL)) == -1 &&
      |                     ^~~~~~~~
      |                     recvmsg
cc1: some warnings being treated as errors

@icing
Copy link
Contributor

icing commented Oct 17, 2024

Ok, let's try disabling mmsg a simpler way via:

diff --git a/lib/vquic/vquic.c b/lib/vquic/vquic.c
index 16bfe4ccd..4926b4e1a 100644
--- a/lib/vquic/vquic.c
+++ b/lib/vquic/vquic.c
@@ -52,6 +52,8 @@

 #ifdef USE_HTTP3

+#undef HAVE_SENDMMSG
+
 #ifdef O_BINARY
 #define QLOGMODE O_WRONLY|O_CREAT|O_BINARY
 #else
``

@koujaz
Copy link

koujaz commented Oct 21, 2024

While investigating this regression I noticed the traffic started to hit flow control limits.
In the sample communication I can see number of SDB frames:
602x STREAM_DATA_BLOCKED (0x0000000000000015)
which suggests the client may be too conservative regarding its MSD frames:
1426x MAX_STREAM_DATA (0x0000000000000011)

From that I suspect that it is not related to GSO.
(At least not directly. Maybe there are GSO-induced timing changes which have influence on flow control? Going to dig deeper ...)

Also enabling/disabling offloads on the interfaces doesn't make SDB frames disappear.

@icing
Copy link
Contributor

icing commented Oct 28, 2024

FTR: I am unable to reproduce. Using the setup described in #15415 on my local network with 1GB/s ethernet, the scorecard download runs QUIC/HTTP/3 at 84 MB/s. If I understood this issue description correctly, this is not what the reporter observes in their setup.

@bagder
Copy link
Member

bagder commented Oct 28, 2024

@koujaz maybe b34b757 improved things a little?

@curl curl deleted a comment Oct 28, 2024
@koujaz
Copy link

koujaz commented Oct 29, 2024

Tried curl w/o MMSG as suggested above. Also w/ recent versions of the components.
Some observations:

  • disabling mmsg indeed restores the original performance; however one can see MSD/SDB frames in the traffic (seems it was flow-control limited even before; hence this looks like a separate issue)
  • tried commit beyond b34b757 and can confirm slight improvement; however SDB/MSD frames do not go away completely (both with and without mmsg)
  • noticed some packets claimed lost by the server (again both with and without mmsg) - trying to hunt this down as this now seem the most likely base of the problem (unrelated to flow control; I guess the mmsg just somehow triggers larger loss)

@koujaz
Copy link

koujaz commented Oct 30, 2024

So the loss is definitely induced by use of recvmmsg. Instrumented the code (recvmmsg at lib/vquic/vquic.c and recv_pkt at curl/lib/vquic/curl_ngtcp2.c) to see how are the data passed around. I'm not familiar w/ curl internals but I assume the "recv_pkt" callback should be invoked per complete received packet.

From the debug prints I see cases like below.
Note:

  • the connection dcid="ABC1EE..." so the packet starts seem right
  • pktlen=848 looks wrong (note 848=2048-1200)
  • pktlen=2048 looks wrong
  • all above packets (during transfer) should have 1200 bytes (traffic is OK as this can be verified by sniffing the traffic on the server in the middle where the segmentation offloads are disabled)

I'll dig further to see whether the problem is in our system/linux or in lib/vquic/vquic.c handling or recvmmsg call.

[recvmmsg] max_pkts=128
    pkts=0 n=64
    mcount=1
    i=0
        msg_len=2048
        gso_size=1200
        recv_cb: pktlen=1200
[recv_pkt] len=1200 43 AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4
        recv_cb: pktlen=848
[recv_pkt] len=848 52 AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4
    pkts=2 n=64
    mcount=-1
[recvmmsg] max_pkts=128
    pkts=0 n=64
    mcount=-1
[recvmmsg] max_pkts=128
    pkts=0 n=64
    mcount=1
    i=0
        msg_len=2048
        gso_size=1200
        recv_cb: pktlen=1200
[recv_pkt] len=1200 5A AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4
         recv_cb: pktlen=848
[recv_pkt] len=848 42 AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4
...
[recvmmsg] max_pkts=128
    pkts=0 n=64
    mcount=3
    i=0 
        msg_len=2048
        gso_size=2048
        recv_cb: pktlen=2048
[recv_pkt] len=2048 4D AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4     
    i=1 
        msg_len=2048
        gso_size=1200
        recv_cb: pktlen=1200
[recv_pkt] len=1200 56 AB C1 EE A0 C6 DA C4 1E 61 5F 96 0F 27 0B 3C 3E 15 5A E4

@icing
Copy link
Contributor

icing commented Oct 30, 2024

So, either the kernel or our vquic.c code does not asses the individual packet lengths correctly and therefore, it gets dropped by ngtcp2 and leads to effectively packet loss for the QUIC connection?

/cc @tatsuhiro-t

@koujaz
Copy link

koujaz commented Oct 30, 2024

Yes; this is how I understand the problem. It also explains the fact regression get's bigger the faster the link gets (the bigger number of packets get's read by single recvmmsg call).
Please advise if there is some debug that could be enabled. Meanwhile I'll keep working w/ instrumented prints...

@tatsuhiro-t
Copy link
Contributor

In recvmmsg_packets, each buffer has 2048 bytes which is too small for GRO and I think the input is truncated.
If the buffer size must be this small, then disable GRO when recvmmsg.
If we can increase buffer size to 64k, then it would be the way to go.

@icing
Copy link
Contributor

icing commented Oct 30, 2024

@tatsuhiro-t I see. Maybe we need to move those off the stack then. I'll make a PR.

@koujaz
Copy link

koujaz commented Oct 30, 2024

One more experiment:
I've replaced the 'recvmmsg' call in lib/vquic/vquic.c by this:
https://github.com/psqli/recvmsg_vs_recvmmsg/blob/35029a62821b8372be76b266678e7bb35b0e3205/receive.c#L43
Which serves like drop-in-replacement but using just recvmsg.

And the data get's truncated the same way...

So it seems the issue is rather in the application's handling of recvmmsg than in the system or recvmmsg itself.

@icing
Copy link
Contributor

icing commented Oct 30, 2024

@koujaz could you give #15454 a try?

@koujaz
Copy link

koujaz commented Oct 30, 2024

It performs much better; however still not 100%.
Again tracing the calls I can confirm the truncated packets are no longer present.

However this one looks wrong:

[recvmmsg] max_pkts=128
    pkts=0 n=2 
    mcount=2
    i=0
        msg_len=13200
        gso_size=13200
        recv_cb: pktlen=13200                                                       
[recv_pkt] pkt_count=1 len=13200 | 40 30 bf 57 b9 53 b2 16 92 f4 b1 c7 d3 ff 19 32 c0 0b f6 11

There should be no single 13200 byte packet (it should be split).
This again induces some packet loss on quic connection (probably only first packet of this block get's decrypted and read).

@tatsuhiro-t
Copy link
Contributor

tatsuhiro-t commented Oct 30, 2024

There is an issue in message header buffer size.
When I changed the type of GRO control data 9feb85a, I failed to update GRO control data buffer size. CMSG_SPACE(sizeof(uint16_t)) should be CMSG_SPACE(sizeof(int)). recvmsg version also has this bug.
There is also a bug in https://github.com/curl/curl/blob/fb711b50981e86c9dcdd6c2ba468b5d32edbcfce/lib/vquic/vquic.c#L387C7-L387C49
It should be:

      mmsg[i].msg_hdr.msg_control = &msg_ctrl[i * CMSG_SPACE(sizeof(int))];
      mmsg[i].msg_hdr.msg_controllen = CMSG_SPACE(sizeof(int));

@jay
Copy link
Member

jay commented Oct 31, 2024

do you want to create a pr? i think others will need to be updated such as uint8_t msg_ctrl[MMSG_NUM * CMSG_SPACE(sizeof(int))]; and basically anything that uses CMSG_SPACE for that reason

@koujaz
Copy link

koujaz commented Oct 31, 2024

I wanted to confirm replacing all CMSG_SPACE occurrences w/ CMSG_SPACE(sizeof(int)) eliminates the packet loss.
But looking closer I still see occasional jumbo frames passed directly to recv_pkt.

@icing
Copy link
Contributor

icing commented Oct 31, 2024

I'll update #15454 with this.

@icing
Copy link
Contributor

icing commented Oct 31, 2024

Thanks everyone! #15454 now uses the correct CMSG_SPACE(int) and also a larger set of 64k buffers.

@tatsuhiro-t
Copy link
Contributor

Thank you @icing. I was a bit busy today.

@icing
Copy link
Contributor

icing commented Oct 31, 2024

np, @tatsuhiro-t, an opportunity to learn more about it.

@bagder bagder closed this as completed in 9b863ac Oct 31, 2024
@koujaz
Copy link

koujaz commented Nov 1, 2024

Tested #15454 (commit 9bb211e) and can confirm the fix works.
There is zero packet loss induced by recvmmsg.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HTTP/3 h3 or quic related performance
Development

No branches or pull requests

9 participants
@icing @bagder @tatsuhiro-t @jay @koujaz @vranem1 and others