Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The speed problem #294

Open
Enjia opened this issue Nov 3, 2019 · 22 comments
Open

The speed problem #294

Enjia opened this issue Nov 3, 2019 · 22 comments

Comments

@Enjia
Copy link

Enjia commented Nov 3, 2019

When I test the proxygen under 1Gbps bandwidth、0.22ms delay and 0 packet loss rate, the results shows that speed of QUIC and FTP are 70M/S and 110M/s. Furthermore, when delay is 40ms, loss rate is 0.01%, the speed of QUIC is 0.8M/s and 60M/s. The file transferred is 500MB in these two tests, applying BBR as CC algorithm. So I am curious of the reasons resulting in such phenomenon, could you please give me some advice? Thanks !(BTW I set the params “-congestion=bbr -connect_udp=true -early_data=true")

@mjoras
Copy link
Contributor

mjoras commented Nov 3, 2019

Can you paste the full command line parameters you are using for client and server?

@yangchi
Copy link
Contributor

yangchi commented Nov 3, 2019

It will also be helpful if you can share how you emulate the latency and loss.

@Enjia
Copy link
Author

Enjia commented Nov 4, 2019

I just used tc command to set the delay and loss rate, as for speed I added time stamp at the start and end of main() function and their difference is the time. I have no idea if it's a method to calculate the speed, or do you guys can offer me more appropriate ways to figure out speed accurately? Thanks~

@mjoras
Copy link
Contributor

mjoras commented Nov 4, 2019

@Enjia that's not a terrible method.

Could you still post the parameters you are using for the hq client and server?

@Enjia
Copy link
Author

Enjia commented Nov 5, 2019

In detail, I run server and client in different machines, for server, I carry on the command:
sudo tc qdisc add dev enp7s0f0 root netem delay 40ms loss 0,01% (enp7s0f0 is the name of machine's net card)
Then, start hq in server and client:
./hq -host=192.168.33.92 -static_root=/home/test-s -congestion=bbr -connect_udp=true -early_data=true
./hq -mode-client -host=192.168.33.92 -outdir=/home/test-c -path=/test.rar -congestion=bbr -connect_udp=true -early_data=true (192.168.33.92 is the IP address of server)
As for speed I have mentioned that I just added time stamp at the start and end of main() function to roughly calculate the speed

@yangchi
Copy link
Contributor

yangchi commented Nov 7, 2019

I tried the same thing. Slightly higher RTT I think since I added the extra 40ms on top of a 13-15ms actual latency.
Anyway, 6 minutes later, only 2xx MB fetched, so i gave up. That's quite bad.
Simplying using a much larger size for cwnd and flow control helped though. On server side i added " -max_cwnd_mss=800000000" and on client side I added "-conn_flow_control=800000000 -stream_flow_control=800000000"

After that it seems downloading a 500MB file can be done in 12-15 seconds.

@yangchi
Copy link
Contributor

yangchi commented Nov 7, 2019

By the way between the same two hosts I use, iperf3 with a forced 1280 MSS gives me 183-202 Mbit/s

(Without forcing MSS to be as small as what we use for Quic, iperf3 can do 1.14Gbit/s)

@yangchi
Copy link
Contributor

yangchi commented Nov 7, 2019

You can also try to see if GSO helps on your host. For me it didn't. The param is -quic_bathing_mode=1.

@yangchi
Copy link
Contributor

yangchi commented Nov 7, 2019

We don't have good PMTUD support at this moment. If you happen to know a proper MSS size that can work well in your network, we can add a param in HQ so you can test the perf with increased MSS size in Quic. For internet facing traffic, i don't think you can make it much higher though.

Comparing to FTP is something i don't think any of us will likely look into. The protocol overhead itself of http/* vs FTP will make this comparsion tricky. At transport layer we use iperf3 internally as a TCP baseline and we use tperf (https://github.com/facebookincubator/mvfst/tree/master/quic/tools/tperf) to test Quic transport speed. At HTTP layer hq is what we mostly rely on for http/2 vs http/3 tests as a server. There is a http perf test client we haven't open sourced for that purpose.

One thing to note is that running BBR without pacing enabled is OK for perf test purpose, but doing that for real production traffic may not be a good idea. Something to keep in mind. (-pacing=true will turn on pacing for you)

@yangchi
Copy link
Contributor

yangchi commented Nov 8, 2019

hey @Enjia can you try with the a few params i mentioned and let me know if it helps? Thank you!

@Enjia
Copy link
Author

Enjia commented Nov 15, 2019

Hi @yangchi
I come across some problems nowadays and really need your help!
1、As for params, I added -max_cwnd_mss=800000000 in server, -conn_flow_control=800000000 & -stream_flow_control=800000000 in client. And I found that when delay=40ms, it took about 15s to transfer file in 500MB, which lifted up the speed a little comparing to former test.
Besides, after checking mvfst codes, I found that the default MSS is 1252, and then I modified it so MSS in both server and client are 1280, the speed didn't have any lift though. Therefore I continue increasing the value of MSS but the conncection failed with the following errors:
E1115 10:21:16.726881 68654 HQClient.cpp:136] HQClient failed to connect, error=Connect failed, msg=connect timeout.
So how can I modify MSS correctly to avoid timeout?

2、I managed to transfer file with 500M under 1G BW, 40ms delay and 0 loss rate, it took around 15-16s, but under same circumstance, other transmission using TCP just took 8-9s (proxygen and TCP both use BBR), so I wonder why there's such a gap between them and if do you hanve any idea of optimization?

3、U guys utilize HTTP3 on top of QUIC(MVFST), which is called application layer. Is there any correlation between speed and application layer types and is there any params I can change in in term of HTTP3?

4、When transferring large file with 5GB, an error occurred:
Got error=LocalError: No Error, Exceeded max PTO.
Have you ever tried to get large files(10GB, 100GB...etc) from server? Thanks again!

@yangchi
Copy link
Contributor

yangchi commented Nov 17, 2019

When you keep increasing MSS, at certain point it will be beyond your network's MTU and packets will be dropped, thus the timeout error you saw. Currently we don't have a good way to probe Path MTU and use that as Quic MSS.

A few questions for the 15s Quic file transfer and 8s Tcp file transfter results:

(1) For the TCP case, did you use proxygen as client and server or did you use something else? If it's proxygen, what command did you run so that I can repro it?
(2) For the Quic case, did you enable GSO?
(3) For the 5GB transfer which eventually failed due to PTO limit, was Pacing enabled for Quic, and was there a policer in the middle of the network?
(4) Was everything running with release binary build?

I don't expect our Quic performance to be on par with a well tuned TCP stack today. But twice the time for 500MB file transfer sound like something we can improve. So it will be good if we know more about the test setup and the commands you used so we can reproduce the problem on our side and go from there.

@Enjia
Copy link
Author

Enjia commented Nov 18, 2019

Thank you for your suggestion~
(1) Conditions: 1G bw, 40ms delay, 0 loss rate, file size 500MB
TCP: I use ftp and wdt, ftp costs around 8-9s, wdt 7-8s.
quic: costs around 15-16s
the command is "./hq -host=192.168.33.92 -static_root=/home/test-s -congestion=bbr -connect_udp=true -early_data=true -max_cwnd_mss=800000000 -quic_batching_mode=1"
"./hq -mode=client -host=192.168.33.92 -outdir=/home/test-c -path=/test.rar -congestion=bbr -connect_udp=true -early_data=true -max_cwnd_mss=800000000 -stream_flow_control=800000000 -quic_batching_mode=1"
(2) You said "Was everything running with release binary build?", do you mean the param "-use_draft"? I supposed so, and I set -use_draft to False, the time was around 16-17s
(3) After decreasing MSS to 626 (half of the former value), the same error appeared again:
E1115 10:21:16.726881 68654 HQClient.cpp:136] HQClient failed to connect, error=Connect failed, msg=connect timeout
(4) There's no difference in setting param "pacing" to True or False, the error is :
Got error=LocalError: No Error, Exceeded max PTO.
(5) I'm not sure of whether policer exists in the middle of the network

@Enjia
Copy link
Author

Enjia commented Nov 21, 2019

@yangchi I wonder if MVFST has been applied to some projects in real life because I think if MVFST is utilized practically the transmission speed should bear the test, while in recent test I found with the increase of network delay the speed of MVFST dropped dramaticlly compared to TCP which performed better. So can you figure out the underlying reasons in terms of this phenomenon? Thanks!

@mjoras
Copy link
Contributor

mjoras commented Nov 21, 2019

@Enjia we use mvfst as the quic implementation at Facebook. I can assure you it has been used in real production traffic more than any other Quic implementation except Google's :)

That said, our tools are not optimized for maximal performance for every network condition out of the box. The main thing that you are probably experiencing as the limiting factor is flow control. If the flow control is not higher than the BDP of the network you are testing, you will not achieve maximal throughput. Mvfst/proxygen do not vary the flow control after you set them. The relevant parameters are -stream_flow_control and -conn_flow_control. These both need to be set on the client to avoid becoming blocked on flow control. If you want you can always set them to be very high without ill consequence for this sort of testing.

As for the other errors, they look to be mostly connect errors, and indicate the client wasn't able to connect.

As for a comparison with TCP, what TCP client are you using? You can use the proxygen curl-like client (called "curl_client") and compare quic and TCP fairly directly using HTTP/2 and HTTP/3.

@Enjia
Copy link
Author

Enjia commented Nov 21, 2019

@mjoras Thanks! As for TCP client, I use WDT and FTP with 1 TCP connection opened. When testing on client side I set both -stream_flow_control and -conn_flow_control to 800000000, the bandwidth is 1G, delay is 40ms, loss rate is 0, so BDP=128M/s * 0.04s = 5.12MB, so stream/conn_flow_control is much higher than BDP. The file size I test is 500MB, quic within proxygen needs 15-16s, FTP and WDT just needs7-8s. For more details you can see the conversations between Yangchi and I above.

@yangchi
Copy link
Contributor

yangchi commented Nov 21, 2019

@Enjia Just another quick question, what's the linux kernel version on your testing machines? UDP GSO was added in 4.18 iirc.

By the way there is no benefit of tuning down the MSS from our default value.

As Matt said, mvfst has been used in real production traffic :) That being said, it's quite different from the type of traffic you are testing with. We use it to carry HTTP requests from Facebook users. One number we shared recently is that on one of Facebook's mobile apps, 80-90% of API requests are over Quic today. Both client and server are mvfst in that case. So when we use it, it's internet facing traffic, a few hundred KB of HTTP body size, multiple streams over one connection concurrently. And we clearly saw wins in terms of perf metrics and user engagement by using Quic for such traffic.

For larger transfer in a more controlled environment either inter- or intra-datacenter, I don't think it's expected to have Mvfst beat TCP today. I'd go even further to say I don't think it's expected to have Quic beat TCP today without modification to either NIC or OS kernel. The benefit provided by the protocol itself is better loss recovery, no head-of-line blocking and 0-rtt connection setup. On a lossless link, with single stream, and carrying a large data transfer, all of these benefits disappear. On the other hand, on a decent system with sane TCP implementation, you can at least expect offloading on both read and write sides and proper MSS sizing. With large transfer in a controlled environment, I expect these play a much bigger role than the protocol optimizations designed into Quic. We can, and have been working on optimizing Mvfst for such use cases in the past a couple months to shrink the gap, but I'd still expect a gap when compared to a decent TCP implementation.

I haven't tested with WDT before, but i believe it's multithreaded multi-connection? The throughput you can get from the sample out-of-box proxygen client and server will likely have a really hard time getting close to the WDT number in that case. In fact I think the design and optimization goals of WDT and Proxygen/Mvfst are very different.

But we still shouldn't error out in the 5GB test you have done. I think that's something we should definitely look into. We have done large transfer test that's even higher than 5GB without problem internally. But when we do large transfer test internally, we only test Quic (mvfst) instead of http/3 (mvfst + proxygen) in those cases. So this has been a blind spot on our side.

@yangchi
Copy link
Contributor

yangchi commented Dec 28, 2019

Looks like connection closed due to PTO limit when transfer 5GB file is not related to the transport: #307

@Winters123
Copy link

Winters123 commented Apr 24, 2020

Hi, @yangchi still following this thread, and thanks for the explanation about the inter/intra network difference. They are pretty insightful. A couple of questions that I'm curious about:

  1. Do you expect any packet out-of-order inside Facebook's data centres? I guess just becuase we are in a lossless env doesn't mean that the packets are perfectly ordered, right?
  2. with NIC acceleration, is there a chance that QUIC may replace TCP for internal traffic?
    The 2nd question comes from the fact that besides loss recovery and head-of-line block, another benefit we can take from QUIC is faster handshake compares to TCP/TLS.

Another thing I'm interested in is the folly::netops::sendmsg API (cost 60%+ on the core) that drives packet I/O. I noticed it behaves like the conventional socket API, which makes it the performance bottleneck on CPU. I wonder if you guys are considering replacing it with kernel-bypass schemes or this is enough for the services Facebook is providing.

Would love to hear your comments as a FPGA guy.

@Winters123
Copy link

@yangchi
p.s. Is there an easy way that the real-time/average throughput (using tperf) can be calculated and showed on the std output?
and the log info printed out in the terminal seems like a great burden of CPU, can I disable the log generation?

Thanks,

@yangchi
Copy link
Contributor

yangchi commented Apr 24, 2020

I don't have the number but i believe there is small fraction of out of order in our network.
Yes sendmsg is where we spend lots of CPU. For the internet use case, we find properly GSO is enough. There are other use-cases that people are experimenting with sendmmsg + gso. I think @mjoras knows more about that. Kernel bypass has been actively explored.

You can totaly remove all the LOGs when you run tperf. But i don't think there isn't much to begin with?

Do you want to send a PR for real-time throughput in tperf? We don't have that yet and it sounds useful. :)

@Winters123
Copy link

Winters123 commented Apr 26, 2020

cool. I tested on mvfst and found that a slight packet out-of-order (5%-10%) can cause throughput decrease (w/o GSO enabled). Not sure if that was caused by packet loss or the functions that handle out-of-order packets, need to check a bit more of the code. But I'm really interested if you are willing to use it when I say that we can provide a pkt reordering engine (on NIC) that can save CPU and improve throughput at the cost of some slight latency costs?

You can totally remove all the LOGs when you run tperf. But i don't think there isn't much to begin with?

Well, at least on my side I got lots of printed stuff on the terminal, which in most cases is a heavy CPU waste. Once the log was disabled, the throughput jumped from less than 100Mb/s to around 350Mb/s. But I'm not sure if that was caused by printf or maybe something else I didn't notice..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants