-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
openssl: enable readahead and increase read buffer size (improves download speed) #17548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like I'll need to add some |
This is a nice find!
Annoyingly, the docs for these functions says nothing about when they were added or other conditions as to when they are available. I presume we need to add configure/cmake checks to detect if they are present or not before we can use them. |
Thanks
I'll look at this in more detail next week, but a quick search with They should both be available in OpenSSL 1.1.0: Since LibreSSL is a fork of OpenSSL 1.0.x it won't have it. So perhaps some OpenSSL >= 1.1.0 could be used (unfortunately LibreSSL claims to be version 2.0 in OPENSSL_VERSION_NUMBER), but I'll figure an appropriate macro to use for detection. |
However using a to large buffer (8 packets) actually decreases performance. | ||
4 packets is better. | ||
*/ | ||
SSL_CTX_set_default_read_buffer_len(octx->ssl_ctx, 0x401e * 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why 0x401e? The comment above would make me think 0x4800
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is up to 0x4000 + 0x800, but can be smaller. I'll investigate with tcpdump/tshark/wireshark what a typical TLS record size is (might also depend on protocol version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tshark
says:
Reassembled TCP length: 16413
TLSv1.2 Record Layer: Application Data Protocol: Hypertext Transfer Protocol
Content Type: Application Data (23)
Version: TLS 1.2 (0x0303)
Length: 16408
16413 is 0x401d. I'll try to find out whether that is more or less constant, or whether different TLS versions or settings could change that (well obviously according to the comment it can go up to +0x800, but it talks about compression, which is usually turned off due to security issues)
@edwintorok nice one. Did not know of that OpenSSL feature. I had experimented with adding buffering to the underlying socket filter. That had no real impact because then it caused an additional buffer copy of the incoming data to OpenSSL. Reading directly larger chunks into OpenSSL's packet handling is much better. I ran our scorecard against a local Caddy server, using HTTP/1.1 downloads, via: > python3 tests/http/scorecard.py -d --samples 10 --caddy h1 with a non-debug curl build and varying the OpenSSL buffer sizes on my macOS box:
Which transfers 50 times a 100MB resource, first over the same connection one after the other. Second, over 50 connections. The effect on the serial case is minimal, on the parallel one noticeable. The serial case "loses" time between end of one and start of next download after processing the request, I believe. In the parallel case, curl is at maxed CPU and the performance benefits of the larger reads shine. The benefit comes with a 16KB buffer already with no real gain thereafter. The case of 8 * 16KB shows that it becomes counter-productive somewhere after 64KB, as you had also noticed. Increasing the resource size does not alter the picture: > python3 tests/http/scorecard.py -d --samples 10 --download-sizes=1gb --download-count=10 --caddy h1
buffer size serial(10) [cpu/rss] parallel(10x10) [cpu/rss]
off 1GB 1390 MB/s [98.8%/10MB] 1764 MB/s [100.0%/12MB]
1 1GB 1369 MB/s [97.7%/10MB] 1958 MB/s [100.0%/12MB]
2 1GB 1393 MB/s [98.8%/10MB] 1972 MB/s [100.0%/12MB]
4 1GB 1426 MB/s [99.4%/10MB] 1988 MB/s [100.0%/12MB]
8 1GB 1240 MB/s [99.7%/10MB] 1837 MB/s [100.0%/13MB] Compatibility: there are faults of properly handling buffered TLS data visible in test failures. I have seen one IMAPS test failing for me. The trap is that a protocol waits on socket events, but the data it is expecting is already present in the OpenSSL buffers and just needs to be read. That stalls curl. This is not a fault of this change, but a bug in the protocol handling exposed. |
So maybe something like this? #if (OPENSSL_VERSION_NUMBER >= 0x10100000L && \
!defined(LIBRESSL_VERSION_NUMBER)
#define HAVE_SSL_CTX_SET_READ_AHEAD 1
#define HAVE_SSL_CTX_SET_DEFAULT_READ_BUFFER_LEN 1
#endif |
strace on curl was showing repeated syscalls for 5 bytes, followed by another for the actual data: ``` recvfrom(4, "\27\3\3@\30", 5, 0, NULL, NULL) = 5 recvfrom(4, "\365\211_r\352,\35\275tR\374>\324\262qs?0D\337/\304DoU\303)\3153}\267\302"..., 16408, 0, NULL, NULL) = 16408 ``` (e.g. OpenSSL 3.2.4 on Fedora42) This is because readahead is not enabled on the OpenSSL context (I'm assuming for backwards compatibility). According to the manpage at https://docs.openssl.org/3.3/man3/SSL_CTX_set_read_ahead/#notes: > If the application wants to continue to use the underlying transport (e.g. TCP connection) > after the SSL connection is finished using SSL_shutdown() reading ahead should be turned off. > Otherwise the SSL structure might read data that it shouldn't However I'm not aware of any protocols (that curl would support) which needs this. STARTTLS is the only one that would combine unencrypted and encrypted traffic on the same socket, but the switch is one way: once it entered TLS mode, it doesn't go back. Enable readahead by default on the OpenSSL context, which cuts the amount of read syscalls into half: ``` recvfrom(4, "Cj%&YD\275V(\363\10\331\233\215\221\4\17!\270V\342T\273\355*\33\255\316^\\\3g"..., 16113, 0, NULL, NULL) = 16113 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 12288) = 12288 recvfrom(4, "\27\3\3@\30\33\5\221\342\21\330U~\0313\327\2122\6\304i+\201\37\311\375 \21\227\244\32\7"..., 16713, 0, NULL, NULL) = 16713 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096 write(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 12288) = 12288 ``` This also reduces HTTPS download times by 5%, as confirmed by `ministat`: ``` for i in $(seq 30); do src/curl https://${DEST}:9443/10GiB -k -o /dev/null -w '%{time_total}\n'; done >$(git describe).dat ... ministat -A curl-8_14_1-11-gba407ee43.dat curl-8_14_1-12-g03ba528c3.dat x curl-8_14_1-11-gba407ee43.dat + curl-8_14_1-12-g03ba528c3.dat N Min Max Median Avg Stddev x 30 4.596546 5.69831 5.542359 5.4684991 0.29021028 + 30 4.636901 5.34536 5.154769 5.1512865 0.1181803 Difference at 95.0% confidence -0.317213 +/- 0.114534 -5.80072% +/- 1.99067% (Student's t, pooled s = 0.221572) ``` This function got introduced in the following OpenSSL commit: OpenSSL 0.9.2b: 413c4f45ed ("Updates to the new SSL compression code") Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Thanks, I pushed a variant of this in a separate commit (with the missing Looks like there is a failure on BoringSSL too, I've expanded the macro. Their PORTING.md says to contact them about missing functions, they may add them, so I've done that: https://issues.chromium.org/issues/423460787 |
Thanks, so we know that >4 packets is bad not just on my HW. The ideal multiple between 1 and 4 TBD, I'll have to do some testing on more HW (I also have access to HW on 10Gbit/s and 25Gbit/s links, so I can measure those too)
Ah, latent bugs... Not quite so simple to enable this as I thought then :(. Looking at the latest CI run I see these test failures, so probably FTP too, and something wrong with PROXY:
|
This got introduced in the following OpenSSL commit: OpenSSL 1.1.0: dad78fb13d ("Add an ability to set the SSL read buffer size") It is not available in LibreSSL, or BoringSSL which forked from version 1.0.x. No functional change yet. Suggested-by: Daniel Stenberg <daniel@haxx.se> Signed-off-by: Edwin Török <edwin.torok@cloud.com>
By default we only read 1 TLS record at a time from the socket, even if more data is available. Modern NICs can perform various TCP offloads (e.g. GRO), where they can effectively transfer ~64KiB in one go. Even if the NIC doesn't support offloads reducing the number of syscalls can improve performance. This builds on top of the previous commit that enabled readahead (without it, this change would have no effect). After this change large reads are seen with `strace -e recvfrom curl`: ``` recvfrom(4, "\3\3@\30a\360\345\327d\378\325\274\320\n\261hr(\0U\325h\344\32}r-t\351m_"..., 65652, 0, NULL, NULL) = 65652 recvfrom(4, "\3\3@\30a\360\345\327d\378\331\222\335}Q\30\353P\37B\304\363\302\2500*\v\2wE\333"..., 65652, 0, NULL, NULL) = 65652 ``` This results in a further speedup (total download time is reduced): ``` ministat -A curl-8_14_1-12-g03ba528c3.dat curl-8_14_1-13-g34606f4ad.dat x curl-8_14_1-12-g03ba528c3.dat + curl-8_14_1-13-g34606f4ad.dat N Min Max Median Avg Stddev x 30 4.636901 5.34536 5.154769 5.1512865 0.1181803 + 30 4.521651 5.151459 4.6656105 4.7356357 0.19006089 Difference at 95.0% confidence -0.415651 +/- 0.0818046 -8.06887% +/- 1.55338% (Student's t, pooled s = 0.158256) ``` This got introduced in the following OpenSSL commit: OpenSSL 1.1.0: dad78fb13d ("Add an ability to set the SSL read buffer size") It is not available in LibreSSL, which forked from version 1.0.x. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The maximum TLS record size is actually 16KB (in the protocol layer), so 64KB would make sure it can at least get 4 maximum-sized records. If that is what the buffer is used for. Is there anything more to do here (it is still marked draft) or should we merge? |
@@ -122,6 +122,12 @@ | |||
static void ossl_provider_cleanup(struct Curl_easy *data); | |||
#endif | |||
|
|||
#if (OPENSSL_VERSION_NUMBER >= 0x10100000L && \ | |||
!defined(LIBRESSL_VERSION_NUMBER) && \ | |||
!defined(OPENSSL_IS_BORINGSSL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icing found SSL_CTX_set_default_read_buffer_len()
in the AWS-LC source code. So maybe this needs some additional tweaks - is it really confirmed to not exist in boring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't thing AWS-LC defines OPENSSL_IS_BORINGSSL
. It defines OPENSSL_IS_AWSLC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wrong. AWS-LC uses OPENSSL_IS_AWSLC
and thus it will be used there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My numbers with AWS-LC today on the current PR:
> python3 tests/http/scorecard.py -d --samples 10 --download-sizes=1gb --download-count=10 --caddy h1
size serial(10) [cpu/rss] parallel(10x10) [cpu/rss]
off 1GB 1394 MB/s [99.2%/7MB] 1572 MB/s [100.0%/8MB]
4 1GB 1428 MB/s [99.5%/9MB] 1922 MB/s [100.0%/15MB]
Nice. All for merging this.
Thanks! |
I think the only outstanding action was to investigate these latent bugs/failures from the CI on certain protocols:
Sorry I haven't had time to look into these yet, I'll try to find some time later this week. (I see you've merged the PR already, that is fine, just something to keep in mind before making the next release of curl). Thanks for all the feedback on my changes. |
Speeds up TLS operations up to ~%13. Closes curl#17548 Signed-off-by: Edwin Török <edwin.torok@cloud.com>
I ran
strace
oncurl
and noticed that it is doing a lot of very short reads, of exactly 5 bytes, followed by a larger read.strace Before
I tweaked the OpenSSL context settings in curl, and now
strace
looks much nicer: we read ~64KiB in one go(writing is still done in several smaller writes, this could be improved in the future).
strace After
Beyond making
strace
look "nicer" this change of course improves download speed too.I measured that download time is reduced by ~13% when downloading a 10GiB file using HTTPS on a 100Gbit/s network link (git hash doesn't match exactly because I amended the commits to include the measurements):
The median speed goes from ~15.5Gbit/s to ~18.4 Gbit/s.
For reference
iperf3
can do 40Gbit/s (unencrypted obviously) on a single TCP stream (and with 3 streams it can saturate the network link).The change is tiny: just 2 commits, 1 line each. Although perhaps the buffer size should be configurable (for now I've hardcoded it).
If you want to reproduce the measurements on your own hardware I've included the commands that I've used below, they may need slight tweaks to run on your OS/HW (click to expand the details).
Experimental setup
Script to put machine into benchmark mode
Following some of the advice from https://llvm.org/docs/Benchmarking.html.
The BIOS was configured to set the System profile to 'Performance' (this exposes P states to the OS and allows turning off turbo):
Script to setup benchmark server
This is part of a larger script that measures also different TLS servers, I've included just the parts needed to measure
curl
withnginx
(since that appears to be the fastest server in my tests so far).Script to benchmark a curl change
I measure total time instead of download speed, because the correct way to average speeds would be to use a geometric mean, however ministat uses an arithmetic mean, so using ministat to process download speed measurements wouldn't be correct.
I also measured using 8 TLS records instead of 4, but that was actually slower, and even canceled out the improvement in the first commit (so overall
ministat
said there was no provable difference).I measured 2 packets instead of 4, but that would've been 5% slower:
So 4 packets seems to be the sweet spot, at least on this HW.
P.S.: this change is probably beneficial for most applications using OpenSSL, especially the readahead change, I'll submit similar changes to a few other projects (
socat
,stunnel
, etc.). Ideally I'd submit the change to OpenSSL itself, but for backwards compatibility reasons they probably cannot change the defaults.