Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add zero byte read to SslStream #87563

Merged
merged 6 commits into from Jun 19, 2023
Merged

add zero byte read to SslStream #87563

merged 6 commits into from Jun 19, 2023

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Jun 14, 2023

fixes #76029

This change aims to postpone buffer allocations and avoid allocations of GCHandles. The current worst case is Kestrel with idle connections. With current code We would allocate large buffer as well as GCHandle as the Socket read fails to finish synchronously. That creates pressure and it can fragment memory with block that is immovable.

To fix this I added zero bytes read to EnsureFullTlsFrameAsync and I moved all EnsureAvailableSpace calls behind it.
If the underlying Stream supports blocking calls this will wait some data are actually available. Then we allocate the buffer and socket read will finish synchronously using the cheap fixed block.
If the underlying stream does not support blocking zero byte streams we just process everything as we did before.

There is some overhead but from what I can tell it is not measurable

Method Job Toolchain protocol Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
HandshakeRSA4096CertAsync Job-YZAHBZ \PR\corerun.exe Tls12 8,354.82 us 212.521 us 244.739 us 8,334.01 us 7,943.804 us 8,808.05 us 1.10 0.03 8795 B 1.00
HandshakeRSA4096CertAsync Job-EHNRXP \main\corerun.exe Tls12 7,625.07 us 90.142 us 84.319 us 7,618.57 us 7,479.032 us 7,751.31 us 1.00 0.00 8790 B 1.00
HandshakeRSA4096CertAsync Job-YZAHBZ . \PR\corerun.exe Tls13 9,025.35 us 177.809 us 157.623 us 8,980.75 us 8,759.627 us 9,299.73 us 1.00 0.03 8968 B 1.00
HandshakeRSA4096CertAsync Job-EHNRXP \main\corerun.exe Tls13 9,041.60 us 145.878 us 129.317 us 9,023.76 us 8,837.507 us 9,260.70 us 1.00 0.00 9005 B 1.00
DefaultHandshakeContextIPv4Async Job-YZAHBZ \PR\corerun.exe ? 5,574.37 us 134.986 us 150.037 us 5,594.24 us 5,294.947 us 5,865.63 us 1.00 0.03 4783 B 0.98
DefaultHandshakeContextIPv4Async Job-EHNRXP \main\corerun.exe ? 5,576.96 us 124.831 us 138.749 us 5,598.41 us 5,376.684 us 5,849.83 us 1.00 0.00 4872 B 1.00
DefaultHandshakeContextIPv6Async Job-YZAHBZ \PR\corerun.exe ? 5,593.63 us 90.259 us 84.429 us 5,608.46 us 5,448.258 us 5,747.86 us 1.02 0.02 4819 B 0.99
DefaultHandshakeContextIPv6Async Job-EHNRXP \main\corerun.exe ? 5,468.87 us 92.570 us 82.061 us 5,474.08 us 5,330.325 us 5,634.00 us 1.00 0.00 4881 B 1.00
DefaultHandshakeIPv4Async Job-YZAHBZ \PR\corerun.exe ? 6,299.44 us 155.422 us 178.984 us 6,287.73 us 5,930.689 us 6,670.41 us 1.01 0.04 6965 B 1.18
DefaultHandshakeIPv4Async Job-EHNRXP \main\corerun.exe ? 6,268.75 us 181.437 us 208.943 us 6,220.77 us 6,014.359 us 6,690.59 us 1.00 0.00 5909 B 1.00
DefaultHandshakeIPv6Async Job-YZAHBZ . \PR\corerun.exe ? 6,208.39 us 147.344 us 163.772 us 6,174.09 us 5,978.729 us 6,497.04 us 1.02 0.03 5910 B 1.00
DefaultHandshakeIPv6Async Job-EHNRXP \main\corerun.exe ? 6,088.69 us 94.486 us 78.900 us 6,083.12 us 5,974.077 us 6,240.72 us 1.00 0.00 5909 B 1.00
DefaultMutualHandshakeIPv4Async Job-YZAHBZ . \PR\corerun.exe ? 6,659.25 us 214.237 us 229.231 us 6,628.40 us 6,316.665 us 7,197.88 us 1.02 0.04 7219 B 1.00
DefaultMutualHandshakeIPv4Async Job-EHNRXP \main\corerun.exe ? 6,520.86 us 129.545 us 143.989 us 6,549.53 us 6,252.251 us 6,837.06 us 1.00 0.00 7240 B 1.00
DefaultMutualHandshakeIPv6Async Job-YZAHBZ . \PR\corerun.exe ? 6,510.77 us 107.640 us 115.174 us 6,484.16 us 6,279.208 us 6,733.40 us 1.00 0.04 7233 B 1.00
DefaultMutualHandshakeIPv6Async Job-EHNRXP \main\corerun.exe ? 6,532.72 us 188.858 us 209.916 us 6,518.10 us 6,244.809 us 7,038.09 us 1.00 0.00 7249 B 1.00
WriteReadAsync Job-YZAHBZ . \PR\corerun.exe ? 10.57 us 0.197 us 0.202 us 10.53 us 10.324 us 11.08 us 1.02 0.03 - NA
WriteReadAsync Job-EHNRXP \main\corerun.exe ? 10.31 us 0.256 us 0.285 us 10.31 us 9.878 us 10.93 us 1.00 0.00 - NA
ReadWriteAsync Job-YZAHBZ . \PR\corerun.exe ? 76.45 us 2.629 us 3.027 us 75.11 us 72.332 us 83.77 us 0.99 0.03 1 B NA
ReadWriteAsync Job-EHNRXP \main\corerun.exe ? 77.11 us 1.450 us 1.286 us 76.89 us 75.247 us 79.70 us 1.00 0.00 - NA
ConcurrentReadWrite Job-YZAHBZ . \PR\corerun.exe ? 214.46 us 42.115 us 48.499 us 226.24 us 114.934 us 286.70 us 5.52 2.34 - NA
ConcurrentReadWrite Job-EHNRXP \main\corerun.exe ? 46.76 us 21.285 us 24.512 us 29.94 us 23.200 us 88.81 us 1.00 0.00 - NA
ConcurrentReadWriteLargeBuffer Job-YZAHBZ . \PR\corerun.exe ? 35.70 us 4.703 us 5.416 us 35.95 us 26.951 us 45.78 us 0.47 0.07 - NA
ConcurrentReadWriteLargeBuffer Job-EHNRXP \main\corerun.exe ? 75.87 us 5.475 us 6.305 us 75.97 us 64.819 us 88.57 us 1.00 0.00 - NA
HandshakeContosoAsync Job-YZAHBZ . \PR\corerun.exe Tls12 18,413.63 us 1,749.626 us 2,014.872 us 18,670.07 us 13,162.108 us 21,200.56 us 1.05 0.12 8837 B 1.00
HandshakeContosoAsync Job-EHNRXP \main\corerun.exe Tls12 17,841.33 us 685.134 us 703.582 us 17,725.48 us 16,269.083 us 19,452.18 us 1.00 0.00 8848 B 1.00
HandshakeECDSA256CertAsync Job-YZAHBZ . \PR\corerun.exe Tls12 11,583.84 us 832.064 us 958.206 us 11,734.53 us 9,884.870 us 12,836.61 us 1.05 0.08 7445 B 1.00
HandshakeECDSA256CertAsync Job-EHNRXP \main\corerun.exe Tls12 11,022.29 us 509.512 us 586.755 us 11,196.88 us 10,124.448 us 12,123.88 us 1.00 0.00 7448 B 1.00
HandshakeRSA2048CertAsync Job-YZAHBZ . \PR\corerun.exe Tls12 9,649.37 us 590.719 us 632.062 us 9,656.99 us 8,478.865 us 10,959.73 us 0.98 0.07 8048 B 1.00
HandshakeRSA2048CertAsync Job-EHNRXP \main\corerun.exe Tls12 9,884.29 us 309.214 us 356.092 us 9,932.62 us 8,681.656 us 10,433.19 us 1.00 0.00 8028 B 1.00
HandshakeContosoAsync Job-YZAHBZ . \PR\corerun.exe Tls13 22,000.81 us 1,204.403 us 1,338.691 us 22,286.12 us 20,138.360 us 23,801.10 us 1.02 0.09 9080 B 1.00
HandshakeContosoAsync Job-EHNRXP \main\corerun.exe Tls13 21,886.53 us 1,303.394 us 1,500.990 us 21,548.64 us 19,493.360 us 24,402.58 us 1.00 0.00 9042 B 1.00
HandshakeECDSA256CertAsync Job-YZAHBZ . \PR\corerun.exe Tls13 14,209.87 us 658.521 us 758.354 us 14,330.59 us 12,484.375 us 15,398.19 us 1.02 0.10 7626 B 1.00
HandshakeECDSA256CertAsync Job-EHNRXP \main\corerun.exe Tls13 13,969.16 us 838.224 us 931.683 us 13,670.56 us 12,548.726 us 15,907.23 us 1.00 0.00 7641 B 1.00
HandshakeRSA2048CertAsync Job-YZAHBZ \8.0.0\corerun.exe Tls13 13,176.09 us 645.199 us 743.012 us 13,002.01 us 11,995.380 us 14,366.19 us 1.05 0.10 8230 B 1.00
HandshakeRSA2048CertAsync Job-EHNRXP \main\corerun.exe Tls13 12,621.10 us 759.072 us 843.707 us 12,890.19 us 10,836.742 us 13,818.75 us 1.00 0.00 8233 B 1.00

The proposed change does zero byte read only before each TLS frame. I wrote test that writes random data and adds some delay. I saw ~ 5% reads would till allocate GCHandle as the TLS frame may not be read in single read. That of course depends on specific pattern but it still looks like improvement. When we receive beginning of the frame, it is likely that rest will arrive as well e.g. the receive buffer will be needed. We could add the zero byte ready into the loop just before current TIOAdapter.ReadAsync to get 100% of synchronous receive. Or we can do it only for Windows, async path or cases when underlying stream is NetworkStream (may not work for Kestrel) to minimize impact for other cases when extra call has overhead with no benefits.

I had to touch the conformance tests as they would no see zero byte reads that were not excepted as well as zero byte reads during handshake. e.g. before the test body even starts.

cc: @davidfowl @Tratcher in case this has some unexpected impact on Kestrel

@wfurt wfurt added this to the 8.0.0 milestone Jun 14, 2023
@wfurt wfurt requested review from stephentoub and rzikm June 14, 2023 16:49
@wfurt wfurt self-assigned this Jun 14, 2023
@ghost
Copy link

ghost commented Jun 14, 2023

Tagging subscribers to this area: @dotnet/ncl, @bartonjs, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

fixes #76029

This change aims to postpone buffer allocations and avoid allocations of GCHandles. The current worst case is Kestrel with idle connections. With current code We would allocate large buffer as well as GCHandle as the Socket read fails to finish synchronously. That creates pressure and it can fragment memory with block that is immovable.

To fix this I added zero bytes read to EnsureFullTlsFrameAsync and I moved all EnsureAvailableSpace calls behind it.
If the underlying Stream supports blocking calls this will wait some data are actually available. Then we allocate the buffer and socket read will finish synchronously using the cheap fixed block.
If the underlying stream does not support blocking zero byte streams we just process everything as we did before.

There is some overhead but from what I can tell it is not measurable

Method Job Toolchain protocol Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
HandshakeRSA4096CertAsync Job-YZAHBZ \PR\corerun.exe Tls12 8,354.82 us 212.521 us 244.739 us 8,334.01 us 7,943.804 us 8,808.05 us 1.10 0.03 8795 B 1.00
HandshakeRSA4096CertAsync Job-EHNRXP \main\corerun.exe Tls12 7,625.07 us 90.142 us 84.319 us 7,618.57 us 7,479.032 us 7,751.31 us 1.00 0.00 8790 B 1.00
HandshakeRSA4096CertAsync Job-YZAHBZ . \PR\corerun.exe Tls13 9,025.35 us 177.809 us 157.623 us 8,980.75 us 8,759.627 us 9,299.73 us 1.00 0.03 8968 B 1.00
HandshakeRSA4096CertAsync Job-EHNRXP \main\corerun.exe Tls13 9,041.60 us 145.878 us 129.317 us 9,023.76 us 8,837.507 us 9,260.70 us 1.00 0.00 9005 B 1.00
DefaultHandshakeContextIPv4Async Job-YZAHBZ \PR\corerun.exe ? 5,574.37 us 134.986 us 150.037 us 5,594.24 us 5,294.947 us 5,865.63 us 1.00 0.03 4783 B 0.98
DefaultHandshakeContextIPv4Async Job-EHNRXP \main\corerun.exe ? 5,576.96 us 124.831 us 138.749 us 5,598.41 us 5,376.684 us 5,849.83 us 1.00 0.00 4872 B 1.00
DefaultHandshakeContextIPv6Async Job-YZAHBZ \PR\corerun.exe ? 5,593.63 us 90.259 us 84.429 us 5,608.46 us 5,448.258 us 5,747.86 us 1.02 0.02 4819 B 0.99
DefaultHandshakeContextIPv6Async Job-EHNRXP \main\corerun.exe ? 5,468.87 us 92.570 us 82.061 us 5,474.08 us 5,330.325 us 5,634.00 us 1.00 0.00 4881 B 1.00
DefaultHandshakeIPv4Async Job-YZAHBZ \PR\corerun.exe ? 6,299.44 us 155.422 us 178.984 us 6,287.73 us 5,930.689 us 6,670.41 us 1.01 0.04 6965 B 1.18
DefaultHandshakeIPv4Async Job-EHNRXP \main\corerun.exe ? 6,268.75 us 181.437 us 208.943 us 6,220.77 us 6,014.359 us 6,690.59 us 1.00 0.00 5909 B 1.00
DefaultHandshakeIPv6Async Job-YZAHBZ . \PR\corerun.exe ? 6,208.39 us 147.344 us 163.772 us 6,174.09 us 5,978.729 us 6,497.04 us 1.02 0.03 5910 B 1.00
DefaultHandshakeIPv6Async Job-EHNRXP \main\corerun.exe ? 6,088.69 us 94.486 us 78.900 us 6,083.12 us 5,974.077 us 6,240.72 us 1.00 0.00 5909 B 1.00
DefaultMutualHandshakeIPv4Async Job-YZAHBZ . \PR\corerun.exe ? 6,659.25 us 214.237 us 229.231 us 6,628.40 us 6,316.665 us 7,197.88 us 1.02 0.04 7219 B 1.00
DefaultMutualHandshakeIPv4Async Job-EHNRXP \main\corerun.exe ? 6,520.86 us 129.545 us 143.989 us 6,549.53 us 6,252.251 us 6,837.06 us 1.00 0.00 7240 B 1.00
DefaultMutualHandshakeIPv6Async Job-YZAHBZ . \PR\corerun.exe ? 6,510.77 us 107.640 us 115.174 us 6,484.16 us 6,279.208 us 6,733.40 us 1.00 0.04 7233 B 1.00
DefaultMutualHandshakeIPv6Async Job-EHNRXP \main\corerun.exe ? 6,532.72 us 188.858 us 209.916 us 6,518.10 us 6,244.809 us 7,038.09 us 1.00 0.00 7249 B 1.00
WriteReadAsync Job-YZAHBZ . \PR\corerun.exe ? 10.57 us 0.197 us 0.202 us 10.53 us 10.324 us 11.08 us 1.02 0.03 - NA
WriteReadAsync Job-EHNRXP \main\corerun.exe ? 10.31 us 0.256 us 0.285 us 10.31 us 9.878 us 10.93 us 1.00 0.00 - NA
ReadWriteAsync Job-YZAHBZ . \PR\corerun.exe ? 76.45 us 2.629 us 3.027 us 75.11 us 72.332 us 83.77 us 0.99 0.03 1 B NA
ReadWriteAsync Job-EHNRXP \main\corerun.exe ? 77.11 us 1.450 us 1.286 us 76.89 us 75.247 us 79.70 us 1.00 0.00 - NA
ConcurrentReadWrite Job-YZAHBZ . \PR\corerun.exe ? 214.46 us 42.115 us 48.499 us 226.24 us 114.934 us 286.70 us 5.52 2.34 - NA
ConcurrentReadWrite Job-EHNRXP \main\corerun.exe ? 46.76 us 21.285 us 24.512 us 29.94 us 23.200 us 88.81 us 1.00 0.00 - NA
ConcurrentReadWriteLargeBuffer Job-YZAHBZ . \PR\corerun.exe ? 35.70 us 4.703 us 5.416 us 35.95 us 26.951 us 45.78 us 0.47 0.07 - NA
ConcurrentReadWriteLargeBuffer Job-EHNRXP \main\corerun.exe ? 75.87 us 5.475 us 6.305 us 75.97 us 64.819 us 88.57 us 1.00 0.00 - NA
HandshakeContosoAsync Job-YZAHBZ . \PR\corerun.exe Tls12 18,413.63 us 1,749.626 us 2,014.872 us 18,670.07 us 13,162.108 us 21,200.56 us 1.05 0.12 8837 B 1.00
HandshakeContosoAsync Job-EHNRXP \main\corerun.exe Tls12 17,841.33 us 685.134 us 703.582 us 17,725.48 us 16,269.083 us 19,452.18 us 1.00 0.00 8848 B 1.00
HandshakeECDSA256CertAsync Job-YZAHBZ . \PR\corerun.exe Tls12 11,583.84 us 832.064 us 958.206 us 11,734.53 us 9,884.870 us 12,836.61 us 1.05 0.08 7445 B 1.00
HandshakeECDSA256CertAsync Job-EHNRXP \main\corerun.exe Tls12 11,022.29 us 509.512 us 586.755 us 11,196.88 us 10,124.448 us 12,123.88 us 1.00 0.00 7448 B 1.00
HandshakeRSA2048CertAsync Job-YZAHBZ . \PR\corerun.exe Tls12 9,649.37 us 590.719 us 632.062 us 9,656.99 us 8,478.865 us 10,959.73 us 0.98 0.07 8048 B 1.00
HandshakeRSA2048CertAsync Job-EHNRXP \main\corerun.exe Tls12 9,884.29 us 309.214 us 356.092 us 9,932.62 us 8,681.656 us 10,433.19 us 1.00 0.00 8028 B 1.00
HandshakeContosoAsync Job-YZAHBZ . \PR\corerun.exe Tls13 22,000.81 us 1,204.403 us 1,338.691 us 22,286.12 us 20,138.360 us 23,801.10 us 1.02 0.09 9080 B 1.00
HandshakeContosoAsync Job-EHNRXP \main\corerun.exe Tls13 21,886.53 us 1,303.394 us 1,500.990 us 21,548.64 us 19,493.360 us 24,402.58 us 1.00 0.00 9042 B 1.00
HandshakeECDSA256CertAsync Job-YZAHBZ . \PR\corerun.exe Tls13 14,209.87 us 658.521 us 758.354 us 14,330.59 us 12,484.375 us 15,398.19 us 1.02 0.10 7626 B 1.00
HandshakeECDSA256CertAsync Job-EHNRXP \main\corerun.exe Tls13 13,969.16 us 838.224 us 931.683 us 13,670.56 us 12,548.726 us 15,907.23 us 1.00 0.00 7641 B 1.00
HandshakeRSA2048CertAsync Job-YZAHBZ \8.0.0\corerun.exe Tls13 13,176.09 us 645.199 us 743.012 us 13,002.01 us 11,995.380 us 14,366.19 us 1.05 0.10 8230 B 1.00
HandshakeRSA2048CertAsync Job-EHNRXP \main\corerun.exe Tls13 12,621.10 us 759.072 us 843.707 us 12,890.19 us 10,836.742 us 13,818.75 us 1.00 0.00 8233 B 1.00

The proposed change does zero byte read only before each TLS frame. I wrote test that writes random data and adds some delay. I saw ~ 5% reads would till allocate GCHandle as the TLS frame may not be read in single read. That of course depends on specific pattern but it still looks like improvement. When we receive beginning of the frame, it is likely that rest will arrive as well e.g. the receive buffer will be needed. We could add the zero byte ready into the loop just before current TIOAdapter.ReadAsync to get 100% of synchronous receive. Or we can do it only for Windows, async path or cases when underlying stream is NetworkStream (may not work for Kestrel) to minimize impact for other cases when extra call has overhead with no benefits.

I had to touch the conformance tests as they would no see zero byte reads that were not excepted as well as zero byte reads during handshake. e.g. before the test body even starts.

cc: @davidfowl @Tratcher in case this has some unexpected impact on Kestrel

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Security

Milestone: 8.0.0

Comment on lines 723 to 733
if (frameSize == UnknownTlsFrameLength)
{
// make sure we have space for the whole frame
_buffer.EnsureAvailableSpace(frameSize - _buffer.EncryptedLength);
// We do not have enough data to determine frame size. Use provided estimate e.g.
// full TLS frame for read, and somewhat shorter frame for handshake or renegotiation
_buffer.EnsureAvailableSpace(estimatedSize);
}
else
{
// move existing data to the beginning of the buffer (they will
// be couple of bytes only, otherwise we would have entire
// header and know exact size)
_buffer.EnsureAvailableSpace(_buffer.Capacity - _buffer.EncryptedLength);
// make sure we have space for the whole frame
_buffer.EnsureAvailableSpace(frameSize - _buffer.EncryptedLength);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider instead:

// If we don't have enough data to determine the frame size, use the provided estimate
// (e.g. a full TLS frame for reads, and a somewhat shorter frame for handshake / renegotiation).
// If we do know the frame size, ensure we have space for the whole frame.
_buffer.EnsureAvailableSpace(frameSize == UnknownTlsFrameLength ?
    estimatedSize :
    frameSize - _buffer.EncryptedLength);

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Co-authored-by: Stephen Toub <stoub@microsoft.com>
@davidfowl
Copy link
Member

cc @amcasey

Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo existing comments

@davidfowl
Copy link
Member

Eager to try this change out.

@wfurt wfurt merged commit 4542e09 into dotnet:main Jun 19, 2023
106 checks passed
@wfurt wfurt deleted the zeroRead branch June 19, 2023 18:37
@davidfowl
Copy link
Member

davidfowl commented Jun 19, 2023

I want to clarify one thing, with all of the existing code doing zero byte reads, do we still end up doing multiple layers of zero byte reads or does this really only affect code the handshake (which the caller doesn't own the buffer)?

i.e.

await sslStream.ReadAsync(Memory<byte>.Empty);
await sslStream.ReadAsync(buffer);

How many reads happen on the underlying stream?

@stephentoub
Copy link
Member

stephentoub commented Jun 19, 2023

How many reads happen on the underlying stream?

I'd expect three. I don't believe there's any memory here for whether the last read issued was zero-byte (nor am I convinced that would always be sound, in particular if buffer was zero-length).

@wfurt
Copy link
Member Author

wfurt commented Jun 19, 2023

we actually do not break out from EnsureFullTlsFrameAsync so the initial read may do more IO work and second would just decrypt data. There was special handling for zero byte reads but it did not returned when the zero read finished on underlying stream. If this is what we want I can make follow-up change.

@davidfowl
Copy link
Member

@wfurt lets wait for the performance numbers to come back from the various https benchmarks. In Kestrel's case, the stream isn't a NetworkStream, it's a custom stream implementation that reads from a PipeReader.

cc @sebastienros

@dotnet dotnet locked as resolved and limited conversation to collaborators Jul 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out Of Memory From Pinned byte[] used in SslStream
5 participants