Skip to content

[WIP, NOREVIEW] Linux SslStream: custom BIO_METHOD over managed buffer windows#128245

Draft
rzikm wants to merge 5 commits into
dotnet:mainfrom
rzikm:perf/sslstream-custom-bio
Draft

[WIP, NOREVIEW] Linux SslStream: custom BIO_METHOD over managed buffer windows#128245
rzikm wants to merge 5 commits into
dotnet:mainfrom
rzikm:perf/sslstream-custom-bio

Conversation

@rzikm
Copy link
Copy Markdown
Member

@rzikm rzikm commented May 15, 2026

Note

This pull request was prepared with AI assistance (GitHub Copilot CLI). The code, build, and test validation were performed by the assistant under my supervision.

Summary

Replace the two BIO_s_mem instances backing each SSL handle on Linux with a custom BIO_METHOD that reads from / writes to caller-supplied managed buffer windows, and thread the user's Memory<byte> all the way through SSL_read. Together these changes eliminate both memcpys per TLS record on the Linux hot path (one on encrypt, one on decrypt).

Previous behavior

SslStream staged every TLS record through BIO_s_mem, and read returned plaintext via the internal SslStream decrypted buffer:

Direction Today
Decrypt managed enc buf → BIO_write (copy 1)BIO_s_mem storage → SSL_read reads from BIO (copy 2 inside OpenSSL record buffer) → decrypt in place into SslStream._bufferCopyDecryptedData (copy 3) into user Memory<byte>
Encrypt managed plaintext → SSL_writeBIO_write to BIO_s_mem (copy 1)BIO_read from BIO to outToken (copy 2)

New behavior

Direction This PR
Decrypt managed enc buf is the BIO read window → SSL_read writes plaintext directly into the user's Memory<byte> (no CopyDecryptedData)
Encrypt managed plaintext → SSL_writeBIO_write lands directly in the managed outToken window

Two distinct optimizations land in this PR:

  1. Custom BIO_METHOD (ManagedSpanBio in pal_bio.c): eliminates the BIO staging memcpy in each direction.
  2. DecryptMessageDirect (new Unix-only PAL API): threads the user's Memory<byte> through SSL_read so OpenSSL decrypts straight into the caller's buffer, eliminating the CopyDecryptedData memcpy on the read side. Windows/OSX/Android expose IsDirectDecryptSupported = false and a throwing stub; the JIT eliminates the branch on those platforms.

Design notes

ManagedSpanBio (pal_bio.c)

  • Per-BIO context tracks an input read window (caller-supplied pinned buffer), an output write window (caller-supplied pinned buffer), a heap-backed spill buffer for output overflow, and a heap-backed carry buffer for unconsumed input bytes.
  • Output: BIO_write lands directly into the write window; once it fills, the BIO falls back to the spill buffer. BioGetWriteResult reports bytes written to the window plus a spill flag; BioDrainSpill copies the spill out afterwards.
  • Input: BIO_read drains any leftover carry first, then the window. On BioClearReadWindow any unread tail is migrated into the carry — this preserves the BIO_s_mem accumulation semantics that SslStreamPal.HandshakeInternal / DecryptMessage rely on (the SSL engine may not consume every byte handed to the BIO in one call).
  • BIO_CTRL_PENDING / BIO_CTRL_RESET updated accordingly.

The output spill buffer is non-negotiable: SSL_read can also emit bytes to the output BIO (KeyUpdate response, alerts, close_notify on shutdown), so the BIO must always be able to absorb writes even when the caller didn't pre-arm a write window. Spill-stress measurements (forced 100% spill via env var) confirm the path is correct and bounded; normal-workload instrumentation shows zero spill events across ~7.5 M Encrypt calls — the path is dead code on the hot path under normal traffic.

DecryptMessageDirect

  • Adds Interop.OpenSsl.Decrypt(input, output) taking ciphertext and plaintext as separate spans (the prior in-place call site passes the same span for both to preserve today's behavior).
  • Adds SSL_pending wrapper to detect plaintext that OpenSSL buffered internally when the user span was smaller than a record's plaintext (up to 16 KiB). SslStream tracks this via _palHasPendingPlaintext (guarded by _handshakeLock) and drains residual via a follow-up direct SSL_read with empty input before the next network IO.
  • Direct-decrypt is gated on SslStreamPal.IsDirectDecryptSupported && !buffer.IsEmpty && _handshakeWaiter == null. Non-OK SSL statuses copy any direct-written bytes into extraBuffer so the existing renegotiate/ContextExpired flow keeps working.
  • NegotiateClientCertificateAsync _buffer.ActiveLength > 0 precondition extended to also check _palHasPendingPlaintext so the PendingDecryptedData_Throws contract is preserved when the residual lives inside OpenSSL rather than _buffer.

Concurrency

DecryptData / DecryptDataDirect / EncryptData all run under _handshakeLock in SslStream, so the BIO state machine and _palHasPendingPlaintext see a single in-flight SSL_* call at a time. Buffer pointers are stashed only for the duration of the SSL_read / SSL_write call and cleared before the fixed block ends.

Compatibility

All shim entries used (BIO_meth_new, BIO_meth_set_*, BIO_set_data, BIO_get_data, BIO_get_new_index, BIO_set_init, BIO_set_flags / BIO_clear_flags / BIO_test_flags, SSL_pending) are available since OpenSSL 1.1.0, which is our minimum.

Validation

  • libs.native+libs clean (0 warnings, 0 errors).
  • System.Net.Security.Tests functional: 4933 pass, 0 fail, 8 skip (matches baseline, all expected platform/OS skips).
  • Crank benchmarks against aspnet-gold-lin (TLS 1.3, baseline = 3d73a08f1ba, current tip = 7371f2e2a1b):
    • SslStream read-write: +28.8% read MB/s (601 → 774), +26.1% write MB/s (695 → 877), +20.5% throughput-per-server-CPU%.
    • SslStream handshake: flat (within run noise).
    • HTTPS GET 16 KiB keep-alive / Connection-close: flat (within run noise — workloads are dominated by something other than SSL CPU at this concurrency).
  • Spill instrumentation: 0 spill events across ~7.5 M Encrypt operations in real workloads. Spill-stress forced-100% mode: 0 errors across 500k+ records, ~20% slower under saturation — confirms the safety net is functionally correct and the cost is bounded.

See follow-up comments for full benchmark tables (including CPU usage) and instrumentation results.

Copilot AI review requested due to automatic review settings May 15, 2026 08:00
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @bartonjs, @vcsjones, @dotnet/area-system-security
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces Linux/OpenSSL SslStream memory BIO staging with a custom managed-window BIO to reduce TLS record copies, and also includes a separate NegotiateStream stale-buffer bug fix.

Changes:

  • Adds native managed-span BIO APIs, OpenSSL shim entries, and exports.
  • Updates Unix SslStream handshake/encrypt/decrypt paths to use managed read/write windows plus spill draining.
  • Fixes NegotiateStream read-buffer state after mid-frame read failure and adds a regression test.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/native/libs/System.Security.Cryptography.Native/pal_bio.h Declares managed-span BIO native APIs.
src/native/libs/System.Security.Cryptography.Native/pal_bio.c Implements custom BIO_METHOD with read carry and write spill buffers.
src/native/libs/System.Security.Cryptography.Native/opensslshim.h Adds OpenSSL BIO method/flag function shims.
src/native/libs/System.Security.Cryptography.Native/entrypoints.c Exports the new native BIO entry points.
src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.Ssl.cs Adds P/Invokes and switches Unix SSL handles to managed-span BIOs.
src/libraries/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs Reworks Unix OpenSSL handshake/encrypt/decrypt to arm BIO windows and drain spill output.
src/libraries/System.Net.Security/src/System/Net/Security/NegotiateStream.cs Defers read-buffer state updates until reads/decryption succeed.
src/libraries/System.Net.Security/tests/FunctionalTests/NegotiateStreamStreamToStreamTest.cs Adds a regression test for stale data after mid-frame read failure.

Comment thread src/native/libs/System.Security.Cryptography.Native/pal_bio.c Outdated
Comment thread src/libraries/System.Net.Security/src/System/Net/Security/NegotiateStream.cs Outdated
Comment thread src/native/libs/System.Security.Cryptography.Native/pal_bio.c
@rzikm rzikm changed the title Linux SslStream: custom BIO_METHOD over managed buffer windows [WIP, NOREVIEW] Linux SslStream: custom BIO_METHOD over managed buffer windows May 15, 2026
@rzikm rzikm added the NO-REVIEW Experimental/testing PR, do NOT review it label May 15, 2026
Replace the pair of BIO_s_mem instances backing each SSL handle on
Linux with a custom BIO_METHOD that reads/writes directly into
caller-supplied managed buffer windows, with a heap-backed spill
buffer for output overflow and a heap-backed carry buffer for
unconsumed input bytes.

This eliminates one memcpy per TLS record in both directions
(encrypt and decrypt) by allowing OpenSSL to read plaintext from
and write ciphertext into managed buffers in-place, instead of
staging through BIO_s_mem.

Native side (src/native/libs/System.Security.Cryptography.Native/):
* pal_bio.c gains a ManagedSpanBio implementation (read/write/ctrl
  callbacks, lazy BIO_METHOD init via pthread_once) plus seven
  exports: BioNewManagedSpan, BioSetReadWindow, BioClearReadWindow,
  BioSetWriteWindow, BioGetWriteResult, BioDrainSpill,
  BioResetManagedSpan.
* When BioClearReadWindow is called with unread bytes still in the
  window, the tail is copied into a per-BIO readCarry buffer so the
  next BIO_read drains it before any new window. This preserves the
  BIO_s_mem semantic that the SslStreamPal layer relies on.
* opensslshim.h adds the BIO_meth_* / BIO_get_data / BIO_set_data /
  BIO_get_new_index / BIO_clear_flags / BIO_test_flags / BIO_set_init
  / BIO_set_flags shim entries (all required since OpenSSL 1.1.0).
* entrypoints.c registers the new exports.

Managed side:
* Interop.Ssl.cs declares the seven new P/Invokes and switches
  SafeSslHandle.Create to allocate ManagedSpan BIOs instead of
  memory BIOs.
* Interop.OpenSsl.cs rewrites Decrypt, Encrypt and DoSslHandshake
  to pin caller buffers, call BioSet*Window before the SSL_*
  operation, and BioClearReadWindow / BioGetWriteResult /
  BioDrainSpill afterwards. New helpers ComputeMaxTlsOutput and
  DrainOutputBioSpill centralise the output-bound logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rzikm rzikm force-pushed the perf/sslstream-custom-bio branch from a9e5bb6 to a134137 Compare May 15, 2026 08:21
- pal_bio: fail BIO_read on lost carry bytes instead of silently dropping
- pal_bio: BIO_CTRL_RESET clears window pointers + error flag
- pal_bio: drop unused BioResetManagedSpan entry point
- DoSslHandshake/Encrypt/Decrypt: clear BIO windows in finally inside fixed
- Encrypt: snapshot pre-write Size so drained spill bytes survive a failed
  SSL_write instead of being reset to 0
- Encrypt: pass only the per-record upper bound to EnsureAvailableSpace
  (not Size + upperBound, which over-allocates by Size)
- ComputeMaxTlsOutput: use OpenSSL's SSL3_RT_MAX_ENCRYPTED_OVERHEAD (256)
  per record instead of the 128-byte estimate that could trigger the spill
  fallback for legitimate cipher suites

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 15, 2026 08:41
@rzikm
Copy link
Copy Markdown
Member Author

rzikm commented May 15, 2026

Note

AI-generated content disclosure: this benchmark and comment were prepared with GitHub Copilot CLI.

Triggering an end-to-end SslStream round-trip benchmark to validate the perf impact of the custom BIO_METHOD. Each iteration writes a chunk on the client and drains it on the server, exercising both the encrypt path (BIO write side) and decrypt path (BIO read side). Parametrized on chunk size so we see the per-record overhead effect (small chunks) and the bulk-throughput effect (large chunks). Linux only — the change is Linux/OpenSSL-specific.

@EgorBot -linux_amd -linux_intel

using System;
using System.Net;
using System.Net.Security;
using System.Net.Sockets;
using System.Security.Authentication;
using System.Security.Cryptography;
using System.Security.Cryptography.X509Certificates;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(SslStreamBench).Assembly).Run(args);

[MemoryDiagnoser]
public class SslStreamBench
{
    private SslStream _client = null!;
    private SslStream _server = null!;
    private byte[] _payload = null!;
    private byte[] _readBuf = null!;

    [Params(64, 1024, 16384, 65536)]
    public int ChunkSize { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        using var rsa = RSA.Create(2048);
        var req = new CertificateRequest("CN=localhost", rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);
        var tmp = req.CreateSelfSigned(DateTimeOffset.UtcNow.AddDays(-1), DateTimeOffset.UtcNow.AddDays(30));
        var cert = new X509Certificate2(tmp.Export(X509ContentType.Pfx), (string?)null, X509KeyStorageFlags.Exportable);

        var listener = new TcpListener(IPAddress.Loopback, 0);
        listener.Start();
        var clientSock = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
        var connectTask = clientSock.ConnectAsync(IPAddress.Loopback, ((IPEndPoint)listener.LocalEndpoint).Port);
        var serverSock = listener.AcceptSocket();
        connectTask.GetAwaiter().GetResult();
        listener.Stop();
        clientSock.NoDelay = true;
        serverSock.NoDelay = true;

        _client = new SslStream(new NetworkStream(clientSock, true), false, (_, _, _, _) => true);
        _server = new SslStream(new NetworkStream(serverSock, true), false);

        var copts = new SslClientAuthenticationOptions
        {
            TargetHost = "localhost",
            EnabledSslProtocols = SslProtocols.Tls12,
        };
        var sopts = new SslServerAuthenticationOptions
        {
            ServerCertificate = cert,
            EnabledSslProtocols = SslProtocols.Tls12,
        };

        Task.WaitAll(
            _client.AuthenticateAsClientAsync(copts),
            _server.AuthenticateAsServerAsync(sopts));

        _payload = new byte[ChunkSize];
        new Random(42).NextBytes(_payload);
        _readBuf = new byte[ChunkSize];
    }

    [Benchmark]
    public async Task RoundTrip()
    {
        await _client.WriteAsync(_payload).ConfigureAwait(false);
        int total = 0;
        while (total < ChunkSize)
        {
            int r = await _server.ReadAsync(_readBuf.AsMemory(total)).ConfigureAwait(false);
            if (r == 0) break;
            total += r;
        }
    }

    [GlobalCleanup]
    public void Cleanup()
    {
        _client?.Dispose();
        _server?.Dispose();
    }
}

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment on lines 809 to +812
if (retVal != input.Length)
{
outToken.Size = 0;
// Drop any partial output written by the failed SSL_write but keep the drained spill bytes.
outToken.Size = preWriteSize;
Comment on lines +83 to +84
Returns the number of bytes written into the window and into the spill
buffer respectively since the last reset/window-set.
Comment on lines +413 to +417
int32_t unread = ctx->readLen - ctx->readPos;
if (unread > 0 && ctx->readPtr != NULL)
{
/* Move existing carry tail down to position 0 first. */
int32_t carryTail = ctx->readCarryLen - ctx->readCarryPos;
@rzikm
Copy link
Copy Markdown
Member Author

rzikm commented May 15, 2026

Note

AI-generated content disclosure: this benchmark and comment were produced with GitHub Copilot CLI.

crank: SslStream read-write throughput + handshake

Setup

  • Profile: aspnet-gold-lin (Intel Xeon Gold, 56 logical cores, dedicated lab box per role)
  • Scenario file: sslstream.benchmarks.yml
  • Scenarios run: read-write (full-duplex bulk transfer) and handshake (handshake-only loop)
  • TLS 1.3, send/receive buffer = 32 KB, client/server CertContext selection, TLS resumption allowed
  • 1 client connection (the scenario default — saturates both encrypt and decrypt paths on a single SSL session)
  • 15 s warm-up + 15 s measurement window
  • Framework forced to net11.0 via --server.framework net11.0 --server.runtimeVersion edge (same for client) so the locally-built libs can be dropped on top
  • Baseline: System.Net.Security.dll + libSystem.Security.Cryptography.Native.OpenSsl.so built from 3d73a08f1ba (origin/main at the merge-base of this PR)
  • Custom BIO (encrypt-only zero-copy): PR at commit 68743ca4d55
  • + Direct Decrypt (this comment, current PR tip 7371f2e2a1b): also threads the user Memory<byte> through SSL_read so plaintext lands directly in the caller's buffer (one more memcpy eliminated on the read path)
  • Both sides — server and client — receive the same overlay files

CPU columns below report benchmarks/cpu/raw (sum across cores; 100 % = one fully-busy core). The lab boxes have 56 logical cores.

Results (single run each unless noted)

Scenario Metric Baseline Custom BIO + Direct Decrypt Δ vs Baseline
read-write sslstream/read/mean (MB/s) 601.1 758.5 774.1 +28.8%
read-write sslstream/write/mean (MB/s) 695.4 795.2 876.6 +26.1%
read-write server CPU (cores %) 194 217 205 +5.7%
read-write client CPU (cores %) 226 218 214 −5.3%
read-write combined throughput / server-core% (MB/s) 6.68 7.16 8.05 +20.5%
handshake sslstream/handshake/avg (ms) 4.974 5.088 4.994 +0.4%
handshake sslstream/handshake/p99 (ms) 6.911 7.112 6.909 0.0%
handshake server CPU (cores %) 112 136 125 +12%
handshake client CPU (cores %) 95 93 94 −1%

read-write: Direct decrypt adds another ~+2% on read MB/s and +10% on write MB/s on top of the custom-BIO encrypt change (write side benefits because the now-faster read loop pulls records out of the kernel quicker and the duplex throughput re-balances). Total improvement vs origin/main is now +28.8% read / +26.1% write, with server-CPU efficiency (MB/s per server-core%) up +20.5% — i.e. the same single SSL session now drives ~20% more bytes per unit of server CPU.

handshake: Within noise of baseline, as expected — handshake-only scenarios don't exercise the application-data BIO or SSL_read fast paths the optimization targets.

Both directions are now zero-copy in steady state: SSL_write writes ciphertext directly into the outbound ProtocolToken window, and SSL_read writes plaintext directly into the caller's Memory<byte>. The output-side spill buffer remains in place as a correctness fallback (see spill instrumentation comment above) but is never hit on the hot path.

Next: I also re-ran the httpclient HTTPS scenarios — see the follow-up comment.

@rzikm
Copy link
Copy Markdown
Member Author

rzikm commented May 15, 2026

Note

AI-generated content disclosure: this benchmark and comment were produced with GitHub Copilot CLI.

crank: httpclient + Kestrel over HTTPS

Follow-up to measure impact on an end-to-end HTTP path where the SSL encrypt/decrypt cost is amortized over HTTP parsing/Kestrel processing.

Setup

  • Profile: aspnet-gold-lin (Intel Xeon Gold, 56 cores, dedicated lab box per role)
  • Scenario file: httpclient.benchmarks.yml, scenario httpclient-kestrel-get
  • HTTP/1.1 over TLS 1.3 (useHttps=true, server uses Kestrel default TLS pipeline)
  • Response body 16 KiB (responseSize=16384) — one full TLS record
  • Client: 1 HttpClient × 32 in-flight requests (concurrencyPerHttpClient=32 numberOfHttpClients=1)
  • 15 s warm-up + 15 s measurement
  • Two variants:
    • keep-alive: default headers — HttpClient reuses pooled TLS connections, so handshake cost is amortized.
    • Connection: close (requestHeaders=connectionclose): one TLS handshake per HTTP request — exercises the handshake-heavy path so the steady-state BIO win is diluted by handshake cost.
  • Framework, runtime version, and binary overlay set up exactly as the SslStream run above.
  • Baseline: 3d73a08f1ba. Custom BIO: 68743ca4d55. + Direct Decrypt: current PR tip 7371f2e2a1b (two runs each for the http variants because their per-run variance is meaningful).

CPU columns below report benchmarks/cpu/raw (sum across cores; 100 % = one fully-busy core). The lab boxes have 56 logical cores.

Results

Variant Metric Baseline Custom BIO + Direct Decrypt (run 1 / run 2 / mean) Δ vs Baseline (mean)
keep-alive http/rps/mean 27,039.6 27,545.9 26,422.6 / 26,879.7 / 26,651 −1.4%
keep-alive server CPU (cores %) 778 667 747 / 817 / 782 +0.5%
keep-alive client CPU (cores %) 503 486 491 / 562 / 527 +4.8%
keep-alive RPS per server-core% 34.8 41.3 35.4 / 32.9 / 34.1 −2.0%
Connection: close http/rps/mean 7,180.4 7,285.5 6,489.7 / 6,839.3 / 6,665 −7.2%
Connection: close server CPU (cores %) 1,849 1,839 1,672 / 1,479 / 1,576 −14.8%
Connection: close client CPU (cores %) 1,725 1,710 1,868 / 1,984 / 1,926 +11.7%
Connection: close RPS per server-core% 3.88 3.96 3.88 / 4.62 / 4.23 +9.0%

All six runs (3 versions × 2 variants): 0 errors, 0 bad-status responses.

Interpretation

  • The httpclient HTTPS workload is much more variable than the pure-SslStream read-write benchmark — both keep-alive and Connection: close RPS showed ±2–3 % swings between back-to-back runs even on this dedicated lab profile, and CPU was even more variable. Two-run averages above are reported to give a clearer picture; single-run swings should not be over-interpreted.
  • keep-alive RPS is essentially flat across all three builds — within run-to-run variance. The custom-BIO commit alone showed a clear server-CPU drop in earlier measurements; that benefit is partially obscured here by the broader run variance.
  • Connection: close is dominated by full TLS 1.3 handshakes (one per request), so neither the BIO nor the decrypt-direct change moves it meaningfully. The headline −7.2% mean RPS swing is well inside this scenario's run-to-run noise band; server-CPU figures move in the opposite direction (down 15%) which is the canonical signature of noise rather than a real regression. The pure-SslStream and instrumentation results (no spill hits, ~28% read/write throughput uplift) confirm the changes don't regress the encrypt/decrypt path itself.

Net: bulk encrypt/decrypt is the wins are clearest in workloads where SSL CPU is the bottleneck (pure-SslStream read-write). End-to-end HTTPS workloads either see the win as a small CPU/throughput improvement or are bottlenecked elsewhere and show no significant change.

@rzikm
Copy link
Copy Markdown
Member Author

rzikm commented May 15, 2026

Note

AI-generated content disclosure: this analysis and comment were produced with GitHub Copilot CLI.

Spill-path instrumentation (validation only — now removed from the PR)

Added temporary instrumentation (env-var gated, never committed to the final PR) to the Encrypt and DrainOutputBioSpill paths to count how often the output-side spill buffer is actually needed in real workloads:

  • Encrypt calls (total)
  • post-write-spill = Encrypt calls where spillLen > 0 after SSL_write (window was too small for what OpenSSL emitted)
  • post-write-spill-bytes = total bytes that went through the spill path
  • pre-drain-spill-hits = DrainOutputBioSpill calls that found leftover bytes (e.g. alert/KeyUpdate emitted by a prior SSL_read)
  • pre-drain-spill-bytes = total bytes drained via the pre-drain path

Re-ran the same three scenarios from the previous comments on aspnet-gold-lin, TLS 1.3, with the instrumentation active on both sides:

Scenario Side Encrypt calls post-write spills pre-drain hits
sslstream read-write (32 KiB buffers, 1 conn) server 600,000+ 0 0
sslstream read-write client 740,849 0 0
httpclient HTTPS GET 16 KiB, keep-alive, 32 in-flight server 4,195,656 0 0
httpclient HTTPS GET 16 KiB, keep-alive, 32 in-flight client 839,132 0 0
httpclient HTTPS GET 16 KiB, Connection: close, 32 in-flight server 977,651 0 0
httpclient HTTPS GET 16 KiB, Connection: close, 32 in-flight client 195,531 0 0

(Server counts are the last periodic-dump snapshots before crank SIGKILLs the server, so the true totals are slightly higher than shown. Clients exit cleanly so those numbers are exact.)

Aggregate: 0 spill events across ~7.5 M Encrypt operations, covering bidirectional bulk transfer, persistent-connection HTTPS, and handshake-per-request HTTPS.

Spill-stress validation (forced 100% spill, post-direct-decrypt)

Because the spill code path is dead-code on the hot path, a second env-var (DOTNET_SSLSTREAM_BIO_SPILL_STRESS=1) was added to clamp the write-window to 16 bytes, forcing every Encrypt call to overflow into the spill buffer and drain it. This was re-run after the direct-decrypt change to confirm the safety net still works correctly when combined with the new read path:

Scenario Mode Encrypt calls spill rate spill bytes errors
sslstream read-write (server) spill-stress + direct-decrypt 500,000+ 100% 16.4 GB 0

Throughput in stress mode (read 617 / write 805 MB/s) is ~10–20% below normal mode (774 / 877 MB/s) — confirming the spill memcpy itself has a real-but-bounded cost, and proving the spill path is functionally correct in the worst case.

The spill buffer remains in place as a defensive fallback for:

  1. Underestimates of ComputeMaxTlsOutput (currently inputLen + 256 * ((inputLen >> 14) + 2) — generous: TLS 1.3 record overhead is well under 256 bytes per record).
  2. Bytes OpenSSL writes outside our explicit windows (e.g. alerts emitted during SSL_read / handshake-internal alerts).

Neither situation was observed in the regular benchmarks, so the spill path is effectively dead code on the hot path under normal traffic — which is the desired outcome. The path is still required for correctness when those edge cases occur (KeyUpdate response during a Decrypt, close_notify during shutdown, mid-stream alerts, etc.) and we have proof it works under saturation.

Instrumentation and the stress-clamp were removed from the PR after this validation step (commit 7371f2e2a1b).

rzikm and others added 2 commits May 15, 2026 14:57
Symmetric to the existing custom-BIO encrypt optimization, this change
threads the caller-supplied Memory<byte> all the way through to
SSL_read so that OpenSSL writes decrypted plaintext directly into the
user destination, eliminating the intermediate copy from the internal
encrypted buffer to the user buffer (CopyDecryptedData on the read
path) for the common case where the user buffer has enough room.

Approach:
- Split Interop.OpenSsl.Decrypt into a (ReadOnlySpan<byte> input,
  Span<byte> output) form: the input span feeds the BIO read window
  for ciphertext, the output span is the SSL_read destination for
  plaintext. The legacy in-place call site (DecryptMessage) now passes
  the same span for both, preserving today behavior.
- Add SSL_pending wrapper (CryptoNative_SslPending) so we can detect
  plaintext residual that OpenSSL buffered internally when the user
  span was smaller than a records plaintext. The next read drains it
  via DecryptMessageDirect(empty input, user buffer) before any
  network IO.
- New SslStreamPal API (Unix only for now): DecryptMessageDirect plus
  IsDirectDecryptSupported. Other PALs (Windows/OSX/Android) expose
  IsDirectDecryptSupported=false and a throwing stub so the JIT
  eliminates the new branch on those platforms.
- SslStream.IO.ReadAsyncInternal: gated on
  SslStreamPal.IsDirectDecryptSupported, non-empty user buffer and no
  in-flight rehandshake, uses the new direct path. Non-OK status
  copies the direct-written bytes into extraBuffer so the existing
  Renegotiate/ContextExpired handlers keep working. The
  net_ssl_renegotiate_buffer guard now also checks
  _palHasPendingPlaintext to keep the
  NegotiateClientCertificateAsync_PendingDecryptedData_Throws contract.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 7371f2e.

The direct-decrypt optimization caused a record-layer failure
(error:0A000139:SSL routines::record layer failure) under HTTP/2 with
concurrency >= 2, where the client receives large response payloads via
direct decrypt. HTTP/1.1 keep-alive and HTTP/1.1 connection: close paths
were not affected and the original PR validation did not exercise HTTP/2
multiplexed reads.

Keeping the custom-BIO encrypt optimization (commit a134137), which
remains correct under all protocols tested.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address PR review feedback by collapsing the per-operation BIO setup,

SSL_{do_handshake,read,write}, BIO result retrieval and ERR_clear_error

into a single P/Invoke per TLS operation (CryptoNative_Ssl{Handshake,

Encrypt,Decrypt}). This removes three GC suspend/resume transitions per

TLS read or write.

On the read path, the atomic SslDecrypt now takes separate input

(ciphertext) and output (user buffer) pointers. When the user buffer is

large enough to receive a full TLS record plaintext (>= 16 KB), the

decrypted bytes are written directly into the user-provided memory,

avoiding the intermediate copy from _buffer.DecryptedSpan via

CopyDecryptedData. Smaller reads continue to use the in-place path,

keeping the implementation free of partial-record/drain state.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 15, 2026 18:37
@rzikm
Copy link
Copy Markdown
Member Author

rzikm commented May 15, 2026

Note

This comment was prepared with AI assistance (GitHub Copilot CLI). Measurements were collected and validated by the assistant under my supervision.

Update — atomic SSL ops + direct-decrypt re-landed (commit 7d3fc43)

Two changes squashed into one commit on top of the prior tip:

1. Atomic native SSL operations (addressing @bartonjs''s review)

Each TLS op (handshake step / encrypt / decrypt) is now a single P/Invoke. The native side does ERR_clear_error → set BIO read/write window via SSL_get_rbio / SSL_get_wbioSSL_{do_handshake,read,write} → retrieve BIO write result → clear windows, all without leaving GC suspended-friendly state. This removes 3 GC suspend/resume transitions per TLS op vs. the prior multi-call sequence.

New native entry points: CryptoNative_SslHandshake, CryptoNative_SslEncrypt, CryptoNative_SslDecrypt. The old per-step pinvokes (BioSetReadWindow / BioSetWriteWindow / BioGetWriteResult / BioClearReadWindow) are no longer called from the SSL hot path; they remain available for the handshake helper.

2. Direct-decrypt restored, gated on buffer size

The previous direct-decrypt attempt (reverted at dc16ab65) was broken under HTTP/2 concurrency because SSL_pending alone is not strong enough to track partial-record state, and the drain path interacted badly with TLS 1.3 post-handshake records.

Re-landed under a simple gate: direct-decrypt now requires buffer.Length >= 16384 (max single TLS record plaintext per RFC 5246/8446). With that gate, one SSL_read always fully consumes the record and emits all of its plaintext, so SSL_pending is always 0 after, and the drain path is never reached on the hot path. Small reads (e.g. HTTP/2 9-byte frame headers) silently fall back to the in-place path, which is the prior known-good behavior.

Validation

  • System.Net.Security.Tests functional: 4941 pass / 0 fail / 8 skip
  • HTTP/2 c=10 (responseSize=16384, useHttps=true): 5,251 RPS, 0 exceptions
  • HTTP/2 c=100 (responseSize=16384, useHttps=true): 20,779 RPS, 0 exceptions ✅ (the case the prior attempt was failing on)

Benchmark deltas (vs. main baseline 3d73a08, aspnet-gold-lin, TLS 1.3)

Scenario Baseline (main 3d73a08) This PR (7d3fc43) Δ vs main
read-write read MB/s 601.1 798.6 +32.9%
read-write write MB/s 695.4 835.9 +20.2%
handshake mean ms 4.974 4.924 noise
HTTPS GET 16 KiB, HTTP/1.1, c=32 keep-alive 27,039 RPS 27,745 RPS noise
HTTPS GET 16 KiB, HTTP/2, c=10 5,048 RPS 5,251 RPS +4.0%
HTTPS GET 16 KiB, HTTP/2, c=100 20,477 RPS 20,779 RPS +1.5%

All runs: 0 bad-status responses, 0 exceptions. HTTP/2 deltas are small because at 16 KiB responses the bottleneck is HTTP/2 framing / Kestrel work rather than the SSL memcpy path; the read-write benchmark directly stresses what this PR actually accelerates.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Comment on lines +241 to +242
int32_t needed = ctx->spillLen + remaining;
if (!ManagedSpanBioGrowSpill(ctx, needed))
Comment on lines +425 to +437
int32_t needed = ctx->readCarryLen + unread;
if (ManagedSpanBioGrowCarry(ctx, needed))
{
memcpy(ctx->readCarry + ctx->readCarryLen, ctx->readPtr + ctx->readPos, (size_t)unread);
ctx->readCarryLen += unread;
}
else
{
/* Carry allocation failed; bytes are lost. Mark the BIO as
permanently broken so the next BIO_read surfaces the failure
rather than masking it as a protocol error. */
ctx->readError = 1;
}
Comment on lines +239 to +242
if (remaining > 0)
{
int32_t needed = ctx->spillLen + remaining;
if (!ManagedSpanBioGrowSpill(ctx, needed))
Comment on lines +413 to +417
int32_t unread = ctx->readLen - ctx->readPos;
if (unread > 0 && ctx->readPtr != NULL)
{
/* Move existing carry tail down to position 0 first. */
int32_t carryTail = ctx->readCarryLen - ctx->readCarryPos;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Security NO-REVIEW Experimental/testing PR, do NOT review it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants