tls: do not clobber cert-verify transport failure with trailing ECONNRESET#45014
Open
nyckone wants to merge 2 commits into
Open
tls: do not clobber cert-verify transport failure with trailing ECONNRESET#45014nyckone wants to merge 2 commits into
nyckone wants to merge 2 commits into
Conversation
…RESET When io_handle_bio's SO_ERROR probe pushes a peer-originated ECONNRESET onto the BoringSSL error queue (the path added in envoyproxy#44149), SslSocket's drainErrorQueue() unconditionally set detected_io_error_ to ConnectionReset. For TLS handshakes that fail due to a certificate verify failure (e.g. CRL revocation) the queue contains BOTH the SSL_R_CERTIFICATE_VERIFY_FAILED entry (the actual root cause) and the trailing ECONNRESET entry from the peer tearing the connection down after our bad-cert alert. Because the ECONNRESET was processed last, the IoError surfaced as ConnectionReset, the connection flipped detected_close_type_ to RemoteReset, and the HTTP conn pool reported reset reason 'remote connection failure' instead of the original 'connection failure' + transport failure reason 'TLS_error:...CERTIFICATE_VERIFY_FAILED' that operators relied on. Defer the detected_io_error_ assignment until after the queue is fully walked, and only commit ConnectionReset when no TLS protocol-level failure (cert verify failure or missing client cert) is also queued. This preserves the pure-ECONNRESET path that envoyproxy#44149 added while keeping the diagnostic TLS error signal as the user-facing failure reason in mixed scenarios. Fixes envoyproxy#45011. Signed-off-by: nyckone <nyckone@users.noreply.github.com> Signed-off-by: Doron Hogery <doron.hogery@gmail.com>
…NRESET Adds a SslSocketPeer friend (following the existing ContextImplPeer pattern) that exposes drainErrorQueue() / detected_io_error_ / failure_reason_ for targeted unit tests, then adds two regression tests: * DrainErrorQueuePrefersCertVerifyOverEconnreset reproduces issue envoyproxy#45011 by seeding the BoringSSL queue with SSL_R_CERTIFICATE_VERIFY_FAILED followed by a trailing LIB_SYS/ECONNRESET (the ordering io_handle_bio's SO_ERROR probe produces in practice). It asserts that detected_io_error_ is NOT set to ConnectionReset, so the TLS cert-verify diagnostic remains the surfaced root cause and continues to flow into transport_failure_reason. Verified to fail without the prior drainErrorQueue fix. * DrainErrorQueueReportsStandaloneEconnreset guards the converse case so the fix does not over-correct: a bare LIB_SYS/ECONNRESET still surfaces as IoErrorCode::ConnectionReset. Signed-off-by: Doron Hogery <doronhogery@meta.com> Signed-off-by: Doron Hogery <doron.hogery@gmail.com>
ba00825 to
6cdc189
Compare
Contributor
Author
|
@ggreenway - PTAL |
ggreenway
requested changes
May 14, 2026
|
|
||
| bug_fixes: | ||
| # *Changes expected to improve the state of the world and are unlikely to have negative effects* | ||
| - area: tls |
Member
There was a problem hiding this comment.
In this case we don't need a changelog; this is fixing something that was just merged last week and was never included in any release
| // bad-cert alert), and reporting it as the transport error would clobber the more diagnostic | ||
| // ``connection failure`` + ``transport failure reason: TLS_error:...CERTIFICATE_VERIFY_FAILED`` | ||
| // signal that operators rely on (see issue #45011). | ||
| if (saw_econnreset && !saw_cert_verify_failed && !saw_no_client_cert) { |
Member
There was a problem hiding this comment.
I wonder if we should only raise the econnreset if there's no other error; if there's any other error, it's more likely to be the cause of the reset, than the reset itself being causal. WDYT?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Commit Message
tls: do not clobber cert-verify transport failure with trailing ECONNRESET
When
io_handle_bio'sSO_ERRORprobe pushes a peer-originatedECONNRESETonto the BoringSSL error queue (the path added in #44149),SslSocket::drainErrorQueue()unconditionally setdetected_io_error_toConnectionReset. For TLS handshakes that fail due to a certificate verify failure (e.g. CRL revocation, untrusted CA), the queue contains BOTH theSSL_R_CERTIFICATE_VERIFY_FAILEDentry (the actual root cause) and the trailingECONNRESETfrom the peer tearing the connection down after our bad-cert alert. Because theECONNRESETwas processed last, theIoErrorsurfaced asConnectionReset, the connection flippeddetected_close_type_toRemoteReset, and the HTTP conn pool reported reset reasonremote connection failureinstead of the originalconnection failure+ transport failure reasonTLS_error:...CERTIFICATE_VERIFY_FAILEDthat operators (notably the Istio team in #45011) rely on.Defer the
detected_io_error_assignment until after the queue is fully walked, and only commitConnectionResetwhen no TLS protocol-level failure (cert verify failure or missing client cert) is also queued. This preserves the pure-ECONNRESETpath that #44149 added while keeping the diagnostic TLS error signal as the user-facing failure reason in mixed scenarios.Additional Description
Adds a small
SslSocketPeerfriend insource/common/tls/ssl_socket.h(following the existingContextImplPeerpattern atsource/common/tls/context_impl.h:122) so the regression tests can exercisedrainErrorQueue()in isolation by seeding the BoringSSL error queue directly withERR_put_error()— the bug requires a specific FIFO ordering of two queued errors that is hard to deterministically orchestrate via real sockets.Risk Level
Low. The behavior change is strictly narrower than #44149's original landing: the pure-
ECONNRESETpath still surfacesConnectionResetexactly as before; only the previously-unintended mixed-error case (TLS protocol failure + trailing RST) is restored to its pre-#44149 behavior.Testing
test/common/tls/ssl_socket_test.cc:DrainErrorQueuePrefersCertVerifyOverEconnreset— seedsSSL_R_CERTIFICATE_VERIFY_FAILEDfollowed byLIB_SYS/ECONNRESET, assertsdetected_io_error_is NOTConnectionResetandfailure_reason_still containsCERTIFICATE_VERIFY_FAILED. Verified to fail without the fix.DrainErrorQueueReportsStandaloneEconnreset— guards the converse: a bareLIB_SYS/ECONNRESETstill surfaces asIoErrorCode::ConnectionReset, preventing an over-correction of the fix.TlsConnectionResetDetectionandTlsConnectionResetDetectionDisabledByRuntimetests still pass.Docs Changes
changelogs/current.yaml—bug_fixesentry underarea: tlsreferencing #45011.Release Notes
See changelog entry.
Platform Specific Features
None.
[Optional Runtime guard]
The new selection logic is implicitly gated by the existing
envoy.reloadable_features.ssl_socket_report_connection_resetruntime feature added in #44149: when that feature is disabled theECONNRESETpath is never even considered, so the fix has no effect.[Optional Fixes #Issue]
Fixes #45011.
[Optional Deprecated]
None.
[Optional API Considerations]
None.