Skip to content

tls: do not clobber cert-verify transport failure with trailing ECONNRESET#45014

Open
nyckone wants to merge 2 commits into
envoyproxy:mainfrom
nyckone:tls-fix-ssl-cert-verify-econnreset
Open

tls: do not clobber cert-verify transport failure with trailing ECONNRESET#45014
nyckone wants to merge 2 commits into
envoyproxy:mainfrom
nyckone:tls-fix-ssl-cert-verify-econnreset

Conversation

@nyckone
Copy link
Copy Markdown
Contributor

@nyckone nyckone commented May 11, 2026

Commit Message

tls: do not clobber cert-verify transport failure with trailing ECONNRESET

When io_handle_bio's SO_ERROR probe pushes a peer-originated ECONNRESET onto the BoringSSL error queue (the path added in #44149), SslSocket::drainErrorQueue() unconditionally set detected_io_error_ to ConnectionReset. For TLS handshakes that fail due to a certificate verify failure (e.g. CRL revocation, untrusted CA), the queue contains BOTH the SSL_R_CERTIFICATE_VERIFY_FAILED entry (the actual root cause) and the trailing ECONNRESET from the peer tearing the connection down after our bad-cert alert. Because the ECONNRESET was processed last, the IoError surfaced as ConnectionReset, the connection flipped detected_close_type_ to RemoteReset, and the HTTP conn pool reported reset reason remote connection failure instead of the original connection failure + transport failure reason TLS_error:...CERTIFICATE_VERIFY_FAILED that operators (notably the Istio team in #45011) rely on.

Defer the detected_io_error_ assignment until after the queue is fully walked, and only commit ConnectionReset when no TLS protocol-level failure (cert verify failure or missing client cert) is also queued. This preserves the pure-ECONNRESET path that #44149 added while keeping the diagnostic TLS error signal as the user-facing failure reason in mixed scenarios.

Additional Description

Adds a small SslSocketPeer friend in source/common/tls/ssl_socket.h (following the existing ContextImplPeer pattern at source/common/tls/context_impl.h:122) so the regression tests can exercise drainErrorQueue() in isolation by seeding the BoringSSL error queue directly with ERR_put_error() — the bug requires a specific FIFO ordering of two queued errors that is hard to deterministically orchestrate via real sockets.

Risk Level

Low. The behavior change is strictly narrower than #44149's original landing: the pure-ECONNRESET path still surfaces ConnectionReset exactly as before; only the previously-unintended mixed-error case (TLS protocol failure + trailing RST) is restored to its pre-#44149 behavior.

Testing

  • Two new regression tests in test/common/tls/ssl_socket_test.cc:
    • DrainErrorQueuePrefersCertVerifyOverEconnreset — seeds SSL_R_CERTIFICATE_VERIFY_FAILED followed by LIB_SYS/ECONNRESET, asserts detected_io_error_ is NOT ConnectionReset and failure_reason_ still contains CERTIFICATE_VERIFY_FAILED. Verified to fail without the fix.
    • DrainErrorQueueReportsStandaloneEconnreset — guards the converse: a bare LIB_SYS/ECONNRESET still surfaces as IoErrorCode::ConnectionReset, preventing an over-correction of the fix.
  • Existing TlsConnectionResetDetection and TlsConnectionResetDetectionDisabledByRuntime tests still pass.
  • Existing cert-verify tests pass.

Docs Changes

changelogs/current.yamlbug_fixes entry under area: tls referencing #45011.

Release Notes

See changelog entry.

Platform Specific Features

None.

[Optional Runtime guard]

The new selection logic is implicitly gated by the existing envoy.reloadable_features.ssl_socket_report_connection_reset runtime feature added in #44149: when that feature is disabled the ECONNRESET path is never even considered, so the fix has no effect.

[Optional Fixes #Issue]

Fixes #45011.

[Optional Deprecated]

None.

[Optional API Considerations]

None.

nyckone added 2 commits May 11, 2026 23:08
…RESET

When io_handle_bio's SO_ERROR probe pushes a peer-originated ECONNRESET
onto the BoringSSL error queue (the path added in envoyproxy#44149), SslSocket's
drainErrorQueue() unconditionally set detected_io_error_ to ConnectionReset.
For TLS handshakes that fail due to a certificate verify failure (e.g. CRL
revocation) the queue contains BOTH the SSL_R_CERTIFICATE_VERIFY_FAILED
entry (the actual root cause) and the trailing ECONNRESET entry from the
peer tearing the connection down after our bad-cert alert. Because the
ECONNRESET was processed last, the IoError surfaced as ConnectionReset,
the connection flipped detected_close_type_ to RemoteReset, and the HTTP
conn pool reported reset reason 'remote connection failure' instead of
the original 'connection failure' + transport failure reason
'TLS_error:...CERTIFICATE_VERIFY_FAILED' that operators relied on.

Defer the detected_io_error_ assignment until after the queue is fully
walked, and only commit ConnectionReset when no TLS protocol-level
failure (cert verify failure or missing client cert) is also queued.
This preserves the pure-ECONNRESET path that envoyproxy#44149 added while keeping
the diagnostic TLS error signal as the user-facing failure reason in
mixed scenarios.

Fixes envoyproxy#45011.

Signed-off-by: nyckone <nyckone@users.noreply.github.com>
Signed-off-by: Doron Hogery <doron.hogery@gmail.com>
…NRESET

Adds a SslSocketPeer friend (following the existing ContextImplPeer pattern)
that exposes drainErrorQueue() / detected_io_error_ / failure_reason_ for
targeted unit tests, then adds two regression tests:

* DrainErrorQueuePrefersCertVerifyOverEconnreset reproduces issue envoyproxy#45011 by
  seeding the BoringSSL queue with SSL_R_CERTIFICATE_VERIFY_FAILED followed by
  a trailing LIB_SYS/ECONNRESET (the ordering io_handle_bio's SO_ERROR probe
  produces in practice). It asserts that detected_io_error_ is NOT set to
  ConnectionReset, so the TLS cert-verify diagnostic remains the surfaced root
  cause and continues to flow into transport_failure_reason. Verified to fail
  without the prior drainErrorQueue fix.

* DrainErrorQueueReportsStandaloneEconnreset guards the converse case so the
  fix does not over-correct: a bare LIB_SYS/ECONNRESET still surfaces as
  IoErrorCode::ConnectionReset.

Signed-off-by: Doron Hogery <doronhogery@meta.com>
Signed-off-by: Doron Hogery <doron.hogery@gmail.com>
@nyckone nyckone force-pushed the tls-fix-ssl-cert-verify-econnreset branch from ba00825 to 6cdc189 Compare May 11, 2026 21:04
@ggreenway ggreenway self-assigned this May 12, 2026
@nyckone
Copy link
Copy Markdown
Contributor Author

nyckone commented May 12, 2026

@ggreenway - PTAL

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

Comment thread changelogs/current.yaml

bug_fixes:
# *Changes expected to improve the state of the world and are unlikely to have negative effects*
- area: tls
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we don't need a changelog; this is fixing something that was just merged last week and was never included in any release

// bad-cert alert), and reporting it as the transport error would clobber the more diagnostic
// ``connection failure`` + ``transport failure reason: TLS_error:...CERTIFICATE_VERIFY_FAILED``
// signal that operators rely on (see issue #45011).
if (saw_econnreset && !saw_cert_verify_failed && !saw_no_client_cert) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should only raise the econnreset if there's no other error; if there's any other error, it's more likely to be the cause of the reset, than the reset itself being causal. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SslSocket: ECONNRESET reporting clobbers transport_failure_reason for TLS verify failures (#44149)

2 participants