Skip to content

[Bug] SslServerTlsHandler.exceptionCaught() swallows exceptions without closing channel, causing permanent half-open TLS connections #16201

@uuuyuqi

Description

@uuuyuqi

Pre-check

  • I am sure that all the content I provide is in English.

Search before asking

  • I had searched in the issues and found no similar issues.

Apache Dubbo Component

Java SDK (apache/dubbo)

Dubbo Version

Dubbo Java 3.3 (also affects 3.2.x). Netty 4.1.x.

Steps to reproduce this issue

Scenario: Provider has a broken Netty dependency (e.g., incompatible netty-buffer version causing NoClassDefFoundError: Could not initialize class io.netty.buffer.PooledUnsafeDirectByteBuf).

  1. Consumer connects to Provider — TCP three-way handshake succeeds (handled by OS kernel, unaffected by the Netty bug).
  2. NettyClient.doConnect() only waits for TCP handshake completion, so it considers the connection successful. DubboInvoker is created and added to validInvokers.
  3. TLS handshake begins asynchronously — Consumer sends ClientHello.
  4. Provider's Netty read loop tries to allocate a ByteBuf to read the incoming data → NoClassDefFoundError is thrown.
  5. Netty's NioByteUnsafe.handleReadException() fires pipeline.fireExceptionCaught(cause) but does not close the channel (Netty only auto-closes for IOException or OutOfMemoryError).
  6. The exception reaches SslServerTlsHandler.exceptionCaught(), which only logs the error — it neither closes the channel nor propagates the exception.
  7. The channel remains TCP-active but is completely non-functional at the application layer.
  8. Consumer's DubboInvoker.isAvailable() returns true (it only checks channel.isActive()), so the invoker is never removed from validInvokers.
  9. All RPC requests routed to this Provider time out after 10 seconds.

What you expected to happen

When SslServerTlsHandler.exceptionCaught() is invoked, the channel should be closed (via ctx.close()), just like the userEventTriggered() method in the same class already does on TLS handshake failure. This would allow:

  • The Consumer to detect channelInactiveisConnected()=falseisAvailable()=false
  • Dubbo's addInvalidateInvoker mechanism to remove the broken invoker from validInvokers
  • The self-healing loop to work as designed

Current behavior of SslServerTlsHandler.exceptionCaught() (line 60-68):

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
    logger.error(INTERNAL_ERROR, "unknown error in remoting module", "",
            "TLS negotiation failed when trying to accept new connection.", cause);
    // BUG: no ctx.close() and no ctx.fireExceptionCaught(cause)
    // The exception is silently swallowed, channel stays open but broken
}

Compare with userEventTriggered() in the same class (line 81-89), which correctly closes the channel:

} else {
    logger.error(INTERNAL_ERROR, "", "",
            "TLS negotiation failed when trying to accept new connection.",
            handshakeEvent.cause());
    ctx.close();  // ← correctly closes the channel
}

Similarly, SslClientTlsHandler.userEventTriggered() on the Consumer side fires ctx.fireExceptionCaught() on TLS failure but does not close the channel, which can also lead to half-open connections.

Anything else

Root cause analysis:

The exception propagation chain breaks at SslServerTlsHandler.exceptionCaught():

Netty read loop: allocate ByteBuf → NoClassDefFoundError
    ↓
NioByteUnsafe.handleReadException() → pipeline.fireExceptionCaught(cause)
    (Netty does NOT auto-close: NoClassDefFoundError is not IOException/OutOfMemoryError)
    ↓
SslServerTlsHandler.exceptionCaught() → logs error, BUT:
    ✗ Does NOT call ctx.close()
    ✗ Does NOT call ctx.fireExceptionCaught(cause)
    → Exception is silently swallowed
    → Channel remains TCP-active but application-dead
    ↓
NettyServerHandler.exceptionCaught() → NEVER reached (exception stopped above)
    ↓
Consumer side: channel still active → isAvailable()=true → invoker never removed
    → Continuous timeout on every RPC call routed to this Provider

This is not limited to NoClassDefFoundError — any non-IOException/non-OutOfMemoryError exception during the Netty read loop would trigger the same behavior, leaving the channel in a zombie state.

Are you willing to submit a pull request to fix on your own?

  • Yes I am willing to submit a pull request on my own!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedEverything needs help from contributors

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions