Skip to content

[Bug] TLS certificate hot-reload leaks native memory due to unreleased old SslContext #10397

@qianye1001

Description

@qianye1001

Before Creating the Bug Report

  • I found a bug, not just asking a question, which should be created in GitHub Discussions.

  • I have searched the GitHub Issues and GitHub Discussions of this repository and believe that this is not a duplicate.

  • I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ.

Runtime platform environment

Linux (observed on CentOS 7 / Ubuntu 22.04)

RocketMQ version

develop (also affects 5.x releases with TLS hot-reload enabled)

JDK Version

JDK 8 / JDK 11, using netty-tcnative (OpenSSL provider)

Describe the Bug

When TLS certificates are dynamically reloaded via TlsCertificateManager (file-watch triggered), a new SslContext is created but the old one is never explicitly released. Since netty-tcnative's OpenSslContext is reference-counted and allocates native (off-heap) memory for the certificate chain, private key, and SSL session cache, simply dereferencing the old context does not free native memory — it relies on GC finalization which may never run under low heap pressure.

This causes native memory (RSS) to grow monotonically with each certificate rotation cycle. In long-running Proxy/Broker deployments with frequent cert rotations (e.g., short-lived certificates rotated every few hours), this eventually leads to OOM kills.

Steps to Reproduce

  1. Enable TLS with OpenSSL provider (tls.provider=OPENSSL) on Broker or Proxy
  2. Configure certificate hot-reload (tlsCertWatchIntervalMs)
  3. Repeatedly replace the certificate files to trigger reload cycles
  4. Monitor native memory (RSS or jcmd VM.native_memory) — it grows on each reload and never reclaims

What Did You Expect to See?

Native memory should remain stable after certificate rotation. The old SslContext should be released promptly when replaced.

What Did You See Instead?

Native memory grows ~200KB–1MB per rotation cycle (depending on cert chain length and session cache size) and is never reclaimed until process restart.

Additional Context

The fix should call ReferenceCountUtil.release(oldSslContext) after the new context is installed. Care is needed to defer release until in-flight channels using the old context have closed, or use ReferenceCountUtil.safeRelease() with proper draining logic.

Related: #10302 (SNI multi-domain support) introduces more SslContext instances per domain, making this leak more severe if not addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions