Skip to content

[ISSUE #10398] Fix native memory leak on TLS certificate hot-reload#10399

Open
qianye1001 wants to merge 2 commits into
apache:developfrom
qianye1001:fix/issue-10398
Open

[ISSUE #10398] Fix native memory leak on TLS certificate hot-reload#10399
qianye1001 wants to merge 2 commits into
apache:developfrom
qianye1001:fix/issue-10398

Conversation

@qianye1001
Copy link
Copy Markdown
Contributor

Summary

Fix native memory leak caused by old SslContext not being released during TLS certificate hot-reload when using the OpenSSL (netty-tcnative) provider.

Fixes #10398

Root Cause

When TLS certificates are dynamically reloaded via FileWatchService, the loadSslContext() methods in NettyRemotingServer and ProxyAndTlsProtocolNegotiator directly overwrite the sslContext field without releasing the old instance. Since ReferenceCountedOpenSslContext allocates native off-heap memory for the certificate chain, private key, and SSL session cache, each reload leaks ~100KB-1MB of native memory per rotation cycle.

Changes

File 1: remoting/src/main/java/org/apache/rocketmq/remoting/netty/NettyRemotingServer.java

  • Added import io.netty.util.ReferenceCountUtil;
  • Modified loadSslContext(): build new context into local variable, save old reference, assign new context to field, then release old context via ReferenceCountUtil.release(oldSslContext) in try-catch

File 2: proxy/src/main/java/org/apache/rocketmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java

  • Added import io.grpc.netty.shaded.io.netty.util.ReferenceCountUtil;
  • Changed sslContext field to private static volatile SslContext sslContext; for thread-safe visibility
  • Modified loadSslContext(): build new context into local variable, save old reference, assign new context to field, then release old context via ReferenceCountUtil.release(oldSslContext) in try-catch

Fix Strategy

Uses "build new, then release old" ordering to ensure sslContext is never null or pointing to a released context during the swap:

  1. Build new SslContext into a local variable
  2. Save old sslContext reference
  3. Assign new context to the field
  4. Release old context via ReferenceCountUtil.release() (no-op for non-refcounted JDK SslContext)

Testing

  • Existing TLS integration tests cover handshake correctness
  • Manual verification: run broker with TLS enabled, trigger cert reload 100+ times, confirm stable RSS/native memory

Backward Compatibility

  • No public API changes
  • No configuration changes
  • No new dependencies (ReferenceCountUtil already in Netty transitive deps)
  • Existing connections unaffected (SslHandler holds its own reference)

Risk Assessment

LOW — Minimal, well-isolated lifecycle fix following established Netty ReferenceCounted resource management patterns.

Copy link
Copy Markdown
Contributor Author

@qianye1001 qianye1001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by rocketmq-reviewer-bot

Summary

Fix native memory leak in TLS certificate hot-reload by releasing the old SslContext via ReferenceCountUtil.release() after assigning the new one. Uses a safe "build new → swap → release old" pattern with try-catch protection.

Comparison with Previous Attempt (PR #10396, closed)

This PR improves upon the earlier closed PR #10396 in two important ways:

  1. Builds new context into a local variable first — if buildSslContext() throws, the old context remains valid (no partial state)
  2. Wraps release() in try-catch — prevents release failures from disrupting the reload flow

Findings

  • [Info] NettyRemotingServer.java:187-196 — Release ordering is correct. Building into newSslContext local var, saving oldSslContext, assigning new to field, then releasing old. This is the standard Netty ReferenceCounted swap pattern. ✅

  • [Info] NettyRemotingServer.javasslContext is declared as protected volatile in parent class NettyRemotingAbstract (line 121), ensuring visibility across threads. ✅

  • [Info] ProxyAndTlsProtocolNegotiator.java:81 — Good addition of volatile to the static sslContext field. The gRPC TLS handshake threads read this field, so visibility is important. ✅

  • [Info] ProxyAndTlsProtocolNegotiator.java:143-148 — Uses io.grpc.netty.shaded.io.netty.util.ReferenceCountUtil (shaded package), which is correct since the proxy module depends on the shaded Netty from grpc-netty-shaded. ✅

  • [Low] NettyRemotingServer.java:56 — Import ordering: ReferenceCountUtil is inserted between HashedWheelTimer and Timeout, breaking alphabetical order. Minor style issue — should be placed after io.netty.util.HashedWheelTimer and before io.netty.util.Timeout (or at end of the io.netty.util.* block depending on project convention). Non-blocking.

  • [Low] Both files — loadSslContext() is not synchronized. If called concurrently (e.g., from multiple FileWatchService callbacks), two threads could capture the same oldSslContext and both release it → IllegalReferenceCountException. In practice, FileWatchService dispatches from a single callback thread, so the risk is low. Adding synchronized would provide defense-in-depth but is not strictly required.

  • [Suggestion] Consider adding a unit test that verifies the fix:

    // Pseudocode
    server.loadSslContext();  // First call: sslContext is set, oldContext is null
    SslContext first = server.sslContext;
    server.loadSslContext();  // Second call: old context should be released
    // Assert: first.refCnt() == 0 (for OpenSSL) or no exception (for JDK)

    This would prevent regression and validate the release logic.

Cross-repo Note

No cross-repo impact. This change is internal to the broker/proxy TLS lifecycle management.

Verdict

Approve with minor suggestions — This is a well-structured fix that improves upon the earlier PR #10396. The "build into local var → swap → release old with try-catch" pattern is the correct approach for ReferenceCounted resource management in Netty. Thread safety is adequate with volatile on both declarations. The try-catch around release() is a good defensive measure. Main improvement: add automated test coverage.


Automated review by rocketmq-reviewer-bot

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.96%. Comparing base (c9e51d6) to head (382e260).

Files with missing lines Patch % Lines
...etmq/proxy/grpc/ProxyAndTlsProtocolNegotiator.java 80.00% 2 Missing ⚠️
...e/rocketmq/remoting/netty/NettyRemotingServer.java 77.77% 2 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop   #10399      +/-   ##
=============================================
- Coverage      49.09%   48.96%   -0.14%     
+ Complexity     13507    13469      -38     
=============================================
  Files           1376     1376              
  Lines         100537   100553      +16     
  Branches       12983    12985       +2     
=============================================
- Hits           49357    49232     -125     
- Misses         45184    45303     +119     
- Partials        5996     6018      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TLS certificate hot-reload leaks native memory due to unreleased old SslContext

3 participants