Skip to content

fix(gRPC): connection pool leak when connection is closed and there are no more subsequent calls#1945

Merged
DMwangnima merged 1 commit intocloudwego:mainfrom
DMwangnima:optimize/grpc_framer
Apr 23, 2026
Merged

fix(gRPC): connection pool leak when connection is closed and there are no more subsequent calls#1945
DMwangnima merged 1 commit intocloudwego:mainfrom
DMwangnima:optimize/grpc_framer

Conversation

@DMwangnima
Copy link
Copy Markdown
Contributor

@DMwangnima DMwangnima commented Apr 13, 2026

What type of PR is this?

fix

Check the PR title.

  • This PR title match the format: <type>(optional scope): <description>
  • The description of this PR title is user-oriented and clear enough for others to understand.
  • Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo

(Optional) Translate the PR title into Chinese.

(Optional) More detailed description for this PR(en: English/zh: Chinese).

en:

  1. Restructure gRPC connection pool to use per-slotatomic.Pointer+sync.Mutex, replacing the old unsynchronized[]ClientTransportslice, fixing connection leak where closed/broken transports stayed in the pool indefinitely
  2. MoveonClose/onGoAway callback invocation after releasinghttp2Client.muto prevent deadlock when callbacks callIsActive()
  3. AddisNewguard +isClosed()check in singleflight to prevent re-inserting closed transports after Clean()/Close()
  4. Fix resource leaks: close dialed conn on TLS handshake failure; callt.Close(err)on Flush failure during newHTTP2Clientinit
  5. FixcloseStreamTaskGC leak: remove task from global ticker when transport is draining with zero active streams
  6. AddNewClientTransportWithConfigwith new callback signatures (ctx, transport, err/reason); deprecate NewClientTransport

zh(optional):

(Optional) Which issue(s) this PR fixes:

(optional) The PR that updates user documentation:

@DMwangnima DMwangnima requested review from a team as code owners April 13, 2026 03:30
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 86.20690% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.82%. Comparing base (bfa27f3) to head (3158415).
⚠️ Report is 46 commits behind head on main.

Files with missing lines Patch % Lines
pkg/remote/trans/nphttp2/conn_pool.go 89.41% 6 Missing and 3 partials ⚠️
pkg/remote/trans/nphttp2/grpc/http2_client.go 67.85% 6 Missing and 3 partials ⚠️
pkg/remote/trans/nphttp2/grpc/transport.go 50.00% 8 Missing ⚠️
pkg/remote/trans/nphttp2/conn_pool_slot.go 97.29% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1945      +/-   ##
==========================================
+ Coverage   61.37%   62.82%   +1.44%     
==========================================
  Files         388      394       +6     
  Lines       35063    30219    -4844     
==========================================
- Hits        21521    18985    -2536     
+ Misses      12247     9939    -2308     
  Partials     1295     1295              
Flag Coverage Δ
integration 51.64% <59.60%> (+1.14%) ⬆️
unit 53.26% <86.20%> (+1.62%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DMwangnima DMwangnima force-pushed the optimize/grpc_framer branch 5 times, most recently from 6ead596 to c69b3d6 Compare April 16, 2026 11:43
@DMwangnima DMwangnima force-pushed the optimize/grpc_framer branch 7 times, most recently from 217f0c3 to 4961d52 Compare April 21, 2026 13:25
Previously a closed/broken transport stayed in the nphttp2 client pool
indefinitely, because only a later put() could overwrite the slot.

Restructure the pool into per-slot transportSlot with atomic.Pointer for
lock-free reads and sync.Mutex for serialized writes. Each slot registers
onClose/onGoAway callbacks that evict the transport via CAS-based
removeTransport as soon as it closes or receives GoAway.

To avoid deadlock between transportSlot.mu and http2Client.mu,
http2Client.Close and handleGoAway now invoke onClose/onGoAway after
releasing t.mu and after the state has been set to closing/draining.

Key changes:
- Per-slot atomic.Pointer + Mutex replaces the old []ClientTransport
  slice with unsynchronized put/get.
- isNew + store-first-then-recheck eliminates the TOCTOU between
  isClosed() and LoadOrStore when racing with Close().
- Get() fast-exits via select{case <-ctx.Done()} when NewStream fails
  due to user context timeout, avoiding unnecessary singleflight entry.
- newTransport now closes the raw conn on TLS handshake failure.
- newHTTP2Client now calls t.Close(err) on Flush failure, preventing
  half-initialized transport leaks.
- closeStreamTask.Tick removes itself when draining + 0 active streams,
  allowing the http2Client to be GCed.
- Close() is idempotent via atomic CAS.
- Callback signatures now take (ctx, trans, err); a new ClientConfig +
  NewClientTransportWithConfig expose the new shape. The deprecated
  NewClientTransport is preserved as a thin adapter with documented
  timing change.
Comment thread pkg/remote/trans/nphttp2/grpc/http2_client.go
Comment thread pkg/remote/trans/nphttp2/conn_pool_slot.go
@DMwangnima DMwangnima force-pushed the optimize/grpc_framer branch from 4961d52 to 3158415 Compare April 22, 2026 06:47
@DMwangnima DMwangnima merged commit 17be90b into cloudwego:main Apr 23, 2026
24 of 25 checks passed
@DMwangnima DMwangnima deleted the optimize/grpc_framer branch April 23, 2026 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants