Skip to content

fix(reconnect): listener bug fixes + transport factory policy hook#45

Merged
GregHolmes merged 3 commits intomainfrom
feat/reconnect-opt-out-for-custom-transports
May 6, 2026
Merged

fix(reconnect): listener bug fixes + transport factory policy hook#45
GregHolmes merged 3 commits intomainfrom
feat/reconnect-opt-out-for-custom-transports

Conversation

@VWang1111
Copy link
Copy Markdown
Contributor

Summary

  • Add ClientOptions.Builder.reconnect(boolean) (default true). When false, the per-resource WebSocket clients connect once via a plain okhttp3.WebSocketListener and propagate failures directly, bypassing ReconnectingWebSocketListener (and its hardcoded connectionFuture.get(4000, MILLISECONDS) ceiling).
  • Auto-disable for custom transports: when DeepgramClientBuilder.transportFactory(...) (or its async counterpart) is set, setAdditional() also calls builder.reconnect(false). Custom transports like SageMakerTransportFactory already manage their own connection lifecycle; wrapping them in another retry layer creates double-retry storms under burst load.
  • Patch applied to all four per-resource WebSocket clients: listen/v1, listen/v2, agent/v1, speak/v1. Same structural change in each: connect() branches on clientOptions.reconnect(); disconnect(), sendMedia(), sendMessage(), assertSocketIsOpen() use directWebSocket when the listener is null.
  • The four files are added to the "temporarily frozen" section of .fernignore so the patch survives the next Fern regen. Unfreeze once the change has been pushed upstream into the Fern generator template.

Why

Validated against a 400-concurrent-stream burst test on a 10× ml.g6.2xlarge endpoint (Deepgram on SageMaker, replicating a customer-reported failure):

Metric Before After (this PR + transport-side timeouts) Δ
ThrottlingException line-count 68,909 240 −99.7 %
ModelStreamError line-count 30,442 1,128 −96.3 %
Streams hitting AWS SDK Attempt Count: 4 29,030 96 −99.7 %
4-second connection timeout after 4000 13 0 eliminated
Total error log lines 1,322,666 33,972 −97.4 %
Wall time 313.57 s 312.61 s unchanged
Transcript count 41,337 40,467 unchanged

Same work done, ~97 % less error noise. Pairs with deepgram/deepgram-java-sdk-transport-sagemaker#14 (Layer 1 transport-side timeouts), but each provides independent value.

Backwards compatibility

Existing callers that do NOT use transportFactory(...) see zero behavior change. The new reconnect() field defaults to true, the ReconnectingWebSocketListener-wrapped code path is unchanged for the OkHttp WebSocket transport (Deepgram cloud, self-hosted Deepgram via raw WS), and the lower-level ClientOptions.Builder.webSocketFactory(...) seam is also unaffected (only the explicit DeepgramClientBuilder.transportFactory(...) API triggers the auto-disable).

Customer shape transportFactory != null? reconnect() Wraps in ReconnectingWebSocketListener? Behavior change
Deepgram Cloud (wss://api.deepgram.com) no true yes (unchanged) none
Self-hosted Deepgram (raw WS) no true yes (unchanged) none
Custom WebSocketFactory via ClientOptions (lower-level seam) no true yes (unchanged) none
transportFactory(SageMakerTransportFactory) yes false (auto-set) no (direct path) yes — desired
transportFactory(<future custom transport>) yes false (auto-set) no (direct path) yes; opt back in with .reconnect(true)

Edge case worth calling out

A SageMaker caller who has been tuning retry behavior via wsClient.reconnectOptions(customOpts) will find those options silently ignored after this change (reconnect=false skips the listener entirely). Their tuning was almost certainly a workaround for the very storm we are now eliminating, so the right migration is to delete the reconnectOptions(...) call. Customers who genuinely want both a custom transport AND SDK-side reconnect can re-enable explicitly with .reconnect(true) after .transportFactory(...):

DeepgramClient.builder()
    .transportFactory(sagemakerFactory)
    .reconnect(true)              // explicit override; not recommended for SageMaker
    .build();

This is a runtime-quality change, not an API break — no compile errors for any existing customer. Recommend mentioning in CHANGELOG and migration notes for the next minor release.

Test plan

  • ./gradlew build — passes (spotless + compile + tests).
  • ./gradlew test — all existing tests pass.
  • End-to-end validation with patched JAR linked into a customer load-test harness against a real SageMaker endpoint (numbers above).
  • CI passes.
  • Push template change upstream into the Fern generator and unfreeze the four per-resource WebSocket client files in .fernignore.

🤖 Generated with Claude Code

@VWang1111 VWang1111 requested a review from lukeocodes as a code owner April 28, 2026 23:10
Copy link
Copy Markdown
Member

@lukeocodes lukeocodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for chasing this down, the diagnosis is right. but this shouldn't land in core. the whole point of the transport seam in #29 was so SageMaker stays invisible to the SDK. transportFactory() was meant to be the entire surface area, and looking at the plugin side over on deepgram/deepgram-java-sdk-transport-sagemaker#14 I think we can keep it that way.

what's happening here is storm handling that belongs in SageMakerTransport is bleeding upward into the generated WS clients via a boolean flag and four near-identical 75-line listener implementations. the freeze list goes from 1 file to 5, and every future regen has to reconcile four duplicated anonymous listeners.

the 4-second connectionFuture.get(4000, MILLISECONDS) ceiling and the maxRetries = Integer.MAX_VALUE default in ReconnectingWebSocketListener are both real, but they're our bugs in that file. fix them there:

  • maxRetries(0) should mean "connect once, don't retry", not "refuse to connect"
  • promote the hardcoded 4000 to a configurable connectionTimeoutMs on ReconnectOptions

then the plugin sets its own policy via webSocketReconnectOptions(...) and core stays untouched. no boolean on ClientOptions, no edits to generated WS clients, no new .fernignore entries past freezing the listener itself.

the storm itself shouldn't even reach ReconnectingWebSocketListener in normal operation. AWS ThrottlingException / pool exhaustion / connect timeouts should be classified and absorbed inside SageMakerTransport.ensureConnected() before they ever fire transport.onError(...). that's plugin-side work, I've left the detail on #14.

asks:

  • revert the four resources/.../websocket/*WebSocketClient.java edits
  • revert the four new .fernignore entries
  • drop ClientOptions.reconnect(boolean) and the if (transportFactory != null) builder.reconnect(false) heuristic
  • fix ReconnectingWebSocketListener instead, freeze just that file
  • let the plugin declare its own ReconnectOptions. a default ReconnectOptions reconnectOptions() on DeepgramTransportFactory would do it cleanly, that file is permanently frozen anyway

with the plugin-side change in #14, customer outcome is the same, one freeze entry instead of four, and transportFactory() stays the only thing core knows about SageMaker.

VWang1111 added a commit that referenced this pull request Apr 29, 2026
…uto-wire transport factory policy

Reworks the Layer-4 patch (#45) per @lukeocodes' review. Replaces the previous
ClientOptions.reconnect(boolean) flag and the four duplicated WebSocketClient
listeners with the underlying bug fixes in ReconnectingWebSocketListener plus
a clean policy-declaration hook on DeepgramTransportFactory.

Reverts (no edits to generated WS clients, no .fernignore freeze inflation):
  - ClientOptions.reconnect(boolean) field/getter/builder method.
  - if (transportFactory != null) builder.reconnect(false) auto-disable
    heuristic in both DeepgramClientBuilder and AsyncDeepgramClientBuilder.
  - The four resources/.../websocket/*WebSocketClient.java listener
    duplications and their .fernignore freeze entries.

Bug fixes in ReconnectingWebSocketListener (now temporarily frozen):
  - maxRetries(0) used to refuse to connect because the gate compared
    retryCount.get() >= maxRetries on the initial attempt (retryCount starts
    at 0). The initial attempt is not a retry; the gate is now > maxRetries
    so the count caps retries only. maxRetries(0) means "1 attempt, no
    retries" as the API name implies.
  - The 4000 ms connectionFuture.get(...) timeout is now configurable via a
    new connectionTimeoutMs field on ReconnectOptions (default 4000,
    validated > 0). The hardcoded literal is gone.
  - Adds applyOptionsOverride(ReconnectOptions) so a transport-factory wrapper
    can rewire policy on the listener after construction. Used by
    TransportWebSocketFactory to honour the new factory hook below without
    requiring edits to the per-resource WebSocketClient classes.

Plugin policy declaration:
  - DeepgramTransportFactory (already permanently frozen) gains a default
    ReconnectOptions reconnectOptions() method returning null. Plugins
    managing their own connection lifecycle (e.g. SageMaker) override this
    to return ReconnectOptions.builder().maxRetries(0).build() so the
    SDK's wrapper-level reconnect doesn't compound their internal retries
    into a Throttling-on-Throttling storm under burst load.
  - TransportWebSocketFactory.create() applies factory.reconnectOptions()
    to the wrapping ReconnectingWebSocketListener via the new override hook.
    Auto-wiring with no edits to generated code.

Tests:
  - New ReconnectingWebSocketListenerTest covers the maxRetries(0)
    regression, connectionTimeoutMs default + customisation + validation,
    and applyOptionsOverride no-op + retry-gate behaviour.

Net diff vs the prior PR head: 4 files changed (+94/−10) instead of 8 files
(+364/−24). Freeze list grows by 1 (the listener), shrinks by 4 (the WS
clients).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VWang1111 added a commit to deepgram/deepgram-java-sdk-transport-sagemaker that referenced this pull request Apr 29, 2026
…ent pending queue

Bundles the four storm-handling asks from @lukeocodes' review of #14 into the
existing PR. The plugin now owns end-to-end retry/backoff/classification for
transient AWS failures so they never reach transport.onError(...) and the
SDK's wrapper-level reconnect can be disabled via the new
DeepgramTransportFactory.reconnectOptions() hook (paired with the SDK fix in
deepgram/deepgram-java-sdk#45).

SageMakerConfig — new retry knobs (defaults tuned for high-burst workloads):
  - maxRetries = 5 (set 0 to disable internal retry)
  - initialBackoff = 100 ms
  - maxBackoff = 5 s
  - backoffMultiplier = 2.0 (validated >= 1.0)
  - retryBudget = 30 s wall-clock cap across all retries
  - build() rejects initialBackoff > maxBackoff

SageMakerTransport — internal storm absorption:
  - ensureConnected() wraps attemptConnect() in a budgeted retry loop with
    exponential backoff. Successful subscription resets the budget.
  - handleStreamError(): the response-handler onError gate from line 100
    is now a classify-and-decide step. RETRYABLE + budget left triggers an
    internal reset (cancel future, complete publisher, mark disconnected)
    so the next send re-enters the loop on a fresh stream. TERMINAL or
    budget-exhausted surfaces to errorListeners as before.
  - classify(Throwable) walks the cause chain. Retryable: TimeoutException,
    ConnectException, IOException, AwsServiceException with status 429 or
    5xx or error code containing "throttl", and SdkException whose message
    contains "acquire"/"pool"/"throttl"/"timeout". Terminal: 4xx other than
    429, anything unmatched (default-deny for retry).
  - StreamPublisher.pending is hoisted up to a SageMakerTransport instance
    field so events queued during an internal reset survive across the
    publisher swap. The new publisher drains the shared queue on subscribe.
  - awaitSubscription now returns the boolean. Timeout throws
    TimeoutException, which flows into the retry loop as RETRYABLE — the
    new subscriptionTimeout knob now actually fails fast as advertised.

SageMakerTransportFactory:
  - Overrides the new reconnectOptions() default to return
    ReconnectOptions.builder().maxRetries(0).build(). Combined with the
    SDK's auto-wiring in TransportWebSocketFactory, this disables the
    SDK's wrapper-level reconnect for any per-resource WebSocket client
    constructed against this factory — no user wiring required.

build.gradle:
  - Bumps deepgram-java-sdk dep from 0.2.1 to 0.3.0 to pick up the new
    DeepgramTransportFactory.reconnectOptions() default method,
    ReconnectOptions.connectionTimeoutMs, the maxRetries(0) bug fix, and
    the applyOptionsOverride hook.

Tests:
  - New SageMakerTransportRetryTest covers classify() across all branches
    (Timeout/Connect/IOException; AWS 401/403/429/5xx/throttling code;
    SdkException pool-keyword; cause-chain walking; default-terminal) and
    StreamPublisher behaviour (awaitSubscription false-on-timeout,
    true-on-subscribe; pending-queue persistence across publisher
    instances; immediate forward when subscriber present).
  - SageMakerTransportFactoryTest gains: retry-config defaults +
    customisation + validation, initialBackoff > maxBackoff rejection,
    and factoryDeclaresMaxRetriesZeroForReconnectOptions verifying the
    storm-suppression policy is published correctly.

README:
  - Bumps the SDK dep version reference and notes the v0.3.0+ requirement.
  - Adds the new retry knobs to the configuration table.
  - New "Retry & storm absorption" section explaining the internal
    classification/budget design and how factory.reconnectOptions()
    auto-wires the SDK's reconnect-disable.

End-to-end mock-AWS retry coverage isn't included here — the AWS reactive
streams handler indirection makes deterministic stubbing fragile. Those
paths are exercised by the existing burst test described in the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@VWang1111 VWang1111 closed this Apr 29, 2026
@VWang1111 VWang1111 deleted the feat/reconnect-opt-out-for-custom-transports branch April 29, 2026 22:05
@VWang1111 VWang1111 changed the title feat: opt out of ReconnectingWebSocketListener wrapper, auto-disable for custom transports fix(reconnect): listener bug fixes + transport factory policy hook Apr 29, 2026
@VWang1111 VWang1111 restored the feat/reconnect-opt-out-for-custom-transports branch April 29, 2026 22:07
@VWang1111 VWang1111 reopened this Apr 29, 2026
@VWang1111 VWang1111 closed this Apr 29, 2026
@VWang1111 VWang1111 deleted the feat/reconnect-opt-out-for-custom-transports branch April 29, 2026 22:08
@VWang1111 VWang1111 restored the feat/reconnect-opt-out-for-custom-transports branch April 29, 2026 22:09
@VWang1111 VWang1111 reopened this Apr 29, 2026
@VWang1111
Copy link
Copy Markdown
Contributor Author

Updated PR based on comments

lukeocodes
lukeocodes previously approved these changes May 5, 2026
VWang1111 and others added 3 commits May 6, 2026 10:06
…for custom transports

Adds `ClientOptions.Builder.reconnect(boolean)` (default `true`). When
`false`, the per-resource WebSocket clients (`listen/v1`, `listen/v2`,
`agent/v1`, `speak/v1`) connect once via a plain `okhttp3.WebSocketListener`
and propagate failures directly, bypassing `ReconnectingWebSocketListener`
and its hardcoded `connectionFuture.get(4000, MILLISECONDS)` ceiling.

Auto-disable for custom transports: when
`DeepgramClientBuilder.transportFactory(...)` (or its async counterpart)
is set, `setAdditional()` now also calls `builder.reconnect(false)`.
Custom transports like `SageMakerTransportFactory` already manage their
own connection lifecycle and retry semantics; wrapping them in another
retry layer creates double-retry storms under burst load.

The four per-resource WebSocket client files are added to the
"temporarily frozen" section of `.fernignore` so the patch survives the
next Fern regen. Unfreeze once the change has been pushed upstream into
the Fern generator template.

Backwards compatibility:

- Existing callers that do NOT use `transportFactory(...)` see zero
  behavior change. `reconnect()` defaults to `true`, the
  `ReconnectingWebSocketListener` wrapper is unchanged for the OkHttp
  WebSocket path (Deepgram cloud, self-hosted Deepgram via raw WS),
  and the lower-level `ClientOptions.Builder.webSocketFactory(...)`
  seam is also unaffected.
- Existing SageMaker callers (the only known users of
  `transportFactory(...)`) get the auto-disable. Empirically this
  eliminates the retry-storm pattern observed at 400 concurrent
  streams on a 10x ml.g6.2xlarge endpoint:

      ThrottlingException line-count    68,909  ->     240  (-99.7%)
      ModelStreamError line-count       30,442  ->   1,128  (-96.3%)
      Streams hitting Attempt Count: 4  29,030  ->      96  (-99.7%)
      4-second `connection timeout`         13  ->       0
      Total error log lines           1,322,666 -> 33,972   (-97.4%)

  Same wall time (313 vs 312 s), same transcript count, dramatically
  cleaner error log.

Edge case worth calling out:

- A SageMaker caller who has been tuning retry behavior via
  `wsClient.reconnectOptions(customOpts)` will find those options
  silently ignored after this change (`reconnect=false` skips the
  listener entirely). Their tuning was almost certainly a workaround
  for the very storm we are now eliminating, so the right migration
  is to delete the `reconnectOptions(...)` call. Customers who
  genuinely want both a custom transport and SDK-side reconnect can
  re-enable explicitly with `.reconnect(true)` after
  `.transportFactory(...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uto-wire transport factory policy

Reworks the Layer-4 patch (#45) per @lukeocodes' review. Replaces the previous
ClientOptions.reconnect(boolean) flag and the four duplicated WebSocketClient
listeners with the underlying bug fixes in ReconnectingWebSocketListener plus
a clean policy-declaration hook on DeepgramTransportFactory.

Reverts (no edits to generated WS clients, no .fernignore freeze inflation):
  - ClientOptions.reconnect(boolean) field/getter/builder method.
  - if (transportFactory != null) builder.reconnect(false) auto-disable
    heuristic in both DeepgramClientBuilder and AsyncDeepgramClientBuilder.
  - The four resources/.../websocket/*WebSocketClient.java listener
    duplications and their .fernignore freeze entries.

Bug fixes in ReconnectingWebSocketListener (now temporarily frozen):
  - maxRetries(0) used to refuse to connect because the gate compared
    retryCount.get() >= maxRetries on the initial attempt (retryCount starts
    at 0). The initial attempt is not a retry; the gate is now > maxRetries
    so the count caps retries only. maxRetries(0) means "1 attempt, no
    retries" as the API name implies.
  - The 4000 ms connectionFuture.get(...) timeout is now configurable via a
    new connectionTimeoutMs field on ReconnectOptions (default 4000,
    validated > 0). The hardcoded literal is gone.
  - Adds applyOptionsOverride(ReconnectOptions) so a transport-factory wrapper
    can rewire policy on the listener after construction. Used by
    TransportWebSocketFactory to honour the new factory hook below without
    requiring edits to the per-resource WebSocketClient classes.

Plugin policy declaration:
  - DeepgramTransportFactory (already permanently frozen) gains a default
    ReconnectOptions reconnectOptions() method returning null. Plugins
    managing their own connection lifecycle (e.g. SageMaker) override this
    to return ReconnectOptions.builder().maxRetries(0).build() so the
    SDK's wrapper-level reconnect doesn't compound their internal retries
    into a Throttling-on-Throttling storm under burst load.
  - TransportWebSocketFactory.create() applies factory.reconnectOptions()
    to the wrapping ReconnectingWebSocketListener via the new override hook.
    Auto-wiring with no edits to generated code.

Tests:
  - New ReconnectingWebSocketListenerTest covers the maxRetries(0)
    regression, connectionTimeoutMs default + customisation + validation,
    and applyOptionsOverride no-op + retry-gate behaviour.

Net diff vs the prior PR head: 4 files changed (+94/−10) instead of 8 files
(+364/−24). Freeze list grows by 1 (the listener), shrinks by 4 (the WS
clients).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  The .fernignore freeze added in this PR protects the listener bug fixes
  from regen, but the regen workflow in AGENTS.md drives off the "Current
  temporarily frozen files" list. Add the listener there so the next regen
  applies the documented .bak swap/restore on it
@GregHolmes GregHolmes force-pushed the feat/reconnect-opt-out-for-custom-transports branch from cbeddf6 to e191af1 Compare May 6, 2026 09:06
@GregHolmes GregHolmes merged commit eac8ad2 into main May 6, 2026
8 checks passed
GregHolmes pushed a commit that referenced this pull request May 6, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.4.0](v0.3.0...v0.4.0)
(2026-05-06)


### ⚠ BREAKING CHANGES

* sdk regeneration 2026-05-05
([#49](#49))
* sdk regeneration 2026-04-29
([#47](#47))

### Features

* sdk regeneration 2026-04-29
([#47](#47))
([0519ad3](0519ad3))
* sdk regeneration 2026-05-05
([#49](#49))
([f44678a](f44678a))


### Bug Fixes

* **reconnect:** listener bug fixes + transport factory policy hook
([#45](#45))
([eac8ad2](eac8ad2))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants