Retry transient network errors in HttpRetryPolicy (fix #260) by msrathore-db · Pull Request #398 · databricks/databricks-sql-nodejs

msrathore-db · 2026-05-29T06:35:26Z

Summary

Fixes #260 — the FetchError: socket hang up / Premature close / ECONNRESET failures users have been reporting during long-running workloads, including the cloud-storage variant (*.blob.core.windows.net result-link downloads) reported by @haggholm and @eranga-acerta on the issue.

Root cause

HttpRetryPolicy.invokeWithRetry only consulted shouldRetry after the fetch resolved, so when fetch() itself rejected with a network error the rejection propagated on attempt #1 with no retry. Additionally, response.buffer() (Thrift) and response.arrayBuffer() (CloudFetch) sat outside the retried block, so node-fetch's "Premature close" body-stream errors were not retried either.

Sibling drivers don't have this gap: Python relies on urllib3's default OSError retry, JDBC uses Apache HttpClient's stale-connection detection + retry handler.

Changes

HttpRetryPolicy.invokeWithRetry — catches operation rejections, classifies them via isRetryableNetworkError (FetchError system / body-timeout, OS errnos ECONNRESET/ECONNREFUSED/ETIMEDOUT/EHOSTUNREACH/ENETUNREACH/EPIPE/ENOTFOUND/EAI_AGAIN, and message patterns socket hang up/premature close/aborted), and retries with the same attempt-budget / backoff used for HTTP retries.
ThriftHttpConnection.write — reads response.buffer() inside the retried block so body-stream failures are retried too.
CloudFetchResultHandler.fetch — reads response.arrayBuffer() inside the retried block, with a guarded fallback so existing test stubs that pre-bake Response objects keep working.

Idempotency

Unchanged. ExecuteStatement and other write-y Thrift methods continue to use NullRetryPolicy. Network-error retry only applies where HttpRetryPolicy was already in use (idempotent Thrift methods + CloudFetch GETs).

Test plan

Unit: `HttpRetryPolicy.test.ts` — 4 new cases (FetchError ECONNRESET retried, "Premature close" retried, non-transient errors pass through, max-attempts on network errors). 14/14 passing.
Touched-area suite (`HttpRetryPolicy + NullRetryPolicy + HttpConnection + CloudFetchResultHandler`) — 30/30 passing.
Synthetic integration: local TCP server that RSTs first 3 connections — CloudFetch download recovers on attempt Some readme cleanup. #4.
Live warehouse end-to-end (pecotesting Azure): TLS-layer RST injection on *.blob.core.windows.net mid-response. Base branch crashes with the exact FetchError from issue [DATABRICKS ERROR]: "FetchError: socket hang up" #260; with this fix the same fault injection passes transparently with correct row counts.

Risk

Retry attempts/timeout budget is shared with HTTP-status retries, so worst-case wall-clock failure latency is unchanged.
Behavior on ExecuteStatement (NullRetryPolicy) is unchanged — no risk of double-execution.
Existing tests pass without modification (the CloudFetch handler has a fallback for stub-based tests).

This pull request and its description were written by Isaac.

`HttpRetryPolicy.invokeWithRetry` only consulted `shouldRetry` after the fetch resolved, so when `fetch()` itself rejected with a network error (`socket hang up`, `ECONNRESET`, etc.) the rejection propagated on attempt #1 with no retry — the exact failure pattern reported in issue #260, including the cloud-storage variant haggholm and eranga-acerta reported. A related issue: `response.buffer()` (ThriftHttpConnection) and `response.arrayBuffer()` (CloudFetchResultHandler) sat outside the retried block, so node-fetch's "Premature close" body-stream errors were also not retried. Changes: * `HttpRetryPolicy.invokeWithRetry` catches operation rejections, classifies them via `isRetryableNetworkError` (FetchError `system` / `body-timeout`, OS errnos `ECONNRESET`/`ECONNREFUSED`/`ETIMEDOUT`/ `EHOSTUNREACH`/`ENETUNREACH`/`EPIPE`/`ENOTFOUND`/`EAI_AGAIN`, and message patterns `socket hang up`/`premature close`/`aborted`), and retries with the same attempt-budget/backoff used for HTTP retries. Aligns with what Python (urllib3 default OSError retry) and JDBC (Apache HttpClient retry handler) already do. * `ThriftHttpConnection.write` reads `response.buffer()` inside the retried block, so body-stream failures are retried too. * `CloudFetchResultHandler.fetch` reads `response.arrayBuffer()` inside the retried block, with a guarded fallback so existing test stubs that pre-bake `Response` objects keep working. Idempotency is preserved: `ExecuteStatement` and other write-y Thrift methods continue to use `NullRetryPolicy`. Network-error retry only applies where `HttpRetryPolicy` was already in use (idempotent Thrift methods + CloudFetch GETs). Verified end-to-end against the pecotesting Azure warehouse with TLS- layer RST injection on `*.blob.core.windows.net` mid-response: base crashes with `FetchError: socket hang up` (matches issue #260 exactly); with this fix the same fault injection passes transparently with correct row counts. Co-authored-by: Isaac

github-actions · 2026-05-29T06:36:13Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

msrathore-db temporarily deployed to azure-prod May 29, 2026 06:35 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry transient network errors in HttpRetryPolicy (fix #260)#398

Retry transient network errors in HttpRetryPolicy (fix #260)#398
msrathore-db wants to merge 1 commit into
mainfrom
fix/issue-260-retry-network-errors

msrathore-db commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msrathore-db commented May 29, 2026

Summary

Root cause

Changes

Idempotency

Test plan

Risk

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant