Skip to content

Retry transient network errors in HttpRetryPolicy (fix #260)#398

Open
msrathore-db wants to merge 1 commit into
mainfrom
fix/issue-260-retry-network-errors
Open

Retry transient network errors in HttpRetryPolicy (fix #260)#398
msrathore-db wants to merge 1 commit into
mainfrom
fix/issue-260-retry-network-errors

Conversation

@msrathore-db
Copy link
Copy Markdown
Contributor

Summary

Fixes #260 — the FetchError: socket hang up / Premature close / ECONNRESET failures users have been reporting during long-running workloads, including the cloud-storage variant (*.blob.core.windows.net result-link downloads) reported by @haggholm and @eranga-acerta on the issue.

Root cause

HttpRetryPolicy.invokeWithRetry only consulted shouldRetry after the fetch resolved, so when fetch() itself rejected with a network error the rejection propagated on attempt #1 with no retry. Additionally, response.buffer() (Thrift) and response.arrayBuffer() (CloudFetch) sat outside the retried block, so node-fetch's "Premature close" body-stream errors were not retried either.

Sibling drivers don't have this gap: Python relies on urllib3's default OSError retry, JDBC uses Apache HttpClient's stale-connection detection + retry handler.

Changes

  • HttpRetryPolicy.invokeWithRetry — catches operation rejections, classifies them via isRetryableNetworkError (FetchError system / body-timeout, OS errnos ECONNRESET/ECONNREFUSED/ETIMEDOUT/EHOSTUNREACH/ENETUNREACH/EPIPE/ENOTFOUND/EAI_AGAIN, and message patterns socket hang up/premature close/aborted), and retries with the same attempt-budget / backoff used for HTTP retries.
  • ThriftHttpConnection.write — reads response.buffer() inside the retried block so body-stream failures are retried too.
  • CloudFetchResultHandler.fetch — reads response.arrayBuffer() inside the retried block, with a guarded fallback so existing test stubs that pre-bake Response objects keep working.

Idempotency

Unchanged. ExecuteStatement and other write-y Thrift methods continue to use NullRetryPolicy. Network-error retry only applies where HttpRetryPolicy was already in use (idempotent Thrift methods + CloudFetch GETs).

Test plan

  • Unit: `HttpRetryPolicy.test.ts` — 4 new cases (FetchError ECONNRESET retried, "Premature close" retried, non-transient errors pass through, max-attempts on network errors). 14/14 passing.
  • Touched-area suite (`HttpRetryPolicy + NullRetryPolicy + HttpConnection + CloudFetchResultHandler`) — 30/30 passing.
  • Synthetic integration: local TCP server that RSTs first 3 connections — CloudFetch download recovers on attempt Some readme cleanup. #4.
  • Live warehouse end-to-end (pecotesting Azure): TLS-layer RST injection on *.blob.core.windows.net mid-response. Base branch crashes with the exact FetchError from issue [DATABRICKS ERROR]: "FetchError: socket hang up" #260; with this fix the same fault injection passes transparently with correct row counts.

Risk

  • Retry attempts/timeout budget is shared with HTTP-status retries, so worst-case wall-clock failure latency is unchanged.
  • Behavior on ExecuteStatement (NullRetryPolicy) is unchanged — no risk of double-execution.
  • Existing tests pass without modification (the CloudFetch handler has a fallback for stub-based tests).

This pull request and its description were written by Isaac.

`HttpRetryPolicy.invokeWithRetry` only consulted `shouldRetry` after the
fetch resolved, so when `fetch()` itself rejected with a network error
(`socket hang up`, `ECONNRESET`, etc.) the rejection propagated on attempt
#1 with no retry — the exact failure pattern reported in issue #260,
including the cloud-storage variant haggholm and eranga-acerta reported.

A related issue: `response.buffer()` (ThriftHttpConnection) and
`response.arrayBuffer()` (CloudFetchResultHandler) sat outside the
retried block, so node-fetch's "Premature close" body-stream errors were
also not retried.

Changes:
* `HttpRetryPolicy.invokeWithRetry` catches operation rejections,
  classifies them via `isRetryableNetworkError` (FetchError `system` /
  `body-timeout`, OS errnos `ECONNRESET`/`ECONNREFUSED`/`ETIMEDOUT`/
  `EHOSTUNREACH`/`ENETUNREACH`/`EPIPE`/`ENOTFOUND`/`EAI_AGAIN`, and
  message patterns `socket hang up`/`premature close`/`aborted`), and
  retries with the same attempt-budget/backoff used for HTTP retries.
  Aligns with what Python (urllib3 default OSError retry) and JDBC
  (Apache HttpClient retry handler) already do.
* `ThriftHttpConnection.write` reads `response.buffer()` inside the
  retried block, so body-stream failures are retried too.
* `CloudFetchResultHandler.fetch` reads `response.arrayBuffer()` inside
  the retried block, with a guarded fallback so existing test stubs
  that pre-bake `Response` objects keep working.

Idempotency is preserved: `ExecuteStatement` and other write-y Thrift
methods continue to use `NullRetryPolicy`. Network-error retry only
applies where `HttpRetryPolicy` was already in use (idempotent Thrift
methods + CloudFetch GETs).

Verified end-to-end against the pecotesting Azure warehouse with TLS-
layer RST injection on `*.blob.core.windows.net` mid-response: base
crashes with `FetchError: socket hang up` (matches issue #260 exactly);
with this fix the same fault injection passes transparently with
correct row counts.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DATABRICKS ERROR]: "FetchError: socket hang up"

1 participant