Skip to content

feat(retries): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342

Merged
aballman merged 3 commits into
mainfrom
aballman/fs-1988-min-attempts
May 27, 2026
Merged

feat(retries): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342
aballman merged 3 commits into
mainfrom
aballman/fs-1988-min-attempts

Conversation

@aballman
Copy link
Copy Markdown
Contributor

@aballman aballman commented May 21, 2026

Summary

retry_with_backoff_async currently checks now - start > max_elapsed_time before sleeping. When RetryConfig.max_elapsed_time is set to e.g. 5 min and the per-attempt httpx client timeout is 30 min, any attempt that runs longer than the budget blows it on attempt 1 — zero retries fire on subsequent transient errors. Closes that short-circuit without re-introducing the unbounded retry loops that recent budget tightening was meant to prevent.

What changes

New optional fields on BackoffStrategy (default values preserve existing behavior):

  • min_attempts: int = 0 — minimum retry attempts that must fire before max_elapsed_time is honored. Counts retries, not the initial attempt. min_attempts=2 permits 1 initial + at least 2 retries (3 total attempts) before the soft cap can cut the loop.
  • absolute_max_elapsed_time_ms: int | None = None — cap on when a new retry can START. Does NOT interrupt an in-flight func() call. Worst-case wall-clock under this cap is absolute_max_elapsed_time_ms + per_attempt_timeout.

Loop changes in both sync (retry_with_backoff) and async (retry_with_backoff_async) paths:

  1. Post-attempt cap check. Soft cap honored only when retries >= min_attempts; hard cap unconditional.
  2. Pre-sleep hard-cap check. Refuses to sleep into a retry whose projected start would cross the hard cap.
  3. Post-sleep verification. Belt-and-suspenders against late wakeups and projection rounding.
  4. Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap) dedupes the soft/hard cap logic between sync and async.

Validation in BackoffStrategy.__init__ rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0, and hard cap below soft cap.

.genignore updated to preserve these fields across future Speakeasy regens, matching the general.py and users.py precedent.

Design choices

  • Hard cap is a retry-start cap, not a wall-clock bound. In-flight func() cannot be interrupted from the retry loop. Consumers should pair the cap with a sensible per-attempt timeout to keep the worst case bounded.
  • Defaults preserve existing behavior so this change is non-breaking for every existing consumer. Consumers opt in by setting min_attempts > 0 and/or absolute_max_elapsed_time_ms.

Tests

49 tests pass (46 unit + 3 split-PDF retry integration). New coverage:

  • T1–T14 in _test_unstructured_client/unit/test_retries.py: fake-clock harness monkeypatching time.time / time.sleep / asyncio.sleep / random.uniform. Covers the slow-first-attempt + min_attempts floor scenario, floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep truncation, TemporaryError early-return through both caps, PermanentError short-circuit immunity, and BackoffStrategy.__init__ validation.
  • test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe in integration/test_decorators.py: pins the body-replay invariant for chunk requests built from open file objects (the split_pdf_cache_tmp_data=True path). Iterates request.stream twice directly — bypasses request.read() caching — so a future Speakeasy template change that produced a single-consumption stream would fail this test.

Test plan

  • uv run pytest _test_unstructured_client/unit/test_retries.py — 46 pass
  • uv run pytest _test_unstructured_client/integration/test_decorators.py::test_split_pdf_*retry* _test_unstructured_client/integration/test_decorators.py::test_split_pdf_cache_tmp_data_* — 3 pass
  • CI green

…ackoffStrategy

Closes a short-circuit in retry_with_backoff{,_async} where a single slow
first attempt can exhaust max_elapsed_time before any retry fires. With
the partitioner config (max_elapsed_time=5min, CLIENT_TIMEOUT_MS=30min),
any chunk whose first attempt exceeds 5 minutes blew the retry budget
on attempt 1 -- zero retries on subsequent transient errors. Documented
in two customer-visible regressions in the platform partition path.

New fields on BackoffStrategy:

* min_attempts (default 0) -- minimum number of retry attempts that must
  fire before max_elapsed_time is honored. Counts retries (not the
  initial attempt). Default 0 preserves existing behavior.
* absolute_max_elapsed_time_ms (default None) -- cap on when a new retry
  can START. Does NOT interrupt an in-flight func() call. Worst-case
  wall-clock under this cap is absolute_max_elapsed_time_ms +
  per_attempt_timeout. Default None preserves existing behavior.

Loop changes in both retry_with_backoff and retry_with_backoff_async:

* Post-attempt cap check honors min_attempts as a floor on the soft cap
  and treats the hard cap as unconditional.
* Pre-sleep hard-cap check refuses to sleep into a doomed retry whose
  projected start would exceed the hard cap.
* Post-sleep verification (belt-and-suspenders against late wakeups and
  rounding drift in the projection).
* Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap)
  dedupes logic between sync and async paths.

Validation rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0,
and absolute_max_elapsed_time_ms below max_elapsed_time.

.genignore: ignore src/unstructured_client/utils/retries.py to preserve
these fields across Speakeasy regens. Documented merge procedure for
future template updates. Pushing these fields upstream to Speakeasy
templates is filed as a follow-up.

Tests:

* T1-T14 in unit/test_retries.py with a fake-clock harness that
  monkeypatches time.time / time.sleep / asyncio.sleep / random.uniform.
  Covers the v1 reproducer (slow first attempt + min_attempts floor),
  floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep
  truncation, TemporaryError early-return through both caps, and
  PermanentError short-circuit immunity.
* Validation tests for the new __init__ guards.
* New integration test
  test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe
  pins the body-replay invariant: chunks built from open file objects
  (cache_tmp_data=True) must produce replay-safe httpx requests so SDK
  retries actually deliver the original multipart payload. Iterates
  request.stream twice directly to bypass request.read() caching.
@aballman aballman force-pushed the aballman/fs-1988-min-attempts branch from 759f400 to 12ba6d6 Compare May 22, 2026 21:19
@aballman aballman changed the title feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy feat(retries): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy May 26, 2026
@aballman
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Claude finished @aballman's task —— View job


PR Review Complete ✓

Reviewing PR #342: feat(retries): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy

Todo List:

  • Read repository guidelines (CLAUDE.md) - None found
  • Review main implementation in retries.py
  • Examine test coverage and quality
  • Check other modified files (.genignore, CHANGELOG.md)
  • Provide comprehensive review feedback

Review Summary

This is a well-implemented PR that addresses a legitimate issue where slow first attempts could exhaust retry budgets before any retries occur. The code quality is high with comprehensive test coverage.

Code Review Feedback

Strengths

1. Clean API Design

  • src/unstructured_client/utils/retries.py:25-26: New optional parameters with sensible defaults preserve backward compatibility
  • src/unstructured_client/utils/retries.py:31-45: Comprehensive validation logic prevents invalid configurations

2. Excellent Test Coverage

  • _test_unstructured_client/unit/test_retries.py:177-228: Thorough validation testing covers edge cases
  • _test_unstructured_client/unit/test_retries.py:20-48: Sophisticated fake clock mechanism enables deterministic testing
  • Tests cover both sync and async paths consistently

3. Robust Implementation

  • src/unstructured_client/utils/retries.py:182-205: Helper functions _cap_hit_after_attempt and _raise_or_return_after_cap eliminate code duplication between sync/async paths
  • src/unstructured_client/utils/retries.py:246-256: Sleep truncation prevents doomed retries that would exceed hard cap
  • src/unstructured_client/utils/retries.py:259-269: Belt-and-suspenders post-sleep verification

4. Clear Documentation

  • PR description thoroughly explains the problem and solution
  • src/unstructured_client/utils/retries.py:28-30: Inline comments clarify parameter semantics
  • Test names like test_t3_slow_first_attempt_then_success_sync are descriptive

⚠️ Minor Concerns

1. Magic Numbers in Tests

  • _test_unstructured_client/unit/test_retries.py:300,319,338,405,424,454,523: Various hardcoded timeouts (6000ms, 9950ms, 11000ms) could benefit from named constants explaining their significance

2. Error Message Clarity

  • src/unstructured_client/utils/retries.py:201-204: Error messages include technical details like "retries + 1 attempts over 6.0s" which may confuse end users. Consider more user-friendly messages.

📝 Implementation Notes

1. Design Decisions Well-Justified

  • Hard cap only prevents retry start, doesn't interrupt in-flight requests - sensible choice
  • min_attempts counts retries not initial attempts - clear from tests and docs

2. Thread Safety

  • Implementation is stateless per retry loop, avoiding concurrency issues
  • Clock mocking approach in tests is clean and doesn't introduce race conditions

3. Integration with Existing Code

  • src/unstructured_client/utils/retries.py:129-131,175-177: Updates to retry and retry_async functions cleanly pass new parameters
  • .genignore:23-25: Proper Speakeasy configuration prevents regeneration conflicts

Test Quality Assessment

The test suite is particularly impressive:

  • T1-T14 test numbering provides clear test organization
  • FakeClock mechanism enables precise timing control without flaky tests
  • Both sync and async coverage ensures feature parity
  • Edge case coverage includes validation, early returns, and cap interactions

Minor Suggestions

  1. Consider extracting timeout constants in tests for better readability
  2. Error messages could be more user-friendly while retaining debugging info
  3. The test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe test in integration looks unrelated to this PR but passes - good defensive programming

Verdict: APPROVE

This PR successfully solves the stated problem with a clean, well-tested implementation. The backward compatibility is preserved, the new functionality is opt-in, and the code quality meets high standards. Ready to merge.


@aballman aballman merged commit 6c28d32 into main May 27, 2026
17 checks passed
@aballman aballman deleted the aballman/fs-1988-min-attempts branch May 27, 2026 15:56
aballman added a commit that referenced this pull request May 27, 2026
## Summary

Cuts a 0.44.1 release so the `BackoffStrategy` retry-budget fields that
landed on main in #342 actually ship to PyPI. v0.44.0 was tagged before
#342 merged, so PyPI v0.44.0 does not include those fields.

## Changes

- `src/unstructured_client/_version.py`: bump `__version__` and
`__user_agent__` to `0.44.1`.
- `CHANGELOG.md`: split the combined 0.44.0 entry — `min_attempts` /
`absolute_max_elapsed_time_ms` move under a new `## 0.44.1` section so
the changelog matches what's actually in each PyPI artifact.
- `RELEASES.md`: append a 0.44.1 entry following the existing
Speakeasy-publish format.

## Test plan

- [x] No code changes; only metadata files
- [ ] CI green
- [ ] PyPI publish workflow picks up the bump on merge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants