-
Notifications
You must be signed in to change notification settings - Fork 1
UN-2793 [FEAT] Added exponential backoff retry mechanism for platform service connections #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UN-2793 [FEAT] Added exponential backoff retry mechanism for platform service connections #199
Conversation
…l backoff retry mechanism for platform service connections - Implemented retry_utils module with configurable retry behavior - Added @retry_on_connection_error decorator to platform service calls - Supports exponential backoff with jitter to prevent thundering herd - Configuration via environment variables for max retries, delays, and backoff factor - Automatically retries on ConnectionError and errno 111 (Connection refused) - Improved error handling and logging for better debugging - Bumped SDK version to v0.78.0
|
Warning Rate limit exceeded@chandrasekharan-zipstack has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 55 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
Summary by CodeRabbit
Summary by CodeRabbit
WalkthroughAdds a retry utilities module and applies retry decorators to platform, prompt, and adapter service-call methods, updates error handling and logging, and bumps SDK version from v0.77.1 to v0.78.0. No public API signatures changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Adapter as ToolAdapter
participant Platform as PlatformHelper
participant Prompt as PromptHelper
participant Retry as Retry Decorator
participant Svc as Remote Service
rect rgb(245,245,255)
note over Adapter,Platform: Adapter configuration fetch
Client->>Adapter: get_adapter_config()
Adapter->>Retry: _get_adapter_configuration()
loop Retry attempts (env-configurable backoff)
Retry->>Svc: HTTP request
alt Success
Svc-->>Retry: 2xx response
Retry-->>Adapter: result
else Retryable error
Svc--x Retry: network/timeout/5xx
Retry-->>Retry: wait(backoff+jitter)
else Exhausted
Retry-->>Adapter: raise ConnectionError
end
end
end
rect rgb(245,255,245)
note over Platform: Generic platform call
Client->>Platform: _call_service(...)
Platform->>Retry: _call_service()
loop Retry attempts
Retry->>Svc: HTTP request
alt Success
Svc-->>Retry: 2xx
Retry-->>Platform: result
else Retryable error
Svc--x Retry: transient failure
Retry-->>Retry: wait/backoff
else Give up
Retry-->>Platform: raise ConnectionError
Platform-->>Client: propagate ConnectionError (after logging/emitting tool error)
end
end
end
rect rgb(255,245,245)
note over Prompt: Prompt service call
Client->>Prompt: _call_service(...)
Prompt->>Retry: _call_service()
loop Retry attempts
Retry->>Svc: HTTP request
alt Success
Svc-->>Retry: 2xx
Retry-->>Prompt: result
else Retryable error
Svc--x Retry: transient failure
Retry-->>Retry: wait/backoff
else Exhausted
Retry-->>Prompt: raise error
Prompt-->>Client: propagate error
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/unstract/sdk/__init__.py (1)
1-6: Add missing Git tag and release notes for v0.78.0
Packaging is configured to pull the version fromsrc/unstract/sdk/__init__.py, so the SDK version is already in sync.
- Create and push the Git tag
v0.78.0- Draft or update the release notes on GitHub for v0.78.0
src/unstract/sdk/platform.py (1)
130-135: Add explicit timeouts to external requests.
Without timeouts, calls can hang indefinitely; with retries this compounds latency.@@ - response = requests.post( - url=url, json=payload, params=params, headers=req_headers - ) + response = requests.post( + url=url, json=payload, params=params, headers=req_headers, timeout=REQUEST_TIMEOUT + ) @@ - response = requests.get(url=url, params=params, headers=req_headers) + response = requests.get(url=url, params=params, headers=req_headers, timeout=REQUEST_TIMEOUT)Add a module-level default near the logger:
@@ logger = logging.getLogger(__name__) +REQUEST_TIMEOUT: tuple[float, float] = (3.05, 30.0) # (connect, read) secondssrc/unstract/sdk/adapter.py (1)
66-67: Add explicit timeouts to the platform request.
Prevents indefinite hangs and bounds total retry latency.- response = requests.get(url, headers=headers, params=query_params) + response = requests.get( + url, headers=headers, params=query_params, timeout=REQUEST_TIMEOUT + )Add a module-level default near the logger:
@@ logger = logging.getLogger(__name__) +REQUEST_TIMEOUT: tuple[float, float] = (3.05, 30.0) # (connect, read) seconds
🧹 Nitpick comments (7)
src/unstract/sdk/utils/retry_utils.py (2)
121-201: Improve failure logging and formatting; align with linter hints.Use logger.exception on final failure to capture traceback and avoid f-string coercions in logging calls.
- logger_instance.error( - f"Failed '{func.__name__}' after {attempt + 1} attempt(s). " - f"Error: {str(e)}" - ) + logger_instance.exception( + "Failed '%s' after %d attempt(s). Error: %s", + func.__name__, attempt + 1, e + ) @@ - logger_instance.warning( - f"Retryable error in '{func.__name__}' " - f"(attempt {attempt + 1}/{max_attempts}). " - f"Error: {str(e)}. Retrying in {delay:.2f} seconds..." - ) + logger_instance.warning( + "Retryable error in '%s' (attempt %d/%d). Error: %s. Retrying in %.2f seconds...", + func.__name__, attempt + 1, max_attempts, e, delay + )
55-61: More robust env parsing for boolean jitter.Accept common truthy values ("1", "yes", "on", "true") and falsy equivalents.
- jitter=os.getenv(f"{prefix}_RETRY_JITTER", "true").lower() == "true", + jitter=os.getenv(f"{prefix}_RETRY_JITTER", "true").strip().lower() + in {"1", "true", "yes", "on"},src/unstract/sdk/platform.py (3)
94-107: Docstring should mention retry configurability.
Add the env knobs so users know how to tune retries.@@ - This method automatically retries on connection errors with exponential backoff. + This method automatically retries on connection errors with exponential backoff. + Retry behavior is configurable via env: + - PLATFORM_SERVICE_MAX_RETRIES (default: 3) + - PLATFORM_SERVICE_INITIAL_DELAY (default: 1.0s) + - PLATFORM_SERVICE_MAX_DELAY (default: 60.0s) + - PLATFORM_SERVICE_BACKOFF_FACTOR (default: 2.0) + - PLATFORM_SERVICE_RETRY_JITTER (default: true)
140-146: Use logging.exception and avoid emitting user-facing error on every retry.
Log the stack once per failure; let the decorator own retry-attempt logs to reduce noise.- logger.error(f"Connection error to platform service: {connect_err}") - msg = ( - "Unable to connect to platform service. Might attempt to retry, " - "please contact admin if the retries fail." - ) - self.tool.stream_log(msg, level=LogLevel.ERROR) + logger.exception("Connection error to platform service: %s", connect_err) + msg = ( + "Unable to connect to platform service. Will retry with exponential backoff; " + "please contact admin if retries ultimately fail." + ) raise ConnectionError(msg) from connect_errIf user-facing logs are required, emit them once after retries are exhausted (in the decorator).
148-159: Harden error extraction and reuse helper for consistency.
Guard JSON parsing and prefer AdapterUtils-style extraction to avoid JSONDecodeError and inconsistent messages.@@ - content_type = response.headers.get("Content-Type", "").lower() - if MimeType.JSON in content_type: - response_json = response.json() - if "error" in response_json: - error_message = response_json["error"] - elif response.text: - error_message = response.text + content_type = response.headers.get("Content-Type", "").lower() + try: + if MimeType.JSON in content_type: + response_json = response.json() + error_message = response_json.get("error", error_message) + elif response.text: + error_message = response.text + except ValueError: + if response.text: + error_message = response.textOptional: Factor this via a shared utility (like AdapterUtils.get_msg_from_request_exc) for parity with adapter.py.
src/unstract/sdk/adapter.py (2)
64-65: Use header constants and include request ID for traceability.
Align with platform.py and carry request correlation.- headers = {"Authorization": f"Bearer {self.bearer_token}"} + headers = { + "Authorization": f"Bearer {self.bearer_token}", + "X-Request-Id": self.request_id or "", + }If available, prefer RequestHeader.AUTHORIZATION and RequestHeader.REQUEST_ID constants for consistency.
50-55: Docstring: note retry configurability.
Make the env-based controls discoverable here too.@@ - retries on connection errors with exponential backoff. + retries on connection errors with exponential backoff. + Controlled via: + PLATFORM_SERVICE_MAX_RETRIES, PLATFORM_SERVICE_INITIAL_DELAY, + PLATFORM_SERVICE_MAX_DELAY, PLATFORM_SERVICE_BACKOFF_FACTOR, + PLATFORM_SERVICE_RETRY_JITTER.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (5)
mypy-errors.txt(1 hunks)src/unstract/sdk/__init__.py(1 hunks)src/unstract/sdk/adapter.py(4 hunks)src/unstract/sdk/platform.py(4 hunks)src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/utils/retry_utils.py
115-115: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
159-159: Consider moving this statement to an else block
(TRY300)
174-177: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
176-176: Use explicit conversion flag
Replace with conversion flag
(RUF010)
188-188: Use explicit conversion flag
Replace with conversion flag
(RUF010)
src/unstract/sdk/adapter.py
82-84: Avoid specifying long messages outside the exception class
(TRY003)
92-92: Avoid specifying long messages outside the exception class
(TRY003)
src/unstract/sdk/platform.py
140-140: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (3)
src/unstract/sdk/__init__.py (1)
1-1: Version bump looks good.No code-path changes; only the version constant was updated to v0.78.0.
src/unstract/sdk/platform.py (1)
1-1: LGTM: module-level logger setup is appropriate.Also applies to: 17-17
src/unstract/sdk/adapter.py (1)
2-2: LGTM: logger wiring mirrors platform module.Also applies to: 15-15
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/unstract/sdk/utils/retry_utils.py (1)
38-46: Also validate max_delay >= initial_delay.Prevents confusing configs where the first backoff is immediately capped.
if backoff_factor <= 0: raise ValueError("backoff_factor must be > 0") + if max_delay < initial_delay: + raise ValueError("max_delay must be >= initial_delay")
🧹 Nitpick comments (5)
src/unstract/sdk/utils/retry_utils.py (5)
61-67: Parse RETRY_JITTER more broadly; friendlier env parsing.Make boolean parsing tolerant to common truthy values; optional: wrap numeric parsing to raise clear messages.
return cls( max_retries=int(os.getenv(f"{prefix}_MAX_RETRIES", "3")), initial_delay=float(os.getenv(f"{prefix}_INITIAL_DELAY", "1.0")), max_delay=float(os.getenv(f"{prefix}_MAX_DELAY", "60.0")), backoff_factor=float(os.getenv(f"{prefix}_BACKOFF_FACTOR", "2.0")), - jitter=os.getenv(f"{prefix}_RETRY_JITTER", "true").lower() == "true", + jitter=os.getenv(f"{prefix}_RETRY_JITTER", "true").strip().lower() + in {"1", "true", "t", "yes", "y", "on"}, )If you want clearer errors for bad numeric envs, I can add small helpers to validate and include the offending var/value in the exception.
180-185: Use logging.exception and lazy formatting; silences TRY400/RUF010 and preserves traceback.Switch to logger formatting placeholders and include traceback on final failure.
- logger_instance.error( - f"Failed '{func.__name__}' after {attempt + 1} attempt(s). " - f"Error: {str(e)}" - ) + logger_instance.exception( + "Failed '%s' after %d attempt(s).", + func.__name__, attempt + 1, + ) @@ - logger_instance.warning( - f"Retryable error in '{func.__name__}' " - f"(attempt {attempt + 1}/{max_attempts}). " - f"Error: {str(e)}. Retrying in {delay:.2f} seconds..." - ) + logger_instance.warning( + "Retryable error in '%s' (attempt %d/%d). Error: %s. Retrying in %.2f seconds...", + func.__name__, attempt + 1, max_attempts, e, delay, + )Also applies to: 191-195
119-124: Nit: replace en dash and silence non-crypto RNG warning (S311).Use ASCII hyphen in the comment and mark RNG as non-crypto.
- # Add jitter if enabled (0–25% positive jitter) + # Add jitter if enabled (0-25% positive jitter) @@ - delay = base + (base * random.uniform(0.0, 0.25)) + delay = base + (base * random.uniform(0.0, 0.25)) # noqa: S311 non-crypto jitterAlso applies to: 121-121
127-132: Optional: injectable sleep for testability and faster unit tests.Allow passing a sleep function (default
time.sleep) so tests can stub it out.def retry_with_exponential_backoff( config: RetryConfig | None = None, retryable_exceptions: tuple[type[Exception], ...] | None = None, logger_instance: logging.Logger | None = None, + sleep_func: Callable[[float], None] | None = None, ) -> Callable: @@ if logger_instance is None: logger_instance = logger + if sleep_func is None: + sleep_func = time.sleep @@ - time.sleep(delay) + sleep_func(delay)Also applies to: 145-147, 198-198
85-92: Consider optionally handling HTTP 429 with Retry-After (if applicable to the platform API).If your platform returns 429 for throttling, optionally retry using the
Retry-Afterheader (seconds or HTTP-date), with a cap tomax_delay.Would you like me to confirm the platform’s throttling semantics before adding this? If yes, I can search the service docs or scan the repo for 429 handling.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (1)
src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/utils/retry_utils.py
39-39: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
43-43: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Redefinition of unused ConnectionError from line 13
Remove definition: ConnectionError
(F811)
70-70: Redefinition of unused HTTPError from line 13
Remove definition: HTTPError
(F811)
119-119: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
121-121: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
165-165: Consider moving this statement to an else block
(TRY300)
180-183: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
182-182: Use explicit conversion flag
Replace with conversion flag
(RUF010)
194-194: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (1)
src/unstract/sdk/utils/retry_utils.py (1)
127-206: No async functions decorated with retry_with_exponential_backoff detected
Ran a repository-wide regex search for@retry_with_exponential_backoffonasync defand found no matches—no blocking issues present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/unstract/sdk/adapter.py (1)
66-67: Add client-side timeouts to external call (retries can’t help a hang).
requests.gethas no timeout; a stalled socket blocks indefinitely and bypasses the retry logic. Pass a sane connect/read timeout.Apply:
- response = requests.get(url, headers=headers, params=query_params) + response = requests.get( + url, headers=headers, params=query_params, timeout=(5, 30) + )
♻️ Duplicate comments (2)
src/unstract/sdk/adapter.py (1)
128-133: Good: ConnectionError now bubbles to the decorator; mapping to SdkError happens at the API boundary.This resolves the earlier concern about neutralizing the retry decorator. Looks correct.
src/unstract/sdk/utils/retry_utils.py (1)
234-237: Bug: all OSErrors are retried due to the OR logic.
is_retryable = is_retryable or isinstance(e, retryable_exceptions)makes everyOSErrorretryable (includingPermissionError), bypassing your errno filter. RemoveOSErrorfromretryable_exceptionsand rely onis_retryable_errorfor errno-based checks.Apply:
decorator = retry_with_exponential_backoff( config=retry_config, - retryable_exceptions=(ConnectionError, OSError), + retryable_exceptions=(ConnectionError,), )
🧹 Nitpick comments (6)
src/unstract/sdk/adapter.py (2)
2-2: Remove unused logger/import.
loggingandlogger = logging.getLogger(__name__)are not used in this file.Apply:
-import logging @@ -logger = logging.getLogger(__name__)Also applies to: 15-15
91-94: Tighten return type hint.
get_adapter_configalways returns a dict or raises;Noneisn’t returned.Apply:
- def get_adapter_config( - tool: BaseTool, adapter_instance_id: str - ) -> dict[str, Any] | None: + def get_adapter_config( + tool: BaseTool, adapter_instance_id: str + ) -> dict[str, Any]:src/unstract/sdk/utils/retry_utils.py (4)
180-183: Log final failure with traceback.Use
logging.exceptionwhen re-raising to capture stack traces.Apply:
- logger_instance.error( - f"Failed '{func.__name__}' after {attempt + 1} attempt(s). " - f"Error: {str(e)}" - ) + logger_instance.exception( + "Failed '%s' after %d attempt(s). Error: %s", + func.__name__, attempt + 1, e, + )
191-195: Prefer logging format params over f-strings (and explicit conversion).Improves performance and avoids string building when the log level is disabled.
Apply:
- logger_instance.warning( - f"Retryable error in '{func.__name__}' " - f"(attempt {attempt + 1}/{max_attempts}). " - f"Error: {str(e)}. Retrying in {delay:.2f} seconds..." - ) + logger_instance.warning( + "Retryable error in '%s' (attempt %d/%d). Error: %s. Retrying in %.2f seconds...", + func.__name__, attempt + 1, max_attempts, e, delay, + )
38-45: Validate that max_delay >= initial_delay.Prevents a permanently clamped delay when misconfigured.
Apply:
if max_delay <= 0: raise ValueError("max_delay must be > 0") + if max_delay < initial_delay: + raise ValueError("max_delay must be >= initial_delay")
119-124: Minor: fix en-dash and silence non-crypto RNG warning.Use a hyphen and add a per-line ignore for S311 since jitter isn’t cryptographic.
Apply:
- # Add jitter if enabled (0–25% positive jitter) + # Add jitter if enabled (0-25% positive jitter) if config.jitter: - delay = base + (base * random.uniform(0.0, 0.25)) + delay = base + (base * random.uniform(0.0, 0.25)) # noqa: S311
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge Base: Disabled due to Reviews > Disable Knowledge Base setting
📒 Files selected for processing (2)
src/unstract/sdk/adapter.py(5 hunks)src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
src/unstract/sdk/adapter.py
87-87: Avoid specifying long messages outside the exception class
(TRY003)
131-133: Avoid specifying long messages outside the exception class
(TRY003)
src/unstract/sdk/utils/retry_utils.py
39-39: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
43-43: Avoid specifying long messages outside the exception class
(TRY003)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Redefinition of unused ConnectionError from line 13
Remove definition: ConnectionError
(F811)
70-70: Redefinition of unused HTTPError from line 13
Remove definition: HTTPError
(F811)
119-119: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
121-121: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
165-165: Consider moving this statement to an else block
(TRY300)
180-183: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
182-182: Use explicit conversion flag
Replace with conversion flag
(RUF010)
194-194: Use explicit conversion flag
Replace with conversion flag
(RUF010)
🔇 Additional comments (1)
src/unstract/sdk/adapter.py (1)
80-87: Confirm intent: catching HTTPError here disables 5xx retries.Because this block converts HTTPError to SdkError inside the decorated function, the decorator won’t see HTTP 502/503/504 and won’t retry, even though the retry utility considers them retryable. If you intended to retry 5xx, re-raise HTTPError here and translate at the caller after retries.
Option if you want 5xx retries:
- except HTTPError as e: - default_err = ( - "Error while calling the platform service, please contact the admin." - ) - msg = AdapterUtils.get_msg_from_request_exc( - err=e, message_key="error", default_err=default_err - ) - raise SdkError(f"Error retrieving adapter. {msg}") from e + except HTTPError: + # Let the retry decorator handle transient 5xx; convert after retries at the caller. + raise
…onfigurable exceptions
- Replace custom retry implementation with battle-tested backoff library
- Add configurable exception types while preserving existing OSError errno logic
- Implement generic retry decorator factory supporting multiple service prefixes
- Maintain backward compatibility with existing platform service retry behavior
- Add comprehensive environment variable configuration for retry parameters
- Improve logging with structured backoff details and exception context
- Reduce codebase complexity from 236 lines to ~108 lines (55% reduction)
- Support custom retry configurations per service via prefix-based env vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (7)
pyproject.toml (3)
28-29: Constrain backoff to a safe major range.Avoid unexpected breaking changes by pinning an upper bound.
- "backoff>=2.2.1", + "backoff>=2.2.1,<3",
117-120: Align linter target with supported Python version.Package supports >=3.9, but Ruff targets py312. This can mask 3.9-incompatible syntax (e.g., type unions). Recommend aligning.
-[tool.ruff] -line-length = 90 -target-version = "py312" +[tool.ruff] +line-length = 90 +target-version = "py39"
205-205: Mypy config targets 3.12 while package supports 3.9.To avoid type hints/syntax creeping beyond 3.9 features, consider matching runtime.
-[tool.mypy] -python_version = "3.12" +[tool.mypy] +python_version = "3.9"src/unstract/sdk/platform.py (2)
133-137: Add request timeouts to avoid indefinite hangs.Without a timeout, retries won’t trigger on stalled connections.
- response = requests.post( - url=url, json=payload, params=params, headers=req_headers - ) + response = requests.post( + url=url, json=payload, params=params, headers=req_headers, timeout=30 + ) @@ - response = requests.get(url=url, params=params, headers=req_headers) + response = requests.get(url=url, params=params, headers=req_headers, timeout=30)
151-163: Prefer e.response over a locally constructed Response() when extracting error details.Avoids empty headers/body when the exception contains the actual response.
- except RequestException as e: - # Extract error information from the response if available - error_message = str(e) - content_type = response.headers.get("Content-Type", "").lower() - if MimeType.JSON in content_type: - response_json = response.json() + except RequestException as e: + # Extract error information from the response if available + error_message = str(e) + resp = getattr(e, "response", None) or response + content_type = resp.headers.get("Content-Type", "").lower() if resp is not None else "" + if resp is not None and MimeType.JSON in content_type: + response_json = resp.json() if "error" in response_json: error_message = response_json["error"] - elif response.text: - error_message = response.text + elif resp is not None and resp.text: + error_message = resp.text self.tool.stream_error_and_exit( f"Error from platform service. {error_message}" )src/unstract/sdk/adapter.py (1)
70-73: Add request timeout for robustness.Prevents indefinite waits and enables Timeout-based retries when server stalls.
- response = requests.get(url, headers=headers, params=query_params) + response = requests.get( + url, headers=headers, params=query_params, timeout=30 + )src/unstract/sdk/utils/retry_utils.py (1)
111-118: Prefer logger.error with exc_info over logger.exception with explicit exception text.Keeps logs clean and avoids redundant exception string formatting.
- logger_instance.exception( - "Giving up after %d retries for %s: %s", tries, prefix, exception - ) + logger_instance.error( + "Giving up after %d retries for %s: %s", tries, prefix, exception, exc_info=True + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
pyproject.toml(1 hunks)src/unstract/sdk/adapter.py(6 hunks)src/unstract/sdk/platform.py(4 hunks)src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
src/unstract/sdk/adapter.py
91-91: Avoid specifying long messages outside the exception class
(TRY003)
135-137: Avoid specifying long messages outside the exception class
(TRY003)
src/unstract/sdk/platform.py
143-143: Redundant exception object included in logging.exception call
(TRY401)
🔇 Additional comments (3)
src/unstract/sdk/platform.py (1)
94-103: Decorator usage looks good.Applying the retry decorator at the method boundary is the right place.
src/unstract/sdk/adapter.py (2)
47-52: Good placement of retry decorator and tightened error mapping.Letting ConnectionError propagate to the decorator and mapping to SdkError at the call site is correct.
132-137: Caller-side conversion to SdkError looks good.This preserves retry behavior and surfaces a domain error after exhausting retries.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chandrasekharan-zipstack would be great if we can also cover prompt service retry with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/unstract/sdk/utils/retry_utils.py (4)
29-51: Include socket/built-in timeouts to cover non-requests flows.
socket.timeout(subclass ofTimeoutError) often lacks errno and won’t match your OSError errno set. Add them explicitly to avoid missing retries when not usingrequests.@@ - # Requests connection and timeout errors - if isinstance(error, (ConnectionError, Timeout)): + # Connection and timeout errors + # - requests: ConnectionError, Timeout + # - stdlib: TimeoutError, socket.timeout + if isinstance(error, (ConnectionError, Timeout, TimeoutError)): return True + + # socket.timeout may not carry errno reliably + try: + import socket # local import to avoid unconditional dependency + if isinstance(error, socket.timeout): + return True + except Exception: + pass
81-92: Harden env parsing; fall back to sane defaults instead of raising.If envs are malformed (e.g., “abc”), current casts raise at import-time (due to the global
retry_platform_service_call). Prefer tolerant parsing with warnings.- max_tries = int(os.getenv(f"{prefix}_MAX_RETRIES", "3")) + 1 # +1 for initial attempt - max_time = float(os.getenv(f"{prefix}_MAX_TIME", "60")) - base = float(os.getenv(f"{prefix}_BASE_DELAY", "1.0")) - factor = float(os.getenv(f"{prefix}_MULTIPLIER", "2.0")) - use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { + def _int_env(key: str, default: int) -> int: + raw = os.getenv(key) + if raw is None: + return default + try: + return int(raw) + except ValueError: + logger.warning("Invalid %s=%r; falling back to %d", key, raw, default) + return default + + def _float_env(key: str, default: float) -> float: + raw = os.getenv(key) + if raw is None: + return default + try: + return float(raw) + except ValueError: + logger.warning("Invalid %s=%r; falling back to %.2f", key, raw, default) + return default + + def _bool_env(key: str, default: bool) -> bool: + raw = os.getenv(key) + if raw is None: + return default + return raw.strip().lower() in {"true", "1", "yes", "on"} + + max_tries = _int_env(f"{prefix}_MAX_RETRIES", 3) + 1 # +1 for initial attempt + max_time = _float_env(f"{prefix}_MAX_TIME", 60.0) + base = _float_env(f"{prefix}_BASE_DELAY", 1.0) + factor = _float_env(f"{prefix}_MULTIPLIER", 2.0) + use_jitter = _bool_env(f"{prefix}_JITTER", True)
81-92: Optional: add per-attempt cap via MAX_DELAY and wire through to expo.This complements MAX_TIME and prevents single sleeps from growing too large under long retry windows.
@@ - use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { + use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { "true", "1", "yes", "on", } + max_delay_env = os.getenv(f"{prefix}_MAX_DELAY") + max_delay = float(max_delay_env) if max_delay_env else None @@ - return backoff.on_exception( - backoff.expo, + return backoff.on_exception( + backoff.expo, exceptions, # Use the configurable exceptions max_tries=max_tries, max_time=max_time, base=base, - factor=factor, + factor=factor, + max_value=max_delay, jitter=backoff.full_jitter if use_jitter else None, giveup=lambda e: not is_retryable_error(e), on_backoff=on_backoff_handler, on_giveup=on_giveup_handler, )Also applies to: 120-129
102-109: Tweak logging: avoid leaking secrets and fix retry count wording.Exception strings can include URLs/tokens; also the “retries” count in give-up should be
tries - 1.- logger_instance.warning( - "Retry %d/%d for %s: %s (waiting %.1fs)", - tries, - max_tries - 1, - prefix, - exception, - wait, - ) + logger_instance.warning( + "Retry %d/%d for %s (waiting %.1fs) | exc=%s", + tries, + max_tries - 1, + prefix, + wait, + type(exception).__name__, + ) @@ - logger_instance.exception( - "Giving up after %d retries for %s: %s", tries, prefix, exception - ) + # tries includes the initial attempt; retries = tries - 1 + logger_instance.error( + "Giving up after %d retries for %s | exc=%s", + max(0, tries - 1), + prefix, + type(exception).__name__, + )Also applies to: 116-118
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
src/unstract/sdk/platform.py(4 hunks)src/unstract/sdk/utils/retry_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/unstract/sdk/platform.py
🔇 Additional comments (3)
src/unstract/sdk/utils/retry_utils.py (3)
1-12: Solid migration to backoff; clear public API.Factory-based decorator with env-driven config and preserved errno logic looks good.
135-141: Docs/variables consistent with PR summary—nice.Env names align with the PR description and defaults. Good clarity for operators.
121-132: Verify backoff kwargs compatibility with pinned version Manually confirm that your pinned backoff version (>=2.2.1) supports passing base, factor and max_value to backoff.on_exception, and ensure tests cover both retry code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/unstract/sdk/prompt.py (2)
189-197: Add a client-side timeout to make retries effective and prevent indefinite hangs.Without a requests timeout, a call can block forever and never trigger retries. Introduce a timeout parameter (default 30s) and pass it to requests.
Apply these diffs:
@@ - def _call_service( + def _call_service( self, url_path: str, payload: dict[str, Any] | None = None, params: dict[str, str] | None = None, headers: dict[str, str] | None = None, method: str = "POST", + timeout: float = 30.0, ) -> dict[str, Any]:And:
@@ - response = requests.post( - url=url, json=payload, params=params, headers=req_headers - ) + response = requests.post( + url=url, json=payload, params=params, headers=req_headers, timeout=timeout + ) @@ - response = requests.get(url=url, params=params, headers=req_headers) + response = requests.get(url=url, params=params, headers=req_headers, timeout=timeout)
223-229: Pass the timeout to requests to avoid indefinite blocking.Follow-up to the timeout addition above; ensure both POST and GET pass it.
Apply:
- response = requests.post( - url=url, json=payload, params=params, headers=req_headers - ) + response = requests.post( + url=url, json=payload, params=params, headers=req_headers, timeout=timeout + ) @@ - response = requests.get(url=url, params=params, headers=req_headers) + response = requests.get(url=url, params=params, headers=req_headers, timeout=timeout)
🧹 Nitpick comments (5)
src/unstract/sdk/prompt.py (1)
189-189: Guard against duplicate effects when retrying POSTs; ensure idempotency via request_id.Since retries apply to POST endpoints, confirm the server deduplicates via the Request-ID header. If not guaranteed, generate a UUID when request_id is None to make retries safe.
Example (outside this hunk):
# At top of file import uuid # In __init__ self.request_id = request_id or str(uuid.uuid4())Can you confirm the prompt service treats Request-ID as an idempotency key?
src/unstract/sdk/utils/retry_utils.py (4)
29-51: Optionally include HTTP 429 (rate limiting) as retryable.Many services return 429 for throttling; retrying with backoff is appropriate.
Apply:
- # Retry on server errors and bad gateway - if status_code in [502, 503, 504]: + # Retry on throttling and transient server errors + if status_code in [429, 502, 503, 504]: return TrueOptionally honor Retry-After in a future enhancement.
81-92: Harden env parsing with basic validation/clamping to sane defaults.Protect against misconfig (negative/zero values) that would create invalid backoff schedules.
Apply:
max_tries = int(os.getenv(f"{prefix}_MAX_RETRIES", "3")) + 1 # +1 for initial attempt max_time = float(os.getenv(f"{prefix}_MAX_TIME", "60")) base = float(os.getenv(f"{prefix}_BASE_DELAY", "1.0")) factor = float(os.getenv(f"{prefix}_MULTIPLIER", "2.0")) use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { "true", "1", "yes", "on", } + + # Clamp/sanitize + if max_tries < 1: + max_tries = 1 + if max_time <= 0: + max_time = 60.0 + if base <= 0: + base = 1.0 + if factor <= 0: + factor = 2.0
81-92: Expose optional per-attempt max delay (PREFIX_MAX_DELAY) and wire it to expo.Prevents unbounded waits while still respecting MAX_TIME.
Apply:
@@ - use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { + use_jitter = os.getenv(f"{prefix}_JITTER", "true").strip().lower() in { "true", "1", "yes", "on", } + max_value_env = os.getenv(f"{prefix}_MAX_DELAY") + max_value = float(max_value_env) if max_value_env else None
120-132: Cap exponential backoff using max_value, if provided.Wires the new PREFIX_MAX_DELAY into backoff.expo.
Apply:
- return backoff.on_exception( - backoff.expo, + return backoff.on_exception( + backoff.expo, exceptions, # Use the configurable exceptions max_tries=max_tries, max_time=max_time, base=base, - factor=factor, + factor=factor, + max_value=max_value, jitter=backoff.full_jitter if use_jitter else None, giveup=lambda e: not is_retryable_error(e), on_backoff=on_backoff_handler, on_giveup=on_giveup_handler, )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
src/unstract/sdk/prompt.py(3 hunks)src/unstract/sdk/utils/retry_utils.py(1 hunks)
🔇 Additional comments (4)
src/unstract/sdk/prompt.py (2)
18-18: LGTM: Using the preconfigured PROMPT_SERVICE retry decorator is appropriate.
201-209: Docstring update aligns with the new env-driven retry behavior.src/unstract/sdk/utils/retry_utils.py (2)
141-149: LGTM: Pre-instantiated decorators with clear prefixes make adoption easy.
9-11: backoff dependency is pinned to >=2.2.1 in pyproject.toml
|
@ritwik-g I've addressed your comments and added support for prompt service as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (1)
src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
src/unstract/sdk/utils/retry_utils.py
83-83: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
136-136: Consider moving this statement to an else block
(TRY300)
151-156: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
167-167: Redundant exception object included in logging.exception call
(TRY401)
179-184: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (1)
src/unstract/sdk/utils/retry_utils.py (1)
75-86: Ensure jitter stays within the configured cap.Right now we clamp the exponential delay before adding jitter, so the final sleep interval can overshoot the configured cap. For example, with
MAX_TIME=60, once the exponential term hits 60 the jitter pushes the delay up to 75 s, which immediately trips thedelay >= remaining_timeguard and aborts retries even though we still had time budget left. Capping after jitter keeps the behaviour aligned with the config.- delay = base_delay * (multiplier**attempt) - - # Cap at max delay - delay = min(delay, max_delay) - - # Add jitter if enabled (0-25% of delay) - if jitter: - jitter_amount = delay * random.uniform(0, 0.25) - delay += jitter_amount - - return delay + delay = base_delay * (multiplier**attempt) + + if jitter: + jitter_amount = delay * random.uniform(0, 0.25) + delay += jitter_amount + + return min(delay, max_delay)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
src/unstract/sdk/utils/retry_utils.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
src/unstract/sdk/utils/retry_utils.py
83-83: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
136-136: Consider moving this statement to an else block
(TRY300)
151-156: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
167-167: Redundant exception object included in logging.exception call
(TRY401)
179-184: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
268-268: Avoid specifying long messages outside the exception class
(TRY003)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
272-272: Avoid specifying long messages outside the exception class
(TRY003)
274-274: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (5)
src/unstract/sdk/utils/retry_utils.py (5)
1-14: LGTM!Module setup and imports are appropriate. The
randommodule is correctly used for backoff jitter (S311 warning is a false positive in this context).
17-53: LGTM!The retryable error detection is comprehensive and correctly handles connection errors, timeouts, specific HTTP status codes, and OS-level connection failures including errno 111 (ECONNREFUSED).
89-211: Core retry logic is sound.The exponential backoff implementation correctly handles max retries, max time, retry predicates, and exception filtering. The defensive code at lines 206-207 is good practice even if logically unreachable.
214-289: LGTM! Configuration validation is appropriate.The environment-driven configuration with validation (lines 267-274) correctly fails fast on invalid values. The TRY003 static analysis hints about exception messages are acceptable in this context where inline validation messages provide clear feedback.
292-306: LGTM! Module-level decorators are properly configured.The pre-instantiated decorators with clear documentation of environment variables make the API easy to consume. Note the comment from hari-kuriakose about adding unit tests for this library.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
What
Why
How
create_retry_decoratorfor flexible retry configurationsCan this PR break any existing features. If yes, please list possible items. If no, please explain why.
No, this PR should not break any existing features because:
Database Migrations
Dependencies Versions / Env Variables
Environment variables (per service prefix, e.g.,
PLATFORM_SERVICE,PROMPT_SERVICE):{PREFIX}_MAX_RETRIES: Maximum retry attempts (default: 3){PREFIX}_MAX_TIME: Maximum total time in seconds (default: 60) (please advice if this should be more @ritwik-g){PREFIX}_BASE_DELAY: Initial delay in seconds (default: 1.0){PREFIX}_MULTIPLIER: Backoff multiplier (default: 2.0){PREFIX}_JITTER: Enable/disable jitter (default: true)Examples:
PLATFORM_SERVICE_MAX_RETRIES=5API_SERVICE_MAX_RETRIES=2, API_SERVICE_BASE_DELAY=2.0Relevant Docs
Related Issues or PRs
Notes on Testing
Screenshots
Checklist
I have read and understood the Contribution Guidelines.