fix: retry on authentication failure with credential refresh#7812
fix: retry on authentication failure with credential refresh#7812wpfleger96 merged 7 commits intomainfrom
Conversation
25e39cf to
0b13967
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b13967496
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
When goosed gets a 403 from a provider (e.g., Databricks token expired after auto-rotation), it now invalidates the in-memory secrets cache, re-reads secrets.yaml from disk, rebuilds the provider, and retries once. Previously, authentication errors were terminal — the session broke and users saw "Invalid access token" with no recovery. This works with a sidecar that periodically refreshes secrets.yaml from AWS Secrets Manager. The sidecar keeps the file fresh; this change makes goosed pick up the new credentials on auth failure. - Make `invalidate_secrets_cache()` public on Config - Add Authentication retry arm in the agent reply loop (max 1 retry)
0b13967 to
d5c9e46
Compare
The agent loop was handling credential refresh directly, adding ~40 lines to an already overweight function. Auth retry is a provider concern — move it into the ProviderRetry blanket impl where it sits closer to the HTTP calls. Add Provider::refresh_credentials() (default returns Err so only providers that support it opt in). DatabricksProvider overrides it to invalidate the secrets cache. Make DatabricksAuthProvider re-read the token from Config on each request (matching how OAuth already works) so cache invalidation actually takes effect without rebuilding the entire provider.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fed65316d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c93848d63e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Address code review feedback: - Use the constructor-supplied token as initial value via a shared Arc<Mutex<Option<String>>> between DatabricksProvider and its auth provider. refresh_credentials() clears the cache, causing the next get_auth_header() call to re-read from Config. Preserves from_params semantics where callers supply a token directly. - Remove the dead trait-default with_retry_config body — the blanket impl overrides it for all Provider types, so the duplicated backoff logic was unreachable. - Log when refresh_credentials() itself fails so operators can see that a refresh was attempted.
c93848d to
256eac4
Compare
Avoids a race where another thread sees token_cache == None and repopulates it from the still-stale secrets cache.
If the provider was created via from_params with a caller-supplied token, refresh_credentials clears the cache but Config may not have DATABRICKS_TOKEN. Fall back to the original constructor-provided token instead of erroring.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e7f7b6090
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Hi @wpfleger96, thanks for the contribution! Can you address the Codex code review comment and then this looks good to merge. |
DOsinga
left a comment
There was a problem hiding this comment.
remove any redundant comments and check codex' replies, otherwise LGTM
…oken-retry * origin/main: (21 commits) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868) updated canonical models (#7920) feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps (#7852) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) Keep messages in sync (#7850) More acp tools (#7843) fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714) ...
- Remove doc comment on refresh_credentials() per reviewer request - Simplify from_params token_cache init to direct Some(api_key.clone()) instead of unnecessary match expression
* origin/main: (72 commits) No Check do Check (#7942) Log 500 errors and also show error for direct download (#7936) fix: retry on authentication failure with credential refresh (#7812) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868) updated canonical models (#7920) feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps (#7852) fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867) fix: tool confirmation handling for multiple requests (#7856) Remove dead OllamaSetup onboarding flow (#7861) fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832) Upgrade Electron 40.6.0 → 41.0.0 (#7851) Only show up to 50 lines of source code (#7578) fix: stop writing without error when hitting broken pipe for goose session list (#7858) feat(acp): add session/set_mode handler (#7801) ...
* main: Add DCO git commit command to AGENTS.md (#7945) fix(claude-code): remove incorrect agent_visible filter on user message (#7931) No Check do Check (#7942) Log 500 errors and also show error for direct download (#7936) fix: retry on authentication failure with credential refresh (#7812) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854) feat(otel): propagate session.id to spans and log records (#7490) fix(test): add env_lock to is_openai_reasoning_model tests (#7917) fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
* origin/main: feat: adversarial agent for preventing leaking of info and more (#7948) Update contributing.md (#7927) docs: add credit balance monitoring section (#7952) docs: add Cerebras provider to supported providers list (#7953) docs: add TUI client documentation to ACP clients guide (#7950) fix: removed double dash in pnpm command (#7951) docs: polish ACP docs (#7946) claude adaptive thinking (#7944) feat: new onboarding flow (#7266) Add DCO git commit command to AGENTS.md (#7945) fix(claude-code): remove incorrect agent_visible filter on user message (#7931) No Check do Check (#7942) Log 500 errors and also show error for direct download (#7936) fix: retry on authentication failure with credential refresh (#7812) Remove java/.ai-usage-marker directory (#7925) test(acp): add terminal delegation fixtures and fix shell singleton (#7923) fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892) feat: persist GooseMode per-session via session DB (#7854)
Summary
When goosed gets a 403 from a provider (e.g., Databricks token expired after auto-rotation), it now invalidates the in-memory secrets cache, re-reads
secrets.yamlfrom disk, and retries once. Previously, authentication errors were terminal — the session broke with "Invalid access token" and no recovery.Context
Databricks tokens auto-rotate every 30 days in AWS Secrets Manager. The Block internal Slackbot fetches the token at pod startup and writes
secrets.yaml. Goosed caches it in memory forever viaConfig::all_secrets(). When the token rotates while pods are running, goosed uses a stale token and returns 403 to all users.This fix works with a sidecar container that periodically refreshes
secrets.yamlfrom AWS SM. The sidecar keeps the file fresh; this change makes goosed pick up the new credentials on auth failure.Changes
Provider::refresh_credentials()to theProvidertrait (base.rs) with a defaultErrreturn — providers opt in by overridingDatabricksProvideroverridesrefresh_credentials()to invalidate the secrets cache (databricks.rs)DatabricksAuthProvider::get_auth_header()re-read the token fromConfig::global()on each call instead of using the value baked in at construction — matches how the OAuth path already works, and means cache invalidation actually takes effect without rebuilding the providerretry.rs): onProviderError::Authentication, callrefresh_credentials()and retry once if it succeedsagent.rsinvalidate_secrets_cache()public onConfig(base.rs)