Skip to content

fix: retry on authentication failure with credential refresh#7812

Merged
wpfleger96 merged 7 commits intomainfrom
wpfleger/databricks-token-retry
Mar 16, 2026
Merged

fix: retry on authentication failure with credential refresh#7812
wpfleger96 merged 7 commits intomainfrom
wpfleger/databricks-token-retry

Conversation

@wpfleger96
Copy link
Collaborator

@wpfleger96 wpfleger96 commented Mar 11, 2026

Summary

When goosed gets a 403 from a provider (e.g., Databricks token expired after auto-rotation), it now invalidates the in-memory secrets cache, re-reads secrets.yaml from disk, and retries once. Previously, authentication errors were terminal — the session broke with "Invalid access token" and no recovery.

Context

Databricks tokens auto-rotate every 30 days in AWS Secrets Manager. The Block internal Slackbot fetches the token at pod startup and writes secrets.yaml. Goosed caches it in memory forever via Config::all_secrets(). When the token rotates while pods are running, goosed uses a stale token and returns 403 to all users.

This fix works with a sidecar container that periodically refreshes secrets.yaml from AWS SM. The sidecar keeps the file fresh; this change makes goosed pick up the new credentials on auth failure.

Changes

  • Add Provider::refresh_credentials() to the Provider trait (base.rs) with a default Err return — providers opt in by overriding
  • DatabricksProvider overrides refresh_credentials() to invalidate the secrets cache (databricks.rs)
  • Make DatabricksAuthProvider::get_auth_header() re-read the token from Config::global() on each call instead of using the value baked in at construction — matches how the OAuth path already works, and means cache invalidation actually takes effect without rebuilding the provider
  • Add auth retry to the provider retry layer (retry.rs): on ProviderError::Authentication, call refresh_credentials() and retry once if it succeeds
  • Remove the previous agent-loop implementation of auth retry from agent.rs
  • Make invalidate_secrets_cache() public on Config (base.rs)

@wpfleger96 wpfleger96 force-pushed the wpfleger/databricks-token-retry branch from 25e39cf to 0b13967 Compare March 13, 2026 15:04
@wpfleger96 wpfleger96 marked this pull request as ready for review March 13, 2026 15:18
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b13967496

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

When goosed gets a 403 from a provider (e.g., Databricks token expired
after auto-rotation), it now invalidates the in-memory secrets cache,
re-reads secrets.yaml from disk, rebuilds the provider, and retries
once. Previously, authentication errors were terminal — the session
broke and users saw "Invalid access token" with no recovery.

This works with a sidecar that periodically refreshes secrets.yaml
from AWS Secrets Manager. The sidecar keeps the file fresh; this
change makes goosed pick up the new credentials on auth failure.

- Make `invalidate_secrets_cache()` public on Config
- Add Authentication retry arm in the agent reply loop (max 1 retry)
@wpfleger96 wpfleger96 force-pushed the wpfleger/databricks-token-retry branch from 0b13967 to d5c9e46 Compare March 13, 2026 15:37
@wpfleger96 wpfleger96 enabled auto-merge March 13, 2026 16:51
The agent loop was handling credential refresh directly, adding ~40
lines to an already overweight function. Auth retry is a provider
concern — move it into the ProviderRetry blanket impl where it sits
closer to the HTTP calls.

Add Provider::refresh_credentials() (default returns Err so only
providers that support it opt in). DatabricksProvider overrides it
to invalidate the secrets cache. Make DatabricksAuthProvider re-read
the token from Config on each request (matching how OAuth already
works) so cache invalidation actually takes effect without rebuilding
the entire provider.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fed65316d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c93848d63e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Address code review feedback:

- Use the constructor-supplied token as initial value via a shared
  Arc<Mutex<Option<String>>> between DatabricksProvider and its auth
  provider. refresh_credentials() clears the cache, causing the next
  get_auth_header() call to re-read from Config. Preserves from_params
  semantics where callers supply a token directly.
- Remove the dead trait-default with_retry_config body — the blanket
  impl overrides it for all Provider types, so the duplicated backoff
  logic was unreachable.
- Log when refresh_credentials() itself fails so operators can see
  that a refresh was attempted.
@wpfleger96 wpfleger96 force-pushed the wpfleger/databricks-token-retry branch from c93848d to 256eac4 Compare March 13, 2026 22:32
Avoids a race where another thread sees token_cache == None and
repopulates it from the still-stale secrets cache.
If the provider was created via from_params with a caller-supplied
token, refresh_credentials clears the cache but Config may not have
DATABRICKS_TOKEN. Fall back to the original constructor-provided
token instead of erroring.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e7f7b6090

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jh-block jh-block self-assigned this Mar 16, 2026
@jh-block
Copy link
Collaborator

Hi @wpfleger96, thanks for the contribution! Can you address the Codex code review comment and then this looks good to merge.

Copy link
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove any redundant comments and check codex' replies, otherwise LGTM

…oken-retry

* origin/main: (21 commits)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
  updated canonical models (#7920)
  feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps  (#7852)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  Keep messages in sync (#7850)
  More acp tools (#7843)
  fix: skip upgrade-insecure-requests CSP for external HTTP backends (#7714)
  ...
- Remove doc comment on refresh_credentials() per reviewer request
- Simplify from_params token_cache init to direct Some(api_key.clone())
  instead of unnecessary match expression
@wpfleger96 wpfleger96 added this pull request to the merge queue Mar 16, 2026
Merged via the queue into main with commit b3510ba Mar 16, 2026
24 checks passed
@wpfleger96 wpfleger96 deleted the wpfleger/databricks-token-retry branch March 16, 2026 16:37
wpfleger96 added a commit that referenced this pull request Mar 16, 2026
* origin/main: (72 commits)
  No Check do Check (#7942)
  Log 500 errors and also show error for direct download (#7936)
  fix: retry on authentication failure with credential refresh (#7812)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
  updated canonical models (#7920)
  feat(autovisualiser): Migrate the autovisualiser extension to MCP Apps  (#7852)
  fix: add tool_choice and parallel_tool_calls to chatgpt_codex provider (#7867)
  fix: tool confirmation handling for multiple requests (#7856)
  Remove dead OllamaSetup onboarding flow (#7861)
  fix: resolve tokio::sync::Mutex deadlock in recipe retry path (#7832)
  Upgrade Electron 40.6.0 → 41.0.0 (#7851)
  Only show up to 50 lines of source code (#7578)
  fix: stop writing without error when hitting broken pipe for goose session list (#7858)
  feat(acp): add session/set_mode handler (#7801)
  ...
lifeizhou-ap added a commit that referenced this pull request Mar 17, 2026
* main:
  Add DCO git commit command to AGENTS.md (#7945)
  fix(claude-code): remove incorrect agent_visible filter on user message (#7931)
  No Check do Check (#7942)
  Log 500 errors and also show error for direct download (#7936)
  fix: retry on authentication failure with credential refresh (#7812)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
  feat(otel): propagate session.id to spans and log records (#7490)
  fix(test): add env_lock to is_openai_reasoning_model tests (#7917)
  fix(acp): pass session_id when loading extensions so skills are discovered (#7868)
jh-block added a commit that referenced this pull request Mar 17, 2026
* origin/main:
  feat: adversarial agent for preventing leaking of info and more  (#7948)
  Update contributing.md (#7927)
  docs: add credit balance monitoring section (#7952)
  docs: add Cerebras provider to supported providers list (#7953)
  docs: add TUI client documentation to ACP clients guide (#7950)
  fix: removed double dash in pnpm command (#7951)
  docs: polish ACP docs (#7946)
  claude adaptive thinking (#7944)
  feat: new onboarding flow (#7266)
  Add DCO git commit command to AGENTS.md (#7945)
  fix(claude-code): remove incorrect agent_visible filter on user message (#7931)
  No Check do Check (#7942)
  Log 500 errors and also show error for direct download (#7936)
  fix: retry on authentication failure with credential refresh (#7812)
  Remove java/.ai-usage-marker directory (#7925)
  test(acp): add terminal delegation fixtures and fix shell singleton (#7923)
  fix: bump pctx_code_mode to 0.3.0 for iterator type checking fix (#7892)
  feat: persist GooseMode per-session via session DB (#7854)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants