Skip to content

fix(control-plane): retry handler errors with exponential backoff in informer#1182

Merged
markturansky merged 5 commits intoalphafrom
fix/cp-credential-rolebinding-and-project-delete
Apr 3, 2026
Merged

fix(control-plane): retry handler errors with exponential backoff in informer#1182
markturansky merged 5 commits intoalphafrom
fix/cp-credential-rolebinding-and-project-delete

Conversation

@markturansky
Copy link
Copy Markdown
Contributor

@markturansky markturansky commented Apr 3, 2026

Summary

  • Transient errors (e.g. TSA RoleBinding not yet propagated when the project reconciler runs ensureRunnerSecrets) caused events to be permanently dropped — the informer logged the error and moved on with no retry.
  • Add a retryLoop goroutine alongside dispatchLoop. Failed handlers are requeued onto a buffered retryCh with exponential backoff: 2s → 4s → 8s → 16s → 30s (cap).
  • After retryMaxAttempts (5) the error is logged as permanent.
  • Fixes the race where a newly-created project namespace's TSA RoleBinding isn't propagated by the time ensureRunnerSecrets runs.

Test plan

  • Create a new project — CP logs should show namespace provisioned and ambient-runner-secrets created without permanent failure
  • If a transient forbidden error occurs, CP logs should show handler failed, will retry with attempt and retry_in fields, followed by eventual success
  • After 5 failed attempts, CP logs handler failed after max retries

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added credential management support: create, delete, view, and list credentials
    • Added JSON output format flag to agent start command
    • Added GitHub integration demo script for end-to-end workflow testing
  • Improvements

    • Enhanced event handler resilience with exponential backoff retry logic
    • Improved namespace cleanup handling during project deletion

Ambient Code Bot and others added 5 commits April 2, 2026 21:21
…ovision namespace on project delete

- Drop ensureCredentialRoleBindings from kube_reconciler: runner authenticates
  via BOT_TOKEN (control-plane JWT), not K8s SA token, so binding a non-existent
  credential:token-reader ClusterRole served no purpose and blocked session start
- Fix project_reconciler EventDeleted to call DeprovisionNamespace instead of
  logging "namespace retained for safety" no-op

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds `credentials` as a valid resource type for `acpctl get`,
`acpctl delete`, and `acpctl describe`, alongside existing verbs.
Aliases: credential, cred, creds.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds `kind: Credential` handling to `acpctl apply -f / -k`.
Supports create and patch semantics (created/configured/unchanged).
Token field expands env vars ($VAR syntax) matching spec usage.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- acpctl agent start now supports -o json returning the session object
- demo-github.sh: GitHub credential end-to-end demo script

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…informer

Transient errors (e.g. TSA RoleBinding not yet propagated when the project
reconciler runs ensureRunnerSecrets) caused events to be permanently dropped.

Add a retryLoop goroutine alongside dispatchLoop. Failed handlers are requeued
onto a buffered retryCh with exponential backoff (2s base, 2x per attempt, 30s
cap). After retryMaxAttempts (5) the error is logged as permanent and dropped.

Retry schedule for a single failure: 2s → 4s → 8s → 16s → 30s.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR adds credential management support across acpctl (apply, get, delete, describe commands), introduces event handler retry logic with exponential backoff in the control plane informer, implements namespace deprovisioning on project deletion, and removes credential RoleBinding provisioning during session reconciliation.

Changes

Cohort / File(s) Summary
CLI Credential Management
components/ambient-cli/cmd/acpctl/apply/cmd.go, components/ambient-cli/cmd/acpctl/delete/cmd.go, components/ambient-cli/cmd/acpctl/describe/cmd.go, components/ambient-cli/cmd/acpctl/get/cmd.go
Extended apply, delete, describe, and get commands to support credential/credentials resource types with appropriate CRUD operations, aliases, and help text updates. get credentials includes table output with ID, NAME, PROVIDER, DESCRIPTION, and AGE columns.
CLI Output Formatting
components/ambient-cli/cmd/acpctl/agent/cmd.go
Added --output/-o flag to agent start command; when set to json, renders response as JSON instead of human-readable text.
Demo Script
components/ambient-cli/demo-github.sh
New interactive end-to-end demo script for GitHub automation workflow; creates project, agent, credential with GitHub PAT, starts session, polls session state, streams session messages, and cleans up resources via EXIT trap.
Informer Retry Logic
components/ambient-control-plane/internal/informer/informer.go
Added retry support for failed event handler executions with exponential backoff (bounded delay) and max attempt limits; new retryLoop goroutine processes retries with scheduled fireAt timing.
Reconciliation Cleanup
components/ambient-control-plane/internal/reconciler/kube_reconciler.go, components/ambient-control-plane/internal/reconciler/project_reconciler.go
Removed credential RoleBinding provisioning from session reconciliation; added namespace deprovisioning on project deletion via DeprovisionNamespace.

Sequence Diagram(s)

sequenceDiagram
    participant Handler as Event Handler
    participant Dispatch as dispatchLoop
    participant Retry as retryLoop
    participant Scheduler as Timer/Scheduler

    Dispatch->>Handler: Execute handler
    alt Handler succeeds
        Handler-->>Dispatch: nil
    else Handler fails
        Handler-->>Dispatch: error
        Dispatch->>Dispatch: Calculate backoff delay<br/>(exponential with cap)
        Dispatch->>Retry: Send retryEvent<br/>(attempt count, fireAt)
        Retry->>Scheduler: Wait until fireAt
        Scheduler-->>Retry: Timer fires
        Retry->>Dispatch: Re-invoke dispatchEvent
        Dispatch->>Handler: Execute handler (retry attempt)
    end
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
Security And Secret Handling ❌ Error Pull request contains critical security vulnerabilities: GitHub PAT token exposed via CLI arguments in demo-github.sh line 277; idempotency broken in apply/cmd.go lines 297-312 missing equality checks before setting changed flag. Use --token-file or stdin for secrets; add equality checks in buildCredentialPatch() for URL, Email, Labels, Annotations; implement pre-commit hooks to detect plaintext secrets in CLI arguments.
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety ⚠️ Warning PR has two safety gaps: missing OwnerReferences on child resources and MCP sidecar lacks resource limits/requests. Add OwnerReferences to child resources and define CPU/memory requests/limits (e.g., 100m/256Mi requests, 500m/512Mi limits) for MCP sidecar.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (fix(control-plane): ...) and directly addresses the main change: retry handler errors in the informer.
Performance And Algorithmic Complexity ✅ Passed Performance-critical components are well-designed: informer retry uses 256-sized buffered channel with 5 max attempts and 30s backoff cap; credentials list enforces pagination (.Size(args.limit) with 100 default); demo polling has bounded timeouts (180s/300s defaults) with deadline checks.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cp-credential-rolebinding-and-project-delete
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix/cp-credential-rolebinding-and-project-delete

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@markturansky markturansky merged commit 281a643 into alpha Apr 3, 2026
44 checks passed
@markturansky markturansky deleted the fix/cp-credential-rolebinding-and-project-delete branch April 3, 2026 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant