Skip to content

fix(manifests): prevent RWO PVC multi-attach errors during backend ro…#853

Merged
Gkrumbach07 merged 2 commits intomainfrom
fix/backend-pvc-multi-attach
Mar 9, 2026
Merged

fix(manifests): prevent RWO PVC multi-attach errors during backend ro…#853
Gkrumbach07 merged 2 commits intomainfrom
fix/backend-pvc-multi-attach

Conversation

@maskarb
Copy link
Copy Markdown
Contributor

@maskarb maskarb commented Mar 9, 2026

Summary

Set strategy.type: Recreate on the backend-api deployment to prevent multi-attach errors on the backend-state-pvc (ReadWriteOnce) during rollouts on multi-node clusters.

Problem

The backend deployment uses a RWO PVC (backend-state-pvc) for session state persistence. With the default RollingUpdate strategy, the new pod can be scheduled on a different node before the old pod fully terminates and releases the volume, causing a multi-attach error that blocks the deployment.

Fix

Recreate strategy ensures the old pod is fully terminated (and its volume detached) before the new pod is created. This is the Kubernetes-recommended approach for deployments with RWO PVCs and is consistent with the existing replicas: 1 constraint.

Test plan

  • Deploy to multi-node cluster and verify rollout completes without multi-attach errors

Fixes: RHOAIENG-52353

🤖 Generated with Claude Code

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Claude Code Review

Summary

This PR adds a RollingUpdate strategy (maxSurge=0, maxUnavailable=1) to the backend-api Deployment to prevent RWO PVC multi-attach errors on multi-node clusters. The change is minimal, correct, and directly solves a real operational problem. No security, architectural, or error-handling concerns are present.

Issues by Severity

Blocker Issues: None

Critical Issues: None

Major Issues: None

Minor Issues:

  1. Consider Recreate strategy for clearer intent

File: components/manifests/base/backend-deployment.yaml lines 9-13

The combination of RollingUpdate + maxSurge=0 + maxUnavailable=1 is functionally equivalent to Recreate for a replicas=1 deployment. The native Recreate strategy is the canonical Kubernetes pattern for single-replica RWO PVC workloads - it documents the intent without requiring readers to reason through the interaction of two numeric parameters, and uses 4 fewer lines. The current approach is correct; this is a readability observation only.

  1. maxUnavailable comment describes effect, not mechanism

File: components/manifests/base/backend-deployment.yaml line 13

The comment 'Allow the old pod to terminate first' describes the desired outcome rather than the Kubernetes semantic. maxUnavailable=1 means at most 1 pod can be unavailable during the rollout. Given replicas=1 and maxSurge=0, this forces sequential terminate-then-replace behaviour, but that connection is non-obvious. A comment like 'With replicas=1 and maxSurge=0, forces terminate-before-replace' would be more precise. Minor nit - existing comments together do convey the intent.

Positive Highlights

  • Minimal and focused: 5 lines of YAML that directly fix a concrete production failure mode with no collateral changes.
  • Well-reasoned PR description: Clearly explains the root cause (RWO volumes can only be attached to one node at a time) and the mechanism of the fix.
  • Consistent with existing awareness: The pre-existing comment about single pod for RWO PVC shows this constraint was already documented; this PR closes the loop by enforcing it at the rollout strategy level.

Recommendations

  1. (Optional) Consider replacing the RollingUpdate strategy with strategy type Recreate - this is the idiomatic Kubernetes pattern for single-replica RWO PVC workloads.
  2. Fix is correct and ready to merge as-is.

Reviewed by Claude Code against ambient-code/platform standards

…llouts

Set maxSurge=0 and maxUnavailable=1 on the backend-api deployment so
the old pod is fully terminated (and its RWO volume detached) before
the new pod is created. This prevents multi-attach errors on
multi-node clusters where the new pod could be scheduled on a
different node while the old pod still holds the volume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maskarb maskarb force-pushed the fix/backend-pvc-multi-attach branch from 29a4f1c to 4831e7f Compare March 9, 2026 15:25
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Ambient Code Platform

Kubernetes-native AI automation platform that orchestrates agentic sessions through containerized microservices. Built with Go (backend, operator), NextJS + Shadcn (frontend), Python (runner), and Kubernetes CRDs.

Technical artifacts still use "vteam" for backward compatibility.

Structure

  • components/backend/ - Go REST API (Gin), manages K8s Custom Resources with multi-tenant project isolation
  • components/frontend/ - NextJS web UI for session management and monitoring
  • components/operator/ - Go Kubernetes controller, watches CRDs and creates Jobs
  • components/runners/ambient-runner/ - Python runner executing Claude Code CLI in Job pods
  • components/ambient-cli/ - Go CLI (acpctl), manages agentic sessions from the command line
  • components/public-api/ - Stateless HTTP gateway, proxies to backend (no direct K8s access)
  • components/manifests/ - Kustomize-based deployment manifests and overlays
  • e2e/ - Cypress end-to-end tests
  • docs/ - Astro Starlight documentation site

Key Files

  • CRD definitions: components/manifests/base/crds/agenticsessions-crd.yaml, projectsettings-crd.yaml
  • Session lifecycle: components/backend/handlers/sessions.go, components/operator/internal/handlers/sessions.go
  • Auth & RBAC middleware: components/backend/handlers/middleware.go
  • K8s client init: components/operator/internal/config/config.go
  • Runner entry point: components/runners/ambient-runner/main.py
  • Route registration: components/backend/routes.go
  • Frontend API layer: components/frontend/src/services/api/, src/services/queries/

Session Flow

User Creates Session → Backend Creates CR → Operator Spawns Job →
Pod Runs Claude CLI → Results Stored in CR → UI Displays Progress

Commands

make build-all                # Build all container images
make deploy                   # Deploy to cluster
make test                     # Run tests
make lint                     # Lint code
make kind-up                  # Start local Kind cluster
make test-e2e-local           # Run E2E tests against Kind

Per-Component

# Backend / Operator (Go)
cd components/backend && gofmt -l . && go vet ./... && golangci-lint run
cd components/operator && gofmt -l . && go vet ./... && golangci-lint run

# Frontend
cd components/frontend && npm run build  # Must pass with 0 errors, 0 warnings

# Runner (Python)
cd components/runners/ambient-runner && uv venv && uv pip install -e .

# Docs
cd docs && npm run dev  # http://localhost:4321

Critical Context

  • User token auth required: All user-facing API ops use GetK8sClientsForRequest(c), never the backend service account
  • OwnerReferences on all child resources: Jobs, Secrets, PVCs must have controller owner refs
  • No panic() in production: Return explicit fmt.Errorf with context
  • No any types in frontend: Use proper types, unknown, or generic constraints
  • Conventional commits: Squashed on merge to main

Pre-commit Hooks

The project uses the pre-commit framework to run linters locally before every commit. Configuration lives in .pre-commit-config.yaml.

Install

make setup-hooks

What Runs

On every git commit:

Hook Scope
trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files, check-merge-conflict, detect-private-key All files
ruff-format, ruff (check + fix) Python (runners, scripts)
gofmt, go vet, golangci-lint Go (backend, operator, public-api — per-module)
eslint Frontend TypeScript/JavaScript
branch-protection Blocks commits to main/master/production

On every git push:

Hook Scope
push-protection Blocks pushes to main/master/production

Run Manually

make lint                                    # All hooks, all files
pre-commit run gofmt-check --all-files       # Single hook
pre-commit run --files path/to/file.go       # Single file

Skip Hooks

git commit --no-verify    # Skip pre-commit hooks
git push --no-verify      # Skip pre-push hooks

Notes

  • Go and ESLint wrappers (scripts/pre-commit/) skip gracefully if the toolchain is not installed
  • tsc --noEmit and npm run build are not included (slow; CI gates on them)
  • Branch/push protection scripts remain in scripts/git-hooks/ and are invoked by pre-commit

Testing

  • Frontend unit tests: cd components/frontend && npx vitest run --coverage (466 tests, ~74% coverage). See components/frontend/README.md.
  • E2E tests: cd e2e && npx cypress run --browser chrome (58 tests, mock SDK). See e2e/README.md.
  • Runner tests: cd components/runners/ambient-runner && python -m pytest tests/
  • Backend tests: cd components/backend && make test. See components/backend/TEST_GUIDE.md.

More Info

See BOOKMARKS.md for architecture decisions, development context, code patterns, and component-specific guides.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Claude Code Review test

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

NOTE: Full review posted in follow-up comment below.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Ambient Code Platform

Kubernetes-native AI automation platform that orchestrates agentic sessions through containerized microservices. Built with Go (backend, operator), NextJS + Shadcn (frontend), Python (runner), and Kubernetes CRDs.

Technical artifacts still use "vteam" for backward compatibility.

Structure

  • components/backend/ - Go REST API (Gin), manages K8s Custom Resources with multi-tenant project isolation
  • components/frontend/ - NextJS web UI for session management and monitoring
  • components/operator/ - Go Kubernetes controller, watches CRDs and creates Jobs
  • components/runners/ambient-runner/ - Python runner executing Claude Code CLI in Job pods
  • components/ambient-cli/ - Go CLI (acpctl), manages agentic sessions from the command line
  • components/public-api/ - Stateless HTTP gateway, proxies to backend (no direct K8s access)
  • components/manifests/ - Kustomize-based deployment manifests and overlays
  • e2e/ - Cypress end-to-end tests
  • docs/ - Astro Starlight documentation site

Key Files

  • CRD definitions: components/manifests/base/crds/agenticsessions-crd.yaml, projectsettings-crd.yaml
  • Session lifecycle: components/backend/handlers/sessions.go, components/operator/internal/handlers/sessions.go
  • Auth & RBAC middleware: components/backend/handlers/middleware.go
  • K8s client init: components/operator/internal/config/config.go
  • Runner entry point: components/runners/ambient-runner/main.py
  • Route registration: components/backend/routes.go
  • Frontend API layer: components/frontend/src/services/api/, src/services/queries/

Session Flow

User Creates Session → Backend Creates CR → Operator Spawns Job →
Pod Runs Claude CLI → Results Stored in CR → UI Displays Progress

Commands

make build-all                # Build all container images
make deploy                   # Deploy to cluster
make test                     # Run tests
make lint                     # Lint code
make kind-up                  # Start local Kind cluster
make test-e2e-local           # Run E2E tests against Kind

Per-Component

# Backend / Operator (Go)
cd components/backend && gofmt -l . && go vet ./... && golangci-lint run
cd components/operator && gofmt -l . && go vet ./... && golangci-lint run

# Frontend
cd components/frontend && npm run build  # Must pass with 0 errors, 0 warnings

# Runner (Python)
cd components/runners/ambient-runner && uv venv && uv pip install -e .

# Docs
cd docs && npm run dev  # http://localhost:4321

Critical Context

  • User token auth required: All user-facing API ops use GetK8sClientsForRequest(c), never the backend service account
  • OwnerReferences on all child resources: Jobs, Secrets, PVCs must have controller owner refs
  • No panic() in production: Return explicit fmt.Errorf with context
  • No any types in frontend: Use proper types, unknown, or generic constraints
  • Conventional commits: Squashed on merge to main

Pre-commit Hooks

The project uses the pre-commit framework to run linters locally before every commit. Configuration lives in .pre-commit-config.yaml.

Install

make setup-hooks

What Runs

On every git commit:

Hook Scope
trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files, check-merge-conflict, detect-private-key All files
ruff-format, ruff (check + fix) Python (runners, scripts)
gofmt, go vet, golangci-lint Go (backend, operator, public-api — per-module)
eslint Frontend TypeScript/JavaScript
branch-protection Blocks commits to main/master/production

On every git push:

Hook Scope
push-protection Blocks pushes to main/master/production

Run Manually

make lint                                    # All hooks, all files
pre-commit run gofmt-check --all-files       # Single hook
pre-commit run --files path/to/file.go       # Single file

Skip Hooks

git commit --no-verify    # Skip pre-commit hooks
git push --no-verify      # Skip pre-push hooks

Notes

  • Go and ESLint wrappers (scripts/pre-commit/) skip gracefully if the toolchain is not installed
  • tsc --noEmit and npm run build are not included (slow; CI gates on them)
  • Branch/push protection scripts remain in scripts/git-hooks/ and are invoked by pre-commit

Testing

  • Frontend unit tests: cd components/frontend && npx vitest run --coverage (466 tests, ~74% coverage). See components/frontend/README.md.
  • E2E tests: cd e2e && npx cypress run --browser chrome (58 tests, mock SDK). See e2e/README.md.
  • Runner tests: cd components/runners/ambient-runner && python -m pytest tests/
  • Backend tests: cd components/backend && make test. See components/backend/TEST_GUIDE.md.

More Info

See BOOKMARKS.md for architecture decisions, development context, code patterns, and component-specific guides.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Ambient Code Platform

Kubernetes-native AI automation platform that orchestrates agentic sessions through containerized microservices. Built with Go (backend, operator), NextJS + Shadcn (frontend), Python (runner), and Kubernetes CRDs.

Technical artifacts still use "vteam" for backward compatibility.

Structure

  • components/backend/ - Go REST API (Gin), manages K8s Custom Resources with multi-tenant project isolation
  • components/frontend/ - NextJS web UI for session management and monitoring
  • components/operator/ - Go Kubernetes controller, watches CRDs and creates Jobs
  • components/runners/ambient-runner/ - Python runner executing Claude Code CLI in Job pods
  • components/ambient-cli/ - Go CLI (acpctl), manages agentic sessions from the command line
  • components/public-api/ - Stateless HTTP gateway, proxies to backend (no direct K8s access)
  • components/manifests/ - Kustomize-based deployment manifests and overlays
  • e2e/ - Cypress end-to-end tests
  • docs/ - Astro Starlight documentation site

Key Files

  • CRD definitions: components/manifests/base/crds/agenticsessions-crd.yaml, projectsettings-crd.yaml
  • Session lifecycle: components/backend/handlers/sessions.go, components/operator/internal/handlers/sessions.go
  • Auth & RBAC middleware: components/backend/handlers/middleware.go
  • K8s client init: components/operator/internal/config/config.go
  • Runner entry point: components/runners/ambient-runner/main.py
  • Route registration: components/backend/routes.go
  • Frontend API layer: components/frontend/src/services/api/, src/services/queries/

Session Flow

User Creates Session → Backend Creates CR → Operator Spawns Job →
Pod Runs Claude CLI → Results Stored in CR → UI Displays Progress

Commands

make build-all                # Build all container images
make deploy                   # Deploy to cluster
make test                     # Run tests
make lint                     # Lint code
make kind-up                  # Start local Kind cluster
make test-e2e-local           # Run E2E tests against Kind

Per-Component

# Backend / Operator (Go)
cd components/backend && gofmt -l . && go vet ./... && golangci-lint run
cd components/operator && gofmt -l . && go vet ./... && golangci-lint run

# Frontend
cd components/frontend && npm run build  # Must pass with 0 errors, 0 warnings

# Runner (Python)
cd components/runners/ambient-runner && uv venv && uv pip install -e .

# Docs
cd docs && npm run dev  # http://localhost:4321

Critical Context

  • User token auth required: All user-facing API ops use GetK8sClientsForRequest(c), never the backend service account
  • OwnerReferences on all child resources: Jobs, Secrets, PVCs must have controller owner refs
  • No panic() in production: Return explicit fmt.Errorf with context
  • No any types in frontend: Use proper types, unknown, or generic constraints
  • Conventional commits: Squashed on merge to main

Pre-commit Hooks

The project uses the pre-commit framework to run linters locally before every commit. Configuration lives in .pre-commit-config.yaml.

Install

make setup-hooks

What Runs

On every git commit:

Hook Scope
trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files, check-merge-conflict, detect-private-key All files
ruff-format, ruff (check + fix) Python (runners, scripts)
gofmt, go vet, golangci-lint Go (backend, operator, public-api — per-module)
eslint Frontend TypeScript/JavaScript
branch-protection Blocks commits to main/master/production

On every git push:

Hook Scope
push-protection Blocks pushes to main/master/production

Run Manually

make lint                                    # All hooks, all files
pre-commit run gofmt-check --all-files       # Single hook
pre-commit run --files path/to/file.go       # Single file

Skip Hooks

git commit --no-verify    # Skip pre-commit hooks
git push --no-verify      # Skip pre-push hooks

Notes

  • Go and ESLint wrappers (scripts/pre-commit/) skip gracefully if the toolchain is not installed
  • tsc --noEmit and npm run build are not included (slow; CI gates on them)
  • Branch/push protection scripts remain in scripts/git-hooks/ and are invoked by pre-commit

Testing

  • Frontend unit tests: cd components/frontend && npx vitest run --coverage (466 tests, ~74% coverage). See components/frontend/README.md.
  • E2E tests: cd e2e && npx cypress run --browser chrome (58 tests, mock SDK). See e2e/README.md.
  • Runner tests: cd components/runners/ambient-runner && python -m pytest tests/
  • Backend tests: cd components/backend && make test. See components/backend/TEST_GUIDE.md.

More Info

See BOOKMARKS.md for architecture decisions, development context, code patterns, and component-specific guides.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Claude Code Review

Summary

This PR adds a Recreate deployment strategy to backend-deployment.yaml to prevent ReadWriteOnce PVC multi-attach errors on multi-node clusters. The change is technically correct and well-scoped (2 lines). Recreate is actually the stronger solution for RWO PVC scenarios compared to what the PR description advertises, though the description mismatch is a minor issue.

Issues by Severity

Blocker Issues

None

Critical Issues

None

Major Issues

None

Minor Issues

1. PR description does not match the actual implementation

  • File: components/manifests/base/backend-deployment.yaml
  • Problem: The PR body says "Set maxSurge=0 and maxUnavailable=1 on the backend-api deployment" but the actual change uses strategy.type: Recreate. These are meaningfully different approaches. RollingUpdate with maxSurge=0, maxUnavailable=1 still uses the rolling update machinery and can still race with async PVC detach on multi-node clusters. Recreate is the stronger guarantee. The mismatch is misleading for anyone reading Git history.
  • Standard violated: CLAUDE.md — conventional commits squashed on merge to main should accurately reflect what landed.
  • Suggested fix: Update the squash commit message to state that Recreate strategy is used and note the accepted brief downtime window on rollouts.

2. Accepted downtime tradeoff is implicit

  • File: components/manifests/base/backend-deployment.yaml, added strategy block
  • Problem: Recreate produces a zero-pod window between termination of the old pod and start of the new one. A future contributor might change this back to RollingUpdate without realising the RWO constraint. The existing replicas: 1 # Single pod for RWO PVC comment is close, but does not capture the deploy-time implication.
  • Suggested: Extend the inline comment, e.g.: "Required: RWO PVC cannot be multi-attached. Accepts brief downtime on redeploy; acceptable for single-replica RWO workloads."

Positive Highlights

  • Recreate is the right call. Unlike RollingUpdate with maxSurge=0, Recreate provides a hard guarantee that no new pod is scheduled until all old pods are fully terminated. For RWO volumes on multi-node clusters, async PVC detach means RollingUpdate (even conservatively configured) can still race. The implementation is actually stronger than the PR description suggests.
  • Minimal, focused change. Two lines with a clear inline comment — no scope creep, no unrelated edits.
  • Consistent with existing intent. The pre-existing replicas: 1 # Single pod for RWO PVC comment shows clear awareness of the RWO constraint; this change reinforces it at the rollout-strategy level.

Recommendations

  1. Update the squash commit message to reflect Recreate rather than maxSurge/maxUnavailable, to keep Git history accurate.
  2. Optionally extend the inline comment to explicitly call out the accepted downtime tradeoff, as a guard against future refactors.

Reviewed by Claude Code using repository standards from .claude/context/ and .claude/patterns/.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit 3f669f9 into main Mar 9, 2026
17 checks passed
@Gkrumbach07 Gkrumbach07 deleted the fix/backend-pvc-multi-attach branch March 9, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants