Skip to content

Conversation

@cooktheryan
Copy link
Collaborator

No description provided.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 11, 2025

PR Review: JWT Vertex Credentials Architecture

This PR refactors Vertex AI credential management from filesystem-based to API-based authentication. This is a significant security improvement that aligns with zero-trust principles. Overall, the implementation is solid with excellent test coverage, but I've identified several critical issues that need attention before merge.


🔴 Critical Issues

1. Backend Reads Secret from Wrong Namespace

Location: components/backend/handlers/sessions.go:924-926

Problem: The code comment says "Load Vertex credentials from ambient-vertex secret in the project namespace", but the production config comment (operator-config-openshift.yaml) says "Ensure ambient-vertex secret exists in backend namespace".

// Load Vertex credentials from ambient-vertex secret in the project namespace
vertexSecretName := "ambient-vertex"
vertexSecret, err := K8sClient.CoreV1().Secrets(project).Get(c.Request.Context(), vertexSecretName, v1.GetOptions{})

This reads from project namespace (line 926), but the production config says:

# Ensure ambient-vertex secret exists in backend namespace with:
#   kubectl create secret generic ambient-vertex \
#     -n <backend-namespace>

Impact: This is a critical mismatch. The backend will fail to find the secret unless it's duplicated in every project namespace, defeating the security benefit of centralized credential management.

Fix Required: Either:

  • Change line 926 to read from appConfig.BackendNamespace OR
  • Update documentation to clarify secrets must exist per-project

Recommendation: Read from backend namespace for centralized management (matches original architecture goal).


2. Missing Token Length Logging

Location: components/backend/handlers/sessions.go:865-869

Problem: Token validation doesn't log token length, violating CLAUDE.md security pattern (line 932):

token := strings.TrimSpace(parts[1])
if token == "" {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "empty token"})
    return
}

Per CLAUDE.md:

// NEVER log the token itself
log.Printf("Processing request with token (len=%d)", len(token))

Impact: Hinders debugging when token auth fails.

Fix: Add: log.Printf("MintSessionVertexCredentials: tokenLen=%d", len(token))


3. Test Assumes Wrong Secret Namespace

Location: components/backend/handlers/sessions_test.go:22-32

Problem: Test creates secret in backend-namespace (line 25) but mocks request for test-project (line 60), yet test passes. This only works because the fake K8s client doesn't enforce namespace isolation.

vertexSecret := &corev1.Secret{
    ObjectMeta: metav1.ObjectMeta{
        Name:      "ambient-vertex",
        Namespace: "backend-namespace",  // ❌ Wrong namespace
    },
    // ...
}
// ...
req := httptest.NewRequest("POST", "/api/projects/test-project/agentic-sessions/test-session/vertex/credentials", nil)

Impact: Test doesn't catch the namespace mismatch issue (#1), giving false confidence.

Fix: Update test to match actual implementation (create secret in test-project OR change implementation to read from backend namespace and update test accordingly).


⚠️ High Priority Issues

4. Missing Error Context in Backend

Location: components/backend/handlers/sessions.go:874-876

Per CLAUDE.md: "Always log errors with context"

rv, err := K8sClient.AuthenticationV1().TokenReviews().Create(c.Request.Context(), tr, v1.CreateOptions{})
if err != nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

Fix: Add: log.Printf("TokenReview failed for session %s/%s: %v", project, sessionName, err)


5. Type-Unsafe Metadata Access

Location: components/backend/handlers/sessions.go:911-912

Per CLAUDE.md: "FORBIDDEN: Direct type assertions without checking"

meta, _ := obj.Object["metadata"].(map[string]interface{})
anns, _ := meta["annotations"].(map[string]interface{})

Problem: Ignores ok value - will panic if metadata or annotations aren't maps.

Fix: Use three-value returns and check ok:

meta, ok := obj.Object["metadata"].(map[string]interface{})
if !ok {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return
}
anns, _ := meta["annotations"].(map[string]interface{}) // This is safe after meta check

6. Nil Secret Data Not Checked

Location: components/backend/handlers/sessions.go:948-949

c.JSON(http.StatusOK, gin.H{
    "credentials": string(credJSON),
    "projectId":   string(vertexSecret.Data["project-id"]),    // ❌ May be nil
    "region":      string(vertexSecret.Data["region"]),       // ❌ May be nil
})

Problem: If project-id or region keys don't exist, string(nil) returns empty string without error.

Impact: Runner will fail silently with empty config values.

Fix: Validate before response:

projectID, ok := vertexSecret.Data["project-id"]
if !ok || len(projectID) == 0 {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing project-id"})
    return
}
region, ok := vertexSecret.Data["region"]
if !ok || len(region) == 0 {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing region"})
    return
}
c.JSON(http.StatusOK, gin.H{
    "credentials": string(credJSON),
    "projectId":   string(projectID),
    "region":      string(region),
})

7. Python Wrapper Hardcodes Credential Path

Location: components/runners/claude-code-runner/wrapper.py:662

creds_path = '/tmp/vertex-credentials.json'

Problem: Multiple concurrent sessions in the same pod would overwrite each other's credentials.

Impact: Race conditions in multi-session environments (if ever implemented).

Fix: Use session-specific path:

creds_path = f'/tmp/vertex-credentials-{self.context.session_id}.json'

📝 Medium Priority Issues

8. Inconsistent Error Messages

  • Line 929: "Vertex credentials not configured for this project"
  • Line 940: "Vertex credentials missing ambient-code-key.json"

Issue: First message is user-facing (project not configured), second is technical (missing key in existing secret).

Suggestion: Make both user-friendly or add context to logs.


9. Missing Cleanup in Python Wrapper

Location: wrapper.py:662-670

Credentials file is written to /tmp but never cleaned up.

Impact: Credentials remain on disk until pod termination (minor risk since /tmp is tmpfs).

Fix: Add cleanup in wrapper shutdown/error paths:

try:
    os.remove(creds_path)
except:
    pass

10. Operator Log Message Unclear

Location: components/operator/internal/handlers/sessions.go:261

log.Printf("Vertex AI enabled - runner will fetch credentials from backend API")

Suggestion: Add session context: log.Printf("Session %s/%s: Vertex AI enabled - runner will fetch credentials from backend API", sessionNamespace, name)


✅ Strengths

  1. Excellent Security Improvement: Moving from filesystem-based to API-based credentials is a major security win
  2. Comprehensive Test Coverage: Backend tests cover all edge cases (8 test functions)
  3. Proper Token Validation: TokenReview validates SA namespace and matches CR annotation
  4. RBAC Cleanup: Removed unnecessary secrets permissions from operator ClusterRole
  5. Clean Operator Refactoring: Removed 150+ lines of complex secret-copying logic
  6. Python Tests: Test coverage for new _fetch_vertex_credentials method
  7. Zero-Trust Architecture: Runner can't access credentials without valid BOT_TOKEN

📋 Pre-Merge Checklist

Before merging, please address:


🎯 Architecture Validation

The overall architecture is sound and follows CLAUDE.md guidelines:

Authentication Pattern: Uses backend service account (K8sClient) for reading secrets - correct per CLAUDE.md
Token Security: BOT_TOKEN validation via TokenReview with namespace + SA verification
Error Handling: Proper IsNotFound checks, appropriate HTTP status codes
Separation of Concerns: Backend owns secrets, operator only passes flag
Security Context: Credentials fetched on-demand, not persisted in pod filesystem


🔧 Suggested Follow-up Work (Post-Merge)

  1. Add integration test for full credential flow (backend → runner → Vertex AI)
  2. Consider adding rate limiting to vertex/credentials endpoint
  3. Add metrics for credential fetch success/failure rates
  4. Document secret creation process in production deployment guide

Summary

This PR represents a significant architectural improvement. The credential-on-demand pattern is exactly right. However, the namespace mismatch between code, tests, and documentation is a blocker that must be resolved before merge. Once the critical issues are addressed, this will be production-ready.

Recommendation: Request changes for critical issues (#1, #3), rest can be addressed before merge or as follow-up PR.

@github-actions
Copy link
Contributor

Pull Request Review: JWT/Vertex AI Credential Management

This PR refactors how Vertex AI credentials are managed, moving from a secret-copying approach to a backend-served JWT token pattern. This is a significant architectural improvement that enhances security and simplifies the RBAC model.

Strengths

  1. Excellent Security Architecture

    • Centralizes credential management in the backend namespace (single source of truth)
    • Uses proper ServiceAccount token validation via TokenReview API
    • Verifies namespace and SA matching against CR annotations
    • Never logs sensitive tokens (uses tokenLen=%d pattern)
    • Removes need for operator to have cross-namespace secret permissions
  2. Comprehensive Test Coverage

    • 8 test cases covering all security boundaries: success path, missing auth, invalid token, wrong namespace, session not found, SA mismatch, secret not found, missing credential file
    • Uses proper fake K8s clients and reactor patterns
    • Good edge case coverage
  3. Clean Operator Simplification

    • Removes secret copying logic and cleanup code
    • Reduces ClusterRole permissions (no more cross-namespace secret access)
    • Operator only needs CLAUDE_CODE_USE_VERTEX flag
    • Removes unused environment variables from ConfigMaps
  4. Follows Established Patterns

    • Mirrors MintSessionGitHubToken handler pattern (handlers/sessions.go:811-845)
    • Uses backend service account for reading backend namespace secret (correct per CLAUDE.md guidelines)
    • Returns structured JSON with credentials, projectId, region

🔧 Issues to Address

Critical Issues

  1. Duplicate setupTestGVR() Call (sessions_test.go:70)

    DynamicClient = fake.NewSimpleDynamicClient(s, sessionCR)
    setupTestGVR()
    setupTestGVR()  // ❌ Duplicate call

    Fix: Remove line 70

  2. Missing Error Check (sessions_test.go:351)

    var response map[string]interface{}
    json.Unmarshal(w.Body.Bytes(), &response)  // ❌ Error ignored
    if !contains(response["error"].(string), "Vertex credentials not configured") {

    Fix:

    if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
        t.Fatalf("Failed to parse response: %v", err)
    }
  3. Potential Nil Pointer Panic (sessions_test.go:352)
    The test assumes response["error"] exists and is a string, but doesn't check:

    if !contains(response["error"].(string), "Vertex credentials not configured") {

    Fix:

    errMsg, ok := response["error"].(string)
    if !ok || !contains(errMsg, "Vertex credentials not configured") {
        t.Errorf("Expected error about missing credentials, got: %v", response)
    }
  4. Custom String Helper Reinvents stdlib (sessions_test.go:421-432)

    func stringContains(s, substr string) bool {
        for i := 0; i <= len(s)-len(substr); i++ {
            if s[i:i+len(substr)] == substr {
                return true
            }
        }
        return false
    }

    Fix: Use strings.Contains() from stdlib instead

Medium Issues

  1. Type Assertion Pattern Inconsistency
    The new handler uses direct type assertions in several places:

    meta, _ := obj.Object["metadata"].(map[string]interface{})
    anns, _ := meta["annotations"].(map[string]interface{})

    Per CLAUDE.md backend standards ("Type-Safe Unstructured Access"), should use unstructured.Nested* helpers:

    anns, found, err := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
    if !found || err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
        return
    }
    expectedSA, ok := anns["ambient-code.io/runner-sa"]
  2. Secret Data Keys Not Validated
    The handler reads vertexSecret.Data["project-id"] and ["region"] without checking if they exist. Should validate all required keys:

    projectID, ok := vertexSecret.Data["project-id"]
    if !ok {
        c.JSON(http.StatusNotFound, gin.H{"error": "Vertex secret missing project-id"})
        return
    }
  3. Inconsistent Backend Namespace Naming
    handlers/sessions.go:930 uses:

    backendNamespace := Namespace
    vertexSecretName := "ambient-vertex"

    But the comment says "Backend reads from its own namespace". Consider logging this for debugging:

    log.Printf("Reading Vertex credentials from backend namespace: %s", backendNamespace)

Minor Issues

  1. Missing Context Timeout
    The handler uses c.Request.Context() without timeout. Consider adding timeout for secret read operations:

    ctx, cancel := context.WithTimeout(c.Request.Context(), 5*time.Second)
    defer cancel()
    vertexSecret, err := K8sClient.CoreV1().Secrets(backendNamespace).Get(ctx, vertexSecretName, v1.GetOptions{})
  2. Unused Import in Operator
    components/operator/internal/handlers/sessions.go:24 removed usage of k8s.io/client-go/util/retry but didn't remove the import (verify with go build)

  3. Documentation Mismatch
    The production ConfigMap comment (operator-config-openshift.yaml:10-15) shows manual kubectl create secret command but doesn't specify which namespace. Should clarify:

    # Ensure ambient-vertex secret exists in backend namespace (same as backend pod) with:
    #   kubectl create secret generic ambient-vertex \
    #     --from-file=ambient-code-key.json=/path/to/service-account.json \
    #     --from-literal=project-id=YOUR_PROJECT \
    #     --from-literal=region=YOUR_REGION \
    #     -n ambient-code  # <-- or your backend namespace

📋 Code Quality Checklist (from CLAUDE.md)

  • User token authentication pattern (uses backend SA for backend namespace secret - correct)
  • No panic() in production code
  • Token redaction in logs
  • [⚠️] Type-safe unstructured access (Issue Epic: Jira Integration & Workflow #5)
  • Proper error handling and logging
  • OwnerReferences not applicable (no child resources created)
  • Security context patterns not applicable (no pods created)
  • Tests added

🚀 Recommendations

  1. Add Integration Test: Create a test that validates the full flow: operator starts job → runner calls endpoint → receives credentials
  2. Document Secret Structure: Add to docs/configuration.md the exact structure of the ambient-vertex secret
  3. Consider Secret Validation: Add a preflight check in the backend startup to validate the vertex secret exists and has required keys
  4. Add Metrics: Consider adding Prometheus metrics for vertex credential requests (success/failure counts)
  5. Rate Limiting: Consider rate limiting this endpoint to prevent credential exposure via brute force

🎯 Priority Actions Before Merge

Must Fix:

Should Fix:

Nice to Have:

📊 Overall Assessment

This is a well-architected change that significantly improves the security posture of Vertex AI credential management. The approach of centralizing credentials in the backend and serving them via authenticated API is the right design pattern.

The test coverage is excellent and follows Go testing best practices. The main issues are minor bugs in the test code and some opportunities to follow established type-safety patterns more closely.

Recommendation: Approve after fixing the 3 must-fix issues. The should-fix issues can be addressed in a follow-up PR if time is constrained.


Testing Checklist (run before merge):

cd components/backend
go test -v ./handlers -run TestMintSessionVertexCredentials
gofmt -l .
go vet ./...
golangci-lint run

@github-actions
Copy link
Contributor

Code Review: JWT Vertex Authentication Changes

This PR implements a significant architectural improvement by moving Vertex AI credential management from operator-level secret copying to a centralized backend API service with JWT authentication. I've conducted a comprehensive review based on the project's CLAUDE.md guidelines.


Summary

Architecture Change: Transitions from operator copying ambient-vertex secrets to each project namespace → backend API serving credentials on-demand to authorized runners via JWT tokens.

Key Benefits:

  • ✅ Reduced RBAC permissions (operator no longer needs cross-namespace secret access)
  • ✅ Centralized credential management
  • ✅ Improved security with per-session token validation
  • ✅ Eliminates secret proliferation across namespaces

Detailed Findings

✅ Strengths

1. Backend Implementation (components/backend/handlers/sessions.go:848-956)

Excellent authentication pattern:

  • Follows CLAUDE.md guidelines for token security (never logs token values - line 872)
  • Proper TokenReview validation with ServiceAccount verification (lines 874-900)
  • Validates SA matches session annotation ambient-code.io/runner-sa (lines 903-925)
  • Namespace isolation check prevents cross-project access (lines 897-900)

Security best practices:

  • Uses backend service account to read from own namespace only (line 931)
  • Returns generic error messages to API clients (no internal details leaked)
  • Proper error handling with IsNotFound checks

Type safety:

  • Uses type assertions with ok checks: anns, _ := meta["annotations"].(map[string]interface{}) (line 915)
  • Could improve with unstructured.NestedStringMap() helpers per CLAUDE.md standards

2. Comprehensive Test Coverage (components/backend/handlers/sessions_test.go)

432 lines of new tests covering:

  • ✅ Success path with valid SA token
  • ✅ Missing/invalid Authorization header
  • ✅ Invalid token (unauthenticated)
  • ✅ Wrong namespace (cross-project access attempt)
  • ✅ Session not found
  • ✅ SA name mismatch
  • ✅ Secret not found
  • ✅ Missing credential file in secret

Strong adherence to Go testing patterns: Uses fake clients, proper mocking with PrependReactor, clear test names.

3. Operator Simplification (components/operator/internal/handlers/sessions.go)

Removed 166 lines of secret copying logic:

  • Eliminates copySecretToNamespace() complexity
  • Removes deleteAmbientVertexSecret() cleanup code
  • Cleaner reconciliation loop

Better separation of concerns: Operator focuses on Job creation, backend handles secrets.

4. Runner Implementation (components/runners/claude-code-runner/wrapper.py)

Secure credential handling (lines 630-679):

  • Writes credentials to /tmp/vertex-credentials.json (tmpfs, in-memory)
  • Sets restrictive permissions: os.chmod(creds_path, 0o400) (owner read-only)
  • Validates all required fields before proceeding

Proper async patterns: Uses run_in_executor for blocking urllib operations (line 1729).

Comprehensive error handling: Returns empty dict on failure, clear logging at each step.


⚠️ Issues & Recommendations

Critical Issues

1. Type Safety Violation in Backend (sessions.go:914-915)

meta, _ := obj.Object["metadata"].(map[string]interface{})
anns, _ := meta["annotations"].(map[string]interface{})

Problem: Direct type assertions without checking, discarding errors.

Risk: If metadata is not a map, anns will be nil and anns["ambient-code.io/runner-sa"] will succeed but return nil.

Fix per CLAUDE.md:

meta, found, err := unstructured.NestedMap(obj.Object, "metadata")
if err \!= nil || \!found {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return
}
anns, _, _ := unstructured.NestedStringMap(meta, "annotations")
expectedSA := ""
if v, ok := anns["ambient-code.io/runner-sa"]; ok {
    expectedSA = strings.TrimSpace(v)
}

Reference: CLAUDE.md "Type-Safe Unstructured Access" section.


2. Interface Type Change Lacks Justification (git/operations.go:48, github_auth.go:27, repo.go:23)

Changed from:

K8sClient *kubernetes.Clientset

To:

K8sClient kubernetes.Interface

Questions:

  • What problem does this solve?
  • Are there new mock requirements for tests?
  • Does this change API compatibility?

Recommendation: Add a comment explaining why the interface is preferred. If it's for testability, ensure all consumers can handle kubernetes.Interface.


3. Missing Error Logging in Token Review (sessions.go:877-878)

if err \!= nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

Problem: Per CLAUDE.md error handling patterns, all errors should be logged with context before returning.

Fix:

if err \!= nil {
    log.Printf("TokenReview failed for project=%s, session=%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

4. Missing Route Registration Context (routes.go:36)

api.POST("/projects/:projectName/agentic-sessions/:sessionName/vertex/credentials", handlers.MintSessionVertexCredentials)

Problem: Route is registered OUTSIDE the ValidateProjectContext() middleware group (line 38).

Risk: Endpoint doesn't benefit from project-level RBAC validation that other session endpoints have.

Analysis: This might be intentional since the handler does its own TokenReview, but it's inconsistent with the pattern at routes.go:33 where GitHub token minting also sits outside the middleware.

Recommendation: Add a comment explaining why this endpoint bypasses ValidateProjectContext(), or consider using the middleware for consistency.


Medium Priority Issues

5. Hardcoded Secret Name (sessions.go:930)

vertexSecretName := "ambient-vertex"

Recommendation: Extract to a constant or config variable for easier customization across environments.

const VertexSecretName = "ambient-vertex"

Or reference from operator/internal/types/resources.go if AmbientVertexSecretName constant still exists there (I see it referenced in tests).


6. Incomplete Documentation in Manifests

overlays/production/operator-config-openshift.yaml:10-15 has good instructions for creating the secret, but:

  • Doesn't specify which namespace ("backend-namespace" is generic)
  • Doesn't mention that backend needs RBAC to read secrets in its own namespace
  • Doesn't explain what happens if the secret is missing (sessions will fail with 404)

Recommendation: Add troubleshooting section to docs or README:

# Prerequisites for Vertex AI:
# 1. Create ambient-vertex secret in the BACKEND namespace (e.g., ambient-code):
#    kubectl create secret generic ambient-vertex \
#      --from-file=ambient-code-key.json=/path/to/sa.json \
#      --from-literal=project-id=YOUR_PROJECT \
#      --from-literal=region=YOUR_REGION \
#      -n ambient-code
#
# 2. Verify backend ServiceAccount can read secrets in its namespace
# 3. Set CLAUDE_CODE_USE_VERTEX=1 in operator config
#
# Troubleshooting:
# - If runners fail with "Vertex credentials not configured", check secret exists
# - If 403 errors, verify backend RBAC allows secret reads

7. Runner Test Gaps

tests/test_wrapper_vertex.py lost 157 lines (406 → 249). Need to verify:

  • ✅ Backend API fetch tests (covered)
  • ✅ Credential file writing tests (covered)
  • ❓ File permission tests (0o400 enforcement)
  • ❓ Cleanup tests (is temp file deleted on exit?)
  • ❓ Credential rotation tests (what happens if backend updates secret mid-session?)

Recommendation: Add test for file permissions:

def test_vertex_credentials_file_permissions(self):
    # After setup, verify file has mode 0o400
    assert stat.S_IMODE(os.stat('/tmp/vertex-credentials.json').st_mode) == 0o400

8. Operator RBAC Cleanup (base/rbac/operator-clusterrole.yaml)

Removed secret permissions (lines 51-54). ✅ Good!

Verify: Does operator still need ANY secret permissions? Check if there are other secrets it creates/manages (e.g., runner SA tokens in provisionRunnerTokenForSession).

Note: That function runs in backend, not operator, so operator should be clear. Confirm no other secret usage in operator code.


Low Priority / Nitpicks

9. Unused Import Cleanup (operator/internal/handlers/sessions.go:24)

"k8s.io/client-go/util/retry"

Removed code that used retry, but import might still be there. Run go mod tidy and gofmt to clean up.


10. Logging Consistency

Backend uses log.Printf (stdlib) while operator uses same. Good consistency.

Runner uses Python logging module. Good.

Recommendation: Ensure log levels are appropriate:

  • logging.info for Vertex setup (line 672) ✅
  • logging.error for fetch failures (line 1723) ✅
  • Consider logging.warning for non-fatal issues

11. Token Length Logging

Backend logs tokenLen=%d (line 872) ✅ Excellent per CLAUDE.md!

Runner doesn't log BOT_TOKEN length when fetching credentials. Consider adding for consistency:

logging.debug(f"Using BOT_TOKEN for Vertex auth (len={len(bot)})")

🧪 Testing Recommendations

Pre-Merge Checklist:

  1. Run backend tests:

    cd components/backend
    go test ./handlers -v -run TestMintSessionVertexCredentials
  2. Run operator tests (if any exist for session handling):

    cd components/operator
    go test ./internal/handlers -v
  3. Run runner tests:

    cd components/runners/claude-code-runner
    pytest tests/test_wrapper_vertex.py -v
  4. Linting:

    cd components/backend && gofmt -l . && golangci-lint run
    cd components/operator && gofmt -l . && golangci-lint run
  5. E2E Test (if Vertex is enabled):

    • Deploy to test cluster with ambient-vertex secret
    • Create session with CLAUDE_CODE_USE_VERTEX=1
    • Verify runner can fetch credentials and initialize Anthropic client
    • Check logs for "Vertex AI configured: project=..."

📋 Pre-Commit Actions (from CLAUDE.md)

Before merging, verify:


Security Analysis

✅ Security Improvements

  1. Reduced Secret Sprawl: Credentials stay in backend namespace, not copied everywhere
  2. Per-Session Authorization: Each request validated via TokenReview + SA annotation check
  3. Minimal File Permissions: Runner writes creds to tmpfs with 0o400
  4. Token Redaction: Proper logging practices throughout

⚠️ Security Considerations

  1. Credentials in API Response: credentials field contains raw GCP SA JSON in HTTP response body (line 952)

    • Mitigation: Uses HTTPS, authorized tokens only
    • Recommendation: Consider encrypting response or using short-lived tokens if GCP supports
  2. Temp File Lifetime: /tmp/vertex-credentials.json persists until pod exits

    • Mitigation: tmpfs is in-memory, not persistent storage
    • Recommendation: Document cleanup behavior or add explicit cleanup on exit
  3. Backend Namespace Secret: Single ambient-vertex secret serves all projects

    • Risk: Compromise of backend namespace exposes all Vertex credentials
    • Mitigation: Proper namespace isolation, RBAC restrictions
    • Recommendation: Consider per-project secrets if multi-tenancy requires isolation

Performance Considerations

New API Call per Session: Runners now make HTTP request to backend for credentials.

Impact: Negligible (one-time fetch during session init, ~10ms overhead).

Benefit: Eliminates operator reconciliation loop for secret copying (reduces cluster load).


Documentation Gaps

  1. No docs update for new endpoint: POST /api/projects/:project/agentic-sessions/:session/vertex/credentials

    • Add to API reference docs
    • Document request/response format
    • Document authentication requirements (BOT_TOKEN)
  2. No migration guide from old secret-copying approach

    • What happens to existing sessions with old secrets?
    • Do operators need to clean up old secrets manually?
  3. No troubleshooting guide for Vertex setup

    • How to verify secret is correctly configured?
    • How to test credential fetch independently?

Overall Assessment

Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design, cleaner separation of concerns

Implementation: ⭐⭐⭐⭐☆ (4/5) - Solid code, needs type safety fixes

Testing: ⭐⭐⭐⭐⭐ (5/5) - Comprehensive backend tests, good runner coverage

Security: ⭐⭐⭐⭐☆ (4/5) - Strong auth model, minor credential handling improvements possible

Documentation: ⭐⭐⭐☆☆ (3/5) - Needs API docs and troubleshooting guide


Recommendation

Status: ✅ Approve with minor fixes

Required before merge:

  1. Fix type safety in sessions.go:914-915 (use unstructured.Nested*)
  2. Add error logging to TokenReview failure (sessions.go:877)
  3. Run gofmt and golangci-lint on backend/operator

Recommended (can be follow-up PRs):
4. Document new API endpoint
5. Add troubleshooting section to production overlay comments
6. Extract ambient-vertex constant
7. Add file permission test to runner tests


Great work on this refactoring! The JWT-based credential distribution is a significant improvement over the previous secret-copying approach. Once the type safety and error logging issues are addressed, this is ready to merge.

Files to update:

  • components/backend/handlers/sessions.go (lines 914-915, 877)
  • Run linters before final commit

Let me know if you'd like help with any of these fixes! 🚀

@github-actions
Copy link
Contributor

PR Review: JWT Vertex AI Integration

I've reviewed PR #285 which implements Vertex AI credential management through a centralized backend service. Overall, this is a well-architected security improvement that follows the project's CLAUDE.md guidelines closely. Here's my detailed feedback:


Strengths

1. Excellent Security Architecture

  • Centralized credential storage: Moving from per-namespace secrets to a single ambient-vertex secret in the backend namespace is a significant security improvement
  • Token-based authentication: Using BOT_TOKEN (ServiceAccount token) with TokenReview validation follows established patterns from MintSessionGitHubToken
  • Credential isolation: Runners fetch credentials on-demand rather than having them pre-provisioned in every namespace
  • Temporary file handling: Credentials written to /tmp (tmpfs) with 0o400 permissions and cleaned up after use (wrapper.py:669-705)
  • Token redaction: Proper logging that never exposes actual token values (handlers/sessions.go:828)

2. CLAUDE.md Compliance

  • Type-safe unstructured access: Correctly uses unstructured.NestedStringMap() and unstructured.NestedString() (handlers/sessions.go:813-816, 826)
  • Proper error handling: Returns explicit errors with context, no panics in production code
  • Token security: Never logs tokens, uses len(token) for debugging
  • Authentication pattern: Uses TokenReview validation matching MintSessionGitHubToken pattern
  • RBAC verification: Validates ServiceAccount matches CR annotation

3. Comprehensive Test Coverage

The new test file (handlers/sessions_test.go) covers all critical paths:

  • ✅ Success path with valid credentials
  • ✅ Missing Authorization header
  • ✅ Invalid token
  • ✅ Namespace mismatch
  • ✅ Session not found
  • ✅ ServiceAccount mismatch
  • ✅ Secret not found
  • ✅ Missing credential file in secret

4. RBAC Reduction

  • Removed operator permissions: Operator no longer needs secrets:get,create,delete cluster-wide (operator-clusterrole.yaml)
  • Principle of least privilege: Operator only reads the flag, backend serves credentials

🔴 Critical Issues

1. Interface Type Change Without Justification (High Priority)

Files: git/operations.go:48, handlers/github_auth.go:27, handlers/repo.go:23, handlers/sessions.go:38, handlers/sessions.go:563

// Changed from:
K8sClient *kubernetes.Clientset
// To:
K8sClient kubernetes.Interface

Problem: This changes the concrete type to an interface across multiple files, but:

  • No explanation in PR description
  • No test modifications to verify interface compatibility
  • Could break code that depends on Clientset-specific methods not in Interface

Recommendation:

  • Add a comment in the PR description explaining why this change is necessary for Vertex integration
  • If this is for testability, update all tests to use fake clients with the Interface type
  • If this change is unrelated to Vertex, consider splitting into a separate refactoring PR

2. Missing Documentation for New Endpoint

File: routes.go:40

The new endpoint /projects/:projectName/agentic-sessions/:sessionName/vertex/credentials is added with a comment explaining it uses BOT_TOKEN, but:

  • No API documentation in docs/
  • No OpenAPI/Swagger spec update
  • Runners need to know this endpoint exists

Recommendation:

  • Add endpoint documentation to docs/reference/api-endpoints.md (if it exists)
  • Update any API reference documentation
  • Consider adding a code example showing how runners should call this endpoint

⚠️ Moderate Issues

3. Inconsistent Error Logging

File: handlers/sessions.go:773, 807

// Line 773: Logs error
if err != nil {
    log.Printf("TokenReview failed for project=%s, session=%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

// Line 807: Also logs error (good!)
if err != nil {
    log.Printf("Failed to get session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to read session"})
    return
}

This is good and consistent. However, the original MintSessionGitHubToken didn't have the log at line 773 originally. This PR adds it, which is an improvement.

Recommendation: ✅ This is actually fine - the consistency is good!

4. Hardcoded Secret Name

File: handlers/sessions.go:888

const vertexSecretName = "ambient-vertex"

Issue: The secret name is hardcoded, making it harder to test or support multiple configurations.

Recommendation:

  • Move to a package-level configurable variable set from environment variable (e.g., VERTEX_SECRET_NAME)
  • Default to "ambient-vertex" for backward compatibility
  • Makes testing easier (can create test-specific secrets)
// Suggested pattern:
var VertexSecretName = getEnv("VERTEX_SECRET_NAME", "ambient-vertex")

5. Credential Cleanup Not Guaranteed on Panic

File: wrapper.py:692-705

def _cleanup_vertex_credentials(self):
    """Clean up Vertex AI credential file from /tmp."""
    if self._vertex_credentials_path and os.path.exists(self._vertex_credentials_path):
        try:
            os.remove(self._vertex_credentials_path)
            logging.info(f"Cleaned up Vertex credential file: {self._vertex_credentials_path}")
        except Exception as e:
            logging.warning(f"Failed to clean up Vertex credential file {self._vertex_credentials_path}: {e}")
        finally:
            self._vertex_credentials_path = None

Issue: If the Python process is killed (SIGKILL) or crashes before reaching cleanup, the credential file persists in /tmp.

Mitigation:

  • /tmp is already mounted as tmpfs (in-memory), so credentials disappear on pod termination ✅
  • File permissions are 0o400 (owner read-only) ✅
  • Credentials are short-lived (pod lifetime only) ✅

Recommendation: Document this in code comments to clarify the security model relies on tmpfs cleanup.


💡 Suggestions & Best Practices

6. Consider Secret Rotation

The current design requires manual secret updates. Consider:

  • Adding a timestamp or version field to the secret
  • Implementing a mechanism for runners to detect stale credentials
  • Adding a /vertex/credentials/refresh endpoint for credential rotation

7. Add Retry Logic for Credential Fetch

File: wrapper.py:1706-1778

The _fetch_vertex_credentials() method has a single 10-second timeout. Consider:

  • Exponential backoff for transient network errors
  • Distinguishing between 401 (auth failure) and 5xx (retry-able) errors
# Suggested pattern:
for attempt in range(3):
    try:
        # ... fetch logic ...
        break
    except _urllib_error.HTTPError as he:
        if he.code in (401, 403, 404):
            raise  # Don't retry auth failures
        if attempt < 2:
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

8. Monitoring & Observability

Consider adding:

  • Metrics for credential fetch success/failure rates
  • Alerts for repeated credential fetch failures
  • Audit logging for credential access (who/when/which session)

9. ConfigMap Simplification

Files: base/operator-deployment.yaml, overlays/*/operator-config*.yaml

Good cleanup: Removed unused env vars (CLOUD_ML_REGION, ANTHROPIC_VERTEX_PROJECT_ID, GOOGLE_APPLICATION_CREDENTIALS) from operator ConfigMaps

Suggestion: Add comments explaining why only CLAUDE_CODE_USE_VERTEX is needed by the operator (acts as a feature flag, actual credentials fetched by runner).


🧪 Testing Recommendations

10. Additional Test Cases

Consider adding tests for:

Backend (handlers/sessions_test.go):

  • ✅ Already covers most paths - excellent work!
  • Add: Test with malformed JSON in secret's ambient-code-key.json
  • Add: Test with valid token but wrong audience (if audience validation is implemented)

Runner (wrapper.py):

  • Add: Test _fetch_vertex_credentials() with network timeouts
  • Add: Test _cleanup_vertex_credentials() when file doesn't exist
  • Add: Integration test verifying full E2E flow (may require test cluster)

11. E2E Test Updates

File: e2e/cypress/e2e/vteam.cy.ts (not modified)

The E2E tests don't cover Vertex AI sessions. Consider:

  • Adding a test that creates a session with CLAUDE_CODE_USE_VERTEX=1
  • Mocking the Vertex credential endpoint
  • Verifying the runner pod receives credentials correctly

📋 Documentation Checklist

Before merging, ensure:

  • Update docs/configuration.md (or equivalent) with ambient-vertex secret format
  • Document the new credential fetch endpoint in API reference
  • Add deployment guide for Vertex AI setup (creating the secret)
  • Update overlays/production/operator-config-openshift.yaml comment with complete secret creation command (already has good start!)
  • Add architecture diagram showing credential flow: Backend Secret → API → Runner → Vertex AI

🎯 Final Verdict

Approval Recommendation: Approve with minor changes

This PR represents a significant security improvement and follows the project's coding standards closely. The architecture is sound, test coverage is comprehensive, and the implementation follows established patterns.

Before merging:

  1. 🔴 MUST: Explain the kubernetes.Interface change or revert if unnecessary
  2. ⚠️ SHOULD: Make secret name configurable for testing
  3. 💡 NICE TO HAVE: Add API documentation and retry logic

Security Score: 9/10 - Excellent security design with proper token handling, credential isolation, and RBAC enforcement.

Code Quality: 8.5/10 - Clean, well-tested code following project standards. Minor improvements possible in configurability.

Great work on this implementation! The centralized credential management pattern could serve as a model for other cloud provider integrations.


Review completed by: Claude Code (Sonnet 4.5)
Files reviewed: 18 files, 1961 additions, 1808 deletions
Reference: CLAUDE.md backend development standards

@github-actions
Copy link
Contributor

PR Review: JWT Vertex Credential Management

Overview

This PR refactors Vertex AI credential management to use a JWT-based flow where runners authenticate to the backend API using their ServiceAccount tokens (BOT_TOKEN) to fetch GCP credentials dynamically. This is a significant security improvement over the previous approach.

✅ Strengths

1. Security Improvements

  • Excellent: Credentials no longer distributed via Secrets to every session namespace
  • Excellent: Backend serves as centralized credential vault, reading from its own namespace
  • Excellent: Runner writes credentials to /tmp (tmpfs, in-memory) with 0o400 permissions
  • Excellent: Cleanup of credential file after session completion (_cleanup_vertex_credentials())
  • Good: Token authentication properly validated via TokenReview API
  • Good: ServiceAccount authorization verified against session annotation

2. Code Quality

  • Excellent: Comprehensive test coverage for MintSessionVertexCredentials (432 lines, 8 test cases)
  • Good: Proper use of type-safe unstructured.NestedStringMap per CLAUDE.md standards (sessions.go:914)
  • Good: Error handling with structured logging and appropriate HTTP status codes
  • Good: Token length logging instead of actual values (sessions.go:868)

3. RBAC Simplification

  • Excellent: Removed operator's cluster-wide Secret access (operator-clusterrole.yaml)
  • Good: Reduced operator environment variables (operator-deployment.yaml)

⚠️ Issues & Recommendations

1. CRITICAL: Missing Credential Cleanup on Error Paths

Location: components/runners/claude-code-runner/wrapper.py:638-690

The _setup_vertex_credentials() method creates a credential file but only tracks it for cleanup on success. If an exception occurs after writing the file, it won't be cleaned up.

Problem:

# Line 672-678: File written
with open(creds_path, 'w') as f:
    f.write(credentials_json)

# If ANY exception occurs here or in caller, _vertex_credentials_path is never set
# and cleanup won't happen on pod exit

self._vertex_credentials_path = creds_path  # Line 684

Fix: Set self._vertex_credentials_path immediately after successful write, before potential failure points:

with open(creds_path, 'w') as f:
    f.write(credentials_json)
os.chmod(creds_path, 0o400)

# Set immediately so cleanup happens even if subsequent steps fail
self._vertex_credentials_path = creds_path

logging.info(f"Vertex credentials written to {creds_path} with mode 0o400")
# ... rest of function

2. HIGH: Inconsistent Error Handling in GitHub vs Vertex Flows

Location: components/backend/handlers/sessions.go:770-841 vs 847-955

The MintSessionGitHubToken and MintSessionVertexCredentials functions have duplicate validation logic but inconsistent error logging.

Inconsistency:

  • GitHub token endpoint: No logging before TokenReview creation (line 771)
  • Vertex endpoint: Logs token length before TokenReview (line 868)
  • GitHub token endpoint: Missing log on session fetch failure (line 806)
  • Vertex endpoint: Logs session fetch failure (line 908)

Recommendation: Extract common validation logic into a shared function:

// validateRunnerToken performs TokenReview and verifies SA matches session annotation
func validateRunnerToken(c *gin.Context, project, sessionName, token string) (saName string, err error) {
    // Common validation logic from both functions
    // Return extracted SA name on success
}

Then both endpoints can use:

saName, err := validateRunnerToken(c, project, sessionName, token)
if err != nil {
    return // validateRunnerToken already sent response
}

3. MEDIUM: Test Coverage Gap - Backend Integration Tests

Location: components/backend/handlers/sessions_test.go

The unit tests are excellent but missing integration test scenarios:

  • ❌ End-to-end flow with real backend API server
  • ❌ Concurrent requests from multiple runners
  • ❌ Token expiration/renewal scenarios
  • ❌ Backend namespace misconfiguration

Recommendation: Add integration test in tests/integration/:

func TestVertexCredentialFlow_Integration(t *testing.T) {
    // Test real runner token → backend → GCP credential flow
    // Verify credential file permissions, cleanup, etc.
}

4. MEDIUM: Vertex Secret Format Not Validated

Location: components/backend/handlers/sessions.go:942-946

The code checks if ambient-code-key.json exists but doesn't validate:

  • JSON structure of credentials
  • Presence of required fields (type, project_id, private_key, etc.)
  • project-id and region fields in secret

Current:

credJSON, ok := vertexSecret.Data["ambient-code-key.json"]
if !ok {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})
    return
}
// Returns potentially invalid JSON to runner

Fix: Add validation:

credJSON, ok := vertexSecret.Data["ambient-code-key.json"]
if !ok {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})
    return
}

// Validate JSON structure
var credStruct map[string]interface{}
if err := json.Unmarshal(credJSON, &credStruct); err != nil {
    log.Printf("Invalid Vertex credential JSON in secret: %v", err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Invalid Vertex credential format"})
    return
}

// Validate required fields
requiredFields := []string{"type", "project_id", "private_key"}
for _, field := range requiredFields {
    if _, ok := credStruct[field]; !ok {
        log.Printf("Vertex credential missing required field: %s", field)
        c.JSON(http.StatusInternalServerError, gin.H{"error": "Incomplete Vertex credentials"})
        return
    }
}

5. LOW: Operator Environment Variable Cleanup Incomplete

Location: components/manifests/base/operator-deployment.yaml:37-45

The comment says "Operator only needs the flag" but the deployment still reads from ConfigMap. Consider removing the ConfigMap reference entirely if only the flag is needed:

Current:

- name: CLAUDE_CODE_USE_VERTEX
  valueFrom:
    configMapKeyRef:
      name: operator-config
      key: CLAUDE_CODE_USE_VERTEX

Consider: If this is truly just a flag, use a literal value in production overlay:

# In overlays/production/operator-deployment.yaml
- name: CLAUDE_CODE_USE_VERTEX
  value: "1"

6. LOW: Missing Documentation for Secret Creation

Location: components/manifests/overlays/production/operator-config-openshift.yaml:7-17

The comment provides secret creation command but doesn't specify:

  • Which backend namespace to use
  • How to obtain the service account JSON
  • Required GCP permissions for the service account

Recommendation: Add to production deployment docs or ConfigMap comment:

# Required GCP Permissions:
#   - aiplatform.endpoints.predict
#   - aiplatform.models.get
# Service Account: ambient-code@PROJECT_ID.iam.gserviceaccount.com
# Backend Namespace: Use same namespace as vteam-backend deployment

🔍 Minor Code Quality Issues

7. Duplicate Type-Safe Access Pattern (Good to Consolidate)

Location: sessions.go:803-820 vs 913-924

Both MintSessionGitHubToken and MintSessionVertexCredentials extract the runner-sa annotation identically. Consider a helper:

func getRunnerSAFromSession(obj *unstructured.Unstructured) (string, error) {
    anns, _, _ := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
    sa, ok := anns["ambient-code.io/runner-sa"]
    if !ok {
        return "", fmt.Errorf("runner-sa annotation not found")
    }
    return strings.TrimSpace(sa), nil
}

8. Test Helper Function Already Exists

Location: sessions_test.go:421-432

The contains() and stringContains() helpers can be replaced with strings.Contains() from stdlib:

// Delete lines 421-432, use:
if !strings.Contains(response["error"].(string), "Vertex credentials not configured") {
    t.Errorf("Expected error about missing credentials, got: %v", response["error"])
}

📋 Pre-Merge Checklist

🎯 Summary

This is a well-architected security improvement with excellent test coverage. The main concerns are:

  1. Credential cleanup on error paths (must fix before merge)
  2. Secret format validation (should fix to prevent runtime errors)
  3. Code duplication (nice to have, not blocking)

The approach of using JWT authentication for credential distribution is sound and follows Kubernetes security best practices. Once the critical cleanup issue is addressed, this will be a solid improvement to the platform's security posture.

Recommendation: Address items #1 and #4, then approve for merge.


Review generated following CLAUDE.md backend development standards

@github-actions
Copy link
Contributor

Pull Request Review: JWT Vertex Authentication

Overview

This PR implements Vertex AI credential distribution via backend API endpoints, improving security by centralizing credential management and eliminating the need for operators to directly access GCP service account keys.


🎯 Positive Highlights

1. Security Architecture Improvements

  • Centralized credential management: Backend reads ambient-vertex secret from its own namespace, eliminating credential duplication
  • Token-based authentication: Runners use BOT_TOKEN (ServiceAccount token) for credential fetching, following existing GitHub token pattern
  • Credential cleanup: Temporary credential file (/tmp/vertex-credentials.json) is properly cleaned up after session completion (wrapper.py:694-707)
  • Restrictive file permissions: Credentials written with mode 0o400 (read-only for owner)

2. Code Quality & Best Practices

  • Type-safe unstructured access: Proper use of unstructured.NestedString helpers in handlers/sessions.go:862-863 (per CLAUDE.md standards)
  • Comprehensive test coverage: Added 7 test cases covering success, auth failures, namespace mismatches, missing secrets, and more
  • DRY principle: Refactored common token validation logic into reusable validateRunnerToken function (handlers/sessions.go:749-821)
  • Interface usage: Changed *kubernetes.Clientset to kubernetes.Interface for better testability

3. Error Handling

  • Structured logging: All error paths include contextual information (project, session, tokenLen)
  • Graceful degradation: Operator's preflight check logs informational warnings if secret missing, doesn't block startup
  • Token redaction: No tokens logged (only token length), following security best practices

⚠️ Issues & Concerns

1. Critical: Security Vulnerability - Credential Exposure 🔴

Location: wrapper.py:1735

req = _urllib_request.Request(url, data=b"{}", headers={'Content-Type': 'application/json'}, method='POST')
bot = (os.getenv('BOT_TOKEN') or '').strip()
if bot:
    req.add_header('Authorization', f'Bearer {bot}')

Problem: The BOT_TOKEN is appended to WebSocket URL as query parameter (wrapper.py:82):

setattr(self.shell.transport, 'url', ws + f"?token={bot}")

Risk: Query parameters can be logged in:

  • Web server access logs
  • Reverse proxy logs (nginx, HAProxy)
  • Browser history (if WebSocket URL is exposed to client)
  • Monitoring/APM tools

Recommendation: Use headers for authentication instead of query params for WebSocket connections, or document that backend must disable query param logging for these endpoints.


2. Type Safety: Missing Error Checks 🟡

Location: handlers/sessions.go:862-863

userID, _, _ := unstructured.NestedString(obj.Object, "spec", "userContext", "userId")
userID = strings.TrimSpace(userID)
if userID == "" {
    c.JSON(http.StatusBadRequest, gin.H{"error": "session missing user context"})
    return
}

Issue: The third return value (error) is ignored with _. While the current check for empty string catches the issue, you should validate the error to distinguish between "field not found" vs "field exists but wrong type".

Per CLAUDE.md:

REQUIRED: Use unstructured.Nested* helpers with three-value returns
REQUIRED: Check found before using values; handle type mismatches gracefully

Recommendation:

userID, found, err := unstructured.NestedString(obj.Object, "spec", "userContext", "userId")
if err != nil {
    log.Printf("Invalid userContext.userId type for session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusBadRequest, gin.H{"error": "invalid user context"})
    return
}
if !found || strings.TrimSpace(userID) == "" {
    c.JSON(http.StatusBadRequest, gin.H{"error": "session missing user context"})
    return
}

3. Backend Namespace Hardcoding 🟡

Location: handlers/sessions.go:879-880

const vertexSecretName = "ambient-vertex"
backendNamespace := Namespace
vertexSecret, err := K8sClient.CoreV1().Secrets(backendNamespace).Get(c.Request.Context(), vertexSecretName, v1.GetOptions{})

Issue: The Namespace variable is set at startup. If backend and operator run in different namespaces (common in multi-tenant deployments), this assumes the secret exists in the backend's namespace.

Current behavior: Works correctly per design, but lacks documentation

Recommendation: Add a comment explaining the namespace assumption:

// Load Vertex credentials from ambient-vertex secret in the backend namespace.
// Note: Backend must have ambient-vertex secret in its own namespace (not the project/session namespace).
// This centralizes credential management and prevents credential duplication across projects.
const vertexSecretName = "ambient-vertex"
backendNamespace := Namespace  // Backend's namespace, set at startup

4. Test Coverage: Missing Integration Tests 🟡

Current tests (handlers/sessions_test.go):

  • ✅ Unit tests with mocked K8s clients
  • ✅ TokenReview validation
  • ✅ Error conditions

Missing:

  • ❌ Integration test with real backend API call
  • ❌ E2E test verifying runner can actually fetch credentials
  • ❌ Test for credential cleanup on session termination

Recommendation: Add integration test in components/backend/tests/integration/ that:

  1. Creates a real ambient-vertex secret
  2. Creates an AgenticSession with runner SA
  3. Calls the /vertex/credentials endpoint with SA token
  4. Verifies credentials match secret data

5. Operator: Misleading Preflight Check 🟡

Location: operator/internal/preflight/vertex.go:23-36

// Check if ambient-vertex secret exists in backend namespace (informational only)
// This is not a hard requirement as backend may be in different namespace
_, err := config.K8sClient.CoreV1().Secrets(operatorNamespace).Get(...)
if err != nil {
    log.Printf("  Info: secret '%s' not found in namespace '%s'", ...)
    log.Printf("  Note: Ensure backend namespace has the ambient-vertex secret with:...")
}

Issue: The check looks in operatorNamespace but the log message says "backend namespace". These may be different namespaces.

Recommendation: Either:

  1. Skip this check entirely (backend will error at runtime if secret missing)
  2. Check the actual backend namespace (requires knowing backend namespace at operator startup)
  3. Clarify in logs: "Checking operator namespace for secret (backend may use different namespace)"

6. RBAC: Backend Service Account Permissions 🟡

Current RBAC (rbac/operator-clusterrole.yaml shows operator permissions, but backend RBAC not in diff)

Required: Backend ServiceAccount needs:

- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
  # Limited to backend's own namespace

Question: Is this permission already granted? Not visible in the PR diff.

Recommendation: Verify backend ServiceAccount has secrets:get permission in its own namespace, or add a preflight check in backend startup.


7. Error Response Consistency 🟢 Minor

Location: handlers/sessions.go:883-886, 891-893

// Case 1: Secret not found
c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials not configured for this project"})

// Case 2: Missing credential file
c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})

Issue: Both return 404, but the second case is a configuration error (should be 500 Internal Server Error) since the secret exists but is malformed.

Recommendation:

// Case 2: Missing credential file (malformed secret)
c.JSON(http.StatusInternalServerError, gin.H{"error": "Vertex credentials misconfigured"})

This helps distinguish between "feature not enabled" (404) vs "feature broken" (500).


🧪 Testing Recommendations

Required Before Merge:

  1. Integration test: Real backend → runner credential fetch flow
  2. RBAC verification: Confirm backend SA has secrets:get in its namespace
  3. E2E test: Full AgenticSession using Vertex AI (if GCP test env available)

Suggested Test Cases:

  • Vertex session with valid credentials → success
  • Vertex session with missing secret → proper error message
  • Vertex session with malformed secret → proper error
  • Vertex session with wrong SA token → 401/403
  • Credential file cleanup after session completion
  • Credential file cleanup after session crash/timeout

📋 Performance Considerations

Positive:

  • ✅ Credentials fetched once per session (not per API call)
  • ✅ Temporary file in /tmp (tmpfs, memory-backed filesystem)
  • ✅ Single HTTP request to backend for credential fetch

Potential Issues:

  • ⚠️ No caching of credentials across sessions (each session fetches fresh)
    • Impact: Minimal - credentials are small (~2KB JSON)
    • Tradeoff: Fresh credentials prevent stale credential issues

🔒 Security Assessment

Strengths:

  1. ✅ Centralized credential management (single source of truth)
  2. ✅ No credential duplication across namespaces
  3. ✅ Token-based authentication for runner access
  4. ✅ Automatic credential cleanup
  5. ✅ Restrictive file permissions

Weaknesses:

  1. 🔴 High: BOT_TOKEN in WebSocket query parameters (see Issue Outcome: Reduce Refinement Time with agent System #1)
  2. 🟡 Medium: No encryption at rest for /tmp/vertex-credentials.json
    • Mitigation: File mode 0o400 + tmpfs (memory-backed) + cleanup
  3. 🟡 Medium: Credentials returned as plain JSON over HTTP
    • Mitigation: Assumes TLS/HTTPS in production (should document this requirement)

Checklist (Per CLAUDE.md)

Backend/Operator Standards:

  • Authentication: Uses validateRunnerToken for token validation
  • Authorization: Validates SA matches session annotation
  • Error Handling: All errors logged with context
  • Token Security: No tokens in logs (only length)
  • Type Safety: Uses unstructured.Nested* helpers
  • Resource Cleanup: Credentials cleaned up on completion
  • Status Updates: N/A (no status updates in this PR)
  • Tests: Added comprehensive unit tests
  • Logging: Structured logs with context
  • Code Quality: Run golangci-lint before merge ⚠️

📝 Documentation Needs

The following should be documented before merge:

  1. Deployment Guide: How to create ambient-vertex secret

    kubectl create secret generic ambient-vertex \
      --from-file=ambient-code-key.json=/path/to/sa.json \
      --from-literal=project-id=YOUR_PROJECT \
      --from-literal=region=us-central1 \
      -n <backend-namespace>
  2. Security Requirements:

    • TLS/HTTPS required for credential transmission
    • Backend namespace access control
    • Token rotation strategy
  3. Architecture Diagram: Update docs to show new credential flow:

    Runner → (BOT_TOKEN) → Backend → ambient-vertex secret → Vertex Credentials → Runner
    

🎯 Final Verdict

Approval Status: Approve with Changes Required ⚠️

This PR demonstrates excellent architectural improvements and follows most CLAUDE.md standards. However, Issue #1 (BOT_TOKEN in query params) should be addressed before merge to prevent credential leakage in logs.

Required Changes:

  1. 🔴 Fix: BOT_TOKEN in WebSocket query params (security risk)
  2. 🟡 Improve: Type-safe error handling for NestedString calls
  3. 🟡 Document: Namespace assumptions and RBAC requirements

Recommended Changes:

  1. Add integration test for credential fetch flow
  2. Clarify operator preflight check logs
  3. Fix error response codes (404 vs 500)

Estimated Effort:

  • Required changes: 2-3 hours
  • Recommended changes: 3-4 hours
  • Total: 5-7 hours

📚 References

  • CLAUDE.md: Backend Development Standards (lines 285-448)
  • CLAUDE.md: Operator Patterns (lines 449-537)
  • CLAUDE.md: Security Patterns (lines 538-581)

Great work overall! The centralized credential management is a significant security improvement. Looking forward to seeing the WebSocket auth issue resolved. 🚀

@github-actions
Copy link
Contributor

Code Review: JWT Vertex AI Credential Management

This PR refactors Vertex AI credential management to use a more secure, centralized approach. The backend now serves credentials via an authenticated API endpoint instead of the operator distributing secrets to project namespaces.


Strengths

1. Security Improvements

  • Centralized credential management: Vertex credentials stay in the backend namespace, reducing secret sprawl
  • Proper token validation: validateRunnerToken() follows CLAUDE.md patterns with TokenReview + SA verification
  • Token redaction: Follows security standards (logs token length, not content) - components/backend/handlers/sessions.go:751
  • Least privilege: Operator no longer needs secret copy permissions, reduced RBAC footprint

2. Code Quality

  • Excellent test coverage: 8 comprehensive test cases cover success, auth failures, namespace mismatches, missing secrets - components/backend/handlers/sessions_test.go
  • Type-safe unstructured access: Uses unstructured.NestedStringMap() instead of raw type assertions (line 805) ✅ CLAUDE.md compliant
  • Proper error handling: All errors logged with context, appropriate HTTP status codes returned
  • Code reuse: validateRunnerToken() helper reduces duplication between GitHub and Vertex endpoints
  • Clear separation of concerns: Operator sets flag, backend manages credentials, runner fetches on-demand

3. Architecture

  • Follow kubernetes.Interface pattern: Changed from *kubernetes.Clientset to kubernetes.Interface for better testability (git/operations.go:48)
  • Runtime credential fetching: Runner fetches credentials JIT instead of mounting secrets, supports dynamic updates
  • Clean lifecycle management: Vertex credential file cleanup in wrapper.py (lines 689-702)

⚠️ Issues & Concerns

HIGH PRIORITY

1. Potential Secret Exposure in Logs ⚠️

Location: components/runners/claude-code-runner/wrapper.py:213

logging.info(f"  GOOGLE_APPLICATION_CREDENTIALS: {os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')}")

Issue: This logs the file path, which is acceptable, but be careful this doesn't accidentally log credential contents elsewhere.

Recommendation: Add a comment clarifying this only logs the path, not contents.


2. Missing Interface Methods Verification ⚠️

Location: components/backend/handlers/github_auth.go:27, git/operations.go:48

Issue: Changed K8sClient from *kubernetes.Clientset to kubernetes.Interface. While this improves testability, you should verify all callers only use methods defined in the interface.

Recommendation: Run the full test suite and ensure no runtime panics due to missing interface methods.

cd components/backend && go test ./... -v

3. Operator Test Deletion 🚨

Location: components/operator/internal/handlers/sessions_test.go (deleted 501 lines)

Issue: The entire operator test file was deleted. This removes test coverage for:

  • Session event handling
  • Job creation logic
  • Status updates
  • Error scenarios

Recommendation: Do not delete these tests. Instead, update them to reflect the new Vertex credential flow. Operator tests are critical for preventing regressions.


MEDIUM PRIORITY

4. Incomplete Error Context in validateRunnerToken()

Location: components/backend/handlers/sessions.go:805-806

anns, _, _ := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")

Issue: The error return value is ignored. If there's a type mismatch (annotations is not a map[string]string), the error is silently swallowed.

Recommendation: Check the error and log it:

anns, found, err := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
if err != nil {
    log.Printf("Failed to parse annotations for session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return ""
}
if !found {
    anns = map[string]string{}
}

5. Race Condition in Credential Fetch

Location: components/runners/claude-code-runner/wrapper.py:1703-1774

Issue: If the backend service is restarted during a runner session, the credential fetch will fail with a non-retryable error.

Recommendation: Add retry logic with exponential backoff for transient network errors:

for attempt in range(3):
    try:
        response = urllib.request.urlopen(req, timeout=10)
        break
    except urllib.error.URLError as e:
        if attempt < 2:
            await asyncio.sleep(2 ** attempt)
        else:
            raise

6. Missing RBAC Documentation

Location: components/manifests/base/rbac/operator-clusterrole.yaml

Issue: The RBAC changes removed secret copy permissions, but this isn't documented in commit messages or PR description.

Recommendation: Add a migration note explaining:

  • Operator no longer needs secrets: [create, update] in project namespaces
  • Backend needs secrets: [get] in its own namespace (for ambient-vertex)
  • Runners need network access to backend API

LOW PRIORITY

7. Hardcoded Secret Name

Location: components/backend/handlers/sessions.go:894

const vertexSecretName = "ambient-vertex"

Recommendation: Consider making this configurable via environment variable for flexibility in different environments.


8. Credential File Permissions

Location: components/runners/claude-code-runner/wrapper.py:670

os.chmod(creds_path, 0o400)  # Read-only for owner

Good practice, but document why 0o400 (prevents other users in container from reading).


9. Test Uses setupTestGVR() Twice

Location: components/backend/handlers/sessions_test.go:69-70

setupTestGVR()
setupTestGVR()

Issue: Duplicate call (harmless but redundant).

Fix: Remove one line.


🔍 Security Review

Secure Patterns

  1. TokenReview validates runner identity before serving credentials
  2. ServiceAccount name verified against session annotation (prevents token reuse across sessions)
  3. Namespace isolation enforced (token from wrong namespace rejected)
  4. Credentials served only to authorized SAs (not broadcast to all runners)
  5. Temporary credential file cleaned up on exit

⚠️ Security Considerations

  1. Backend becomes single point of failure: If backend is compromised, all Vertex credentials are exposed
    • Mitigation: Ensure backend has strong RBAC, runs with minimal privileges
  2. Credential caching: Runner stores credentials in /tmp for session duration
    • Mitigation: File permissions set to 0o400 (good), but /tmp is shared in some environments
    • Recommendation: Consider using in-memory credential storage if possible
  3. No credential rotation: If ambient-vertex secret is updated, existing runners won't pick up new credentials
    • Mitigation: Document that runners must be restarted after credential rotation

📊 Performance Considerations

  1. Additional HTTP request: Each runner now makes an API call to fetch credentials (adds ~10-100ms latency)
    • Impact: Negligible for typical session durations (minutes to hours)
  2. Backend load: Backend now serves credentials for every session startup
    • Impact: Minimal (credentials fetched once per session, not per request)
  3. Secret read: Backend reads ambient-vertex secret on every credential request
    • Optimization: Consider caching secret in memory with a 5-minute TTL

🧪 Testing Recommendations

Required Before Merge

  1. Backend unit tests: Excellent coverage (8 test cases)
  2. 🚨 Operator tests: Restore deleted tests, update for new flow
  3. ⚠️ Integration test: Test full flow (operator → runner → backend → Vertex API)
  4. ⚠️ Negative test: Verify runner fails gracefully when backend is unreachable
  5. ⚠️ RBAC test: Verify operator no longer needs secret permissions in project namespaces

Test Coverage Gaps

  • No tests for wrapper.py credential fetching logic (consider adding unit tests)
  • No tests for credential cleanup on runner crash/termination
  • No tests for concurrent credential requests (thread safety)

📝 Documentation Updates Needed

  1. Migration guide: How to upgrade from old secret-per-namespace model
  2. Architecture diagram: Show new credential flow (operator → backend → runner)
  3. Troubleshooting: Common errors (backend unreachable, invalid token, secret missing)
  4. RBAC requirements: Update docs to reflect reduced operator permissions

🎯 Recommendations Summary

MUST FIX (Blocking)

  1. 🚨 Restore operator tests (sessions_test.go) - update, don't delete
  2. ⚠️ Fix error handling in validateRunnerToken() (line 805)

SHOULD FIX (High Priority)

  1. Verify kubernetes.Interface change doesn't break callers
  2. Add retry logic for credential fetch in wrapper.py
  3. Document RBAC changes in PR description

NICE TO HAVE (Low Priority)

  1. Make secret name configurable
  2. Add comment about logging file path vs. contents
  3. Remove duplicate setupTestGVR() call in test
  4. Consider credential caching in backend (performance optimization)

Overall Assessment

Verdict: APPROVE with changes required

This is a well-architected refactoring that improves security posture by centralizing credential management. The backend code follows CLAUDE.md standards excellently (type-safe unstructured access, proper error handling, comprehensive tests). However, deleting all operator tests is a critical issue that must be addressed before merge.

The new credential flow (operator sets flag → runner fetches from backend) is cleaner than the old model (operator copies secrets to every namespace), but the PR needs:

  1. Operator test coverage restored
  2. Integration testing to verify end-to-end flow
  3. Documentation updates for the new architecture

Estimated time to address blockers: 2-4 hours (restore operator tests, fix error handling)


Great work on the security improvements and test coverage! The validateRunnerToken() helper is especially well-designed and will be useful for future authenticated endpoints. 🚀

Copy link
Collaborator

@Gkrumbach07 Gkrumbach07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@Gkrumbach07
Copy link
Collaborator

Sandboxing the pod better will help this be more secure too.

@Gkrumbach07 Gkrumbach07 merged commit 2f783a1 into ambient-code:main Nov 12, 2025
21 of 22 checks passed
cooktheryan added a commit that referenced this pull request Nov 12, 2025
cooktheryan added a commit that referenced this pull request Nov 12, 2025
jeremyeder pushed a commit to jeremyeder/platform that referenced this pull request Nov 12, 2025
Signed-off-by: Ryan Cook <rcook@redhat.com>
jeremyeder pushed a commit to jeremyeder/platform that referenced this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants