WIP: Jwt vertex #285

cooktheryan · 2025-11-11T14:04:59Z

No description provided.

Signed-off-by: Ryan Cook <rcook@redhat.com>

github-actions · 2025-11-11T14:06:48Z

PR Review: JWT Vertex Credentials Architecture

This PR refactors Vertex AI credential management from filesystem-based to API-based authentication. This is a significant security improvement that aligns with zero-trust principles. Overall, the implementation is solid with excellent test coverage, but I've identified several critical issues that need attention before merge.

🔴 Critical Issues

1. Backend Reads Secret from Wrong Namespace

Location: components/backend/handlers/sessions.go:924-926

Problem: The code comment says "Load Vertex credentials from ambient-vertex secret in the project namespace", but the production config comment (operator-config-openshift.yaml) says "Ensure ambient-vertex secret exists in backend namespace".

// Load Vertex credentials from ambient-vertex secret in the project namespace
vertexSecretName := "ambient-vertex"
vertexSecret, err := K8sClient.CoreV1().Secrets(project).Get(c.Request.Context(), vertexSecretName, v1.GetOptions{})

This reads from project namespace (line 926), but the production config says:

# Ensure ambient-vertex secret exists in backend namespace with:
#   kubectl create secret generic ambient-vertex \
#     -n <backend-namespace>

Impact: This is a critical mismatch. The backend will fail to find the secret unless it's duplicated in every project namespace, defeating the security benefit of centralized credential management.

Fix Required: Either:

Change line 926 to read from appConfig.BackendNamespace OR
Update documentation to clarify secrets must exist per-project

Recommendation: Read from backend namespace for centralized management (matches original architecture goal).

2. Missing Token Length Logging

Location: components/backend/handlers/sessions.go:865-869

Problem: Token validation doesn't log token length, violating CLAUDE.md security pattern (line 932):

token := strings.TrimSpace(parts[1])
if token == "" {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "empty token"})
    return
}

Per CLAUDE.md:

// NEVER log the token itself
log.Printf("Processing request with token (len=%d)", len(token))

Impact: Hinders debugging when token auth fails.

Fix: Add: log.Printf("MintSessionVertexCredentials: tokenLen=%d", len(token))

3. Test Assumes Wrong Secret Namespace

Location: components/backend/handlers/sessions_test.go:22-32

Problem: Test creates secret in backend-namespace (line 25) but mocks request for test-project (line 60), yet test passes. This only works because the fake K8s client doesn't enforce namespace isolation.

vertexSecret := &corev1.Secret{
    ObjectMeta: metav1.ObjectMeta{
        Name:      "ambient-vertex",
        Namespace: "backend-namespace",  // ❌ Wrong namespace
    },
    // ...
}
// ...
req := httptest.NewRequest("POST", "/api/projects/test-project/agentic-sessions/test-session/vertex/credentials", nil)

Impact: Test doesn't catch the namespace mismatch issue (#1), giving false confidence.

Fix: Update test to match actual implementation (create secret in test-project OR change implementation to read from backend namespace and update test accordingly).

⚠️ High Priority Issues

4. Missing Error Context in Backend

Location: components/backend/handlers/sessions.go:874-876

Per CLAUDE.md: "Always log errors with context"

rv, err := K8sClient.AuthenticationV1().TokenReviews().Create(c.Request.Context(), tr, v1.CreateOptions{})
if err != nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

Fix: Add: log.Printf("TokenReview failed for session %s/%s: %v", project, sessionName, err)

5. Type-Unsafe Metadata Access

Location: components/backend/handlers/sessions.go:911-912

Per CLAUDE.md: "FORBIDDEN: Direct type assertions without checking"

meta, _ := obj.Object["metadata"].(map[string]interface{})
anns, _ := meta["annotations"].(map[string]interface{})

Problem: Ignores ok value - will panic if metadata or annotations aren't maps.

Fix: Use three-value returns and check ok:

meta, ok := obj.Object["metadata"].(map[string]interface{})
if !ok {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return
}
anns, _ := meta["annotations"].(map[string]interface{}) // This is safe after meta check

6. Nil Secret Data Not Checked

Location: components/backend/handlers/sessions.go:948-949

c.JSON(http.StatusOK, gin.H{
    "credentials": string(credJSON),
    "projectId":   string(vertexSecret.Data["project-id"]),    // ❌ May be nil
    "region":      string(vertexSecret.Data["region"]),       // ❌ May be nil
})

Problem: If project-id or region keys don't exist, string(nil) returns empty string without error.

Impact: Runner will fail silently with empty config values.

Fix: Validate before response:

projectID, ok := vertexSecret.Data["project-id"]
if !ok || len(projectID) == 0 {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing project-id"})
    return
}
region, ok := vertexSecret.Data["region"]
if !ok || len(region) == 0 {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing region"})
    return
}
c.JSON(http.StatusOK, gin.H{
    "credentials": string(credJSON),
    "projectId":   string(projectID),
    "region":      string(region),
})

7. Python Wrapper Hardcodes Credential Path

Location: components/runners/claude-code-runner/wrapper.py:662

creds_path = '/tmp/vertex-credentials.json'

Problem: Multiple concurrent sessions in the same pod would overwrite each other's credentials.

Impact: Race conditions in multi-session environments (if ever implemented).

Fix: Use session-specific path:

creds_path = f'/tmp/vertex-credentials-{self.context.session_id}.json'

📝 Medium Priority Issues

8. Inconsistent Error Messages

Line 929: "Vertex credentials not configured for this project"
Line 940: "Vertex credentials missing ambient-code-key.json"

Issue: First message is user-facing (project not configured), second is technical (missing key in existing secret).

Suggestion: Make both user-friendly or add context to logs.

9. Missing Cleanup in Python Wrapper

Location: wrapper.py:662-670

Credentials file is written to /tmp but never cleaned up.

Impact: Credentials remain on disk until pod termination (minor risk since /tmp is tmpfs).

Fix: Add cleanup in wrapper shutdown/error paths:

try:
    os.remove(creds_path)
except:
    pass

10. Operator Log Message Unclear

Location: components/operator/internal/handlers/sessions.go:261

log.Printf("Vertex AI enabled - runner will fetch credentials from backend API")

Suggestion: Add session context: log.Printf("Session %s/%s: Vertex AI enabled - runner will fetch credentials from backend API", sessionNamespace, name)

✅ Strengths

Excellent Security Improvement: Moving from filesystem-based to API-based credentials is a major security win
Comprehensive Test Coverage: Backend tests cover all edge cases (8 test functions)
Proper Token Validation: TokenReview validates SA namespace and matches CR annotation
RBAC Cleanup: Removed unnecessary secrets permissions from operator ClusterRole
Clean Operator Refactoring: Removed 150+ lines of complex secret-copying logic
Python Tests: Test coverage for new _fetch_vertex_credentials method
Zero-Trust Architecture: Runner can't access credentials without valid BOT_TOKEN

📋 Pre-Merge Checklist

Before merging, please address:

CRITICAL: Fix namespace mismatch (Outcome: Reduce Refinement Time with agent System #1) - decide backend vs project namespace
CRITICAL: Update test to match implementation (Epic: Data Source Integration #3)
Add token length logging (Epic: RAT Architecture & Design #2)
Add error context logging (Epic: AI Agent Development #4)
Fix type-unsafe metadata access (Epic: Jira Integration & Workflow #5)
Validate all secret data fields before returning (Epic: Testing & Validation #6)
Use session-specific credential file path (Test: Automation Workflow Validation #7)
Run make lint on backend and operator (verify gofmt, golangci-lint pass)
Update documentation in operator-config-openshift.yaml to match implementation

🎯 Architecture Validation

The overall architecture is sound and follows CLAUDE.md guidelines:

✅ Authentication Pattern: Uses backend service account (K8sClient) for reading secrets - correct per CLAUDE.md
✅ Token Security: BOT_TOKEN validation via TokenReview with namespace + SA verification
✅ Error Handling: Proper IsNotFound checks, appropriate HTTP status codes
✅ Separation of Concerns: Backend owns secrets, operator only passes flag
✅ Security Context: Credentials fetched on-demand, not persisted in pod filesystem

🔧 Suggested Follow-up Work (Post-Merge)

Add integration test for full credential flow (backend → runner → Vertex AI)
Consider adding rate limiting to vertex/credentials endpoint
Add metrics for credential fetch success/failure rates
Document secret creation process in production deployment guide

Summary

This PR represents a significant architectural improvement. The credential-on-demand pattern is exactly right. However, the namespace mismatch between code, tests, and documentation is a blocker that must be resolved before merge. Once the critical issues are addressed, this will be production-ready.

Recommendation: Request changes for critical issues (#1, #3), rest can be addressed before merge or as follow-up PR.

Signed-off-by: Ryan Cook <rcook@redhat.com>

…rtex

github-actions · 2025-11-11T14:32:22Z

Pull Request Review: JWT/Vertex AI Credential Management

This PR refactors how Vertex AI credentials are managed, moving from a secret-copying approach to a backend-served JWT token pattern. This is a significant architectural improvement that enhances security and simplifies the RBAC model.

✅ Strengths

Excellent Security Architecture
- Centralizes credential management in the backend namespace (single source of truth)
- Uses proper ServiceAccount token validation via TokenReview API
- Verifies namespace and SA matching against CR annotations
- Never logs sensitive tokens (uses tokenLen=%d pattern)
- Removes need for operator to have cross-namespace secret permissions
Comprehensive Test Coverage
- 8 test cases covering all security boundaries: success path, missing auth, invalid token, wrong namespace, session not found, SA mismatch, secret not found, missing credential file
- Uses proper fake K8s clients and reactor patterns
- Good edge case coverage
Clean Operator Simplification
- Removes secret copying logic and cleanup code
- Reduces ClusterRole permissions (no more cross-namespace secret access)
- Operator only needs CLAUDE_CODE_USE_VERTEX flag
- Removes unused environment variables from ConfigMaps
Follows Established Patterns
- Mirrors MintSessionGitHubToken handler pattern (handlers/sessions.go:811-845)
- Uses backend service account for reading backend namespace secret (correct per CLAUDE.md guidelines)
- Returns structured JSON with credentials, projectId, region

🔧 Issues to Address

Critical Issues

Duplicate setupTestGVR() Call (sessions_test.go:70)

DynamicClient = fake.NewSimpleDynamicClient(s, sessionCR)
setupTestGVR()
setupTestGVR()  // ❌ Duplicate call

Fix: Remove line 70

Missing Error Check (sessions_test.go:351)

var response map[string]interface{}
json.Unmarshal(w.Body.Bytes(), &response)  // ❌ Error ignored
if !contains(response["error"].(string), "Vertex credentials not configured") {

Fix:

if err := json.Unmarshal(w.Body.Bytes(), &response); err != nil {
    t.Fatalf("Failed to parse response: %v", err)
}

Potential Nil Pointer Panic (sessions_test.go:352)
The test assumes response["error"] exists and is a string, but doesn't check:

if !contains(response["error"].(string), "Vertex credentials not configured") {

Fix:

errMsg, ok := response["error"].(string)
if !ok || !contains(errMsg, "Vertex credentials not configured") {
    t.Errorf("Expected error about missing credentials, got: %v", response)
}

Custom String Helper Reinvents stdlib (sessions_test.go:421-432)

func stringContains(s, substr string) bool {
    for i := 0; i <= len(s)-len(substr); i++ {
        if s[i:i+len(substr)] == substr {
            return true
        }
    }
    return false
}

Fix: Use strings.Contains() from stdlib instead

Medium Issues

Type Assertion Pattern Inconsistency
The new handler uses direct type assertions in several places:

meta, _ := obj.Object["metadata"].(map[string]interface{})
anns, _ := meta["annotations"].(map[string]interface{})

Per CLAUDE.md backend standards ("Type-Safe Unstructured Access"), should use unstructured.Nested* helpers:

anns, found, err := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
if !found || err != nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return
}
expectedSA, ok := anns["ambient-code.io/runner-sa"]

Secret Data Keys Not Validated
The handler reads vertexSecret.Data["project-id"] and ["region"] without checking if they exist. Should validate all required keys:

projectID, ok := vertexSecret.Data["project-id"]
if !ok {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex secret missing project-id"})
    return
}

Inconsistent Backend Namespace Naming
handlers/sessions.go:930 uses:
```
backendNamespace := Namespace
vertexSecretName := "ambient-vertex"
```
But the comment says "Backend reads from its own namespace". Consider logging this for debugging:
```
log.Printf("Reading Vertex credentials from backend namespace: %s", backendNamespace)
```

Minor Issues

Missing Context Timeout
The handler uses c.Request.Context() without timeout. Consider adding timeout for secret read operations:

ctx, cancel := context.WithTimeout(c.Request.Context(), 5*time.Second)
defer cancel()
vertexSecret, err := K8sClient.CoreV1().Secrets(backendNamespace).Get(ctx, vertexSecretName, v1.GetOptions{})

Unused Import in Operator
components/operator/internal/handlers/sessions.go:24 removed usage of k8s.io/client-go/util/retry but didn't remove the import (verify with go build)

Documentation Mismatch
The production ConfigMap comment (operator-config-openshift.yaml:10-15) shows manual kubectl create secret command but doesn't specify which namespace. Should clarify:

# Ensure ambient-vertex secret exists in backend namespace (same as backend pod) with:
#   kubectl create secret generic ambient-vertex \
#     --from-file=ambient-code-key.json=/path/to/service-account.json \
#     --from-literal=project-id=YOUR_PROJECT \
#     --from-literal=region=YOUR_REGION \
#     -n ambient-code  # <-- or your backend namespace

📋 Code Quality Checklist (from CLAUDE.md)

User token authentication pattern (uses backend SA for backend namespace secret - correct)
No panic() in production code
Token redaction in logs
[⚠️] Type-safe unstructured access (Issue Epic: Jira Integration & Workflow #5)
Proper error handling and logging
OwnerReferences not applicable (no child resources created)
Security context patterns not applicable (no pods created)
Tests added

🚀 Recommendations

Add Integration Test: Create a test that validates the full flow: operator starts job → runner calls endpoint → receives credentials
Document Secret Structure: Add to docs/configuration.md the exact structure of the ambient-vertex secret
Consider Secret Validation: Add a preflight check in the backend startup to validate the vertex secret exists and has required keys
Add Metrics: Consider adding Prometheus metrics for vertex credential requests (success/failure counts)
Rate Limiting: Consider rate limiting this endpoint to prevent credential exposure via brute force

🎯 Priority Actions Before Merge

Must Fix:

Issue Outcome: Reduce Refinement Time with agent System #1: Remove duplicate setupTestGVR() call
Issue Epic: RAT Architecture & Design #2: Add error check for json.Unmarshal
Issue Epic: Data Source Integration #3: Add nil check before type assertion

Should Fix:

Issue Epic: AI Agent Development #4: Use strings.Contains() instead of custom helper
Issue Epic: Jira Integration & Workflow #5: Use unstructured.NestedStringMap() for type safety
Issue Epic: Testing & Validation #6: Validate all secret data keys exist

Nice to Have:

Issues Test: Automation Workflow Validation #7-10: Documentation and minor improvements

📊 Overall Assessment

This is a well-architected change that significantly improves the security posture of Vertex AI credential management. The approach of centralizing credentials in the backend and serving them via authenticated API is the right design pattern.

The test coverage is excellent and follows Go testing best practices. The main issues are minor bugs in the test code and some opportunities to follow established type-safety patterns more closely.

Recommendation: Approve after fixing the 3 must-fix issues. The should-fix issues can be addressed in a follow-up PR if time is constrained.

Testing Checklist (run before merge):

cd components/backend
go test -v ./handlers -run TestMintSessionVertexCredentials
gofmt -l .
go vet ./...
golangci-lint run

github-actions · 2025-11-11T14:36:09Z

Code Review: JWT Vertex Authentication Changes

This PR implements a significant architectural improvement by moving Vertex AI credential management from operator-level secret copying to a centralized backend API service with JWT authentication. I've conducted a comprehensive review based on the project's CLAUDE.md guidelines.

Summary

Architecture Change: Transitions from operator copying ambient-vertex secrets to each project namespace → backend API serving credentials on-demand to authorized runners via JWT tokens.

Key Benefits:

✅ Reduced RBAC permissions (operator no longer needs cross-namespace secret access)
✅ Centralized credential management
✅ Improved security with per-session token validation
✅ Eliminates secret proliferation across namespaces

Detailed Findings

✅ Strengths

1. Backend Implementation (`components/backend/handlers/sessions.go:848-956`)

Excellent authentication pattern:

Follows CLAUDE.md guidelines for token security (never logs token values - line 872)
Proper TokenReview validation with ServiceAccount verification (lines 874-900)
Validates SA matches session annotation ambient-code.io/runner-sa (lines 903-925)
Namespace isolation check prevents cross-project access (lines 897-900)

Security best practices:

Uses backend service account to read from own namespace only (line 931)
Returns generic error messages to API clients (no internal details leaked)
Proper error handling with IsNotFound checks

Type safety:

Uses type assertions with ok checks: anns, _ := meta["annotations"].(map[string]interface{}) (line 915)
Could improve with unstructured.NestedStringMap() helpers per CLAUDE.md standards

2. Comprehensive Test Coverage (`components/backend/handlers/sessions_test.go`)

432 lines of new tests covering:

✅ Success path with valid SA token
✅ Missing/invalid Authorization header
✅ Invalid token (unauthenticated)
✅ Wrong namespace (cross-project access attempt)
✅ Session not found
✅ SA name mismatch
✅ Secret not found
✅ Missing credential file in secret

Strong adherence to Go testing patterns: Uses fake clients, proper mocking with PrependReactor, clear test names.

3. Operator Simplification (`components/operator/internal/handlers/sessions.go`)

Removed 166 lines of secret copying logic:

Eliminates copySecretToNamespace() complexity
Removes deleteAmbientVertexSecret() cleanup code
Cleaner reconciliation loop

Better separation of concerns: Operator focuses on Job creation, backend handles secrets.

4. Runner Implementation (`components/runners/claude-code-runner/wrapper.py`)

Secure credential handling (lines 630-679):

Writes credentials to /tmp/vertex-credentials.json (tmpfs, in-memory)
Sets restrictive permissions: os.chmod(creds_path, 0o400) (owner read-only)
Validates all required fields before proceeding

Proper async patterns: Uses run_in_executor for blocking urllib operations (line 1729).

Comprehensive error handling: Returns empty dict on failure, clear logging at each step.

⚠️ Issues & Recommendations

Critical Issues

1. Type Safety Violation in Backend (sessions.go:914-915)

meta, _ := obj.Object["metadata"].(map[string]interface{})
anns, _ := meta["annotations"].(map[string]interface{})

Problem: Direct type assertions without checking, discarding errors.

Risk: If metadata is not a map, anns will be nil and anns["ambient-code.io/runner-sa"] will succeed but return nil.

Fix per CLAUDE.md:

meta, found, err := unstructured.NestedMap(obj.Object, "metadata")
if err \!= nil || \!found {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return
}
anns, _, _ := unstructured.NestedStringMap(meta, "annotations")
expectedSA := ""
if v, ok := anns["ambient-code.io/runner-sa"]; ok {
    expectedSA = strings.TrimSpace(v)
}

Reference: CLAUDE.md "Type-Safe Unstructured Access" section.

2. Interface Type Change Lacks Justification (git/operations.go:48, github_auth.go:27, repo.go:23)

Changed from:

K8sClient *kubernetes.Clientset

To:

K8sClient kubernetes.Interface

Questions:

What problem does this solve?
Are there new mock requirements for tests?
Does this change API compatibility?

Recommendation: Add a comment explaining why the interface is preferred. If it's for testability, ensure all consumers can handle kubernetes.Interface.

3. Missing Error Logging in Token Review (sessions.go:877-878)

if err \!= nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

Problem: Per CLAUDE.md error handling patterns, all errors should be logged with context before returning.

Fix:

if err \!= nil {
    log.Printf("TokenReview failed for project=%s, session=%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

4. Missing Route Registration Context (routes.go:36)

api.POST("/projects/:projectName/agentic-sessions/:sessionName/vertex/credentials", handlers.MintSessionVertexCredentials)

Problem: Route is registered OUTSIDE the ValidateProjectContext() middleware group (line 38).

Risk: Endpoint doesn't benefit from project-level RBAC validation that other session endpoints have.

Analysis: This might be intentional since the handler does its own TokenReview, but it's inconsistent with the pattern at routes.go:33 where GitHub token minting also sits outside the middleware.

Recommendation: Add a comment explaining why this endpoint bypasses ValidateProjectContext(), or consider using the middleware for consistency.

Medium Priority Issues

5. Hardcoded Secret Name (sessions.go:930)

vertexSecretName := "ambient-vertex"

Recommendation: Extract to a constant or config variable for easier customization across environments.

const VertexSecretName = "ambient-vertex"

Or reference from operator/internal/types/resources.go if AmbientVertexSecretName constant still exists there (I see it referenced in tests).

6. Incomplete Documentation in Manifests

overlays/production/operator-config-openshift.yaml:10-15 has good instructions for creating the secret, but:

Doesn't specify which namespace ("backend-namespace" is generic)
Doesn't mention that backend needs RBAC to read secrets in its own namespace
Doesn't explain what happens if the secret is missing (sessions will fail with 404)

Recommendation: Add troubleshooting section to docs or README:

# Prerequisites for Vertex AI:
# 1. Create ambient-vertex secret in the BACKEND namespace (e.g., ambient-code):
#    kubectl create secret generic ambient-vertex \
#      --from-file=ambient-code-key.json=/path/to/sa.json \
#      --from-literal=project-id=YOUR_PROJECT \
#      --from-literal=region=YOUR_REGION \
#      -n ambient-code
#
# 2. Verify backend ServiceAccount can read secrets in its namespace
# 3. Set CLAUDE_CODE_USE_VERTEX=1 in operator config
#
# Troubleshooting:
# - If runners fail with "Vertex credentials not configured", check secret exists
# - If 403 errors, verify backend RBAC allows secret reads

7. Runner Test Gaps

tests/test_wrapper_vertex.py lost 157 lines (406 → 249). Need to verify:

✅ Backend API fetch tests (covered)
✅ Credential file writing tests (covered)
❓ File permission tests (0o400 enforcement)
❓ Cleanup tests (is temp file deleted on exit?)
❓ Credential rotation tests (what happens if backend updates secret mid-session?)

Recommendation: Add test for file permissions:

def test_vertex_credentials_file_permissions(self):
    # After setup, verify file has mode 0o400
    assert stat.S_IMODE(os.stat('/tmp/vertex-credentials.json').st_mode) == 0o400

8. Operator RBAC Cleanup (base/rbac/operator-clusterrole.yaml)

Removed secret permissions (lines 51-54). ✅ Good!

Verify: Does operator still need ANY secret permissions? Check if there are other secrets it creates/manages (e.g., runner SA tokens in provisionRunnerTokenForSession).

Note: That function runs in backend, not operator, so operator should be clear. Confirm no other secret usage in operator code.

Low Priority / Nitpicks

9. Unused Import Cleanup (operator/internal/handlers/sessions.go:24)

"k8s.io/client-go/util/retry"

Removed code that used retry, but import might still be there. Run go mod tidy and gofmt to clean up.

10. Logging Consistency

Backend uses log.Printf (stdlib) while operator uses same. Good consistency.

Runner uses Python logging module. Good.

Recommendation: Ensure log levels are appropriate:

logging.info for Vertex setup (line 672) ✅
logging.error for fetch failures (line 1723) ✅
Consider logging.warning for non-fatal issues

11. Token Length Logging

Backend logs tokenLen=%d (line 872) ✅ Excellent per CLAUDE.md!

Runner doesn't log BOT_TOKEN length when fetching credentials. Consider adding for consistency:

logging.debug(f"Using BOT_TOKEN for Vertex auth (len={len(bot)})")

🧪 Testing Recommendations

Pre-Merge Checklist:

Run backend tests:

cd components/backend
go test ./handlers -v -run TestMintSessionVertexCredentials

Run operator tests (if any exist for session handling):

cd components/operator
go test ./internal/handlers -v

Run runner tests:

cd components/runners/claude-code-runner
pytest tests/test_wrapper_vertex.py -v

Linting:

cd components/backend && gofmt -l . && golangci-lint run
cd components/operator && gofmt -l . && golangci-lint run

E2E Test (if Vertex is enabled):
- Deploy to test cluster with ambient-vertex secret
- Create session with CLAUDE_CODE_USE_VERTEX=1
- Verify runner can fetch credentials and initialize Anthropic client
- Check logs for "Vertex AI configured: project=..."

📋 Pre-Commit Actions (from CLAUDE.md)

Before merging, verify:

Authentication: New endpoint uses proper token validation ✅
Authorization: SA matches session annotation ✅
Error Handling: Add logging to TokenReview error (Issue Epic: Data Source Integration #3)
Token Security: No tokens in logs ✅
Type Safety: Use unstructured.Nested* helpers (Issue Outcome: Reduce Refinement Time with agent System #1)
Resource Cleanup: OwnerReferences properly set ✅
Tests: Comprehensive test coverage ✅
Logging: Missing context in some error logs (Issue Epic: Data Source Integration #3)
Code Quality: Run gofmt, golangci-lint (Issue Bump actions/add-to-project from 0.5.0 to 1.0.2 #9)

Security Analysis

✅ Security Improvements

Reduced Secret Sprawl: Credentials stay in backend namespace, not copied everywhere
Per-Session Authorization: Each request validated via TokenReview + SA annotation check
Minimal File Permissions: Runner writes creds to tmpfs with 0o400
Token Redaction: Proper logging practices throughout

⚠️ Security Considerations

Credentials in API Response: credentials field contains raw GCP SA JSON in HTTP response body (line 952)
- Mitigation: Uses HTTPS, authorized tokens only
- Recommendation: Consider encrypting response or using short-lived tokens if GCP supports
Temp File Lifetime: /tmp/vertex-credentials.json persists until pod exits
- Mitigation: tmpfs is in-memory, not persistent storage
- Recommendation: Document cleanup behavior or add explicit cleanup on exit
Backend Namespace Secret: Single ambient-vertex secret serves all projects
- Risk: Compromise of backend namespace exposes all Vertex credentials
- Mitigation: Proper namespace isolation, RBAC restrictions
- Recommendation: Consider per-project secrets if multi-tenancy requires isolation

Performance Considerations

New API Call per Session: Runners now make HTTP request to backend for credentials.

Impact: Negligible (one-time fetch during session init, ~10ms overhead).

Benefit: Eliminates operator reconciliation loop for secret copying (reduces cluster load).

Documentation Gaps

No docs update for new endpoint: POST /api/projects/:project/agentic-sessions/:session/vertex/credentials
- Add to API reference docs
- Document request/response format
- Document authentication requirements (BOT_TOKEN)
No migration guide from old secret-copying approach
- What happens to existing sessions with old secrets?
- Do operators need to clean up old secrets manually?
No troubleshooting guide for Vertex setup
- How to verify secret is correctly configured?
- How to test credential fetch independently?

Overall Assessment

Architecture: ⭐⭐⭐⭐⭐ (5/5) - Excellent design, cleaner separation of concerns

Implementation: ⭐⭐⭐⭐☆ (4/5) - Solid code, needs type safety fixes

Testing: ⭐⭐⭐⭐⭐ (5/5) - Comprehensive backend tests, good runner coverage

Security: ⭐⭐⭐⭐☆ (4/5) - Strong auth model, minor credential handling improvements possible

Documentation: ⭐⭐⭐☆☆ (3/5) - Needs API docs and troubleshooting guide

Recommendation

Status: ✅ Approve with minor fixes

Required before merge:

Fix type safety in sessions.go:914-915 (use unstructured.Nested*)
Add error logging to TokenReview failure (sessions.go:877)
Run gofmt and golangci-lint on backend/operator

Recommended (can be follow-up PRs):
4. Document new API endpoint
5. Add troubleshooting section to production overlay comments
6. Extract ambient-vertex constant
7. Add file permission test to runner tests

Great work on this refactoring! The JWT-based credential distribution is a significant improvement over the previous secret-copying approach. Once the type safety and error logging issues are addressed, this is ready to merge.

Files to update:

components/backend/handlers/sessions.go (lines 914-915, 877)
Run linters before final commit

Let me know if you'd like help with any of these fixes! 🚀

…rtex

github-actions · 2025-11-11T15:03:01Z

PR Review: JWT Vertex AI Integration

I've reviewed PR #285 which implements Vertex AI credential management through a centralized backend service. Overall, this is a well-architected security improvement that follows the project's CLAUDE.md guidelines closely. Here's my detailed feedback:

✅ Strengths

1. Excellent Security Architecture

Centralized credential storage: Moving from per-namespace secrets to a single ambient-vertex secret in the backend namespace is a significant security improvement
Token-based authentication: Using BOT_TOKEN (ServiceAccount token) with TokenReview validation follows established patterns from MintSessionGitHubToken
Credential isolation: Runners fetch credentials on-demand rather than having them pre-provisioned in every namespace
Temporary file handling: Credentials written to /tmp (tmpfs) with 0o400 permissions and cleaned up after use (wrapper.py:669-705)
Token redaction: Proper logging that never exposes actual token values (handlers/sessions.go:828)

2. CLAUDE.md Compliance

✅ Type-safe unstructured access: Correctly uses unstructured.NestedStringMap() and unstructured.NestedString() (handlers/sessions.go:813-816, 826)
✅ Proper error handling: Returns explicit errors with context, no panics in production code
✅ Token security: Never logs tokens, uses len(token) for debugging
✅ Authentication pattern: Uses TokenReview validation matching MintSessionGitHubToken pattern
✅ RBAC verification: Validates ServiceAccount matches CR annotation

3. Comprehensive Test Coverage

The new test file (handlers/sessions_test.go) covers all critical paths:

✅ Success path with valid credentials
✅ Missing Authorization header
✅ Invalid token
✅ Namespace mismatch
✅ Session not found
✅ ServiceAccount mismatch
✅ Secret not found
✅ Missing credential file in secret

4. RBAC Reduction

Removed operator permissions: Operator no longer needs secrets:get,create,delete cluster-wide (operator-clusterrole.yaml)
Principle of least privilege: Operator only reads the flag, backend serves credentials

🔴 Critical Issues

1. Interface Type Change Without Justification (High Priority)

Files: git/operations.go:48, handlers/github_auth.go:27, handlers/repo.go:23, handlers/sessions.go:38, handlers/sessions.go:563

// Changed from:
K8sClient *kubernetes.Clientset
// To:
K8sClient kubernetes.Interface

Problem: This changes the concrete type to an interface across multiple files, but:

No explanation in PR description
No test modifications to verify interface compatibility
Could break code that depends on Clientset-specific methods not in Interface

Recommendation:

Add a comment in the PR description explaining why this change is necessary for Vertex integration
If this is for testability, update all tests to use fake clients with the Interface type
If this change is unrelated to Vertex, consider splitting into a separate refactoring PR

2. Missing Documentation for New Endpoint

File: routes.go:40

The new endpoint /projects/:projectName/agentic-sessions/:sessionName/vertex/credentials is added with a comment explaining it uses BOT_TOKEN, but:

No API documentation in docs/
No OpenAPI/Swagger spec update
Runners need to know this endpoint exists

Recommendation:

Add endpoint documentation to docs/reference/api-endpoints.md (if it exists)
Update any API reference documentation
Consider adding a code example showing how runners should call this endpoint

⚠️ Moderate Issues

3. Inconsistent Error Logging

File: handlers/sessions.go:773, 807

// Line 773: Logs error
if err != nil {
    log.Printf("TokenReview failed for project=%s, session=%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "token review failed"})
    return
}

// Line 807: Also logs error (good!)
if err != nil {
    log.Printf("Failed to get session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to read session"})
    return
}

This is good and consistent. However, the original MintSessionGitHubToken didn't have the log at line 773 originally. This PR adds it, which is an improvement.

Recommendation: ✅ This is actually fine - the consistency is good!

4. Hardcoded Secret Name

File: handlers/sessions.go:888

const vertexSecretName = "ambient-vertex"

Issue: The secret name is hardcoded, making it harder to test or support multiple configurations.

Recommendation:

Move to a package-level configurable variable set from environment variable (e.g., VERTEX_SECRET_NAME)
Default to "ambient-vertex" for backward compatibility
Makes testing easier (can create test-specific secrets)

// Suggested pattern:
var VertexSecretName = getEnv("VERTEX_SECRET_NAME", "ambient-vertex")

5. Credential Cleanup Not Guaranteed on Panic

File: wrapper.py:692-705

def _cleanup_vertex_credentials(self):
    """Clean up Vertex AI credential file from /tmp."""
    if self._vertex_credentials_path and os.path.exists(self._vertex_credentials_path):
        try:
            os.remove(self._vertex_credentials_path)
            logging.info(f"Cleaned up Vertex credential file: {self._vertex_credentials_path}")
        except Exception as e:
            logging.warning(f"Failed to clean up Vertex credential file {self._vertex_credentials_path}: {e}")
        finally:
            self._vertex_credentials_path = None

Issue: If the Python process is killed (SIGKILL) or crashes before reaching cleanup, the credential file persists in /tmp.

Mitigation:

/tmp is already mounted as tmpfs (in-memory), so credentials disappear on pod termination ✅
File permissions are 0o400 (owner read-only) ✅
Credentials are short-lived (pod lifetime only) ✅

Recommendation: Document this in code comments to clarify the security model relies on tmpfs cleanup.

💡 Suggestions & Best Practices

6. Consider Secret Rotation

The current design requires manual secret updates. Consider:

Adding a timestamp or version field to the secret
Implementing a mechanism for runners to detect stale credentials
Adding a /vertex/credentials/refresh endpoint for credential rotation

7. Add Retry Logic for Credential Fetch

File: wrapper.py:1706-1778

The _fetch_vertex_credentials() method has a single 10-second timeout. Consider:

Exponential backoff for transient network errors
Distinguishing between 401 (auth failure) and 5xx (retry-able) errors

# Suggested pattern:
for attempt in range(3):
    try:
        # ... fetch logic ...
        break
    except _urllib_error.HTTPError as he:
        if he.code in (401, 403, 404):
            raise  # Don't retry auth failures
        if attempt < 2:
            await asyncio.sleep(2 ** attempt)  # Exponential backoff

8. Monitoring & Observability

Consider adding:

Metrics for credential fetch success/failure rates
Alerts for repeated credential fetch failures
Audit logging for credential access (who/when/which session)

9. ConfigMap Simplification

Files: base/operator-deployment.yaml, overlays/*/operator-config*.yaml

✅ Good cleanup: Removed unused env vars (CLOUD_ML_REGION, ANTHROPIC_VERTEX_PROJECT_ID, GOOGLE_APPLICATION_CREDENTIALS) from operator ConfigMaps

Suggestion: Add comments explaining why only CLAUDE_CODE_USE_VERTEX is needed by the operator (acts as a feature flag, actual credentials fetched by runner).

🧪 Testing Recommendations

10. Additional Test Cases

Consider adding tests for:

Backend (handlers/sessions_test.go):

✅ Already covers most paths - excellent work!
Add: Test with malformed JSON in secret's ambient-code-key.json
Add: Test with valid token but wrong audience (if audience validation is implemented)

Runner (wrapper.py):

Add: Test _fetch_vertex_credentials() with network timeouts
Add: Test _cleanup_vertex_credentials() when file doesn't exist
Add: Integration test verifying full E2E flow (may require test cluster)

11. E2E Test Updates

File: e2e/cypress/e2e/vteam.cy.ts (not modified)

The E2E tests don't cover Vertex AI sessions. Consider:

Adding a test that creates a session with CLAUDE_CODE_USE_VERTEX=1
Mocking the Vertex credential endpoint
Verifying the runner pod receives credentials correctly

📋 Documentation Checklist

Before merging, ensure:

Update docs/configuration.md (or equivalent) with ambient-vertex secret format
Document the new credential fetch endpoint in API reference
Add deployment guide for Vertex AI setup (creating the secret)
Update overlays/production/operator-config-openshift.yaml comment with complete secret creation command (already has good start!)
Add architecture diagram showing credential flow: Backend Secret → API → Runner → Vertex AI

🎯 Final Verdict

Approval Recommendation: Approve with minor changes

This PR represents a significant security improvement and follows the project's coding standards closely. The architecture is sound, test coverage is comprehensive, and the implementation follows established patterns.

Before merging:

🔴 MUST: Explain the kubernetes.Interface change or revert if unnecessary
⚠️ SHOULD: Make secret name configurable for testing
💡 NICE TO HAVE: Add API documentation and retry logic

Security Score: 9/10 - Excellent security design with proper token handling, credential isolation, and RBAC enforcement.

Code Quality: 8.5/10 - Clean, well-tested code following project standards. Minor improvements possible in configurability.

Great work on this implementation! The centralized credential management pattern could serve as a model for other cloud provider integrations.

Review completed by: Claude Code (Sonnet 4.5)
Files reviewed: 18 files, 1961 additions, 1808 deletions
Reference: CLAUDE.md backend development standards

github-actions · 2025-11-11T15:12:30Z

PR Review: JWT Vertex Credential Management

Overview

This PR refactors Vertex AI credential management to use a JWT-based flow where runners authenticate to the backend API using their ServiceAccount tokens (BOT_TOKEN) to fetch GCP credentials dynamically. This is a significant security improvement over the previous approach.

✅ Strengths

1. Security Improvements

Excellent: Credentials no longer distributed via Secrets to every session namespace
Excellent: Backend serves as centralized credential vault, reading from its own namespace
Excellent: Runner writes credentials to /tmp (tmpfs, in-memory) with 0o400 permissions
Excellent: Cleanup of credential file after session completion (_cleanup_vertex_credentials())
Good: Token authentication properly validated via TokenReview API
Good: ServiceAccount authorization verified against session annotation

2. Code Quality

Excellent: Comprehensive test coverage for MintSessionVertexCredentials (432 lines, 8 test cases)
Good: Proper use of type-safe unstructured.NestedStringMap per CLAUDE.md standards (sessions.go:914)
Good: Error handling with structured logging and appropriate HTTP status codes
Good: Token length logging instead of actual values (sessions.go:868)

3. RBAC Simplification

Excellent: Removed operator's cluster-wide Secret access (operator-clusterrole.yaml)
Good: Reduced operator environment variables (operator-deployment.yaml)

⚠️ Issues & Recommendations

1. CRITICAL: Missing Credential Cleanup on Error Paths

Location: components/runners/claude-code-runner/wrapper.py:638-690

The _setup_vertex_credentials() method creates a credential file but only tracks it for cleanup on success. If an exception occurs after writing the file, it won't be cleaned up.

Problem:

# Line 672-678: File written
with open(creds_path, 'w') as f:
    f.write(credentials_json)

# If ANY exception occurs here or in caller, _vertex_credentials_path is never set
# and cleanup won't happen on pod exit

self._vertex_credentials_path = creds_path  # Line 684

Fix: Set self._vertex_credentials_path immediately after successful write, before potential failure points:

with open(creds_path, 'w') as f:
    f.write(credentials_json)
os.chmod(creds_path, 0o400)

# Set immediately so cleanup happens even if subsequent steps fail
self._vertex_credentials_path = creds_path

logging.info(f"Vertex credentials written to {creds_path} with mode 0o400")
# ... rest of function

2. HIGH: Inconsistent Error Handling in GitHub vs Vertex Flows

Location: components/backend/handlers/sessions.go:770-841 vs 847-955

The MintSessionGitHubToken and MintSessionVertexCredentials functions have duplicate validation logic but inconsistent error logging.

Inconsistency:

GitHub token endpoint: No logging before TokenReview creation (line 771)
Vertex endpoint: Logs token length before TokenReview (line 868)
GitHub token endpoint: Missing log on session fetch failure (line 806)
Vertex endpoint: Logs session fetch failure (line 908)

Recommendation: Extract common validation logic into a shared function:

// validateRunnerToken performs TokenReview and verifies SA matches session annotation
func validateRunnerToken(c *gin.Context, project, sessionName, token string) (saName string, err error) {
    // Common validation logic from both functions
    // Return extracted SA name on success
}

Then both endpoints can use:

saName, err := validateRunnerToken(c, project, sessionName, token)
if err != nil {
    return // validateRunnerToken already sent response
}

3. MEDIUM: Test Coverage Gap - Backend Integration Tests

Location: components/backend/handlers/sessions_test.go

The unit tests are excellent but missing integration test scenarios:

❌ End-to-end flow with real backend API server
❌ Concurrent requests from multiple runners
❌ Token expiration/renewal scenarios
❌ Backend namespace misconfiguration

Recommendation: Add integration test in tests/integration/:

func TestVertexCredentialFlow_Integration(t *testing.T) {
    // Test real runner token → backend → GCP credential flow
    // Verify credential file permissions, cleanup, etc.
}

4. MEDIUM: Vertex Secret Format Not Validated

Location: components/backend/handlers/sessions.go:942-946

The code checks if ambient-code-key.json exists but doesn't validate:

JSON structure of credentials
Presence of required fields (type, project_id, private_key, etc.)
project-id and region fields in secret

Current:

credJSON, ok := vertexSecret.Data["ambient-code-key.json"]
if !ok {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})
    return
}
// Returns potentially invalid JSON to runner

Fix: Add validation:

credJSON, ok := vertexSecret.Data["ambient-code-key.json"]
if !ok {
    c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})
    return
}

// Validate JSON structure
var credStruct map[string]interface{}
if err := json.Unmarshal(credJSON, &credStruct); err != nil {
    log.Printf("Invalid Vertex credential JSON in secret: %v", err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Invalid Vertex credential format"})
    return
}

// Validate required fields
requiredFields := []string{"type", "project_id", "private_key"}
for _, field := range requiredFields {
    if _, ok := credStruct[field]; !ok {
        log.Printf("Vertex credential missing required field: %s", field)
        c.JSON(http.StatusInternalServerError, gin.H{"error": "Incomplete Vertex credentials"})
        return
    }
}

5. LOW: Operator Environment Variable Cleanup Incomplete

Location: components/manifests/base/operator-deployment.yaml:37-45

The comment says "Operator only needs the flag" but the deployment still reads from ConfigMap. Consider removing the ConfigMap reference entirely if only the flag is needed:

Current:

- name: CLAUDE_CODE_USE_VERTEX
  valueFrom:
    configMapKeyRef:
      name: operator-config
      key: CLAUDE_CODE_USE_VERTEX

Consider: If this is truly just a flag, use a literal value in production overlay:

# In overlays/production/operator-deployment.yaml
- name: CLAUDE_CODE_USE_VERTEX
  value: "1"

6. LOW: Missing Documentation for Secret Creation

Location: components/manifests/overlays/production/operator-config-openshift.yaml:7-17

The comment provides secret creation command but doesn't specify:

Which backend namespace to use
How to obtain the service account JSON
Required GCP permissions for the service account

Recommendation: Add to production deployment docs or ConfigMap comment:

# Required GCP Permissions:
#   - aiplatform.endpoints.predict
#   - aiplatform.models.get
# Service Account: ambient-code@PROJECT_ID.iam.gserviceaccount.com
# Backend Namespace: Use same namespace as vteam-backend deployment

🔍 Minor Code Quality Issues

7. Duplicate Type-Safe Access Pattern (Good to Consolidate)

Location: sessions.go:803-820 vs 913-924

Both MintSessionGitHubToken and MintSessionVertexCredentials extract the runner-sa annotation identically. Consider a helper:

func getRunnerSAFromSession(obj *unstructured.Unstructured) (string, error) {
    anns, _, _ := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
    sa, ok := anns["ambient-code.io/runner-sa"]
    if !ok {
        return "", fmt.Errorf("runner-sa annotation not found")
    }
    return strings.TrimSpace(sa), nil
}

8. Test Helper Function Already Exists

Location: sessions_test.go:421-432

The contains() and stringContains() helpers can be replaced with strings.Contains() from stdlib:

// Delete lines 421-432, use:
if !strings.Contains(response["error"].(string), "Vertex credentials not configured") {
    t.Errorf("Expected error about missing credentials, got: %v", response["error"])
}

📋 Pre-Merge Checklist

Fix CRITICAL credential cleanup issue (recommendation Outcome: Reduce Refinement Time with agent System #1)
Add validation for Vertex secret JSON format (recommendation Epic: AI Agent Development #4)
Extract duplicate validation logic (recommendation Epic: RAT Architecture & Design #2)
Run gofmt -w components/backend components/operator
Run golangci-lint run in backend and operator
Update PR title (remove "WIP:" when ready)
Add integration test or document testing plan for Vertex flow

🎯 Summary

This is a well-architected security improvement with excellent test coverage. The main concerns are:

Credential cleanup on error paths (must fix before merge)
Secret format validation (should fix to prevent runtime errors)
Code duplication (nice to have, not blocking)

The approach of using JWT authentication for credential distribution is sound and follows Kubernetes security best practices. Once the critical cleanup issue is addressed, this will be a solid improvement to the platform's security posture.

Recommendation: Address items #1 and #4, then approve for merge.

Review generated following CLAUDE.md backend development standards

github-actions · 2025-11-11T15:20:15Z

Pull Request Review: JWT Vertex Authentication

Overview

This PR implements Vertex AI credential distribution via backend API endpoints, improving security by centralizing credential management and eliminating the need for operators to directly access GCP service account keys.

🎯 Positive Highlights

1. Security Architecture Improvements ✅

Centralized credential management: Backend reads ambient-vertex secret from its own namespace, eliminating credential duplication
Token-based authentication: Runners use BOT_TOKEN (ServiceAccount token) for credential fetching, following existing GitHub token pattern
Credential cleanup: Temporary credential file (/tmp/vertex-credentials.json) is properly cleaned up after session completion (wrapper.py:694-707)
Restrictive file permissions: Credentials written with mode 0o400 (read-only for owner)

2. Code Quality & Best Practices ✅

Type-safe unstructured access: Proper use of unstructured.NestedString helpers in handlers/sessions.go:862-863 (per CLAUDE.md standards)
Comprehensive test coverage: Added 7 test cases covering success, auth failures, namespace mismatches, missing secrets, and more
DRY principle: Refactored common token validation logic into reusable validateRunnerToken function (handlers/sessions.go:749-821)
Interface usage: Changed *kubernetes.Clientset to kubernetes.Interface for better testability

3. Error Handling ✅

Structured logging: All error paths include contextual information (project, session, tokenLen)
Graceful degradation: Operator's preflight check logs informational warnings if secret missing, doesn't block startup
Token redaction: No tokens logged (only token length), following security best practices

⚠️ Issues & Concerns

1. Critical: Security Vulnerability - Credential Exposure 🔴

Location: wrapper.py:1735

req = _urllib_request.Request(url, data=b"{}", headers={'Content-Type': 'application/json'}, method='POST')
bot = (os.getenv('BOT_TOKEN') or '').strip()
if bot:
    req.add_header('Authorization', f'Bearer {bot}')

Problem: The BOT_TOKEN is appended to WebSocket URL as query parameter (wrapper.py:82):

setattr(self.shell.transport, 'url', ws + f"?token={bot}")

Risk: Query parameters can be logged in:

Web server access logs
Reverse proxy logs (nginx, HAProxy)
Browser history (if WebSocket URL is exposed to client)
Monitoring/APM tools

Recommendation: Use headers for authentication instead of query params for WebSocket connections, or document that backend must disable query param logging for these endpoints.

2. Type Safety: Missing Error Checks 🟡

Location: handlers/sessions.go:862-863

userID, _, _ := unstructured.NestedString(obj.Object, "spec", "userContext", "userId")
userID = strings.TrimSpace(userID)
if userID == "" {
    c.JSON(http.StatusBadRequest, gin.H{"error": "session missing user context"})
    return
}

Issue: The third return value (error) is ignored with _. While the current check for empty string catches the issue, you should validate the error to distinguish between "field not found" vs "field exists but wrong type".

Per CLAUDE.md:

REQUIRED: Use unstructured.Nested* helpers with three-value returns
REQUIRED: Check found before using values; handle type mismatches gracefully

Recommendation:

userID, found, err := unstructured.NestedString(obj.Object, "spec", "userContext", "userId")
if err != nil {
    log.Printf("Invalid userContext.userId type for session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusBadRequest, gin.H{"error": "invalid user context"})
    return
}
if !found || strings.TrimSpace(userID) == "" {
    c.JSON(http.StatusBadRequest, gin.H{"error": "session missing user context"})
    return
}

3. Backend Namespace Hardcoding 🟡

Location: handlers/sessions.go:879-880

const vertexSecretName = "ambient-vertex"
backendNamespace := Namespace
vertexSecret, err := K8sClient.CoreV1().Secrets(backendNamespace).Get(c.Request.Context(), vertexSecretName, v1.GetOptions{})

Issue: The Namespace variable is set at startup. If backend and operator run in different namespaces (common in multi-tenant deployments), this assumes the secret exists in the backend's namespace.

Current behavior: Works correctly per design, but lacks documentation

Recommendation: Add a comment explaining the namespace assumption:

// Load Vertex credentials from ambient-vertex secret in the backend namespace.
// Note: Backend must have ambient-vertex secret in its own namespace (not the project/session namespace).
// This centralizes credential management and prevents credential duplication across projects.
const vertexSecretName = "ambient-vertex"
backendNamespace := Namespace  // Backend's namespace, set at startup

4. Test Coverage: Missing Integration Tests 🟡

Current tests (handlers/sessions_test.go):

✅ Unit tests with mocked K8s clients
✅ TokenReview validation
✅ Error conditions

Missing:

❌ Integration test with real backend API call
❌ E2E test verifying runner can actually fetch credentials
❌ Test for credential cleanup on session termination

Recommendation: Add integration test in components/backend/tests/integration/ that:

Creates a real ambient-vertex secret
Creates an AgenticSession with runner SA
Calls the /vertex/credentials endpoint with SA token
Verifies credentials match secret data

5. Operator: Misleading Preflight Check 🟡

Location: operator/internal/preflight/vertex.go:23-36

// Check if ambient-vertex secret exists in backend namespace (informational only)
// This is not a hard requirement as backend may be in different namespace
_, err := config.K8sClient.CoreV1().Secrets(operatorNamespace).Get(...)
if err != nil {
    log.Printf("  Info: secret '%s' not found in namespace '%s'", ...)
    log.Printf("  Note: Ensure backend namespace has the ambient-vertex secret with:...")
}

Issue: The check looks in operatorNamespace but the log message says "backend namespace". These may be different namespaces.

Recommendation: Either:

Skip this check entirely (backend will error at runtime if secret missing)
Check the actual backend namespace (requires knowing backend namespace at operator startup)
Clarify in logs: "Checking operator namespace for secret (backend may use different namespace)"

6. RBAC: Backend Service Account Permissions 🟡

Current RBAC (rbac/operator-clusterrole.yaml shows operator permissions, but backend RBAC not in diff)

Required: Backend ServiceAccount needs:

- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
  # Limited to backend's own namespace

Question: Is this permission already granted? Not visible in the PR diff.

Recommendation: Verify backend ServiceAccount has secrets:get permission in its own namespace, or add a preflight check in backend startup.

7. Error Response Consistency 🟢 Minor

Location: handlers/sessions.go:883-886, 891-893

// Case 1: Secret not found
c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials not configured for this project"})

// Case 2: Missing credential file
c.JSON(http.StatusNotFound, gin.H{"error": "Vertex credentials missing ambient-code-key.json"})

Issue: Both return 404, but the second case is a configuration error (should be 500 Internal Server Error) since the secret exists but is malformed.

Recommendation:

// Case 2: Missing credential file (malformed secret)
c.JSON(http.StatusInternalServerError, gin.H{"error": "Vertex credentials misconfigured"})

This helps distinguish between "feature not enabled" (404) vs "feature broken" (500).

🧪 Testing Recommendations

Required Before Merge:

Integration test: Real backend → runner credential fetch flow
RBAC verification: Confirm backend SA has secrets:get in its namespace
E2E test: Full AgenticSession using Vertex AI (if GCP test env available)

Suggested Test Cases:

Vertex session with valid credentials → success
Vertex session with missing secret → proper error message
Vertex session with malformed secret → proper error
Vertex session with wrong SA token → 401/403
Credential file cleanup after session completion
Credential file cleanup after session crash/timeout

📋 Performance Considerations

Positive:

✅ Credentials fetched once per session (not per API call)
✅ Temporary file in /tmp (tmpfs, memory-backed filesystem)
✅ Single HTTP request to backend for credential fetch

Potential Issues:

⚠️ No caching of credentials across sessions (each session fetches fresh)
- Impact: Minimal - credentials are small (~2KB JSON)
- Tradeoff: Fresh credentials prevent stale credential issues

🔒 Security Assessment

Strengths:

✅ Centralized credential management (single source of truth)
✅ No credential duplication across namespaces
✅ Token-based authentication for runner access
✅ Automatic credential cleanup
✅ Restrictive file permissions

Weaknesses:

🔴 High: BOT_TOKEN in WebSocket query parameters (see Issue Outcome: Reduce Refinement Time with agent System #1)
🟡 Medium: No encryption at rest for /tmp/vertex-credentials.json
- Mitigation: File mode 0o400 + tmpfs (memory-backed) + cleanup
🟡 Medium: Credentials returned as plain JSON over HTTP
- Mitigation: Assumes TLS/HTTPS in production (should document this requirement)

✅ Checklist (Per CLAUDE.md)

Backend/Operator Standards:

📝 Documentation Needs

The following should be documented before merge:

Deployment Guide: How to create ambient-vertex secret

kubectl create secret generic ambient-vertex \
  --from-file=ambient-code-key.json=/path/to/sa.json \
  --from-literal=project-id=YOUR_PROJECT \
  --from-literal=region=us-central1 \
  -n <backend-namespace>

Security Requirements:
- TLS/HTTPS required for credential transmission
- Backend namespace access control
- Token rotation strategy

Architecture Diagram: Update docs to show new credential flow:

Runner → (BOT_TOKEN) → Backend → ambient-vertex secret → Vertex Credentials → Runner

🎯 Final Verdict

Approval Status: Approve with Changes Required ⚠️

This PR demonstrates excellent architectural improvements and follows most CLAUDE.md standards. However, Issue #1 (BOT_TOKEN in query params) should be addressed before merge to prevent credential leakage in logs.

Required Changes:

🔴 Fix: BOT_TOKEN in WebSocket query params (security risk)
🟡 Improve: Type-safe error handling for NestedString calls
🟡 Document: Namespace assumptions and RBAC requirements

Recommended Changes:

Add integration test for credential fetch flow
Clarify operator preflight check logs
Fix error response codes (404 vs 500)

Estimated Effort:

Required changes: 2-3 hours
Recommended changes: 3-4 hours
Total: 5-7 hours

📚 References

CLAUDE.md: Backend Development Standards (lines 285-448)
CLAUDE.md: Operator Patterns (lines 449-537)
CLAUDE.md: Security Patterns (lines 538-581)

Great work overall! The centralized credential management is a significant security improvement. Looking forward to seeing the WebSocket auth issue resolved. 🚀

github-actions · 2025-11-11T15:42:02Z

Code Review: JWT Vertex AI Credential Management

This PR refactors Vertex AI credential management to use a more secure, centralized approach. The backend now serves credentials via an authenticated API endpoint instead of the operator distributing secrets to project namespaces.

✅ Strengths

1. Security Improvements

Centralized credential management: Vertex credentials stay in the backend namespace, reducing secret sprawl
Proper token validation: validateRunnerToken() follows CLAUDE.md patterns with TokenReview + SA verification
Token redaction: Follows security standards (logs token length, not content) - components/backend/handlers/sessions.go:751
Least privilege: Operator no longer needs secret copy permissions, reduced RBAC footprint

2. Code Quality

Excellent test coverage: 8 comprehensive test cases cover success, auth failures, namespace mismatches, missing secrets - components/backend/handlers/sessions_test.go
Type-safe unstructured access: Uses unstructured.NestedStringMap() instead of raw type assertions (line 805) ✅ CLAUDE.md compliant
Proper error handling: All errors logged with context, appropriate HTTP status codes returned
Code reuse: validateRunnerToken() helper reduces duplication between GitHub and Vertex endpoints
Clear separation of concerns: Operator sets flag, backend manages credentials, runner fetches on-demand

3. Architecture

Follow kubernetes.Interface pattern: Changed from *kubernetes.Clientset to kubernetes.Interface for better testability (git/operations.go:48)
Runtime credential fetching: Runner fetches credentials JIT instead of mounting secrets, supports dynamic updates
Clean lifecycle management: Vertex credential file cleanup in wrapper.py (lines 689-702)

⚠️ Issues & Concerns

HIGH PRIORITY

1. Potential Secret Exposure in Logs ⚠️

Location: components/runners/claude-code-runner/wrapper.py:213

logging.info(f"  GOOGLE_APPLICATION_CREDENTIALS: {os.environ.get('GOOGLE_APPLICATION_CREDENTIALS')}")

Issue: This logs the file path, which is acceptable, but be careful this doesn't accidentally log credential contents elsewhere.

Recommendation: Add a comment clarifying this only logs the path, not contents.

2. Missing Interface Methods Verification ⚠️

Location: components/backend/handlers/github_auth.go:27, git/operations.go:48

Issue: Changed K8sClient from *kubernetes.Clientset to kubernetes.Interface. While this improves testability, you should verify all callers only use methods defined in the interface.

Recommendation: Run the full test suite and ensure no runtime panics due to missing interface methods.

cd components/backend && go test ./... -v

3. Operator Test Deletion 🚨

Location: components/operator/internal/handlers/sessions_test.go (deleted 501 lines)

Issue: The entire operator test file was deleted. This removes test coverage for:

Session event handling
Job creation logic
Status updates
Error scenarios

Recommendation: Do not delete these tests. Instead, update them to reflect the new Vertex credential flow. Operator tests are critical for preventing regressions.

MEDIUM PRIORITY

4. Incomplete Error Context in validateRunnerToken()

Location: components/backend/handlers/sessions.go:805-806

anns, _, _ := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")

Issue: The error return value is ignored. If there's a type mismatch (annotations is not a map[string]string), the error is silently swallowed.

Recommendation: Check the error and log it:

anns, found, err := unstructured.NestedStringMap(obj.Object, "metadata", "annotations")
if err != nil {
    log.Printf("Failed to parse annotations for session %s/%s: %v", project, sessionName, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "invalid session metadata"})
    return ""
}
if !found {
    anns = map[string]string{}
}

5. Race Condition in Credential Fetch

Location: components/runners/claude-code-runner/wrapper.py:1703-1774

Issue: If the backend service is restarted during a runner session, the credential fetch will fail with a non-retryable error.

Recommendation: Add retry logic with exponential backoff for transient network errors:

for attempt in range(3):
    try:
        response = urllib.request.urlopen(req, timeout=10)
        break
    except urllib.error.URLError as e:
        if attempt < 2:
            await asyncio.sleep(2 ** attempt)
        else:
            raise

6. Missing RBAC Documentation

Location: components/manifests/base/rbac/operator-clusterrole.yaml

Issue: The RBAC changes removed secret copy permissions, but this isn't documented in commit messages or PR description.

Recommendation: Add a migration note explaining:

Operator no longer needs secrets: [create, update] in project namespaces
Backend needs secrets: [get] in its own namespace (for ambient-vertex)
Runners need network access to backend API

LOW PRIORITY

7. Hardcoded Secret Name

Location: components/backend/handlers/sessions.go:894

const vertexSecretName = "ambient-vertex"

Recommendation: Consider making this configurable via environment variable for flexibility in different environments.

8. Credential File Permissions

Location: components/runners/claude-code-runner/wrapper.py:670

os.chmod(creds_path, 0o400)  # Read-only for owner

Good practice, but document why 0o400 (prevents other users in container from reading).

9. Test Uses setupTestGVR() Twice

Location: components/backend/handlers/sessions_test.go:69-70

setupTestGVR()
setupTestGVR()

Issue: Duplicate call (harmless but redundant).

Fix: Remove one line.

🔍 Security Review

✅ Secure Patterns

TokenReview validates runner identity before serving credentials
ServiceAccount name verified against session annotation (prevents token reuse across sessions)
Namespace isolation enforced (token from wrong namespace rejected)
Credentials served only to authorized SAs (not broadcast to all runners)
Temporary credential file cleaned up on exit

⚠️ Security Considerations

Backend becomes single point of failure: If backend is compromised, all Vertex credentials are exposed
- Mitigation: Ensure backend has strong RBAC, runs with minimal privileges
Credential caching: Runner stores credentials in /tmp for session duration
- Mitigation: File permissions set to 0o400 (good), but /tmp is shared in some environments
- Recommendation: Consider using in-memory credential storage if possible
No credential rotation: If ambient-vertex secret is updated, existing runners won't pick up new credentials
- Mitigation: Document that runners must be restarted after credential rotation

📊 Performance Considerations

Additional HTTP request: Each runner now makes an API call to fetch credentials (adds ~10-100ms latency)
- Impact: Negligible for typical session durations (minutes to hours)
Backend load: Backend now serves credentials for every session startup
- Impact: Minimal (credentials fetched once per session, not per request)
Secret read: Backend reads ambient-vertex secret on every credential request
- Optimization: Consider caching secret in memory with a 5-minute TTL

🧪 Testing Recommendations

Required Before Merge

✅ Backend unit tests: Excellent coverage (8 test cases)
🚨 Operator tests: Restore deleted tests, update for new flow
⚠️ Integration test: Test full flow (operator → runner → backend → Vertex API)
⚠️ Negative test: Verify runner fails gracefully when backend is unreachable
⚠️ RBAC test: Verify operator no longer needs secret permissions in project namespaces

Test Coverage Gaps

No tests for wrapper.py credential fetching logic (consider adding unit tests)
No tests for credential cleanup on runner crash/termination
No tests for concurrent credential requests (thread safety)

📝 Documentation Updates Needed

Migration guide: How to upgrade from old secret-per-namespace model
Architecture diagram: Show new credential flow (operator → backend → runner)
Troubleshooting: Common errors (backend unreachable, invalid token, secret missing)
RBAC requirements: Update docs to reflect reduced operator permissions

🎯 Recommendations Summary

MUST FIX (Blocking)

🚨 Restore operator tests (sessions_test.go) - update, don't delete
⚠️ Fix error handling in validateRunnerToken() (line 805)

SHOULD FIX (High Priority)

Verify kubernetes.Interface change doesn't break callers
Add retry logic for credential fetch in wrapper.py
Document RBAC changes in PR description

NICE TO HAVE (Low Priority)

Make secret name configurable
Add comment about logging file path vs. contents
Remove duplicate setupTestGVR() call in test
Consider credential caching in backend (performance optimization)

✅ Overall Assessment

Verdict: APPROVE with changes required

This is a well-architected refactoring that improves security posture by centralizing credential management. The backend code follows CLAUDE.md standards excellently (type-safe unstructured access, proper error handling, comprehensive tests). However, deleting all operator tests is a critical issue that must be addressed before merge.

The new credential flow (operator sets flag → runner fetches from backend) is cleaner than the old model (operator copies secrets to every namespace), but the PR needs:

Operator test coverage restored
Integration testing to verify end-to-end flow
Documentation updates for the new architecture

Estimated time to address blockers: 2-4 hours (restore operator tests, fix error handling)

Great work on the security improvements and test coverage! The validateRunnerToken() helper is especially well-designed and will be useful for future authenticated endpoints. 🚀

Gkrumbach07

looks good to me

Gkrumbach07 · 2025-11-12T13:12:48Z

Sandboxing the pod better will help this be more secure too.

This reverts commit 2f783a1.

Reverts #285

Signed-off-by: Ryan Cook <rcook@redhat.com>

Reverts ambient-code#285

cooktheryan added 3 commits November 10, 2025 16:41

WIP: Vertex JWT

165c5bc

Signed-off-by: Ryan Cook <rcook@redhat.com>

dont need comment line

090c36f

Merge branch 'main' into jwt_vertex

a2915a4

cooktheryan added 2 commits November 11, 2025 09:29

incorporating feedback

ea7bf56

Signed-off-by: Ryan Cook <rcook@redhat.com>

Merge branch 'jwt_vertex' of github.com:cooktheryan/vTeam into jwt_ve…

6daf3c7

…rtex

Merge branch 'main' into jwt_vertex

b940ee7

cooktheryan added 3 commits November 11, 2025 10:00

feedback fixes

dea76b5

Merge branch 'jwt_vertex' of github.com:cooktheryan/vTeam into jwt_ve…

9a3554f

…rtex

Merge branch 'main' into jwt_vertex

f77b7e6

additional lint and feedback

bda39af

fix of secruity error

cbe8a20

Gkrumbach07 approved these changes Nov 12, 2025

View reviewed changes

Gkrumbach07 merged commit 2f783a1 into ambient-code:main Nov 12, 2025
21 of 22 checks passed

cooktheryan added a commit that referenced this pull request Nov 12, 2025

Revert "WIP: Jwt vertex (#285)"

430170e

This reverts commit 2f783a1.

cooktheryan mentioned this pull request Nov 12, 2025

Revert "WIP: Jwt vertex" #289

Merged

cooktheryan added a commit that referenced this pull request Nov 12, 2025

Revert "WIP: Jwt vertex" (#289)

b3caed8

Reverts #285

jeremyeder pushed a commit to jeremyeder/platform that referenced this pull request Nov 12, 2025

WIP: Jwt vertex (ambient-code#285)

179e3ca

Signed-off-by: Ryan Cook <rcook@redhat.com>

jeremyeder pushed a commit to jeremyeder/platform that referenced this pull request Nov 12, 2025

Revert "WIP: Jwt vertex" (ambient-code#289)

88351d1

Reverts ambient-code#285

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Jwt vertex #285

WIP: Jwt vertex #285

Uh oh!

cooktheryan commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

Gkrumbach07 left a comment

Uh oh!

Gkrumbach07 commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: Jwt vertex #285

WIP: Jwt vertex #285

Uh oh!

Conversation

cooktheryan commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: JWT Vertex Credentials Architecture

🔴 Critical Issues

1. Backend Reads Secret from Wrong Namespace

2. Missing Token Length Logging

3. Test Assumes Wrong Secret Namespace

⚠️ High Priority Issues

4. Missing Error Context in Backend

5. Type-Unsafe Metadata Access

6. Nil Secret Data Not Checked

7. Python Wrapper Hardcodes Credential Path

📝 Medium Priority Issues

8. Inconsistent Error Messages

9. Missing Cleanup in Python Wrapper

10. Operator Log Message Unclear

✅ Strengths

📋 Pre-Merge Checklist

🎯 Architecture Validation

🔧 Suggested Follow-up Work (Post-Merge)

Summary

Uh oh!

github-actions bot commented Nov 11, 2025

Pull Request Review: JWT/Vertex AI Credential Management

✅ Strengths

🔧 Issues to Address

Critical Issues

Medium Issues

Minor Issues

📋 Code Quality Checklist (from CLAUDE.md)

🚀 Recommendations

🎯 Priority Actions Before Merge

📊 Overall Assessment

Uh oh!

github-actions bot commented Nov 11, 2025

Code Review: JWT Vertex Authentication Changes

Summary

Detailed Findings

✅ Strengths

1. Backend Implementation (components/backend/handlers/sessions.go:848-956)

2. Comprehensive Test Coverage (components/backend/handlers/sessions_test.go)

3. Operator Simplification (components/operator/internal/handlers/sessions.go)

4. Runner Implementation (components/runners/claude-code-runner/wrapper.py)

⚠️ Issues & Recommendations

Critical Issues

Medium Priority Issues

Low Priority / Nitpicks

🧪 Testing Recommendations

📋 Pre-Commit Actions (from CLAUDE.md)

Security Analysis

✅ Security Improvements

⚠️ Security Considerations

Performance Considerations

Documentation Gaps

Overall Assessment

Recommendation

Uh oh!

github-actions bot commented Nov 11, 2025

PR Review: JWT Vertex AI Integration

✅ Strengths

1. Excellent Security Architecture

2. CLAUDE.md Compliance

3. Comprehensive Test Coverage

4. RBAC Reduction

🔴 Critical Issues

1. Interface Type Change Without Justification (High Priority)

2. Missing Documentation for New Endpoint

⚠️ Moderate Issues

3. Inconsistent Error Logging

4. Hardcoded Secret Name

5. Credential Cleanup Not Guaranteed on Panic

💡 Suggestions & Best Practices

6. Consider Secret Rotation

7. Add Retry Logic for Credential Fetch

github-actions bot commented Nov 11, 2025 •

edited

Loading

1. Backend Implementation (`components/backend/handlers/sessions.go:848-956`)

2. Comprehensive Test Coverage (`components/backend/handlers/sessions_test.go`)

3. Operator Simplification (`components/operator/internal/handlers/sessions.go`)

4. Runner Implementation (`components/runners/claude-code-runner/wrapper.py`)